Bolder: trying to extract tags from HTML page

Sunday, 8 September 2013

trying to extract tags from HTML page

trying to extract tags from HTML page

I need to find all nodes of an HTML page that have a structure <h2><span
class="mw-headline" ...> ... </span></h2>, <h2>...<\h2> pairs determining
the beginning & end of the nodes. I tried to find nodes like so:
string raw_code = doc.DocumentNode.SelectNodes("/")[0].WriteTo(); // can
there be more than 1 node there?
string[] lines = raw_code.Split('\n');
foreach(HtmlNode hdr in doc.DocumentNode.SelectNodes("//span[@class =
\"mw-headline\"]"))
{
int line_number = hdr.Line;
int line_position = hdr.LinePosition;
string font_tag = lines[line_number].Substring(line_position -
font_tag_length, line_position);
MessageBox.Show(lines[line_number]); // returns div c
}
The MessageBox.Show(), frankly speaking, shows anything but not it's meant
to show, including <div class="thumb tright"> and <p>Mostly flat plains or
gently rolling hills in north and west.</p>.
What have I done wrong?

Bolder

Sunday, 8 September 2013

trying to extract tags from HTML page

No comments:

Post a Comment