Re: Help to extract data from a web page
- From: Joe Kesselman <keshlam-nospam@xxxxxxxxxxx>
- Date: Fri, 24 Aug 2007 23:21:46 -0400
(Despite its name, microsoft.public.xsl doesn't let me post to it, so you're only going to get an answer in comp.text.xml.)
XSLT is set up to process XML, not HTML. Your HTML document will not go through an XML parser. So the firs thing you'll need to do is put it through an HTML-to-XHTML conversion layer, such as the W3C's "tidy" tool. (Alternatively you could feed the output of an HTML-to-XML parser, such as NekoHTML, into an XSLT processor... but that will require a bit more programming to hook those tools to each other.)
After doing that... what do you mean by "extract article data"? You're writing a program, so you need to be explicit about what it's supposed to do. Page title and article title are easy; look for <p> elements with the appropriate class attribute, using XPaths with predicates.
Article date is more of a pain since you need to search for the <td> with the appropriate text value, then retrieve its following sibling's value... unless you can count on the fact that it will always be in the first <tr>, in which case you search for the second td of that tr.
Content -- Can you count on that being the second tr? If so, just copying the contents of that seems to meet your need.
Author -- Again assuming that it's reliably going to be the third tr, this is more of a pain because you're going to have to do string manipulation to extract the author's name.
Having broken it down to this point, you really ought to be able to complete the task yourself by consulting a good intro-to-XSLT tutorial. Try it, and if you run into trouble come back with specific questions.
--
() ASCII Ribbon Campaign | Joe Kesselman
/\ Stamp out HTML e-mail! | System architexture and kinetic poetry
.
- Follow-Ups:
- Re: Help to extract data from a web page
- From: Martin Honnen
- Re: Help to extract data from a web page
- Prev by Date: Re: Free / Opensource Grid XML editor
- Next by Date: Re: Help to extract data from a web page
- Previous by thread: Is this XML data correct?
- Next by thread: Re: Help to extract data from a web page
- Index(es):
Relevant Pages
|