Re: Help to extract data from a web page



(Despite its name, microsoft.public.xsl doesn't let me post to it, so you're only going to get an answer in comp.text.xml.)

XSLT is set up to process XML, not HTML. Your HTML document will not go through an XML parser. So the firs thing you'll need to do is put it through an HTML-to-XHTML conversion layer, such as the W3C's "tidy" tool. (Alternatively you could feed the output of an HTML-to-XML parser, such as NekoHTML, into an XSLT processor... but that will require a bit more programming to hook those tools to each other.)

After doing that... what do you mean by "extract article data"? You're writing a program, so you need to be explicit about what it's supposed to do. Page title and article title are easy; look for <p> elements with the appropriate class attribute, using XPaths with predicates.

Article date is more of a pain since you need to search for the <td> with the appropriate text value, then retrieve its following sibling's value... unless you can count on the fact that it will always be in the first <tr>, in which case you search for the second td of that tr.

Content -- Can you count on that being the second tr? If so, just copying the contents of that seems to meet your need.

Author -- Again assuming that it's reliably going to be the third tr, this is more of a pain because you're going to have to do string manipulation to extract the author's name.


Having broken it down to this point, you really ought to be able to complete the task yourself by consulting a good intro-to-XSLT tutorial. Try it, and if you run into trouble come back with specific questions.


--
() ASCII Ribbon Campaign | Joe Kesselman
/\ Stamp out HTML e-mail! | System architexture and kinetic poetry
.



Relevant Pages

  • Re: What is pros/cons of using XSLT?
    ... Form "Cons". ... You can work with dynamic controls from code-behind in .NET ... I agree with CMM's points that generating the HTML from .NET code is more ... because XSLT is more indirect, there is some benefit to that too. ...
    (microsoft.public.dotnet.general)
  • Strange Exception using XSLT from NetBeans
    ... I have a strange problem in transforming an XML file in an HTML file ... the XSLT stylesheets and setting where to write the output HTML. ... I utilize NetBeans as editor and the strange fact is the if I set all ...
    (comp.lang.java.programmer)
  • Strange Exception using XSLT from NetBeans
    ... I have a strange problem in transforming an XML file in an HTML file ... the XSLT stylesheets and setting where to write the output HTML. ... I utilize NetBeans as editor and the strange fact is the if I set all ...
    (comp.lang.java.help)
  • Re: Displaying XML document in ASP.NET page using XSL Transform
    ... that's all your XSLT will ever do irrespective of the XSLT ... You need to either generate HTML (the preferred method of output for XSLT ... for some reason ASP.NET just sends all the XML data as one string on ... Hello, My Name is Joe. ...
    (microsoft.public.dotnet.framework.aspnet)
  • Re: Newbie, HTML calling XSLT
    ... > other flavour of XML) because XSLT 1.0 only processes XML. ... > HTML is written in SGML, not XML, and that won't work with XSLT. ... HTML because you need scripting, ...
    (comp.text.xml)