Re: HTML-Seite parsen in Java??



Nico Render wrote:

Ich habe eine Menge HTML-Seiten, die ziemlich gleich aufgebaut sind. So eine
Seite besteht aus einer Überschrift, Untertitel, Datum und einem
Beschreibungstext. Jedoch bei der Anzeige nur die Überschrift als Link
angezeigt. Wenn man auf den Link klickt wird Überschrift, Untertitel, Datum
und Beschreibungstext komplett angezeigt. Meine Frage: Gibt es einen Parser,
der so eine Seite zerlegen kann?

Ich habe genau sowas vor einiger Zeit gemacht und dabei gute Erfahrungen mit der Kombination TagSoup und XPath gemacht.

TagSoup kann dir normales, sogar teils defektes HTML in einem DOM-Tree einlesen. Auf diesem DOM-Tree kannst du dann via XPath navigieren und die Textpassagen die dich interessieren extrahieren.

TagSoup gibts hier:
http://home.ccil.org/~cowan/XML/tagsoup/

XPath ist seit 1.4 im JRE.


Hier noch ein Beispiel, das aus einer unter tagsoup.html abgespeicherten Heise-Einstiegsseite, die Liste der Meldungen samt Teaser (der fehlt im RSS leider) extrahiert:

package de.jnana.scratch.tagsoup;

import java.util.List;

import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.transform.OutputKeys;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerException;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.TransformerFactoryConfigurationError;
import javax.xml.transform.dom.DOMResult;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.sax.SAXSource;
import javax.xml.transform.stream.StreamResult;

import org.ccil.cowan.tagsoup.Parser;
import org.w3c.dom.Document;
import org.w3c.dom.Node;
import org.xml.sax.InputSource;
import org.xml.sax.SAXNotRecognizedException;
import org.xml.sax.SAXNotSupportedException;

import de.jnana.mrelay.util.XPathHelper;

public class TagSoupDemo {

private static Document parseTagSoup(InputSource is, Document node)
throws TransformerFactoryConfigurationError, TransformerException {
Parser parser = new Parser();
try {
parser.setFeature(Parser.bogonsEmptyFeature, false);
parser.setFeature(Parser.namespacesFeature, false);
} catch (SAXNotRecognizedException e) {
e.printStackTrace();
} catch (SAXNotSupportedException e) {
e.printStackTrace();
}
SAXSource ss = new SAXSource(parser, is);

DOMResult dr = new DOMResult(node);

Transformer t = TransformerFactory.newInstance().newTransformer();
// t.setOutputProperty("method", "xml");
// t.setOutputProperty("standalone", "yes");
// t.setOutputProperty("indent", "yes");
t.transform(ss, dr);
return (Document) dr.getNode();
}

public static void main(String[] args) throws Error, Exception {
XPathHelper xph = new XPathHelper();

DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();

InputSource is = new InputSource();
is.setSystemId(TagSoupDemo.class.getResource("tagsoup.html").toURI()
.toString());
Document doc = parseTagSoup(is, db.newDocument());
System.out.println("doc: " + doc.getNodeType());

List<Node> nodes = xph.listFor("//heisetext//h3[@class='anriss']", doc);
for (Node node : nodes) {
String link = xph.textFor("a/@href", node);
String subject = xph.textFor("a", node);
String teaser = xph.textFor("following-sibling::p/text()", node);

System.out.println("l=" + link + " s=" + subject + " t=" + teaser);
}
}
}


Hier noch der XPathHelper den obiger Code verwendet:

package de.jnana.mrelay.util;

import java.util.AbstractList;
import java.util.Iterator;
import java.util.List;
import java.util.Map;
import java.util.NoSuchElementException;

import javax.xml.namespace.NamespaceContext;
import javax.xml.xpath.XPath;
import javax.xml.xpath.XPathConstants;
import javax.xml.xpath.XPathExpressionException;
import javax.xml.xpath.XPathFactory;
import javax.xml.xpath.XPathFactoryConfigurationException;

import org.w3c.dom.Node;
import org.w3c.dom.NodeList;

public class XPathHelper {

XPath xpath;

Map<String, String> map;

public XPathHelper() {
this(XPathFactory.newInstance());
}

public XPathHelper(String uri) throws XPathFactoryConfigurationException {
this(XPathFactory.newInstance(uri));
}

public XPathHelper(XPathFactory factory) {
xpath = factory.newXPath();
xpath.setNamespaceContext(new NamespaceContext() {

public String getNamespaceURI(String prefix) {
return XPathHelper.this.getNamespaceURI(prefix);
}

public String getPrefix(String namespaceURI) {
return XPathHelper.this.getPrefix(namespaceURI);
}

public Iterator getPrefixes(String namespaceURI) {
return XPathHelper.this.getPrefixes(namespaceURI);
}
});
}

public void addNamespace(String prefix, String namespaceURI) {
if ((prefix == null) || (namespaceURI == null)) {
throw new NullPointerException();
}
map.put(prefix, namespaceURI);
}

public void removeNamespace(String prefix, String namespaceURI) {
if ((prefix == null) || (namespaceURI == null)) {
throw new NullPointerException();
}
String oldUri = map.get(prefix);
if (oldUri.equals(namespaceURI)) {
map.remove(prefix);
} else {
throw new IllegalArgumentException();
}
}

public String getNamespaceURI(String prefix) {
return map.get(prefix);
}

public String getPrefix(String namespaceURI) {
Iterator<String> i = getPrefixes(namespaceURI);
if (i.hasNext()) {
return i.next();
}
return null;
}

public Iterator<String> getPrefixes(final String namespaceURI) {
final Iterator<Map.Entry<String, String>> entries = map.entrySet()
.iterator();
return new Iterator<String>() {

Map.Entry<String, String> pending;

public boolean hasNext() {
if (pending != null) {
return true;
}
while (entries.hasNext()) {
pending = entries.next();
if (pending.getValue().equals(namespaceURI)) {
return true;
}
}
pending = null;
return false;
}

public String next() {
if (!hasNext()) {
throw new NoSuchElementException();
}
String rv = pending.getKey();
pending = null;
return rv;
}

public void remove() {
throw new UnsupportedOperationException();
}
};
}

public String textFor(String expr, Object item) {
try {
return (String) xpath.evaluate(expr, item, XPathConstants.STRING);
} catch (XPathExpressionException e) {
throw new IllegalArgumentException(e);
}
}

public Node nodeFor(String expr, Object item) {
try {
return (Node) xpath.evaluate(expr, item, XPathConstants.NODE);
} catch (XPathExpressionException e) {
throw new IllegalArgumentException(e);
}
}

public NodeList nodesetFor(String expr, Object item) {
try {
return (NodeList) xpath
.evaluate(expr, item, XPathConstants.NODESET);
} catch (XPathExpressionException e) {
throw new IllegalArgumentException(e);
}
}

public List<Node> listFor(String expr, Object item) {
return asList(nodesetFor(expr, item));
}

public static List<Node> asList(final NodeList nodeset) {
return new AbstractList<Node>() {

@Override
public Node get(int index) {
if (nodeset == null) {
throw new IndexOutOfBoundsException("Index: " + index
+ ", Size: 0");
}
return nodeset.item(index);
}

@Override
public int size() {
return (nodeset == null ? 0 : nodeset.getLength());
}
};
}
}
.



Relevant Pages

  • Re: Xalan, XPath - Namespace Problem (TransformerException)
    ... Du mußt ihm noch den Node geben, ... public static String getValueByXPath(final Node node, ... XPath xpath = factory.newXPath; ... public IteratorgetPrefixes(String namespaceURI) { ...
    (de.comp.lang.java)
  • Re: tdom xpath search
    ... what you intend; XPath 1.0 doesn't ... know string matching wild-cards. ... always the first seq-name child below sequence, ...
    (comp.lang.tcl)
  • Code Added: Facade to read & WRITE using XPath
    ... Very limited support for XPath features ... protected String getValuethrows Exception ... * @return a List of all values found at the specified path. ... public void set ...
    (comp.lang.java)
  • Re: .NET equivalent to XSLT value-of select
    ... There seem to be a number of classes that can use XPath, ... > XPathNavigator and its method Evaluate where you would need to explictly ... > call the XPath string function on your expression e.g. ... > public static void Main { ...
    (microsoft.public.dotnet.xml)
  • Facade to read & WRITE using XPath
    ... I have taught my developers the basics of XPath, ... to have to review XMLDOM code all over the place, so this Facade ... public void set(String xpath, String value); ...
    (comp.lang.java)