[XOM-interest] using xpath and querying html

Discussion:

Jason Novotny

2009-10-29 00:23:03 UTC

Hi,

I'm completely new to xpath but after finding an interesting post on
using XOM, Saxon and TagSoup I have a simple example below:

XMLReader tagsoup =
XMLReaderFactory.createXMLReader("org.ccil.cowan.tagsoup.Parser");

tagsoup.setFeature("http://xml.org/sax/features/namespace-prefixes",true);
Builder builder = new Builder(tagsoup);

String url = "http://myurl.com";

Document doc = builder.build(url);

XPathContext context = new XPathContext("html",
"http://www.w3.org/1999/xhtml");

// THIS ONE WORKS
Nodes estimated = doc.query("//html:span[@class='small_title'][2]",
context);
for (int i = 0; i < estimated.size(); i++) {
System.err.println(estimated.get(i).toXML());
}

// THIS DOESN'T
Nodes results =
doc.query("//html/body/table/tbody/tr/td/table/tbody/tr/td/table[2]/tbody/tr/td[2]/table/tbody/tr/td/table[4]/tbody/tr[3]/td[3]",
context);
for (int i = 0; i < results.size(); i++) {
System.err.println(results.get(i).toXML());
}

So my problem is the syntax-- the first query works fine but the second
one doesn't-- I actually pulled this query string from using Firebug and
doing "copy XPath"...

Any guidance on libraries I can use or the proper syntax with dealing
with HTML querying would be greatly appreciated!

Thanks, Jason

Christophe Marchand

2009-10-29 07:01:15 UTC

Permalink

As you define "html" as the prefix, your request should be

/html:html/html:body/html:table/html:tbody/html:tr/html:td/html:table/html:tbody/html:tr/html:td/html:table[2]/html:tbody/html:tr/html:td[2]/html:table/html:tbody/html:tr/html:td/html:table[4]/html:tbody/html:tr[3]/html:td[3]

You can also add a mapping beween the default prefix "" and
"http://www.w3.org/1999/xhtml" to your XPathContext

Christophe

Post by Jason Novotny
Hi,
I'm completely new to xpath but after finding an interesting post on
XMLReader tagsoup =
XMLReaderFactory.createXMLReader("org.ccil.cowan.tagsoup.Parser");
tagsoup.setFeature("http://xml.org/sax/features/namespace-prefixes",true);
Builder builder = new Builder(tagsoup);
String url = "http://myurl.com";
Document doc = builder.build(url);
XPathContext context = new XPathContext("html",
"http://www.w3.org/1999/xhtml");
// THIS ONE WORKS
context);
for (int i = 0; i < estimated.size(); i++) {
System.err.println(estimated.get(i).toXML());
}
// THIS DOESN'T
Nodes results =
doc.query("//html/body/table/tbody/tr/td/table/tbody/tr/td/table[2]/tbody/tr/td[2]/table/tbody/tr/td/table[4]/tbody/tr[3]/td[3]",
context);
for (int i = 0; i < results.size(); i++) {
System.err.println(results.get(i).toXML());
}
So my problem is the syntax-- the first query works fine but the second
one doesn't-- I actually pulled this query string from using Firebug and
doing "copy XPath"...
Any guidance on libraries I can use or the proper syntax with dealing
with HTML querying would be greatly appreciated!
Thanks, Jason
_______________________________________________
XOM-interest mailing list
XOM-interest at lists.ibiblio.org
http://lists.ibiblio.org/mailman/listinfo/xom-interest

Jason Novotny

2009-10-29 07:53:56 UTC

Permalink

Thanks!

How do I add a mapping between "" and "http://www.w3.org/1999/xhtml"
exactly?
I tried various combinations of context.addNamespace(..,...) but nothing
seemed to work.

Also the URL I'm using in the search below (as a sample is)

String url =
"http://collection.fng.fi/wandora/w?museumfilter=&lang=en&artworkclassfilter=&action=gen&timefilter=&si=&query=kuopassa&qtype=http%3A%2F%2Fwww.wandora.net%2Fartwork&ifilter=all#listmod";

There are several nested tables and as per firebugs "Copy Xpath" I tried
the following:

"/html:html/html:body/html:table/html:tbody/html:tr/html:td/html:table/html:tbody/html:tr/html:td/html:table[2]/html:tbody/html:tr/html:td[2]/html:table/html:tbody/html:tr/html:td/html:table[4]/html:tbody/html:tr[3]/html:td[3]"

and nothing seems to come back.Hacking around it seems the deepest I can
go is

"/html:html/html:body/html:table/html:tr/html:td/html:table"

I even tried

"/html:table[@class='content_container']" but that doesn't even work
although there is clearly a table with class content_container in the
HTML. I wonder if it has anything to do with the fact that a comment
that is included i.e. some part of the HTML looks like this:

</div>
</td></tr></table>

<table class="content_container"><tr><td colspan="1" rowspan="1"
class="content_container_margin"> </td><td colspan="1" rowspan="1"
class="content_container_column">

Not sure if that's a problem...

Thanks, Jason

Post by Christophe Marchand
As you define "html" as the prefix, your request should be
/html:html/html:body/html:table/html:tbody/html:tr/html:td/html:table/html:tbody/html:tr/html:td/html:table[2]/html:tbody/html:tr/html:td[2]/html:table/html:tbody/html:tr/html:td/html:table[4]/html:tbody/html:tr[3]/html:td[3]
You can also add a mapping beween the default prefix "" and
"http://www.w3.org/1999/xhtml" to your XPathContext
Christophe

_______________________________________________
XOM-interest mailing list
XOM-interest at lists.ibiblio.org
http://lists.ibiblio.org/mailman/listinfo/xom-interest