[XOM-interest] Parsing an XML document from a URL

Discussion:

PG-Dillingham, Iain

2010-05-07 20:53:16 UTC

Hello all,

Apologies if this question has been asked (and answered) before. I checked through the archives, but only covering the last year-or-so.

I'm trying to parse an XML document from a URL, as explained in the tutorial (http://www.xom.nu/tutorial.xhtml#d0e417). Although this works perfectly with http://www.cafeconleche.org/, many other URLs result in an IOException (e.g. http://en.wikipedia.org/wiki/Main_Page). The exception suggests the problem lies with DTDs hosted on www.w3.org not responding (response code 503).

I have tried calling Builder.build(...) with an InputStream, Reader and String but get the same error each time. I am not using a validating parser.

Any advice would be much appreciated; I'm still learning the Java ropes!

Thanks in advance,

Iain

Elliotte Rusty Harold

2010-05-08 00:12:44 UTC

Permalink

On Fri, May 7, 2010 at 4:53 PM, PG-Dillingham, Iain
<Iain.Dillingham.1 at city.ac.uk> wrote:
> Hello all,
>
> Apologies if this question has been asked (and answered) before. I checked through the archives, but only covering the last year-or-so.
>
> I'm trying to parse an XML document from a URL, as explained in the tutorial (http://www.xom.nu/tutorial.xhtml#d0e417). Although this works perfectly with http://www.cafeconleche.org/, many other URLs result in an IOException (e.g. http://en.wikipedia.org/wiki/Main_Page). The exception suggests the problem lies with DTDs hosted on www.w3.org not responding (response code 503).
>

The W3C has begun rejecting requests for the XHTML DTDs that it
doesn't like. The Cafe con Leche pages point to local pages instead so
they don;t have this problem.

--
Elliotte Rusty Harold
elharo at ibiblio.org

PG-Dillingham, Iain

2010-05-08 10:04:30 UTC

Permalink

Thanks for your prompt reply.

I appreciate that the issue here lies with the W3C rejecting requests and noted the local DTD at Cafe con Leche. However, I wonder if you -- or another member of this list -- could suggest a work-around? I don't wish to validate the XML, merely parse and extract certain elements; ideally without storing the XML locally. Using XOM, is this possible for XML where the DTD is not accessible? Should I use SAX or DOM instead?

As I mentioned in my email I'm a Java beginner: XOM was highlighted as an excellent XML API for people like me!

Thanks for your time,

Iain
________________________________________
From: Elliotte Rusty Harold [elharo at ibiblio.org]
Sent: 08 May 2010 01:12
To: XOM API for Processing XML with Java
Subject: Re: [XOM-interest] Parsing an XML document from a URL

On Fri, May 7, 2010 at 4:53 PM, PG-Dillingham, Iain
<Iain.Dillingham.1 at city.ac.uk> wrote:
> Hello all,
>
> Apologies if this question has been asked (and answered) before. I checked through the archives, but only covering the last year-or-so.
>
> I'm trying to parse an XML document from a URL, as explained in the tutorial (http://www.xom.nu/tutorial.xhtml#d0e417). Although this works perfectly with http://www.cafeconleche.org/, many other URLs result in an IOException (e.g. http://en.wikipedia.org/wiki/Main_Page). The exception suggests the problem lies with DTDs hosted on www.w3.org not responding (response code 503).
>

The W3C has begun rejecting requests for the XHTML DTDs that it
doesn't like. The Cafe con Leche pages point to local pages instead so
they don;t have this problem.

--
Elliotte Rusty Harold
elharo at ibiblio.org

Elliotte Rusty Harold

2010-05-08 10:38:36 UTC

Permalink

On Sat, May 8, 2010 at 6:04 AM, PG-Dillingham, Iain
<Iain.Dillingham.1 at city.ac.uk> wrote:
> Thanks for your prompt reply.
>
> I appreciate that the issue here lies with the W3C rejecting requests and noted the local DTD at Cafe con Leche. However, I wonder if you -- or another member of this list -- could suggest a work-around? I don't wish to validate the XML, merely parse and extract certain elements; ideally without storing the XML locally. Using XOM, is this possible for XML where the DTD is not accessible? Should I use SAX or DOM instead?
>

http://www.w3.org/blog/systeam/2008/02/08/w3c_s_excessive_dtd_traffic

--
Elliotte Rusty Harold
elharo at ibiblio.org

Dave Pawson

2010-05-08 10:45:52 UTC

Permalink

On 8 May 2010 11:38, Elliotte Rusty Harold <elharo at ibiblio.org> wrote:
> On Sat, May 8, 2010 at 6:04 AM, PG-Dillingham, Iain
> <Iain.Dillingham.1 at city.ac.uk> wrote:
>> Thanks for your prompt reply.
>>
>> I appreciate that the issue here lies with the W3C rejecting requests and noted the local DTD at Cafe con Leche. However, I wonder if you -- or another member of this list -- could suggest a work-around? I don't wish to validate the XML, merely parse and extract certain elements; ideally without storing the XML locally. Using XOM, is this possible for XML where the DTD is not accessible? Should I use SAX or DOM instead?
>>
>
> http://www.w3.org/blog/systeam/2008/02/08/w3c_s_excessive_dtd_traffic

Mean old Elliotte :-)

That's why they don't like it.
How you fix it?

Your choice.
1. Get rid of the DTD reference in your xml instance.
2. Copy the DTD from the given URL to your hard disk,
then change the reference in the XML instance
3. Learn how to use xml catalogs[1] which
basically tell xom that when he sees url X to go to file Y

http://www.oasis-open.org/committees/entity/spec-2001-08-06.html

http://www.dpawson.co.uk/docbook/catalogs.html might help you
understand the use with XSLT,
I haven't used one with XOM, perhaps Elliotte can tell you
how to 'redirect' the resolver to use your catalog.

HTH

--
Dave Pawson
XSLT XSL-FO FAQ.
Docbook FAQ.
http://www.dpawson.co.uk

Elliotte Rusty Harold

2010-05-08 11:00:39 UTC

Permalink

On Sat, May 8, 2010 at 6:45 AM, Dave Pawson <dave.pawson at gmail.com> wrote:

> http://www.dpawson.co.uk/docbook/catalogs.html might help you
> understand the use with XSLT,
> I haven't used one with XOM, perhaps Elliotte can tell you
> how to 'redirect' the resolver to use your catalog.
>

I've considered adding catalog support to XOM, but there's been any
real demand for it. You should be able to use a SAX-based XML Catalog
resolver as the base parser for XOM:

http://xml.apache.org/commons/components/resolver/resolver-article.html

--
Elliotte Rusty Harold
elharo at ibiblio.org

Dave Pawson

2010-05-08 11:41:59 UTC

Permalink

On 8 May 2010 12:00, Elliotte Rusty Harold <elharo at ibiblio.org> wrote:
> On Sat, May 8, 2010 at 6:45 AM, Dave Pawson <dave.pawson at gmail.com> wrote:
>
>> http://www.dpawson.co.uk/docbook/catalogs.html might help you
>> understand the use with XSLT,
>> I haven't used one with XOM, perhaps Elliotte can tell you
>> how to 'redirect' the resolver to use your catalog.
>>
>
> I've considered adding catalog support to XOM, but there's been any
> real demand for it. You should be able to use a SAX-based XML Catalog
> resolver as the base parser for XOM:
>
> http://xml.apache.org/commons/components/resolver/resolver-article.html
>
> --
> Elliotte Rusty Harold
> elharo at ibiblio.org

java -cp .:/sgml:/myjava/saxon655.jar:/myjava/xercesImple.jar:/myjava/resolver.jar
com.icl.saxon.StyleSheet -o $3 -x
org.apache.xml.resolver.tools.ResolvingXMLReader -y
org.apache.xml.resolver.tools.ResolvingXMLReader -r
org.apache.xml.resolver.tools.CatalogResolver -w1 $1 $2
"saxon.extensions=1" $4 $5 $6

The -x and -y and -r options provide the resolver access.
IMHO it's a worthwhile addition... redirection?

<grin/> Feature request Elliotte.

regards

--
Dave Pawson
XSLT XSL-FO FAQ.
Docbook FAQ.
http://www.dpawson.co.uk

Peter Murray-Rust

2010-05-08 11:42:37 UTC

Permalink

On Sat, May 8, 2010 at 11:45 AM, Dave Pawson <dave.pawson at gmail.com> wrote:
>
> That's why they don't like it.
> How you fix it?
>
> Your choice.
> 1. Get rid of the DTD reference in your xml instance.

I read a lot of HTML files from a lot of sites and DTD references can
cause a lot of trouble. So routinely when I process HTML (often not
XHTML, of course) I strip the DTD reference automatically. This has to
be done by something other than XOM even if the document is well
formed. Then I pass it to Tidy or TagSoup and thence to XOM.

--
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069

PG-Dillingham, Iain

2010-05-08 12:50:15 UTC

Permalink

Elliotte, thank you for the W3C link.

Dave and Elliotte, thank you for the suggestions. I will investigate these further. There's plenty to learn here!

Peter, ditto. This suggestion is more at my ability-level: much appreciated.

Iain