[XOM-interest] Illegal character in W3CDOM passed to XOM

Discussion:

Peter Murray-Rust

2012-05-17 22:42:46 UTC

I am using the XOM DOMConverter to try to convert a W3CDOM.

public static nu.xom.Element convertW3CDocument(org.w3c.dom.Element
w3cElement) {
return nu.xom.converters.DOMConverter.convert(w3cElement);
}
and getting an illegal character exception:

... 4 more
Caused by: nu.xom.IllegalCharacterDataException: 0x14 is not allowed in XML
content
at nu.xom.Verifier.throwIllegalCharacterDataException(Verifier.java:154)
at nu.xom.Verifier.checkPCDATA(Verifier.java:205)
at nu.xom.Text._setValue(Text.java:126)
at nu.xom.Text.<init>(Text.java:62)
at nu.xom.converters.DOMConverter.convert(DOMConverter.java:217)
at nu.xom.converters.DOMConverter.convert(DOMConverter.java:166)
at nu.xom.converters.DOMConverter.convert(DOMConverter.java:354)

I am sure this is not XOM's fault/problem but It am wondering how the
W3CDOM comes to output illegal XML

This comes at the end of a long chain (PDFBox converts PDF to Java
Graphics2D which is then converted by Batik to W3C DOM. I'm not surprised
that there are problems with encoding (or lack of it).

It would appear that the W3C DOM has illegal characters in the Text
elements. I assume there is no check on doing this so that it is possible
to construct invalid XML. I'm simply wondering if anyone has had similar
experiences and whether there are quick fixes or whether the characters/
encoding in Batik's DOM has to be fixed.

P.

--
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069

Elliotte Rusty Harold

2012-05-19 13:27:33 UTC

Permalink

Your analysis is correct. I suspect the original PDF contains the bad
character, (well, bad for XML; probably not bad for PDF) and XOM is
the first part of the chain to notice it.

--
Elliotte Rusty Harold
elharo at ibiblio.org

Peter Murray-Rust

2012-05-19 14:10:17 UTC

Permalink

On Sat, May 19, 2012 at 2:27 PM, Elliotte Rusty Harold

Post by Elliotte Rusty Harold
Your analysis is correct. I suspect the original PDF contains the bad
character, (well, bad for XML; probably not bad for PDF) and XOM is
the first part of the chain to notice it.

Thanks. I agree. There could be all sorts of character encodings and image
data.

I noticed that XOM would prevent people doing this - good for XOM! and that
the W3C engine was tool lax and allowed invalid XML to be output. Ah well,
I'll probably cut out the W3C bit.

P.

Post by Elliotte Rusty Harold
--
Elliotte Rusty Harold
elharo at ibiblio.org
_______________________________________________
XOM-interest mailing list
XOM-interest at lists.ibiblio.org
http://lists.ibiblio.org/mailman/listinfo/xom-interest

--
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069

Michael Kay

2012-05-19 16:05:37 UTC

Permalink

Post by Peter Murray-Rust
I noticed that XOM would prevent people doing this - good for XOM! and

that the W3C engine was tool lax and allowed invalid XML to be output.
Ah well, I'll probably cut out the W3C bit.

If you're still using DOM, then it's high time you (and the millions of
other people who use it thinking that if it's a W3C standard and in the
JDK then it must be good) moved off it to something better.

Michael Kay
Saxonica

Peter Murray-Rust

2012-05-19 16:25:26 UTC

Permalink

Post by Michael Kay

Post by Peter Murray-Rust
I noticed that XOM would prevent people doing this - good for XOM! and

that the W3C engine was tool lax and allowed invalid XML to be output.
Ah well, I'll probably cut out the W3C bit.
If you're still using DOM,

I'm not! it's in the Batik package. What I meant is that I will probably
rewrite the bits of Batik using XOM.

Post by Michael Kay
then it's high time you (and the millions of
other people who use it thinking that if it's a W3C standard and in the
JDK then it must be good) moved off it to something better.
I moved off 7 years ago.
Michael Kay
Saxonica
_______________________________________________
XOM-interest mailing list
XOM-interest at lists.ibiblio.org
http://lists.ibiblio.org/mailman/listinfo/xom-interest

--
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069