Discussion:
[XOM-interest] Illegal character in W3CDOM passed to XOM
Peter Murray-Rust
2012-05-17 22:42:46 UTC
Permalink
I am using the XOM DOMConverter to try to convert a W3CDOM.

public static nu.xom.Element convertW3CDocument(org.w3c.dom.Element
w3cElement) {
return nu.xom.converters.DOMConverter.convert(w3cElement);
}
and getting an illegal character exception:

... 4 more
Caused by: nu.xom.IllegalCharacterDataException: 0x14 is not allowed in XML
content
at nu.xom.Verifier.throwIllegalCharacterDataException(Verifier.java:154)
at nu.xom.Verifier.checkPCDATA(Verifier.java:205)
at nu.xom.Text._setValue(Text.java:126)
at nu.xom.Text.<init>(Text.java:62)
at nu.xom.converters.DOMConverter.convert(DOMConverter.java:217)
at nu.xom.converters.DOMConverter.convert(DOMConverter.java:166)
at nu.xom.converters.DOMConverter.convert(DOMConverter.java:354)

I am sure this is not XOM's fault/problem but It am wondering how the
W3CDOM comes to output illegal XML

This comes at the end of a long chain (PDFBox converts PDF to Java
Graphics2D which is then converted by Batik to W3C DOM. I'm not surprised
that there are problems with encoding (or lack of it).

It would appear that the W3C DOM has illegal characters in the Text
elements. I assume there is no check on doing this so that it is possible
to construct invalid XML. I'm simply wondering if anyone has had similar
experiences and whether there are quick fixes or whether the characters/
encoding in Batik's DOM has to be fixed.

P.
--
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069
Elliotte Rusty Harold
2012-05-19 13:27:33 UTC
Permalink
Your analysis is correct. I suspect the original PDF contains the bad
character, (well, bad for XML; probably not bad for PDF) and XOM is
the first part of the chain to notice it.
--
Elliotte Rusty Harold
elharo at ibiblio.org
Peter Murray-Rust
2012-05-19 14:10:17 UTC
Permalink
On Sat, May 19, 2012 at 2:27 PM, Elliotte Rusty Harold
Post by Elliotte Rusty Harold
Your analysis is correct. I suspect the original PDF contains the bad
character, (well, bad for XML; probably not bad for PDF) and XOM is
the first part of the chain to notice it.
Thanks. I agree. There could be all sorts of character encodings and image
data.

I noticed that XOM would prevent people doing this - good for XOM! and that
the W3C engine was tool lax and allowed invalid XML to be output. Ah well,
I'll probably cut out the W3C bit.

P.
Post by Elliotte Rusty Harold
--
Elliotte Rusty Harold
elharo at ibiblio.org
_______________________________________________
XOM-interest mailing list
XOM-interest at lists.ibiblio.org
http://lists.ibiblio.org/mailman/listinfo/xom-interest
--
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069
Michael Kay
2012-05-19 16:05:37 UTC
Permalink
Post by Peter Murray-Rust
I noticed that XOM would prevent people doing this - good for XOM! and
that the W3C engine was tool lax and allowed invalid XML to be output.
Ah well, I'll probably cut out the W3C bit.

If you're still using DOM, then it's high time you (and the millions of
other people who use it thinking that if it's a W3C standard and in the
JDK then it must be good) moved off it to something better.

Michael Kay
Saxonica
Peter Murray-Rust
2012-05-19 16:25:26 UTC
Permalink
Post by Michael Kay
Post by Peter Murray-Rust
I noticed that XOM would prevent people doing this - good for XOM! and
that the W3C engine was tool lax and allowed invalid XML to be output.
Ah well, I'll probably cut out the W3C bit.
If you're still using DOM,
I'm not! it's in the Batik package. What I meant is that I will probably
rewrite the bits of Batik using XOM.
Post by Michael Kay
then it's high time you (and the millions of
other people who use it thinking that if it's a W3C standard and in the
JDK then it must be good) moved off it to something better.
I moved off 7 years ago.
Michael Kay
Saxonica
_______________________________________________
XOM-interest mailing list
XOM-interest at lists.ibiblio.org
http://lists.ibiblio.org/mailman/listinfo/xom-interest
--
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069
Loading...