Discussion:
[XOM-interest] Inserting entities directly?
n***@io7m.com
2016-05-27 13:41:35 UTC
Permalink
Hello.

I'm dealing with a text format that allows characters that are not
allowed by XML. I'm referring to those characters in particular areas
of the BMP [U+0001, U+0009], etc:

https://en.wikipedia.org/wiki/Valid_Characters_in_XML#XML_1.0

I'm trying to serialize the text as XML 1.0 and therefore obviously
need to escape some characters. XOM transparently escapes <, &, etc,
and this is fine. However, it raises an exception if I try to append
text to a child that contains forbidden codepoints such as U+0001.
If I try to manually escape characters myself by writing &#0001 and so
on, XOM escapes the ampersand and I end up with &amp;#0001.

What's the correct way to insert the characters myself such that
they'll be escaped?

M
Michael Kay
2016-05-27 14:27:07 UTC
Permalink
From the XML FAQ

http://www.xom.nu/faq.xhtml#d0e186
Does XOM support XML 1.1?
No. XML 1.1 is an abomination. You don't need it and you shouldn't use it.
You're out of luck. XOM is very very strict about validating content according to the XML rules, and you can't cheat. If you want something that isn't XML 1.0, or that is more liberal in what it accepts, then you'll need a different library.

And don't even bother to ask Elliotte: if he thinks jam is bad for you, he won't give you jam however much you plead. He has firm views, and they are usually right.

Michael Kay
Saxonica
Hello.
I'm dealing with a text format that allows characters that are not
allowed by XML. I'm referring to those characters in particular areas
https://en.wikipedia.org/wiki/Valid_Characters_in_XML#XML_1.0
I'm trying to serialize the text as XML 1.0 and therefore obviously
need to escape some characters. XOM transparently escapes <, &, etc,
and this is fine. However, it raises an exception if I try to append
text to a child that contains forbidden codepoints such as U+0001.
If I try to manually escape characters myself by writing &#0001 and so
on, XOM escapes the ampersand and I end up with &amp;#0001.
What's the correct way to insert the characters myself such that
they'll be escaped?
M
_______________________________________________
XOM-interest mailing list
http://lists.ibiblio.org/mailman/listinfo/xom-interest
n***@io7m.com
2016-05-27 14:59:47 UTC
Permalink
'Lo.

On 2016-05-27T15:27:07 +0100
Post by Michael Kay
From the XML FAQ
http://www.xom.nu/faq.xhtml#d0e186
Does XOM support XML 1.1?
No. XML 1.1 is an abomination. You don't need it and you shouldn't use it.
No XML 1.1 here, only 1.0.
Post by Michael Kay
You're out of luck. XOM is very very strict about validating content according to the XML rules, and you can't cheat.
And I appreciate XOM for it. Correctness is typically a vague
afterthought in modern software.
Post by Michael Kay
If you want something that isn't XML 1.0, or that is more liberal in what it accepts, then you'll need a different library.
I wasn't aware that what I was trying to do was forbidden by the XML
specification (I assumed the restrictions only applied to raw
characters appearing in the document as opposed to characters resulting
from the expansion of hex entities).

I guess I'll be transforming everything to U+FFFD.

M
Michael Kay
2016-05-27 15:33:18 UTC
Permalink
Post by n***@io7m.com
I guess I'll be transforming everything to U+FFFD.
If there's a strong need to support non-XML characters and they occur only occasionally, then using processing instructions can be a handy solution because it doesn't affect content without such characters, and doesn't get in the way of validation:

<a>Some text with <?U FFFD?> special characters</a>

Michael Kay
Saxonica

Loading...