Discussion:
[XOM-interest] What's an elegant way to write symbolic entities
David Collier-Brown
2011-01-09 16:44:35 UTC
Permalink
I'm starting a mini-project to get some input documents into a good,
editable xml.
One of the things I desire to do is write entities in their symbolic
form, so that a
non-breaking space will be written as   and e-acute as é

This is in principle doable by sending the output not through a
serializer, but instead
a through an xslt processor that uses a list of character-maps which in
turn enumerate
the entire list of entities from my DTD.
However, thus looks a lot like "going via Snarey's Corners", a term
from my
childhood symbolizing going insanely far out of one's way to do something.

Is there a simple, straightforward and perhaps even elegant way
to produce output containing symbolic entities?

--dave
--
David Collier-Brown, | Always do right. This will gratify
System Programmer and Author | some people and astonish the rest
davecb at spamcop.net | -- Mark Twain
(416) 223-8968
Michael Kay
2011-01-09 16:58:25 UTC
Permalink
Post by David Collier-Brown
I'm starting a mini-project to get some input documents into a good,
editable xml.
One of the things I desire to do is write entities in their symbolic
form, so that a
non-breaking space will be written as  and e-acute asé
This is in principle doable by sending the output not through a
serializer, but instead
a through an xslt processor that uses a list of character-maps which in
turn enumerate
the entire list of entities from my DTD.
However, thus looks a lot like "going via Snarey's Corners", a term
from my
childhood symbolizing going insanely far out of one's way to do something.
Is there a simple, straightforward and perhaps even elegant way
to produce output containing symbolic entities?
--dave
You could write a Java Writer that filters the output stream by looking
at each character and converting it accordingly - so long as you know
that special characters won't occur in element/attribute names, and so
long as you don't care what happens in comments and processing instructions.

Michael Kay
Saxonica
Elliotte Rusty Harold
2011-01-10 00:22:50 UTC
Permalink
?Is there a simple, straightforward and perhaps even elegant way
to produce output containing symbolic entities?
No, there isn't.

Writing an editor is the one major use case that is explicitly outside
the goals of XOM (and implicitly outside the goals of every other
major XML API).

If you're going to write an editor you probably need to start by
writing your own parser and libraries, because the usual ones just
don't support this use case.
--
Elliotte Rusty Harold
elharo at ibiblio.org
David Collier-Brown
2011-01-10 00:36:49 UTC
Permalink
Post by Elliotte Rusty Harold
Post by David Collier-Brown
Is there a simple, straightforward and perhaps even elegant way
to produce output containing symbolic entities?
No, there isn't.
Writing an editor is the one major use case that is explicitly outside
the goals of XOM (and implicitly outside the goals of every other
major XML API).
If you're going to write an editor you probably need to start by
writing your own parser and libraries, because the usual ones just
don't support this use case.
Ok, thanks: I'm actually producing input data for an editor, but that
looks as if it's effectively the same thing.

--dave
--
David Collier-Brown, | Always do right. This will gratify
System Programmer and Author | some people and astonish the rest
davecb at spamcop.net | -- Mark Twain
(416) 223-8968
David Collier-Brown
2011-02-06 01:08:58 UTC
Permalink
This is perhaps as much a tag soup question as an XOM one, but I'm
trying to read a file created in a Windows character set and then copied
to our production Linux system.

Has anyone experience in explicitly setting the input character set so
that the sax parser, in this case tag soup, can interpret it properly?
I've experimented with setting the encoding with an InputSource, but
with no visible effect.

--dave
--
David Collier-Brown, | Always do right. This will gratify
System Programmer and Author | some people and astonish the rest
davecb at spamcop.net | -- Mark Twain
(416) 223-8968
Peter Taylor
2011-02-06 10:25:08 UTC
Permalink
the xml declaration should have defined the encoding, but as a work-around try using this:

http://download.oracle.com/javase/6/docs/api/java/io/InputStreamReader.html#InputStreamReader%28java.io.InputStream,%20java.lang.String%29

to adapter the inputstream with a charset into a reader, then pass this reader to the InputSource.


-----Original Message-----
From: xom-interest-bounces at lists.ibiblio.org on behalf of David Collier-Brown
Sent: Sun 06/02/2011 01:08
To: XOM API for Processing XML with Java
Cc: davecb at spamcop.net
Subject: [SPAM(hdr)] - [XOM-interest] Has anyone experience using tag Soup to read Windows files on a Linux system?

This is perhaps as much a tag soup question as an XOM one, but I'm
trying to read a file created in a Windows character set and then copied
to our production Linux system.

Has anyone experience in explicitly setting the input character set so
that the sax parser, in this case tag soup, can interpret it properly?
I've experimented with setting the encoding with an InputSource, but
with no visible effect.

--dave
--
David Collier-Brown, | Always do right. This will gratify
System Programmer and Author | some people and astonish the rest
davecb at spamcop.net | -- Mark Twain
(416) 223-8968
David Collier-Brown
2011-02-06 16:06:26 UTC
Permalink
Thanks for the pointer, that solved the problem.

Nothing solves the underlying one, though. which is that I'm reading
Microsoft-produced dreck with Tag Soup, and the encoding is buried
several tags deep in a <meta http-equiv="Content-Type"
content="text/html; charset=windows-1252"/> tag. This breaks most sane
recognition schemes, so I have to specify it in the code (;-))

The solution for the real problem is to switch the company to using
InDesign or the like, and UTF-8.

--dave
Post by Peter Taylor
http://download.oracle.com/javase/6/docs/api/java/io/InputStreamReader.html#InputStreamReader%28java.io.InputStream,%20java.lang.String%29
to adapter the inputstream with a charset into a reader, then pass this reader to the InputSource.
-----Original Message-----
From: xom-interest-bounces at lists.ibiblio.org on behalf of David Collier-Brown
Sent: Sun 06/02/2011 01:08
To: XOM API for Processing XML with Java
Cc: davecb at spamcop.net
Subject: [SPAM(hdr)] - [XOM-interest] Has anyone experience using tag Soup to read Windows files on a Linux system?
This is perhaps as much a tag soup question as an XOM one, but I'm
trying to read a file created in a Windows character set and then copied
to our production Linux system.
Has anyone experience in explicitly setting the input character set so
that the sax parser, in this case tag soup, can interpret it properly?
I've experimented with setting the encoding with an InputSource, but
with no visible effect.
--dave
--
David Collier-Brown, | Always do right. This will gratify
System Programmer and Author | some people and astonish the rest
davecb at spamcop.net | -- Mark Twain
(416) 223-8968
Tatu Saloranta
2011-01-10 17:48:01 UTC
Permalink
From: David Collier-Brown <davec-b at rogers.com>
Subject: Re: [XOM-interest] What's an elegant way to write symbolic entities
To: "Elliotte Rusty Harold" <elharo at ibiblio.org>
Cc: "XOM API for Processing XML with Java" <xom-interest at lists.ibiblio.org>, davecb at spamcop.net
Date: Sunday, January 9, 2011, 5:36 PM
On Sun, Jan 9, 2011 at 11:44 AM, David Collier-Brown
<davec-b at rogers.com>
???
? Is there a simple, straightforward and
perhaps even elegant way
to produce output containing symbolic entities?
? ???
No, there isn't.
Writing an editor is the one major use case that is
explicitly outside
the goals of XOM (and implicitly outside the goals of
every other
major XML API).
If you're going to write an editor you probably need to start by
writing your own parser and libraries, because the usual ones just
don't support this use case.
???
Ok, thanks: I'm actually producing input data for an editor, but that
looks as if it's effectively the same thing.
I would soften Elliotte's suggestion by noting that while handling of DTD-provided entities (or access to them) is generally well supported, some libraries provide more access than others.
So I would suggest trying to use existing an existing parser (and generator) as base. Woodstox does provide more access to "exotic" parts than many others, for example, especially via stax2 extension API.
It also reports accurate location information, and is used by tools like editors (based on feature requests).

It may be necessary to have a customized version, but writing a parser is a rather non-trivial task, and part of the reason for lack of access to DTD aspects has to do with complexity of DTD handling itself.
And even if you do need to implement a higher-level abstraction (like tree model that XOM et al provide) you would save quite a bit of work.

-+ Tatu +-
Loading...