Jason McKendry
2010-09-30 17:51:34 UTC
Elliotte, Michael,
Thank you both so much. I'm as grateful as I am embarrassed! I have never
worked at this level with anything to do with encoding, and I was bleeding
out my ears trying to keep all the new information in my head. It never
even occurred to me to try putting the actual original characters in my XML
and tinker with the encoding declarations wherever they were. Now that
you've spelled it out for me, I suppose it should have been obvious instead
of frustrating when I saw that every "&" I sent through came back as "&"
instead.
This does lead me to one new question for future consideration, though.
Right now, I have this line hard-coded in my .java file:
Serializer serializer = new Serializer(out, "UTF-8");
I also have a .ini file from which the application reads in some settings.
Would it make sense to make encoding available as a setting in the .ini
file? This is also my first .java application, so some finer, or rather
"not totally broad" points are fuzzy for me.
One last question (hopefully for awhile), I know you're both very busy
people; any excellent books or sites to read about encoding that would help
someone who hasn't done a lot with it become more proficient? I read this
old article (http://www.joelonsoftware.com/articles/Unicode.html) by Joel
Spolsky, and it was very helpful.
Thank you both for your help, I really appreciate it.
Jason McKendry
namelessOperation();
Thank you both so much. I'm as grateful as I am embarrassed! I have never
worked at this level with anything to do with encoding, and I was bleeding
out my ears trying to keep all the new information in my head. It never
even occurred to me to try putting the actual original characters in my XML
and tinker with the encoding declarations wherever they were. Now that
you've spelled it out for me, I suppose it should have been obvious instead
of frustrating when I saw that every "&" I sent through came back as "&"
instead.
This does lead me to one new question for future consideration, though.
Right now, I have this line hard-coded in my .java file:
Serializer serializer = new Serializer(out, "UTF-8");
I also have a .ini file from which the application reads in some settings.
Would it make sense to make encoding available as a setting in the .ini
file? This is also my first .java application, so some finer, or rather
"not totally broad" points are fuzzy for me.
One last question (hopefully for awhile), I know you're both very busy
people; any excellent books or sites to read about encoding that would help
someone who hasn't done a lot with it become more proficient? I read this
old article (http://www.joelonsoftware.com/articles/Unicode.html) by Joel
Spolsky, and it was very helpful.
Thank you both for your help, I really appreciate it.
Jason McKendry
namelessOperation();
Message: 5
Date: Thu, 30 Sep 2010 11:18:39 +0100
From: Michael Kay <mike at saxonica.com>
To: xom-interest at lists.ibiblio.org
Subject: Re: [XOM-interest] (no subject)
Message-ID: <4CA463FF.5090301 at saxonica.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
I think you're confused in your requirements.
XOM is a tree model, an abstract view of an XML document. In the tree
view of XML, characters are simply characters (things that correspond
one-to-one with unicode codepoints). The copyright symbol is one
character, one codepoint, and it is represented as such; you aren't
concerned with expansions such as © because those don't exist in
the tree view, they are only devices for serializing the XML within the
constraints of a restricted character repertoire. Creating entity or
character references is something that should only happen when you
serialize from the tree model to lexical XML, you should never attempt
to have such references present in the tree itself.
Michael Kay
Saxonica
------------------------------
Message: 6
Date: Thu, 30 Sep 2010 07:01:08 -0400
From: Elliotte Rusty Harold <elharo at ibiblio.org>
To: XOM API for Processing XML with Java
<xom-interest at lists.ibiblio.org>
Subject: Re: [XOM-interest] (no subject)
<AANLkTikKtk2h7T8u7wfPkO+-cEiMzskOyQ28+xfPEQkg at mail.gmail.com>
Content-Type: text/plain; charset=UTF-8
On Thu, Sep 30, 2010 at 5:48 AM, Jason McKendry
the characters you want to use like ? in your text and let XOM decide
how to encode them on output. Unless by "non-standard characters" you
mean characters that aren't even in Unicode--e.g. Klingon--in which
case there's not a lot XOM can do for you.
--
Elliotte Rusty Harold
elharo at ibiblio.org
Date: Thu, 30 Sep 2010 11:18:39 +0100
From: Michael Kay <mike at saxonica.com>
To: xom-interest at lists.ibiblio.org
Subject: Re: [XOM-interest] (no subject)
Message-ID: <4CA463FF.5090301 at saxonica.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
I think you're confused in your requirements.
XOM is a tree model, an abstract view of an XML document. In the tree
view of XML, characters are simply characters (things that correspond
one-to-one with unicode codepoints). The copyright symbol is one
character, one codepoint, and it is represented as such; you aren't
concerned with expansions such as © because those don't exist in
the tree view, they are only devices for serializing the XML within the
constraints of a restricted character repertoire. Creating entity or
character references is something that should only happen when you
serialize from the tree model to lexical XML, you should never attempt
to have such references present in the tree itself.
Michael Kay
Saxonica
------------------------------
Message: 6
Date: Thu, 30 Sep 2010 07:01:08 -0400
From: Elliotte Rusty Harold <elharo at ibiblio.org>
To: XOM API for Processing XML with Java
<xom-interest at lists.ibiblio.org>
Subject: Re: [XOM-interest] (no subject)
<AANLkTikKtk2h7T8u7wfPkO+-cEiMzskOyQ28+xfPEQkg at mail.gmail.com>
Content-Type: text/plain; charset=UTF-8
On Thu, Sep 30, 2010 at 5:48 AM, Jason McKendry
Hello everyone,
I have built an application on top of XOM, and the last thing I haven't
been able to iron out over the past few weeks has been a problem with
character references and escape sequences to represent non-standard
characters in my XML data files. ?I found information about how to use
the xsl:output tag to add a character map, but since the result of each
transformation is a XOM Nodes object, the xsl:output tag is never
processed. ?I had to solve a similar problem using the DocType object,
but I wasn't having any luck figuring out how to use that knowledge to
solve this problem.
You should never have to worry about character references. Just useI have built an application on top of XOM, and the last thing I haven't
been able to iron out over the past few weeks has been a problem with
character references and escape sequences to represent non-standard
characters in my XML data files. ?I found information about how to use
the xsl:output tag to add a character map, but since the result of each
transformation is a XOM Nodes object, the xsl:output tag is never
processed. ?I had to solve a similar problem using the DocType object,
but I wasn't having any luck figuring out how to use that knowledge to
solve this problem.
the characters you want to use like ? in your text and let XOM decide
how to encode them on output. Unless by "non-standard characters" you
mean characters that aren't even in Unicode--e.g. Klingon--in which
case there's not a lot XOM can do for you.
--
Elliotte Rusty Harold
elharo at ibiblio.org