[XOM-interest] XSLTransform class API

Discussion:

Dmitry Katsubo

2010-09-16 19:01:51 UTC

Dear XOM developers!

I've come across few questions concerning nu.xom.xslt.XSLTransform. I
will be very pleased, if someone can provide any feedback / opinion. All
below is just my opinion, which may differ from majority's:

* Having XSLTransform(Source source) constructor private is too
restrictive. One may wish to pass XSLT template as InputStream without
pre-building it as XOM document model. I don't see any good reason, why
there is only nu.xom.Document constructor.

* It looks like XSLTransform class itself has only one added value:
handling of exceptions. More over in this very example XOM is throwing
checked exception while in the rest of the library unchecked are preferred.

* Having XOMResult/XOMSource classes public adds flexibility to XOM, as
one can use them separately from XSLTransform.

* Returning Nodes as result of transformation is a bit odd (I took
extract from tutorial [1]):

Nodes output = transform.transform(input);
Document result = XSLTransform.toDocument(output);

I fully agree that result should be a list of nodes, but I think
returning DocumentFragment better matches the return result type. Maybe
it wouldn't be simple then...

Thank you for any comments in advance!

[1] http://www.xom.nu/tutorial.xhtml#d0e1875
[2] http://lists.ibiblio.org/pipermail/xom-interest/2005-May/002272.html

--
With best regards,
Dmitry

Elliotte Rusty Harold

2010-09-16 22:24:44 UTC

Permalink

Post by Dmitry Katsubo
Dear XOM developers!

, which may differ from majority's:
XOM is designed quite explicitly not to do everything in every
possible way. I find that adding every possible variation of a given
operation simply makes an API too confusing, hard to learn, and hard
to use. New methods are added only if they have a compelling
justification; i.e. if they add functionality that does not already
exist.

Post by Dmitry Katsubo
* Having XSLTransform(Source source) constructor private is too
restrictive. One may wish to pass XSLT template as InputStream without
pre-building it as XOM document model. I don't see any good reason, why
there is only nu.xom.Document constructor.

There might be some justification for doing this, but it's not
self-evident. The existing code does not reparse the document, and
something's going to have to parse it at some point. You'd have to
show there was a significant amount of overhead to building the XSL
document as a XOM tree.

Post by Dmitry Katsubo
handling of exceptions. More over in this very example XOM is throwing
checked exception while in the rest of the library unchecked are preferred.
* Having XOMResult/XOMSource classes public adds flexibility to XOM, as
one can use them separately from XSLTransform.

No, you can't. The Source and Result interfaces are not properly
designed for reuse. Not a lot I can do about that. For that matter,
neither are XOMResult and XOMSource, but that's a deliberate decision
and why they are private.

Post by Dmitry Katsubo
* Returning Nodes as result of transformation is a bit odd (I took
Nodes output = transform.transform(input);
Document result = XSLTransform.toDocument(output);
I fully agree that result should be a list of nodes, but I think
returning DocumentFragment better matches the return result type. Maybe
it wouldn't be simple then...

There is no DocumentFragment type in XOM. Why would you need it when
you have Nodes?

--
Elliotte Rusty Harold
elharo at ibiblio.org

Dmitry Katsubo

2010-09-17 11:11:22 UTC

Permalink

Hi Elliotte!

Again thank you for comments. I do appreciate them a lot.

Post by Elliotte Rusty Harold
XOM is designed quite explicitly not to do everything in every
possible way. I find that adding every possible variation of a given
operation simply makes an API too confusing, hard to learn, and hard
to use. New methods are added only if they have a compelling
justification; i.e. if they add functionality that does not already
exist.

I agree with this statement. I have read your vision concerning XOM
library, your interview on [1] and also some chapters from you book "XML
in a nutshell" [2] (BTW, thank you for your work!). However, I would
like to express my opinion, which maybe will change XOM for better, if
you will also agree with me.

Post by Elliotte Rusty Harold
There might be some justification for doing this, but it's not
self-evident. The existing code does not reparse the document, and
something's going to have to parse it at some point. You'd have to
show there was a significant amount of overhead to building the XSL
document as a XOM tree.

I fully agree on the statement that parsing the XSLT into XOM tree is
fast and should not be considered as memory or time loss at all. However
we can imagine, that I cannot receive the stylesheet neither as
InputStream, File or String. It is passed to me from some other 3rd
party library X as javax.xml.stream.XMLEventReader or as
org.xmlpull.v1.XmlPullParser. Of course, I agree that it is a bad
library X design (which should probably return commonly used
InputStream), but connecting it to XOM library becomes troublesome. I
would love to see all library APIs in world harmonized and following the
ground principles, but unfortunately, XOM API cannot change the world.
So it has to ways: either to ignore (and stay persistent) or to adapt.

Post by Elliotte Rusty Harold

Post by Dmitry Katsubo
* Having XOMResult/XOMSource classes public adds flexibility to XOM, as
one can use them separately from XSLTransform.

I read you message here as "(a) TrAX API is bad, that is why we (b)
should not support it and (c) suppress any attempts to add support for
it in XOM". I might agree with (a) and (b), but doing (c) frustrates me.

I personally think that "good API" also is "extendable API" (taken from
page 4 in Google presentation [3]), so if somebody inherits its class
from XSLTransform what is wrong with that? I agree that one can misuse
the parent XSLTransform class, break things and do a lot of harm, but if
I am to choose between "let people extend and re-use the classes and
*probably* make mistakes (and learn from them)" and "do not allow people
extend and re-use the classes (and get a headache)", I would choose the
first one. I have stressed the word *probably* because one needs to be a
real Java hacker to invent something to break XSLTransform class
functionality (maybe via reflection? byte-code injection?
java.lang.instrument.ClassFileTransformer?).

Post by Elliotte Rusty Harold

There is no DocumentFragment type in XOM. Why would you need it when
you have Nodes?

There is nu.xom.DocumentFragment class, but it is (again) private. I try
to imagine what are the probable ways to use the result of
transformation. I think, in most cases applications serialize the result
of transformation into String/OutputStream/Writer (and send pipe it to
another module or application or DB). So I would expect to simply say:

String xml = transform.transform(input).toXML();
... and send String to next consumer in a pipe ...

So my points here:

1) If I use XSLTransform.toDocument(transform.transform(input)).toXML()
I get additionally post-transformed tree with first Element in root and
all other nodes as it's child. This post-transformation is not evident
and not natural but maybe OK for most transformations. I would like to
keep the original document as is after the transformation.
2) So I need to write a loop over all nodes and serialize them. Not a
big deal in general, but if API use case for almost all case is to loop
over all nodes, isn't it a signal to improve it? nu.xom.Serializer also
cannot write Nodes...
3) If transform() method returns DocumentFragment, I expect
DocumentFragment.toXML() not to break down in
UnsupportedOperationException, but correctly serialize all nodes in a
loop. Yes, in this case toXML() returns a non-complete (non-valid) XML,
but toXML() is not supposed to produce a valid XML, right? (e.g.
Text.toXML()). So the caller knows, what can be in output.

Also as we have touched the question of serialization, why
nu.xom.Serializer does not have a constructor with Writer? Internally it
uses a writer. The only added value I see in nu.xom.Serializer is to
protect the user from using broken EBCDIC-family output streams. If I
use broken JDK OutputStream implementation, I won't blame XOM, really.
And if I have only a Writer, I need to think how to convert it to
OutputStream for XOM, who will convert it to Writer :)

XOM provides a nice nu.xom.EBCDICWriter but again, it is not public.
This forces programmers to copy-paste the code, not not to re-use. Why
preventing it from being used even outside XOM?

Just for my information: can you please provide a reference for Sun JDK
bugtracker concerning EBCDIC problem. I would like to know for what JDK
is it relevant for. I tried to locate something relevant in [4], but I
failed.

Thank you a lot in advance for any comments.

[1]
http://books.google.com/books?id=NBwnSfoCStAC&printsec=frontcover&#v=onepage
[2] http://www.artima.com/intv/xmlapis.html
[3]
http://www.slideshare.net/guestbe92f4/how-to-design-a-good-a-p-i-and-why-it-matters-g-o-o-g-l-e
[4] http://bugs.sun.com/bugdatabase/

--
With best regards,
Dmitry

Elliotte Rusty Harold

2010-09-17 12:20:37 UTC

Permalink

Post by Dmitry Katsubo
I fully agree on the statement that parsing the XSLT into XOM tree is
fast and should not be considered as memory or time loss at all. However
we can imagine, that I cannot receive the stylesheet neither as
InputStream, File or String. It is passed to me from some other 3rd
party library X as javax.xml.stream.XMLEventReader or as
org.xmlpull.v1.XmlPullParser.

Supporting hypothetical use cases lead to bloated APIs. If there's a
complelling need for XMLEventReader or XmlPullParser I'll look into
it. So far I've never seen such a thing. Even you are not really
saying you need it, just that you think it should be there.

Post by Dmitry Katsubo
I read you message here as "(a) TrAX API is bad, that is why we (b)
should not support it and (c) suppress any attempts to add support for
it in XOM". I might agree with (a) and (b), but doing (c) frustrates me.

No, the message is that the TrAX API is so bad that XOM CANNOT support
it. It is simply not possible even if I wanted to do this. TrAX does
not provide a usable abstraction of sources and results. If the TrAX
API actually allowed interoperability between representations I would
have supported it years ago. It doesn't.

Post by Dmitry Katsubo
I personally think that "good API" also is "extendable API" (taken from
page 4 in Google presentation [3]), so if somebody inherits its class
from XSLTransform what is wrong with that? I agree that one can misuse
the parent XSLTransform class, break things and do a lot of harm, but if
I am to choose between "let people extend and re-use the classes and
*probably* make mistakes (and learn from them)" and "do not allow people
extend and re-use the classes (and get a headache)", I would choose the
first one. I have stressed the word *probably* because one needs to be a
real Java hacker to invent something to break XSLTransform class
functionality (maybe via reflection? byte-code injection?
java.lang.instrument.ClassFileTransformer?).

Extensible is hard. One of the hard parts is making something
extensible without making it fragile. XOM goes to a lot of trouble in
this area.

Post by Dmitry Katsubo
There is nu.xom.DocumentFragment class, but it is (again) private. I try
to imagine what are the probable ways to use the result of
transformation. I think, in most cases applications serialize the result
of transformation into String/OutputStream/Writer (and send pipe it to
String xml = transform.transform(input).toXML();
... and send String to next consumer in a pipe ...

I've though about adding a toXML method to Nodes. That might be useful.

Post by Dmitry Katsubo
1) If I use XSLTransform.toDocument(transform.transform(input)).toXML()
I get additionally post-transformed tree with first Element in root and
all other nodes as it's child.
This post-transformation is not evident
and not natural but maybe OK for most transformations.

I beg to differ. No way is XOM going to move nodes around into other
nodes behind the user's back. If the user wants that (and I can't see
why they would) they'll have to write code to do it.)

Post by Dmitry Katsubo
I would like to
keep the original document as is after the transformation.

That's as it is now.

Post by Dmitry Katsubo
2) So I need to write a loop over all nodes and serialize them. Not a
big deal in general, but if API use case for almost all case is to loop
over all nodes, isn't it a signal to improve it? nu.xom.Serializer also
cannot write Nodes...

Java is not a list-based language. The use of loops is deliberate in
the design of both Java and XOM.

Post by Dmitry Katsubo
3) If transform() method returns DocumentFragment, I expect
DocumentFragment.toXML() not to break down in
UnsupportedOperationException, but correctly serialize all nodes in a
loop. Yes, in this case toXML() returns a non-complete (non-valid) XML,
but toXML() is not supposed to produce a valid XML, right? (e.g.
Text.toXML()). So the caller knows, what can be in output.

This is why DocumentFragment is not public. :-)

Post by Dmitry Katsubo
Also as we have touched the question of serialization, why
nu.xom.Serializer does not have a constructor with Writer? Internally it
uses a writer. The only added value I see in nu.xom.Serializer is to
protect the user from using broken EBCDIC-family output streams. If I
use broken JDK OutputStream implementation, I won't blame XOM, really.
And if I have only a Writer, I need to think how to convert it to
OutputStream for XOM, who will convert it to Writer :)

This is painful but ties directly to another design flaw in Java, not
XOM. Writers do not allow the client to determine the underlying
encoding of the text (UTF-8, ISO-8859-1, etc.). Therefore with only a
Writer I can't guarantee well-formed output. To answer your question
in another thread this is one (of several) ways you can get malformed
output from JDOM.

Post by Dmitry Katsubo
XOM provides a nice nu.xom.EBCDICWriter but again, it is not public.
This forces programmers to copy-paste the code, not not to re-use. Why
preventing it from being used even outside XOM?

God sense. No one needs to, or should be, writing EBCDIC XML files in 2010.

--
Elliotte Rusty Harold
elharo at ibiblio.org

Dmitry Katsubo

2010-09-29 11:53:19 UTC

Permalink

Dear Elliotte,

Thank you for your comments.

Post by Elliotte Rusty Harold

I suggest you waiting for few more requests coming from several users,
and then make an appropriate extension to API. No pressure from my side.
Well, the drawback of extending the API *on request* is that some
developers will not come to this maillist to discuss the probable
improvements, because they are too lazy. Finally they will bloat the
application code, that is using XOM, with workarounds. So if you look at
final application as a whole, you will notice a bright and shining XOM
library in the center and back patches here and there when gluing with
it. So the overall picture is gray. And knowing that the majority
programmers will implement something "just to make it working" and never
come back for improvements, you will never know about the traps in using
XOM. It is just my feeling :)

Post by Elliotte Rusty Harold

OK, I have looked the source code again. I see that the added value are
also the text escapers, which define the rules per charset.

Sorry, I don't understand the problem ground here. If I pass Writer to
XOM, it is true that you cannot determine the output encoding - it needs
to be supplied with 2nd argument the same way as for OutpuStream. But
XOM needs to feed the Writer with UTF strings (allowed by given
encoding) and it is up to Writer (or String) to convert the characters
to output encoding later on. There is kind of danger here, but I don't
think it is a stopper.

I attach a test case, that serializes two XML models via XOM and DOMv3
with same result. Please, point me exactly the line I am wrong or which
has a pitfall that most of programmers will hit.

Thank you!

--
With best regards,
Dmitry
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: SerializationTest.java
Url: http://lists.ibiblio.org/pipermail/xom-interest/attachments/20100929/4c9605e8/attachment.pl

Elliotte Rusty Harold

2010-09-29 13:00:57 UTC

Permalink

Post by Dmitry Katsubo
I attach a test case, that serializes two XML models via XOM and DOMv3
with same result. Please, point me exactly the line I am wrong or which
has a pitfall that most of programmers will hit.

Your code uses a StringWriter. Problems arise with something like this:

Writer out = new OutputStreamWriter(new ByteArrayOutputStream(), "ISO-8859-1");

or even just, depending on platform:

Writer out = new OutputStreamWriter(new ByteArrayOutputStream());

The problem is I have no good way to tell which characters do and do
not need to be escaped given the Writer. A StringWriter can write
anything, as can a CharArrayWriter, but most other writers can't.

--
Elliotte Rusty Harold
elharo at ibiblio.org

Dmitry Katsubo

2010-09-29 18:15:42 UTC

Permalink

Dear Elliotte,

Thanks for great remarks!

Post by Elliotte Rusty Harold

Writer out = new OutputStreamWriter(new ByteArrayOutputStream(), "ISO-8859-1");

This case worked fine for me. If you specify the encoding for output
stream, it works OK. Diff is attached.

Post by Elliotte Rusty Harold
Writer out = new OutputStreamWriter(new ByteArrayOutputStream());

I agree. This may trigger problems. I would rather say it is not a
problem of a library, but a misuse of Writer. But why do I need to wrap
OutputStream into Writer, if I can pass OutputStream directly to XOM
API? And from the other side, if I write the code like you suggest, I
know what I am doing, so it not XOM to blame on what will happen to
Writer after the serialization is done (whether the stream goes to byte
array or string or pipe or ...).

Post by Elliotte Rusty Harold
The problem is I have no good way to tell which characters do and do
not need to be escaped given the Writer. A StringWriter can write
anything, as can a CharArrayWriter, but most other writers can't.

I think, the same characters should be escaped as for OutputStream (no
difference). Unescaped characters should be left in UTF8.

--
With best regards,
Dmitry
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: SerializationTest.java.diff
Url: http://lists.ibiblio.org/pipermail/xom-interest/attachments/20100929/6b2dc4eb/attachment.pl

Elliotte Rusty Harold

2010-09-30 11:07:58 UTC

Permalink

Post by Dmitry Katsubo
Dear Elliotte,

Post by Elliotte Rusty Harold
Writer out = new OutputStreamWriter(new ByteArrayOutputStream(), "ISO-8859-1");

This case worked fine for me. If you specify the encoding for output
stream, it works OK. Diff is attached.

The problem comes when you attempt to write a character such as the
Greek letter alpha onto this writer. That can't be represented in
8859-1 but XOM has no way of knowing that.

I suppose I could just escape all non-ASCII characters when writing to
a Writer, but that's really ugly and even that doesn't work if the tag
names contain non-ASCII.

Post by Dmitry Katsubo
I think, the same characters should be escaped as for OutputStream (no
difference). Unescaped characters should be left in UTF8.

You can't leave characters in UTF-8 if the writer doesn't write UTF-8.
With an output stream, XOM knows which characters it must escape and
which ones it doesn't have to escape. With a writer XOM doesn't know
that.

--
Elliotte Rusty Harold
elharo at ibiblio.org