Discussion:
[XOM-interest] StreamingPathFilter trims space-only nodes
Tóth Tamás
2012-07-10 09:12:37 UTC
Permalink
Hi,

I use the XOM library to parse and process .docx documents. MS Word
stores text content in runs (<w:r>) inside the paragraph tags (<w:p>),
and often breaks the text into several runs. Sometimes every word and
every space between them is in a separate run. When I try to load a run
containing only a single space, the parser removes that space and
handles it as an empty tag (i.e. Element.getValue() returns null/empty
string), as a result, the output contains the text without spaces. Why
is it so? How could I force the parser to keep all the spaces?

This is how I call the parser:

|StreamingPathFilter filter= new StreamingPathFilter("/w:document/w:body/*:*", prefixes);
Builder builder= new Builder(filter.createNodeFactory(null, contentTransform));
builder.build(documentFile);
...

StreamingTransform contentTransform= new StreamingTransform() {

@Override
public Nodes transform(nu.xom.Element node){
<...process XML and output text...>
}
}
|

Thank you for your help!

Tamas
Elliotte Rusty Harold
2012-07-10 10:59:49 UTC
Permalink
StreamingPathFilter is not part of XOM. I suggest you don't use it,
and then see where you are.
Post by Tóth Tamás
Hi,
I use the XOM library to parse and process .docx documents. MS Word
stores text content in runs (<w:r>) inside the paragraph tags (<w:p>),
and often breaks the text into several runs. Sometimes every word and
every space between them is in a separate run. When I try to load a run
containing only a single space, the parser removes that space and
handles it as an empty tag (i.e. Element.getValue() returns null/empty
string), as a result, the output contains the text without spaces. Why
is it so? How could I force the parser to keep all the spaces?
|StreamingPathFilter filter= new StreamingPathFilter("/w:document/w:body/*:*", prefixes);
Builder builder= new Builder(filter.createNodeFactory(null, contentTransform));
builder.build(documentFile);
...
StreamingTransform contentTransform= new StreamingTransform() {
@Override
public Nodes transform(nu.xom.Element node){
<...process XML and output text...>
}
}
|
Thank you for your help!
Tamas
_______________________________________________
XOM-interest mailing list
XOM-interest at lists.ibiblio.org
http://lists.ibiblio.org/mailman/listinfo/xom-interest
--
Elliotte Rusty Harold
elharo at ibiblio.org
Tóth Tamás
2012-07-10 12:00:47 UTC
Permalink
Hi,

many thanks for your hint! Indeed, I just recognized that
StreamingPathFilter belongs to the nux.xom package and not the nu.xom.
Anyway, I changed my code to use the default Builder constructor, and
extracted the methods from the StreamingTransform class to the main
class. This way it seems to be OK, the missing spaces appeared in the
output.

Regards,
Tamas
Post by Elliotte Rusty Harold
StreamingPathFilter is not part of XOM. I suggest you don't use it,
and then see where you are.
Post by Tóth Tamás
Hi,
I use the XOM library to parse and process .docx documents. MS Word
stores text content in runs (<w:r>) inside the paragraph tags (<w:p>),
and often breaks the text into several runs. Sometimes every word and
every space between them is in a separate run. When I try to load a run
containing only a single space, the parser removes that space and
handles it as an empty tag (i.e. Element.getValue() returns null/empty
string), as a result, the output contains the text without spaces. Why
is it so? How could I force the parser to keep all the spaces?
|StreamingPathFilter filter= new StreamingPathFilter("/w:document/w:body/*:*", prefixes);
Builder builder= new Builder(filter.createNodeFactory(null, contentTransform));
builder.build(documentFile);
...
StreamingTransform contentTransform= new StreamingTransform() {
@Override
public Nodes transform(nu.xom.Element node){
<...process XML and output text...>
}
}
|
Thank you for your help!
Tamas
_______________________________________________
XOM-interest mailing list
XOM-interest at lists.ibiblio.org
http://lists.ibiblio.org/mailman/listinfo/xom-interest
Loading...