Tóth Tamás
2012-07-10 09:12:37 UTC
Hi,
I use the XOM library to parse and process .docx documents. MS Word
stores text content in runs (<w:r>) inside the paragraph tags (<w:p>),
and often breaks the text into several runs. Sometimes every word and
every space between them is in a separate run. When I try to load a run
containing only a single space, the parser removes that space and
handles it as an empty tag (i.e. Element.getValue() returns null/empty
string), as a result, the output contains the text without spaces. Why
is it so? How could I force the parser to keep all the spaces?
This is how I call the parser:
|StreamingPathFilter filter= new StreamingPathFilter("/w:document/w:body/*:*", prefixes);
Builder builder= new Builder(filter.createNodeFactory(null, contentTransform));
builder.build(documentFile);
...
StreamingTransform contentTransform= new StreamingTransform() {
@Override
public Nodes transform(nu.xom.Element node){
<...process XML and output text...>
}
}
|
Thank you for your help!
Tamas
I use the XOM library to parse and process .docx documents. MS Word
stores text content in runs (<w:r>) inside the paragraph tags (<w:p>),
and often breaks the text into several runs. Sometimes every word and
every space between them is in a separate run. When I try to load a run
containing only a single space, the parser removes that space and
handles it as an empty tag (i.e. Element.getValue() returns null/empty
string), as a result, the output contains the text without spaces. Why
is it so? How could I force the parser to keep all the spaces?
This is how I call the parser:
|StreamingPathFilter filter= new StreamingPathFilter("/w:document/w:body/*:*", prefixes);
Builder builder= new Builder(filter.createNodeFactory(null, contentTransform));
builder.build(documentFile);
...
StreamingTransform contentTransform= new StreamingTransform() {
@Override
public Nodes transform(nu.xom.Element node){
<...process XML and output text...>
}
}
|
Thank you for your help!
Tamas