[XOM-interest] Performance problem when processing large attribute values

Discussion:

Peter Murray-Rust

2012-05-31 12:28:38 UTC

I am reading SVG files with XOM, some of which have very long strings (e.g.
4 Mb) for attribute values. For example images (bitmaps) are encoded as
attribute values, such as

<image x="0" y="0" transform="matrix(0.144,0,0,0.1439,251.521,271.844)"
clip-path="url(#clipPath2)" width="1797"
xlink:href="data:image/png;
base64,iVBORw0KGgoAAAANSUhEUgAABwUAAAV4CAMAAAB2DvLsAAADAFBM...
...JRU5ErkJggg==" height="1400" preserveAspectRatio="none"
stroke-width="0" xmlns:xlink="http://www.w3.org/1999/xlink"/>

My code is

Document doc = new Builder().build(file);

For a file with one attribute value of 3.9 Mbytes the time is 9 seconds
while if the same string is PCDATA content the time is 0.1 seconds.

Is this expected? and is there anything I can do to improve the parsing
performance? I don't actually want the value - I simply want to read it in
and throw it away.

--
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069

Elliotte Rusty Harold

2012-05-31 12:43:12 UTC

Permalink

Post by Peter Murray-Rust
I am reading SVG files with XOM, some of which have very long strings (e.g.
4 Mb) ?for attribute values. For example images (bitmaps) are encoded as
attribute values, such as
? ?<image x="0" y="0" transform="matrix(0.144,0,0,0.1439,251.521,271.844)"
clip-path="url(#clipPath2)" width="1797"
? ? ? ? xlink:href="data:image/png;
base64,iVBORw0KGgoAAAANSUhEUgAABwUAAAV4CAMAAAB2DvLsAAADAFBM...
? ? ? ?...JRU5ErkJggg==" height="1400" preserveAspectRatio="none"
stroke-width="0" xmlns:xlink="http://www.w3.org/1999/xlink"/>
My code is
? ? ? ?Document doc = new Builder().build(file);
For a file with one attribute value of 3.9 Mbytes the time is 9 seconds
while if the same string is PCDATA content the time is 0.1 seconds.
Is this expected? and is there anything I can do to improve the parsing
performance? I don't actually want the value - I simply want to read it in
and throw it away.

Interesting. I'll have to look at that. Off the top of my head I
expect the problem is in the underlying parser. XOM just gets these
from the parser as a Java String and treats them as such. You might
try switching parsers, but that's just a guess.

--
Elliotte Rusty Harold
elharo at ibiblio.org

Elliotte Rusty Harold

2012-05-31 12:48:13 UTC

Permalink

I bet if you profile this you'll find that the parser is using a
StringBuilder (or, worse yet, a StringBuffer) to build up the
attribute. And that it's spending a lot of time resizing the buffer
since it probably starts with a pretty low capacity like 16.

By contrast, when working with PCDATA the parser is using an array
that it doesn't have to resize because it can pass it to XOM in
chunks.

--
Elliotte Rusty Harold
elharo at ibiblio.org

Peter Murray-Rust

2012-05-31 13:03:04 UTC

Permalink

On Thu, May 31, 2012 at 1:48 PM, Elliotte Rusty Harold

Post by Elliotte Rusty Harold
I bet if you profile this you'll find that the parser is using a
StringBuilder (or, worse yet, a StringBuffer) to build up the
attribute. And that it's spending a lot of time resizing the buffer
since it probably starts with a pretty low capacity like 16.
I'm using the default parsers in Java 1.7 and XOM 1.1
By contrast, when working with PCDATA the parser is using an array
that it doesn't have to resize because it can pass it to XOM in
chunks.

Understood.

The file is
https://bitbucket.org/petermr/svg2semantic/src/eb9d5dd237e5/src/test/resources/org/xmlcml/graphics/pdf2svg/pages/page5.svg

P.

I wonder if it's changed between Java 1.6 and 1.7?

Post by Elliotte Rusty Harold
--
Elliotte Rusty Harold
elharo at ibiblio.org
_______________________________________________
XOM-interest mailing list
XOM-interest at lists.ibiblio.org
http://lists.ibiblio.org/mailman/listinfo/xom-interest

--
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069