[XOM-interest] Parsing large element content in streaming mode

Discussion:

Arshad Noor

2010-11-22 06:50:29 UTC

Hi,

I'm new to XOM and am experimenting using it to generate and parse
XMLEncryption (XENC) documents (http://www.w3.org/TR/xmlenc-core/).

While the XENC tree is fairly shallow, one particular element - the
CipherValue element - presents some challenges:

1) The first 12 or 24 bytes (depending on the encryption algorithm)
of CipherValue content is only Base64-encoded);
2) The remainder of the CipherValue content, concatenated to the first
12 or 24 bytes from step #1, is encrypted *and* Base64-encoded;
3) 99.99% of the document's content is *inside* this single CipherValue
element.

While DOM/XOM/Stax work fine for reading and writing documents of a
reasonable size (100MB), some documents are expected to be multiple
gigabytes in size. Trying to create a document tree in memory with
such a large file, predictably, crashes the JVM.

Based on sample XOM code, I have extended the NodeFactory and am
trying to process the CipherValue content using startMakingElement()
method. However, not having the position of the underlying stream
in this method (or the Builder/NodeFactory) makes it difficult to
process the CipherValue element's content.

Given my inexperience with XOM, I may be missing a solution obvious
to the experts; are there any suggestions on how I can solve this
problem? Thanks.

Arshad Noor
StrongAuth, Inc.

Elliotte Rusty Harold

2010-11-22 10:18:00 UTC

Permalink

Doesn't sound like you're missing anything. I suspect your use case is
just extreme enough that you maybe hitting some genuine limitations.

XML encryption is something I've thought about exploring for XOM. Can
you describe ina little more detail what methods you'd need to make
this work? I.e. if there's one method or two extra methods on Node or
Element or NodeFactory or some such that would make your job easier
then maybe I can add them. Though if you really need the possibility
to put gigabytes of text in one text node, that's going to require
somewhat more major surgery. Perhaps not impossible but not trivial.

--
Elliotte Rusty Harold
elharo at ibiblio.org

Arshad Noor

2010-11-22 22:52:14 UTC

Permalink

I appreciate the prompt response, Elliotte.

While I should think about this in a little more detail, off the top of
my head, I think what will help is a subclass of Node called ByteStream
(or TextStream) with the following properties:

1) The class will be used to read/write content that are too large to
fit into main memory;

2) The class will set/get values using byte-arrays. (Since encryption
only deals with byte-arrays, text must always be converted to bytes.
However, I can see the benefit of dealing with String-streams too
for non-crypto software);

3) The class will allow the programmer to set parameters like the size
of buffer to return when getting the Value-content, the estimated
amount of data remaining within the Node's value-content (based on
the original file-size and what has been read so far), etc.

Example: int getValue(byte[] buffer, int offset, int length);
int getNextValue(byte[] buffer, int offset, int length);
byte[] getValue(int length);
byte[] getNextValue(int length);
int getRemainder();
void setBufferSize(int length);

Programmer provides the "byte[] buf" in the first two calls; XOM
creates and returns it in the next two; offset is where the parser
starts storing (length) data in the buffer.

4) In the NodeFactory, a makeByteStream() - or makeTextStream() method
(or both) will allow software to use that method repeatedly to
process the voluminous data. Or, maybe a single makeStream() call
must, in turn, make repeated getValue() (calls as shown in #3) until
the element's content is completely processed.

I'm not sure this fits into your design requirements; if it does, I'm
more than happy to help test the new code to validate that it works (our
test-case is a 5GB file). If this is too much to do in the short
term, here's what will help at a minimum:

- The ability to read the underlying stream, at the start of the
element's content, in a repeated loop till the end-tag. The only
setting I would like to specify in such a method is the number of
bytes to return in each iteration. with that, I should be able to
do what I need.

Thank you.

Arshad Noor
StrongAuth, Inc.

Post by Elliotte Rusty Harold
Doesn't sound like you're missing anything. I suspect your use case is
just extreme enough that you maybe hitting some genuine limitations.
XML encryption is something I've thought about exploring for XOM. Can
you describe ina little more detail what methods you'd need to make
this work? I.e. if there's one method or two extra methods on Node or
Element or NodeFactory or some such that would make your job easier
then maybe I can add them. Though if you really need the possibility
to put gigabytes of text in one text node, that's going to require
somewhat more major surgery. Perhaps not impossible but not trivial.

Continue reading on narkive:

Search results for '[XOM-interest] Parsing large element content in streaming mode' (Questions and Answers)

replies

What is it MPEG4 ?

started 2006-10-19 12:10:16 UTC

computer networking