[XOM-interest] Processing XML file of 4GB

Discussion:

Geet Gangwar

2012-04-30 11:42:41 UTC

Hi,

I have a requirement of processing large XML files of about 4GB. I am using
the NUX distribution to process the files.
But when I fire a Xquery to this read this file, I get JAVA Heap Space
error.

I am using 64bit JDK1.6 and also I have increased the JVM size to 5 GB by
-Xmx5G.

Please help me how to solve this problem.

I am pasting my sample code also.

String path = "/xbrli:xbrl/";
Map prefixes = new HashMap();
prefixes.put("xbrli", "http://www.xbrl.org/2003/instance");
prefixes.put("link", "http://www.xbrl.org/2003/linkbase");
prefixes.put("xlink", "http://www.w3.org/1999/xlink");
prefixes.put("xsd", "http://www.w3.org/2001/XMLSchema");

StreamingTransform myTransform = new StreamingTransform() {
private Nodes NONE = new Nodes();

// execute XQuery against each element matching location
path
public Nodes transform(Element subtree) {
Nodes results = XQueryUtil.xquery(subtree,
"xbrli:context[id = 'c-01']");

for (int i=0; i < results.size(); i++) {
// do something useful with query results; here we
just print them

System.out.println(XOMUtil.toPrettyXML(results.get(i)));
}
return NONE; // current subtree becomes subject to
garbage collection
// returning empty node list removes current subtree
from document being build.
// returning new Nodes(subtree) retains the current
subtree.
// returning new Nodes(some other nodes) replaces the
current subtree with
// some other nodes.
// if you want (SAX) parsing to terminate at this
point, simply throw an exception
}
};

// parse document with a filtering Builder
StreamingPathFilter filter = new StreamingPathFilter(path,
prefixes);
Builder builder = new Builder(filter.createNodeFactory(null,
myTransform));
Document doc = builder.build(new File("F:/sample/cntxt.xml"));
System.out.println(doc.getRootElement().getValue());
System.out.println("doc.size()=" +
doc.getRootElement().getChildCount());
System.out.println(XOMUtil.toPrettyXML(doc));

Regards

Geet

Bruno Oliveira

2012-04-30 11:50:32 UTC

Permalink

Hi,
5 GB of memory isn't enought. In average, a XML file needs 4 or 5 times
more memory compared to the file size. For example, you have a file with
4GB, probably you will need to 12-15 GB of memory at least (depends of the
number of nested nodes per node). If you want to process the entire file, i
recomend you to use a Streaming-based API, e.g., StAX, or use the XOM
stream-based approach: http://www.xom.nu/faq.xhtml

Best Regards,
Bruno Oliveira

2012/4/30 Geet Gangwar <geetgangwar at gmail.com>

Post by Geet Gangwar
Hi,
I have a requirement of processing large XML files of about 4GB. I am using
the NUX distribution to process the files.
But when I fire a Xquery to this read this file, I get JAVA Heap Space
error.
I am using 64bit JDK1.6 and also I have increased the JVM size to 5 GB by
-Xmx5G.
Please help me how to solve this problem.
I am pasting my sample code also.
String path = "/xbrli:xbrl/";
Map prefixes = new HashMap();
prefixes.put("xbrli", "http://www.xbrl.org/2003/instance");
prefixes.put("link", "http://www.xbrl.org/2003/linkbase");
prefixes.put("xlink", "http://www.w3.org/1999/xlink");
prefixes.put("xsd", "http://www.w3.org/2001/XMLSchema");
StreamingTransform myTransform = new StreamingTransform() {
private Nodes NONE = new Nodes();
// execute XQuery against each element matching location
path
public Nodes transform(Element subtree) {
Nodes results = XQueryUtil.xquery(subtree,
"xbrli:context[id = 'c-01']");
for (int i=0; i < results.size(); i++) {
// do something useful with query results; here we
just print them
System.out.println(XOMUtil.toPrettyXML(results.get(i)));
}
return NONE; // current subtree becomes subject to
garbage collection
// returning empty node list removes current subtree
from document being build.
// returning new Nodes(subtree) retains the current
subtree.
// returning new Nodes(some other nodes) replaces the
current subtree with
// some other nodes.
// if you want (SAX) parsing to terminate at this
point, simply throw an exception
}
};
// parse document with a filtering Builder
StreamingPathFilter filter = new StreamingPathFilter(path,
prefixes);
Builder builder = new Builder(filter.createNodeFactory(null,
myTransform));
Document doc = builder.build(new File("F:/sample/cntxt.xml"));
System.out.println(doc.getRootElement().getValue());
System.out.println("doc.size()=" +
doc.getRootElement().getChildCount());
System.out.println(XOMUtil.toPrettyXML(doc));
Regards
Geet
_______________________________________________
XOM-interest mailing list
XOM-interest at lists.ibiblio.org
http://lists.ibiblio.org/mailman/listinfo/xom-interest

Michael Kay

2012-04-30 12:11:21 UTC

Permalink

NUX implements what one might call "chunked streaming": it selects a
sequence of subtrees of your document and applies XQuery processing to
each one in turn. (This is effectively the same as saxon:stream() in
Saxon.) You need to have enough memory to hold the largest of these
subtrees. Because your path /xbrli:xbrl effectively selects the whole
document, your largest subtree is the same as the document size, so you
aren't really streaming at all.

If you can describe the query/transformation you want to carry out, then
I might be able to suggest a way of doing it in a streamed manner either
using NUX, or using native facilities of Saxon (whose streaming
capabilities have advanced considerably since NUX was last released).

Michael Kay
Saxonica

2012/4/30 Geet Gangwar <geetgangwar at gmail.com>