Discussion:
[XOM-interest] Processing XML file of 4GB
Geet Gangwar
2012-04-30 11:42:41 UTC
Permalink
Hi,

I have a requirement of processing large XML files of about 4GB. I am using
the NUX distribution to process the files.
But when I fire a Xquery to this read this file, I get JAVA Heap Space
error.

I am using 64bit JDK1.6 and also I have increased the JVM size to 5 GB by
-Xmx5G.

Please help me how to solve this problem.

I am pasting my sample code also.

String path = "/xbrli:xbrl/";
Map prefixes = new HashMap();
prefixes.put("xbrli", "http://www.xbrl.org/2003/instance");
prefixes.put("link", "http://www.xbrl.org/2003/linkbase");
prefixes.put("xlink", "http://www.w3.org/1999/xlink");
prefixes.put("xsd", "http://www.w3.org/2001/XMLSchema");

StreamingTransform myTransform = new StreamingTransform() {
private Nodes NONE = new Nodes();

// execute XQuery against each element matching location
path
public Nodes transform(Element subtree) {
Nodes results = XQueryUtil.xquery(subtree,
"xbrli:context[id = 'c-01']");

for (int i=0; i < results.size(); i++) {
// do something useful with query results; here we
just print them

System.out.println(XOMUtil.toPrettyXML(results.get(i)));
}
return NONE; // current subtree becomes subject to
garbage collection
// returning empty node list removes current subtree
from document being build.
// returning new Nodes(subtree) retains the current
subtree.
// returning new Nodes(some other nodes) replaces the
current subtree with
// some other nodes.
// if you want (SAX) parsing to terminate at this
point, simply throw an exception
}
};

// parse document with a filtering Builder
StreamingPathFilter filter = new StreamingPathFilter(path,
prefixes);
Builder builder = new Builder(filter.createNodeFactory(null,
myTransform));
Document doc = builder.build(new File("F:/sample/cntxt.xml"));
System.out.println(doc.getRootElement().getValue());
System.out.println("doc.size()=" +
doc.getRootElement().getChildCount());
System.out.println(XOMUtil.toPrettyXML(doc));

Regards

Geet
Bruno Oliveira
2012-04-30 11:50:32 UTC
Permalink
Hi,
5 GB of memory isn't enought. In average, a XML file needs 4 or 5 times
more memory compared to the file size. For example, you have a file with
4GB, probably you will need to 12-15 GB of memory at least (depends of the
number of nested nodes per node). If you want to process the entire file, i
recomend you to use a Streaming-based API, e.g., StAX, or use the XOM
stream-based approach: http://www.xom.nu/faq.xhtml

Best Regards,
Bruno Oliveira

2012/4/30 Geet Gangwar <geetgangwar at gmail.com>
Post by Geet Gangwar
Hi,
I have a requirement of processing large XML files of about 4GB. I am using
the NUX distribution to process the files.
But when I fire a Xquery to this read this file, I get JAVA Heap Space
error.
I am using 64bit JDK1.6 and also I have increased the JVM size to 5 GB by
-Xmx5G.
Please help me how to solve this problem.
I am pasting my sample code also.
String path = "/xbrli:xbrl/";
Map prefixes = new HashMap();
prefixes.put("xbrli", "http://www.xbrl.org/2003/instance");
prefixes.put("link", "http://www.xbrl.org/2003/linkbase");
prefixes.put("xlink", "http://www.w3.org/1999/xlink");
prefixes.put("xsd", "http://www.w3.org/2001/XMLSchema");
StreamingTransform myTransform = new StreamingTransform() {
private Nodes NONE = new Nodes();
// execute XQuery against each element matching location
path
public Nodes transform(Element subtree) {
Nodes results = XQueryUtil.xquery(subtree,
"xbrli:context[id = 'c-01']");
for (int i=0; i < results.size(); i++) {
// do something useful with query results; here we
just print them
System.out.println(XOMUtil.toPrettyXML(results.get(i)));
}
return NONE; // current subtree becomes subject to
garbage collection
// returning empty node list removes current subtree
from document being build.
// returning new Nodes(subtree) retains the current
subtree.
// returning new Nodes(some other nodes) replaces the
current subtree with
// some other nodes.
// if you want (SAX) parsing to terminate at this
point, simply throw an exception
}
};
// parse document with a filtering Builder
StreamingPathFilter filter = new StreamingPathFilter(path,
prefixes);
Builder builder = new Builder(filter.createNodeFactory(null,
myTransform));
Document doc = builder.build(new File("F:/sample/cntxt.xml"));
System.out.println(doc.getRootElement().getValue());
System.out.println("doc.size()=" +
doc.getRootElement().getChildCount());
System.out.println(XOMUtil.toPrettyXML(doc));
Regards
Geet
_______________________________________________
XOM-interest mailing list
XOM-interest at lists.ibiblio.org
http://lists.ibiblio.org/mailman/listinfo/xom-interest
Michael Kay
2012-04-30 12:11:21 UTC
Permalink
NUX implements what one might call "chunked streaming": it selects a
sequence of subtrees of your document and applies XQuery processing to
each one in turn. (This is effectively the same as saxon:stream() in
Saxon.) You need to have enough memory to hold the largest of these
subtrees. Because your path /xbrli:xbrl effectively selects the whole
document, your largest subtree is the same as the document size, so you
aren't really streaming at all.

If you can describe the query/transformation you want to carry out, then
I might be able to suggest a way of doing it in a streamed manner either
using NUX, or using native facilities of Saxon (whose streaming
capabilities have advanced considerably since NUX was last released).

Michael Kay
Saxonica


2012/4/30 Geet Gangwar <geetgangwar at gmail.com>
Post by Geet Gangwar
Post by Geet Gangwar
Hi,
I have a requirement of processing large XML files of about 4GB. I am using
the NUX distribution to process the files.
But when I fire a Xquery to this read this file, I get JAVA Heap Space
error.
I am using 64bit JDK1.6 and also I have increased the JVM size to 5 GB by
-Xmx5G.
Please help me how to solve this problem.
I am pasting my sample code also.
String path = "/xbrli:xbrl/";
Map prefixes = new HashMap();
prefixes.put("xbrli", "http://www.xbrl.org/2003/instance");
prefixes.put("link", "http://www.xbrl.org/2003/linkbase");
prefixes.put("xlink", "http://www.w3.org/1999/xlink");
prefixes.put("xsd", "http://www.w3.org/2001/XMLSchema");
StreamingTransform myTransform = new StreamingTransform() {
private Nodes NONE = new Nodes();
// execute XQuery against each element matching location
path
public Nodes transform(Element subtree) {
Nodes results = XQueryUtil.xquery(subtree,
"xbrli:context[id = 'c-01']");
for (int i=0; i< results.size(); i++) {
// do something useful with query results; here we
just print them
System.out.println(XOMUtil.toPrettyXML(results.get(i)));
}
return NONE; // current subtree becomes subject to
garbage collection
// returning empty node list removes current subtree
from document being build.
// returning new Nodes(subtree) retains the current
subtree.
// returning new Nodes(some other nodes) replaces the
current subtree with
// some other nodes.
// if you want (SAX) parsing to terminate at this
point, simply throw an exception
}
};
// parse document with a filtering Builder
StreamingPathFilter filter = new StreamingPathFilter(path,
prefixes);
Builder builder = new Builder(filter.createNodeFactory(null,
myTransform));
Document doc = builder.build(new File("F:/sample/cntxt.xml"));
System.out.println(doc.getRootElement().getValue());
System.out.println("doc.size()=" +
doc.getRootElement().getChildCount());
System.out.println(XOMUtil.toPrettyXML(doc));
Regards
Geet
_______________________________________________
XOM-interest mailing list
XOM-interest at lists.ibiblio.org
http://lists.ibiblio.org/mailman/listinfo/xom-interest
_______________________________________________
XOM-interest mailing list
XOM-interest at lists.ibiblio.org
http://lists.ibiblio.org/mailman/listinfo/xom-interest
Geet Gangwar
2012-04-30 12:37:00 UTC
Permalink
Hi Michael,

Basically I need to perform some validation on the xml files. So my
requirement is to pick particular block from xml and validate that block.
For example say I want to pull all context i.e

XPATH expression like : /xbrli:xbrl/xbrli:context

from the document and validate them.

Regards

Geet
Post by Michael Kay
NUX implements what one might call "chunked streaming": it selects a
sequence of subtrees of your document and applies XQuery processing to
each one in turn. (This is effectively the same as saxon:stream() in
Saxon.) You need to have enough memory to hold the largest of these
subtrees. Because your path /xbrli:xbrl effectively selects the whole
document, your largest subtree is the same as the document size, so you
aren't really streaming at all.
If you can describe the query/transformation you want to carry out, then
I might be able to suggest a way of doing it in a streamed manner either
using NUX, or using native facilities of Saxon (whose streaming
capabilities have advanced considerably since NUX was last released).
Michael Kay
Saxonica
2012/4/30 Geet Gangwar <geetgangwar at gmail.com>
Post by Geet Gangwar
Post by Geet Gangwar
Hi,
I have a requirement of processing large XML files of about 4GB. I am
using
Post by Geet Gangwar
Post by Geet Gangwar
the NUX distribution to process the files.
But when I fire a Xquery to this read this file, I get JAVA Heap Space
error.
I am using 64bit JDK1.6 and also I have increased the JVM size to 5 GB
by
Post by Geet Gangwar
Post by Geet Gangwar
-Xmx5G.
Please help me how to solve this problem.
I am pasting my sample code also.
String path = "/xbrli:xbrl/";
Map prefixes = new HashMap();
prefixes.put("xbrli", "http://www.xbrl.org/2003/instance
");
Post by Geet Gangwar
Post by Geet Gangwar
prefixes.put("link", "http://www.xbrl.org/2003/linkbase");
prefixes.put("xlink", "http://www.w3.org/1999/xlink");
prefixes.put("xsd", "http://www.w3.org/2001/XMLSchema");
StreamingTransform myTransform = new StreamingTransform() {
private Nodes NONE = new Nodes();
// execute XQuery against each element matching
location
Post by Geet Gangwar
Post by Geet Gangwar
path
public Nodes transform(Element subtree) {
Nodes results = XQueryUtil.xquery(subtree,
"xbrli:context[id = 'c-01']");
for (int i=0; i< results.size(); i++) {
// do something useful with query results;
here we
Post by Geet Gangwar
Post by Geet Gangwar
just print them
System.out.println(XOMUtil.toPrettyXML(results.get(i)));
}
return NONE; // current subtree becomes subject to
garbage collection
// returning empty node list removes current
subtree
Post by Geet Gangwar
Post by Geet Gangwar
from document being build.
// returning new Nodes(subtree) retains the current
subtree.
// returning new Nodes(some other nodes) replaces
the
Post by Geet Gangwar
Post by Geet Gangwar
current subtree with
// some other nodes.
// if you want (SAX) parsing to terminate at this
point, simply throw an exception
}
};
// parse document with a filtering Builder
StreamingPathFilter filter = new StreamingPathFilter(path,
prefixes);
Builder builder = new
Builder(filter.createNodeFactory(null,
Post by Geet Gangwar
Post by Geet Gangwar
myTransform));
Document doc = builder.build(new
File("F:/sample/cntxt.xml"));
Post by Geet Gangwar
Post by Geet Gangwar
System.out.println(doc.getRootElement().getValue());
System.out.println("doc.size()=" +
doc.getRootElement().getChildCount());
System.out.println(XOMUtil.toPrettyXML(doc));
Regards
Geet
_______________________________________________
XOM-interest mailing list
XOM-interest at lists.ibiblio.org
http://lists.ibiblio.org/mailman/listinfo/xom-interest
_______________________________________________
XOM-interest mailing list
XOM-interest at lists.ibiblio.org
http://lists.ibiblio.org/mailman/listinfo/xom-interest
_______________________________________________
XOM-interest mailing list
XOM-interest at lists.ibiblio.org
http://lists.ibiblio.org/mailman/listinfo/xom-interest
Michael Kay
2012-04-30 13:03:32 UTC
Permalink
Follow-up questions then:

(1) how big is an xbrli:context object?

(2) is the validation of each xbrli:context object independent of the
others?

(3) what is the logic of the validation that needs to be performed?

Michael Kay
Saxonica
Post by Geet Gangwar
Hi Michael,
Basically I need to perform some validation on the xml files. So my
requirement is to pick particular block from xml and validate that block.
For example say I want to pull all context i.e
XPATH expression like : /xbrli:xbrl/xbrli:context
from the document and validate them.
Regards
Geet
Post by Michael Kay
NUX implements what one might call "chunked streaming": it selects a
sequence of subtrees of your document and applies XQuery processing to
each one in turn. (This is effectively the same as saxon:stream() in
Saxon.) You need to have enough memory to hold the largest of these
subtrees. Because your path /xbrli:xbrl effectively selects the whole
document, your largest subtree is the same as the document size, so you
aren't really streaming at all.
If you can describe the query/transformation you want to carry out, then
I might be able to suggest a way of doing it in a streamed manner either
using NUX, or using native facilities of Saxon (whose streaming
capabilities have advanced considerably since NUX was last released).
Michael Kay
Saxonica
2012/4/30 Geet Gangwar<geetgangwar at gmail.com>
Post by Geet Gangwar
Post by Geet Gangwar
Hi,
I have a requirement of processing large XML files of about 4GB. I am
using
Post by Geet Gangwar
Post by Geet Gangwar
the NUX distribution to process the files.
But when I fire a Xquery to this read this file, I get JAVA Heap Space
error.
I am using 64bit JDK1.6 and also I have increased the JVM size to 5 GB
by
Post by Geet Gangwar
Post by Geet Gangwar
-Xmx5G.
Please help me how to solve this problem.
I am pasting my sample code also.
String path = "/xbrli:xbrl/";
Map prefixes = new HashMap();
prefixes.put("xbrli", "http://www.xbrl.org/2003/instance
");
Post by Geet Gangwar
Post by Geet Gangwar
prefixes.put("link", "http://www.xbrl.org/2003/linkbase");
prefixes.put("xlink", "http://www.w3.org/1999/xlink");
prefixes.put("xsd", "http://www.w3.org/2001/XMLSchema");
StreamingTransform myTransform = new StreamingTransform() {
private Nodes NONE = new Nodes();
// execute XQuery against each element matching
location
Post by Geet Gangwar
Post by Geet Gangwar
path
public Nodes transform(Element subtree) {
Nodes results = XQueryUtil.xquery(subtree,
"xbrli:context[id = 'c-01']");
for (int i=0; i< results.size(); i++) {
// do something useful with query results;
here we
Post by Geet Gangwar
Post by Geet Gangwar
just print them
System.out.println(XOMUtil.toPrettyXML(results.get(i)));
}
return NONE; // current subtree becomes subject to
garbage collection
// returning empty node list removes current
subtree
Post by Geet Gangwar
Post by Geet Gangwar
from document being build.
// returning new Nodes(subtree) retains the current
subtree.
// returning new Nodes(some other nodes) replaces
the
Post by Geet Gangwar
Post by Geet Gangwar
current subtree with
// some other nodes.
// if you want (SAX) parsing to terminate at this
point, simply throw an exception
}
};
// parse document with a filtering Builder
StreamingPathFilter filter = new StreamingPathFilter(path,
prefixes);
Builder builder = new
Builder(filter.createNodeFactory(null,
Post by Geet Gangwar
Post by Geet Gangwar
myTransform));
Document doc = builder.build(new
File("F:/sample/cntxt.xml"));
Post by Geet Gangwar
Post by Geet Gangwar
System.out.println(doc.getRootElement().getValue());
System.out.println("doc.size()=" +
doc.getRootElement().getChildCount());
System.out.println(XOMUtil.toPrettyXML(doc));
Regards
Geet
_______________________________________________
XOM-interest mailing list
XOM-interest at lists.ibiblio.org
http://lists.ibiblio.org/mailman/listinfo/xom-interest
_______________________________________________
XOM-interest mailing list
XOM-interest at lists.ibiblio.org
http://lists.ibiblio.org/mailman/listinfo/xom-interest
_______________________________________________
XOM-interest mailing list
XOM-interest at lists.ibiblio.org
http://lists.ibiblio.org/mailman/listinfo/xom-interest
_______________________________________________
XOM-interest mailing list
XOM-interest at lists.ibiblio.org
http://lists.ibiblio.org/mailman/listinfo/xom-interest
Geet Gangwar
2012-04-30 13:16:40 UTC
Permalink
There multiple occurrences of contexts element say upto 10 million.

Validation logic for each object is same. I need to pick one context
perform the validation logic, pick another perform the same logic.

Regards

Geet
Post by Michael Kay
(1) how big is an xbrli:context object?
(2) is the validation of each xbrli:context object independent of the
others?
(3) what is the logic of the validation that needs to be performed?
Michael Kay
Saxonica
Post by Geet Gangwar
Hi Michael,
Basically I need to perform some validation on the xml files. So my
requirement is to pick particular block from xml and validate that block.
For example say I want to pull all context i.e
XPATH expression like : /xbrli:xbrl/xbrli:context
from the document and validate them.
Regards
Geet
Post by Michael Kay
NUX implements what one might call "chunked streaming": it selects a
sequence of subtrees of your document and applies XQuery processing to
each one in turn. (This is effectively the same as saxon:stream() in
Saxon.) You need to have enough memory to hold the largest of these
subtrees. Because your path /xbrli:xbrl effectively selects the whole
document, your largest subtree is the same as the document size, so you
aren't really streaming at all.
If you can describe the query/transformation you want to carry out, then
I might be able to suggest a way of doing it in a streamed manner either
using NUX, or using native facilities of Saxon (whose streaming
capabilities have advanced considerably since NUX was last released).
Michael Kay
Saxonica
2012/4/30 Geet Gangwar<geetgangwar at gmail.com>
Post by Geet Gangwar
Post by Geet Gangwar
Hi,
I have a requirement of processing large XML files of about 4GB. I am
using
Post by Geet Gangwar
Post by Geet Gangwar
the NUX distribution to process the files.
But when I fire a Xquery to this read this file, I get JAVA Heap Space
error.
I am using 64bit JDK1.6 and also I have increased the JVM size to 5 GB
by
Post by Geet Gangwar
Post by Geet Gangwar
-Xmx5G.
Please help me how to solve this problem.
I am pasting my sample code also.
String path = "/xbrli:xbrl/";
Map prefixes = new HashMap();
prefixes.put("xbrli", "
http://www.xbrl.org/2003/instance
Post by Geet Gangwar
Post by Michael Kay
");
Post by Geet Gangwar
Post by Geet Gangwar
prefixes.put("link", "http://www.xbrl.org/2003/linkbase
");
Post by Geet Gangwar
Post by Michael Kay
Post by Geet Gangwar
Post by Geet Gangwar
prefixes.put("xlink", "http://www.w3.org/1999/xlink");
prefixes.put("xsd", "http://www.w3.org/2001/XMLSchema
");
Post by Geet Gangwar
Post by Michael Kay
Post by Geet Gangwar
Post by Geet Gangwar
StreamingTransform myTransform = new
StreamingTransform() {
Post by Geet Gangwar
Post by Michael Kay
Post by Geet Gangwar
Post by Geet Gangwar
private Nodes NONE = new Nodes();
// execute XQuery against each element matching
location
Post by Geet Gangwar
Post by Geet Gangwar
path
public Nodes transform(Element subtree) {
Nodes results = XQueryUtil.xquery(subtree,
"xbrli:context[id = 'c-01']");
for (int i=0; i< results.size(); i++) {
// do something useful with query results;
here we
Post by Geet Gangwar
Post by Geet Gangwar
just print them
System.out.println(XOMUtil.toPrettyXML(results.get(i)));
}
return NONE; // current subtree becomes subject
to
Post by Geet Gangwar
Post by Michael Kay
Post by Geet Gangwar
Post by Geet Gangwar
garbage collection
// returning empty node list removes current
subtree
Post by Geet Gangwar
Post by Geet Gangwar
from document being build.
// returning new Nodes(subtree) retains the
current
Post by Geet Gangwar
Post by Michael Kay
Post by Geet Gangwar
Post by Geet Gangwar
subtree.
// returning new Nodes(some other nodes)
replaces
Post by Geet Gangwar
Post by Michael Kay
the
Post by Geet Gangwar
Post by Geet Gangwar
current subtree with
// some other nodes.
// if you want (SAX) parsing to terminate at
this
Post by Geet Gangwar
Post by Michael Kay
Post by Geet Gangwar
Post by Geet Gangwar
point, simply throw an exception
}
};
// parse document with a filtering Builder
StreamingPathFilter filter = new
StreamingPathFilter(path,
Post by Geet Gangwar
Post by Michael Kay
Post by Geet Gangwar
Post by Geet Gangwar
prefixes);
Builder builder = new
Builder(filter.createNodeFactory(null,
Post by Geet Gangwar
Post by Geet Gangwar
myTransform));
Document doc = builder.build(new
File("F:/sample/cntxt.xml"));
Post by Geet Gangwar
Post by Geet Gangwar
System.out.println(doc.getRootElement().getValue());
System.out.println("doc.size()=" +
doc.getRootElement().getChildCount());
System.out.println(XOMUtil.toPrettyXML(doc));
Regards
Geet
_______________________________________________
XOM-interest mailing list
XOM-interest at lists.ibiblio.org
http://lists.ibiblio.org/mailman/listinfo/xom-interest
_______________________________________________
XOM-interest mailing list
XOM-interest at lists.ibiblio.org
http://lists.ibiblio.org/mailman/listinfo/xom-interest
_______________________________________________
XOM-interest mailing list
XOM-interest at lists.ibiblio.org
http://lists.ibiblio.org/mailman/listinfo/xom-interest
_______________________________________________
XOM-interest mailing list
XOM-interest at lists.ibiblio.org
http://lists.ibiblio.org/mailman/listinfo/xom-interest
_______________________________________________
XOM-interest mailing list
XOM-interest at lists.ibiblio.org
http://lists.ibiblio.org/mailman/listinfo/xom-interest
Michael Kay
2012-04-30 13:42:37 UTC
Permalink
In that case I would suggest starting with an initial path of
/xbrli:xbrl/xbrli:context rather than /xbrli:xbrl.

Michael Kay
Saxonica
Post by Geet Gangwar
There multiple occurrences of contexts element say upto 10 million.
Validation logic for each object is same. I need to pick one context
perform the validation logic, pick another perform the same logic.
Regards
Geet
Post by Michael Kay
(1) how big is an xbrli:context object?
(2) is the validation of each xbrli:context object independent of the
others?
(3) what is the logic of the validation that needs to be performed?
Michael Kay
Saxonica
Post by Geet Gangwar
Hi Michael,
Basically I need to perform some validation on the xml files. So my
requirement is to pick particular block from xml and validate that block.
For example say I want to pull all context i.e
XPATH expression like : /xbrli:xbrl/xbrli:context
from the document and validate them.
Regards
Geet
Post by Michael Kay
NUX implements what one might call "chunked streaming": it selects a
sequence of subtrees of your document and applies XQuery processing to
each one in turn. (This is effectively the same as saxon:stream() in
Saxon.) You need to have enough memory to hold the largest of these
subtrees. Because your path /xbrli:xbrl effectively selects the whole
document, your largest subtree is the same as the document size, so you
aren't really streaming at all.
If you can describe the query/transformation you want to carry out, then
I might be able to suggest a way of doing it in a streamed manner either
using NUX, or using native facilities of Saxon (whose streaming
capabilities have advanced considerably since NUX was last released).
Michael Kay
Saxonica
2012/4/30 Geet Gangwar<geetgangwar at gmail.com>
Post by Geet Gangwar
Post by Geet Gangwar
Hi,
I have a requirement of processing large XML files of about 4GB. I am
using
Post by Geet Gangwar
Post by Geet Gangwar
the NUX distribution to process the files.
But when I fire a Xquery to this read this file, I get JAVA Heap Space
error.
I am using 64bit JDK1.6 and also I have increased the JVM size to 5 GB
by
Post by Geet Gangwar
Post by Geet Gangwar
-Xmx5G.
Please help me how to solve this problem.
I am pasting my sample code also.
String path = "/xbrli:xbrl/";
Map prefixes = new HashMap();
prefixes.put("xbrli", "
http://www.xbrl.org/2003/instance
Post by Geet Gangwar
Post by Michael Kay
");
Post by Geet Gangwar
Post by Geet Gangwar
prefixes.put("link", "http://www.xbrl.org/2003/linkbase
");
Post by Geet Gangwar
Post by Michael Kay
Post by Geet Gangwar
Post by Geet Gangwar
prefixes.put("xlink", "http://www.w3.org/1999/xlink");
prefixes.put("xsd", "http://www.w3.org/2001/XMLSchema
");
Post by Geet Gangwar
Post by Michael Kay
Post by Geet Gangwar
Post by Geet Gangwar
StreamingTransform myTransform = new
StreamingTransform() {
Post by Geet Gangwar
Post by Michael Kay
Post by Geet Gangwar
Post by Geet Gangwar
private Nodes NONE = new Nodes();
// execute XQuery against each element matching
location
Post by Geet Gangwar
Post by Geet Gangwar
path
public Nodes transform(Element subtree) {
Nodes results = XQueryUtil.xquery(subtree,
"xbrli:context[id = 'c-01']");
for (int i=0; i< results.size(); i++) {
// do something useful with query results;
here we
Post by Geet Gangwar
Post by Geet Gangwar
just print them
System.out.println(XOMUtil.toPrettyXML(results.get(i)));
}
return NONE; // current subtree becomes subject
to
Post by Geet Gangwar
Post by Michael Kay
Post by Geet Gangwar
Post by Geet Gangwar
garbage collection
// returning empty node list removes current
subtree
Post by Geet Gangwar
Post by Geet Gangwar
from document being build.
// returning new Nodes(subtree) retains the
current
Post by Geet Gangwar
Post by Michael Kay
Post by Geet Gangwar
Post by Geet Gangwar
subtree.
// returning new Nodes(some other nodes)
replaces
Post by Geet Gangwar
Post by Michael Kay
the
Post by Geet Gangwar
Post by Geet Gangwar
current subtree with
// some other nodes.
// if you want (SAX) parsing to terminate at
this
Post by Geet Gangwar
Post by Michael Kay
Post by Geet Gangwar
Post by Geet Gangwar
point, simply throw an exception
}
};
// parse document with a filtering Builder
StreamingPathFilter filter = new
StreamingPathFilter(path,
Post by Geet Gangwar
Post by Michael Kay
Post by Geet Gangwar
Post by Geet Gangwar
prefixes);
Builder builder = new
Builder(filter.createNodeFactory(null,
Post by Geet Gangwar
Post by Geet Gangwar
myTransform));
Document doc = builder.build(new
File("F:/sample/cntxt.xml"));
Post by Geet Gangwar
Post by Geet Gangwar
System.out.println(doc.getRootElement().getValue());
System.out.println("doc.size()=" +
doc.getRootElement().getChildCount());
System.out.println(XOMUtil.toPrettyXML(doc));
Regards
Geet
_______________________________________________
XOM-interest mailing list
XOM-interest at lists.ibiblio.org
http://lists.ibiblio.org/mailman/listinfo/xom-interest
_______________________________________________
XOM-interest mailing list
XOM-interest at lists.ibiblio.org
http://lists.ibiblio.org/mailman/listinfo/xom-interest
_______________________________________________
XOM-interest mailing list
XOM-interest at lists.ibiblio.org
http://lists.ibiblio.org/mailman/listinfo/xom-interest
_______________________________________________
XOM-interest mailing list
XOM-interest at lists.ibiblio.org
http://lists.ibiblio.org/mailman/listinfo/xom-interest
_______________________________________________
XOM-interest mailing list
XOM-interest at lists.ibiblio.org
http://lists.ibiblio.org/mailman/listinfo/xom-interest
_______________________________________________
XOM-interest mailing list
XOM-interest at lists.ibiblio.org
http://lists.ibiblio.org/mailman/listinfo/xom-interest
Geet Gangwar
2012-05-02 13:21:15 UTC
Permalink
Hi Michael,

As suggested by you I have used /xbrli:xbrl/xbrli:context to get all
context from the file. But apart from context there are other child nodes
of /xbrli:xbrl nodes which are around 50 millions in number and I want a
list of all those elements based on attribute. I am using Xpath as
/xbrli:xbrl/[@contextRef] but getting heap space errors. The problem is
there is not specific node name available for nodes all names are different.


Regards

Geet
Post by Michael Kay
In that case I would suggest starting with an initial path of
/xbrli:xbrl/xbrli:context rather than /xbrli:xbrl.
Michael Kay
Saxonica
Post by Geet Gangwar
There multiple occurrences of contexts element say upto 10 million.
Validation logic for each object is same. I need to pick one context
perform the validation logic, pick another perform the same logic.
Regards
Geet
Post by Michael Kay
(1) how big is an xbrli:context object?
(2) is the validation of each xbrli:context object independent of the
others?
(3) what is the logic of the validation that needs to be performed?
Michael Kay
Saxonica
Post by Geet Gangwar
Hi Michael,
Basically I need to perform some validation on the xml files. So my
requirement is to pick particular block from xml and validate that
block.
Post by Geet Gangwar
Post by Michael Kay
Post by Geet Gangwar
For example say I want to pull all context i.e
XPATH expression like : /xbrli:xbrl/xbrli:context
from the document and validate them.
Regards
Geet
On Mon, Apr 30, 2012 at 5:41 PM, Michael Kay<mike at saxonica.com>
Post by Michael Kay
NUX implements what one might call "chunked streaming": it selects a
sequence of subtrees of your document and applies XQuery processing to
each one in turn. (This is effectively the same as saxon:stream() in
Saxon.) You need to have enough memory to hold the largest of these
subtrees. Because your path /xbrli:xbrl effectively selects the whole
document, your largest subtree is the same as the document size, so
you
Post by Geet Gangwar
Post by Michael Kay
Post by Geet Gangwar
Post by Michael Kay
aren't really streaming at all.
If you can describe the query/transformation you want to carry out,
then
Post by Geet Gangwar
Post by Michael Kay
Post by Geet Gangwar
Post by Michael Kay
I might be able to suggest a way of doing it in a streamed manner
either
Post by Geet Gangwar
Post by Michael Kay
Post by Geet Gangwar
Post by Michael Kay
using NUX, or using native facilities of Saxon (whose streaming
capabilities have advanced considerably since NUX was last released).
Michael Kay
Saxonica
2012/4/30 Geet Gangwar<geetgangwar at gmail.com>
Post by Geet Gangwar
Post by Geet Gangwar
Hi,
I have a requirement of processing large XML files of about 4GB. I
am
Post by Geet Gangwar
Post by Michael Kay
Post by Geet Gangwar
Post by Michael Kay
using
Post by Geet Gangwar
Post by Geet Gangwar
the NUX distribution to process the files.
But when I fire a Xquery to this read this file, I get JAVA Heap
Space
Post by Geet Gangwar
Post by Michael Kay
Post by Geet Gangwar
Post by Michael Kay
Post by Geet Gangwar
Post by Geet Gangwar
error.
I am using 64bit JDK1.6 and also I have increased the JVM size to 5
GB
Post by Geet Gangwar
Post by Michael Kay
Post by Geet Gangwar
Post by Michael Kay
by
Post by Geet Gangwar
Post by Geet Gangwar
-Xmx5G.
Please help me how to solve this problem.
I am pasting my sample code also.
String path = "/xbrli:xbrl/";
Map prefixes = new HashMap();
prefixes.put("xbrli", "
http://www.xbrl.org/2003/instance
Post by Geet Gangwar
Post by Michael Kay
");
Post by Geet Gangwar
Post by Geet Gangwar
prefixes.put("link", "
http://www.xbrl.org/2003/linkbase
Post by Geet Gangwar
Post by Michael Kay
");
Post by Geet Gangwar
Post by Michael Kay
Post by Geet Gangwar
Post by Geet Gangwar
prefixes.put("xlink", "http://www.w3.org/1999/xlink
");
Post by Geet Gangwar
Post by Michael Kay
Post by Geet Gangwar
Post by Michael Kay
Post by Geet Gangwar
Post by Geet Gangwar
prefixes.put("xsd", "
http://www.w3.org/2001/XMLSchema
Post by Geet Gangwar
Post by Michael Kay
");
Post by Geet Gangwar
Post by Michael Kay
Post by Geet Gangwar
Post by Geet Gangwar
StreamingTransform myTransform = new
StreamingTransform() {
Post by Geet Gangwar
Post by Michael Kay
Post by Geet Gangwar
Post by Geet Gangwar
private Nodes NONE = new Nodes();
// execute XQuery against each element matching
location
Post by Geet Gangwar
Post by Geet Gangwar
path
public Nodes transform(Element subtree) {
Nodes results = XQueryUtil.xquery(subtree,
"xbrli:context[id = 'c-01']");
for (int i=0; i< results.size(); i++) {
// do something useful with query
results;
Post by Geet Gangwar
Post by Michael Kay
Post by Geet Gangwar
Post by Michael Kay
here we
Post by Geet Gangwar
Post by Geet Gangwar
just print them
System.out.println(XOMUtil.toPrettyXML(results.get(i)));
}
return NONE; // current subtree becomes
subject
Post by Geet Gangwar
Post by Michael Kay
to
Post by Geet Gangwar
Post by Michael Kay
Post by Geet Gangwar
Post by Geet Gangwar
garbage collection
// returning empty node list removes current
subtree
Post by Geet Gangwar
Post by Geet Gangwar
from document being build.
// returning new Nodes(subtree) retains the
current
Post by Geet Gangwar
Post by Michael Kay
Post by Geet Gangwar
Post by Geet Gangwar
subtree.
// returning new Nodes(some other nodes)
replaces
Post by Geet Gangwar
Post by Michael Kay
the
Post by Geet Gangwar
Post by Geet Gangwar
current subtree with
// some other nodes.
// if you want (SAX) parsing to terminate at
this
Post by Geet Gangwar
Post by Michael Kay
Post by Geet Gangwar
Post by Geet Gangwar
point, simply throw an exception
}
};
// parse document with a filtering Builder
StreamingPathFilter filter = new
StreamingPathFilter(path,
Post by Geet Gangwar
Post by Michael Kay
Post by Geet Gangwar
Post by Geet Gangwar
prefixes);
Builder builder = new
Builder(filter.createNodeFactory(null,
Post by Geet Gangwar
Post by Geet Gangwar
myTransform));
Document doc = builder.build(new
File("F:/sample/cntxt.xml"));
Post by Geet Gangwar
Post by Geet Gangwar
System.out.println(doc.getRootElement().getValue());
System.out.println("doc.size()=" +
doc.getRootElement().getChildCount());
System.out.println(XOMUtil.toPrettyXML(doc));
Regards
Geet
_______________________________________________
XOM-interest mailing list
XOM-interest at lists.ibiblio.org
http://lists.ibiblio.org/mailman/listinfo/xom-interest
_______________________________________________
XOM-interest mailing list
XOM-interest at lists.ibiblio.org
http://lists.ibiblio.org/mailman/listinfo/xom-interest
_______________________________________________
XOM-interest mailing list
XOM-interest at lists.ibiblio.org
http://lists.ibiblio.org/mailman/listinfo/xom-interest
_______________________________________________
XOM-interest mailing list
XOM-interest at lists.ibiblio.org
http://lists.ibiblio.org/mailman/listinfo/xom-interest
_______________________________________________
XOM-interest mailing list
XOM-interest at lists.ibiblio.org
http://lists.ibiblio.org/mailman/listinfo/xom-interest
_______________________________________________
XOM-interest mailing list
XOM-interest at lists.ibiblio.org
http://lists.ibiblio.org/mailman/listinfo/xom-interest
_______________________________________________
XOM-interest mailing list
XOM-interest at lists.ibiblio.org
http://lists.ibiblio.org/mailman/listinfo/xom-interest
Christophe Marchand
2012-05-02 14:11:11 UTC
Permalink
Sometime, writing your own SAX handler to code your controls is quicker
than everything else ; and if your code is clean, it will work on very
very very huge files.

Regards,
Christophe
Post by Geet Gangwar
Hi Michael,
As suggested by you I have used /xbrli:xbrl/xbrli:context to get all
context from the file. But apart from context there are other child nodes
of /xbrli:xbrl nodes which are around 50 millions in number and I want a
list of all those elements based on attribute. I am using Xpath as
there is not specific node name available for nodes all names are different.
Regards
Geet
Post by Michael Kay
In that case I would suggest starting with an initial path of
/xbrli:xbrl/xbrli:context rather than /xbrli:xbrl.
Michael Kay
Saxonica
Post by Geet Gangwar
There multiple occurrences of contexts element say upto 10 million.
Validation logic for each object is same. I need to pick one context
perform the validation logic, pick another perform the same logic.
Regards
Geet
Post by Michael Kay
(1) how big is an xbrli:context object?
(2) is the validation of each xbrli:context object independent of the
others?
(3) what is the logic of the validation that needs to be performed?
Michael Kay
Saxonica
Post by Geet Gangwar
Hi Michael,
Basically I need to perform some validation on the xml files. So my
requirement is to pick particular block from xml and validate that
block.
Post by Geet Gangwar
Post by Michael Kay
Post by Geet Gangwar
For example say I want to pull all context i.e
XPATH expression like : /xbrli:xbrl/xbrli:context
from the document and validate them.
Regards
Geet
On Mon, Apr 30, 2012 at 5:41 PM, Michael Kay<mike at saxonica.com>
Post by Michael Kay
NUX implements what one might call "chunked streaming": it selects a
sequence of subtrees of your document and applies XQuery processing to
each one in turn. (This is effectively the same as saxon:stream() in
Saxon.) You need to have enough memory to hold the largest of these
subtrees. Because your path /xbrli:xbrl effectively selects the whole
document, your largest subtree is the same as the document size, so
you
Post by Geet Gangwar
Post by Michael Kay
Post by Geet Gangwar
Post by Michael Kay
aren't really streaming at all.
If you can describe the query/transformation you want to carry out,
then
Post by Geet Gangwar
Post by Michael Kay
Post by Geet Gangwar
Post by Michael Kay
I might be able to suggest a way of doing it in a streamed manner
either
Post by Geet Gangwar
Post by Michael Kay
Post by Geet Gangwar
Post by Michael Kay
using NUX, or using native facilities of Saxon (whose streaming
capabilities have advanced considerably since NUX was last released).
Michael Kay
Saxonica
2012/4/30 Geet Gangwar<geetgangwar at gmail.com>
Post by Geet Gangwar
Post by Geet Gangwar
Hi,
I have a requirement of processing large XML files of about 4GB. I
am
Post by Geet Gangwar
Post by Michael Kay
Post by Geet Gangwar
Post by Michael Kay
using
Post by Geet Gangwar
Post by Geet Gangwar
the NUX distribution to process the files.
But when I fire a Xquery to this read this file, I get JAVA Heap
Space
Post by Geet Gangwar
Post by Michael Kay
Post by Geet Gangwar
Post by Michael Kay
Post by Geet Gangwar
Post by Geet Gangwar
error.
I am using 64bit JDK1.6 and also I have increased the JVM size to 5
GB
Post by Geet Gangwar
Post by Michael Kay
Post by Geet Gangwar
Post by Michael Kay
by
Post by Geet Gangwar
Post by Geet Gangwar
-Xmx5G.
Please help me how to solve this problem.
I am pasting my sample code also.
String path = "/xbrli:xbrl/";
Map prefixes = new HashMap();
prefixes.put("xbrli", "
http://www.xbrl.org/2003/instance
Post by Geet Gangwar
Post by Michael Kay
");
Post by Geet Gangwar
Post by Geet Gangwar
prefixes.put("link", "
http://www.xbrl.org/2003/linkbase
Post by Geet Gangwar
Post by Michael Kay
");
Post by Geet Gangwar
Post by Michael Kay
Post by Geet Gangwar
Post by Geet Gangwar
prefixes.put("xlink", "http://www.w3.org/1999/xlink
");
Post by Geet Gangwar
Post by Michael Kay
Post by Geet Gangwar
Post by Michael Kay
Post by Geet Gangwar
Post by Geet Gangwar
prefixes.put("xsd", "
http://www.w3.org/2001/XMLSchema
Post by Geet Gangwar
Post by Michael Kay
");
Post by Geet Gangwar
Post by Michael Kay
Post by Geet Gangwar
Post by Geet Gangwar
StreamingTransform myTransform = new
StreamingTransform() {
Post by Geet Gangwar
Post by Michael Kay
Post by Geet Gangwar
Post by Geet Gangwar
private Nodes NONE = new Nodes();
// execute XQuery against each element matching
location
Post by Geet Gangwar
Post by Geet Gangwar
path
public Nodes transform(Element subtree) {
Nodes results = XQueryUtil.xquery(subtree,
"xbrli:context[id = 'c-01']");
for (int i=0; i< results.size(); i++) {
// do something useful with query
results;
Post by Geet Gangwar
Post by Michael Kay
Post by Geet Gangwar
Post by Michael Kay
here we
Post by Geet Gangwar
Post by Geet Gangwar
just print them
System.out.println(XOMUtil.toPrettyXML(results.get(i)));
}
return NONE; // current subtree becomes
subject
Post by Geet Gangwar
Post by Michael Kay
to
Post by Geet Gangwar
Post by Michael Kay
Post by Geet Gangwar
Post by Geet Gangwar
garbage collection
// returning empty node list removes current
subtree
Post by Geet Gangwar
Post by Geet Gangwar
from document being build.
// returning new Nodes(subtree) retains the
current
Post by Geet Gangwar
Post by Michael Kay
Post by Geet Gangwar
Post by Geet Gangwar
subtree.
// returning new Nodes(some other nodes)
replaces
Post by Geet Gangwar
Post by Michael Kay
the
Post by Geet Gangwar
Post by Geet Gangwar
current subtree with
// some other nodes.
// if you want (SAX) parsing to terminate at
this
Post by Geet Gangwar
Post by Michael Kay
Post by Geet Gangwar
Post by Geet Gangwar
point, simply throw an exception
}
};
// parse document with a filtering Builder
StreamingPathFilter filter = new
StreamingPathFilter(path,
Post by Geet Gangwar
Post by Michael Kay
Post by Geet Gangwar
Post by Geet Gangwar
prefixes);
Builder builder = new
Builder(filter.createNodeFactory(null,
Post by Geet Gangwar
Post by Geet Gangwar
myTransform));
Document doc = builder.build(new
File("F:/sample/cntxt.xml"));
Post by Geet Gangwar
Post by Geet Gangwar
System.out.println(doc.getRootElement().getValue());
System.out.println("doc.size()=" +
doc.getRootElement().getChildCount());
System.out.println(XOMUtil.toPrettyXML(doc));
Regards
Geet
_______________________________________________
XOM-interest mailing list
XOM-interest at lists.ibiblio.org
http://lists.ibiblio.org/mailman/listinfo/xom-interest
_______________________________________________
XOM-interest mailing list
XOM-interest at lists.ibiblio.org
http://lists.ibiblio.org/mailman/listinfo/xom-interest
_______________________________________________
XOM-interest mailing list
XOM-interest at lists.ibiblio.org
http://lists.ibiblio.org/mailman/listinfo/xom-interest
_______________________________________________
XOM-interest mailing list
XOM-interest at lists.ibiblio.org
http://lists.ibiblio.org/mailman/listinfo/xom-interest
_______________________________________________
XOM-interest mailing list
XOM-interest at lists.ibiblio.org
http://lists.ibiblio.org/mailman/listinfo/xom-interest
_______________________________________________
XOM-interest mailing list
XOM-interest at lists.ibiblio.org
http://lists.ibiblio.org/mailman/listinfo/xom-interest
_______________________________________________
XOM-interest mailing list
XOM-interest at lists.ibiblio.org
http://lists.ibiblio.org/mailman/listinfo/xom-interest
_______________________________________________
XOM-interest mailing list
XOM-interest at lists.ibiblio.org
http://lists.ibiblio.org/mailman/listinfo/xom-interest
Michael Kay
2012-05-02 14:27:21 UTC
Permalink
Post by Geet Gangwar
Hi Michael,
As suggested by you I have used /xbrli:xbrl/xbrli:context to get all
context from the file. But apart from context there are other child nodes
of /xbrli:xbrl nodes which are around 50 millions in number and I want a
list of all those elements based on attribute. I am using Xpath as
there is not specific node name available for nodes all names are different.
I'm afraid I can't offer any detailed technical support on NUX streaming
features: it's not my product and I don't know it in depth.

If you want to try out streaming in Saxon-EE, I'll be happy to help.

Michael Kay
Saxonica

Geet Gangwar
2012-04-30 12:30:09 UTC
Permalink
Thanks Bruno for your quick reply. I read on the NUX site that it is an
extension of XOM. Also I have used the StreamingTransform class of XOM.

I have multiple xml files to process which are interlinked with each other
through xlink. So having Xpath is necessary for me. I am not getting how to
approach by just using Stax apis.

Regards

Geet

On Mon, Apr 30, 2012 at 5:20 PM, Bruno Oliveira <bruno.oliveira360 at gmail.com
Post by Bruno Oliveira
Hi,
5 GB of memory isn't enought. In average, a XML file needs 4 or 5 times
more memory compared to the file size. For example, you have a file with
4GB, probably you will need to 12-15 GB of memory at least (depends of the
number of nested nodes per node). If you want to process the entire file, i
recomend you to use a Streaming-based API, e.g., StAX, or use the XOM
stream-based approach: http://www.xom.nu/faq.xhtml
Best Regards,
Bruno Oliveira
2012/4/30 Geet Gangwar <geetgangwar at gmail.com>
Post by Geet Gangwar
Hi,
I have a requirement of processing large XML files of about 4GB. I am
using
Post by Geet Gangwar
the NUX distribution to process the files.
But when I fire a Xquery to this read this file, I get JAVA Heap Space
error.
I am using 64bit JDK1.6 and also I have increased the JVM size to 5 GB by
-Xmx5G.
Please help me how to solve this problem.
I am pasting my sample code also.
String path = "/xbrli:xbrl/";
Map prefixes = new HashMap();
prefixes.put("xbrli", "http://www.xbrl.org/2003/instance");
prefixes.put("link", "http://www.xbrl.org/2003/linkbase");
prefixes.put("xlink", "http://www.w3.org/1999/xlink");
prefixes.put("xsd", "http://www.w3.org/2001/XMLSchema");
StreamingTransform myTransform = new StreamingTransform() {
private Nodes NONE = new Nodes();
// execute XQuery against each element matching location
path
public Nodes transform(Element subtree) {
Nodes results = XQueryUtil.xquery(subtree,
"xbrli:context[id = 'c-01']");
for (int i=0; i < results.size(); i++) {
// do something useful with query results; here
we
Post by Geet Gangwar
just print them
System.out.println(XOMUtil.toPrettyXML(results.get(i)));
}
return NONE; // current subtree becomes subject to
garbage collection
// returning empty node list removes current subtree
from document being build.
// returning new Nodes(subtree) retains the current
subtree.
// returning new Nodes(some other nodes) replaces the
current subtree with
// some other nodes.
// if you want (SAX) parsing to terminate at this
point, simply throw an exception
}
};
// parse document with a filtering Builder
StreamingPathFilter filter = new StreamingPathFilter(path,
prefixes);
Builder builder = new Builder(filter.createNodeFactory(null,
myTransform));
Document doc = builder.build(new
File("F:/sample/cntxt.xml"));
Post by Geet Gangwar
System.out.println(doc.getRootElement().getValue());
System.out.println("doc.size()=" +
doc.getRootElement().getChildCount());
System.out.println(XOMUtil.toPrettyXML(doc));
Regards
Geet
_______________________________________________
XOM-interest mailing list
XOM-interest at lists.ibiblio.org
http://lists.ibiblio.org/mailman/listinfo/xom-interest
_______________________________________________
XOM-interest mailing list
XOM-interest at lists.ibiblio.org
http://lists.ibiblio.org/mailman/listinfo/xom-interest
Loading...