Page 1 of 2

IBM Support for 1gb or more size XML file.

Posted: Thu Mar 11, 2010 10:06 am
by dsdevper
Hi

We are facing abnormal termination of XML input stage,when using 1gb file and runs smooth for files less than 250 Mb files.
We have opened a ticket with IBM to know the exact size XML stages can handle or to know the limit of the stages.I am kinda lost of what else to ask,its a dumb think to ask But can Please any one let me know what to ask IBM support people in regards to the XML stages.

They have asked some info before calling me like

1. the version.xml from the server and client machines.
2. Please send the dsenv, uvconfig and DSParams file from the Project
3.Job dsx


Thanks

Posted: Thu Mar 11, 2010 10:31 am
by chulett
There's not much to ask. You cannot process XML files that large and people that send them out like that should be shot. IMHO. :wink:

You'll need a 'pre-processing' step to break them up into digestable pieces or use a third-party parsing tool that doesn't need to suck the whole thing up into memory first. I would imagine there are 'stream parsers' out there somewhere.

Paging Dr Ostic, Dr Fine, Dr Ostic

Posted: Thu Mar 11, 2010 11:03 am
by chulett
If you really want to ask them something, how about asking them for their recommendation as to a 'best practice' for processing huge XML files in DataStage? I'd be curious what they say.

Posted: Thu Mar 11, 2010 12:08 pm
by dsdevper
Thanks Chulett,from reading so many posts on our Blog i came to the conclusion that it cannot process that big file but our company wants to take this to the IBM and want to know the Answer from them.

I was expecting a reply from them as "we cannot process such big file." as soon as they see the ticket.

But instead they were asking all these files information.So thought is there anything that i want to ask them.

I will make sure to ask them the what you have said.

Posted: Thu Mar 11, 2010 2:03 pm
by eostic
Late reply. Been on a long plane ride from Asia.... IBM is working on it, but we have a another quarter or two to wait. We're seeing more and more large XML......

In the meantime, you would need to break it up...I've seen people do it creatively in java, or using something like XMLMax (Windows based tool to help).

Alternatively, if you have it available, you could use WebSphere TX, by itself, or via MapStage from within DataStage. It uses a sax-style reader that can read the larger docs.

Ernie

Posted: Thu Mar 11, 2010 4:37 pm
by dsdevper
Hi, I do not know what to say but here is the reply i got from IBM support through mail.Still hoping them to call me.

""Thank you for the information. Found that the XML Input stage requires, on average, 5-7 times the size of the file in memory to process the document. The memory usage is based on
the actual structure and data within the document, and on the XPATH that defined in the job.
There is a risk of random job failures with large files, even when the memory usage is optimally configured.
The recommended solution is the input XML files should be kept as small as possible. The guideline is 100 MB or less for each file.""

They said there is risk of random job failures with out giving any reasons for it.

They didn't give the actual size or limit of the XML stages.

Any thoughts ?


Thanks

Posted: Thu Mar 11, 2010 4:55 pm
by chulett
There's too many variables, a file that's "too large" on my system may process fine on yours - and all that matters is your limit, not mine. About all I can suggest is you write something to generate test XML files of varying sizes and see how big you can push it until a reader job falls over dead.

Or start looking into something to chunk them up and try to stay under a limit, their "100MB or less" rule is a good general one. Heck, when I was building files for Google, they had that as a strict rule - as many files as we wanted to send, provided none of them were a byte over 100MB. One bad apple and the whole bushel basket was rejected.

Posted: Fri Mar 12, 2010 12:50 am
by vmcburney
The problem is that the DataStage XML input stage sees the entire file as one XML document so it tries to validate that entire document before it starts XML processing - that's what breaks the memory limits. This is fixed in DataStage 8.5 in a couple months but for now you can break it up into smaller files or you can read it through a sequential file stage and parse the XML in a transformer.

You might be able to stage the file in a DB2 PureXML database. Have a look at this article on a benchmark for processing a terabyte of XML documents:
http://www.ibm.com/developerworks/data/ ... index.html

And this one on retrieving database from PureXML using the DataStage DB2 Connector which supports xqueries and can shred the data without having to read all the XML at startup:
http://www.ibm.com/developerworks/data/ ... epurexml1/

Posted: Fri Mar 12, 2010 2:53 am
by lstsaur
XML input stage is using DOM parser and it creates a DOM tree in memory for a XML document. Anytime you are running with large XML document, XML Input stage crashes.

So, what you can do is using StAX parser to divide the large XML file into smaller DOM subtrees, and then each subtree is evaluated with XPath individually. It's no easy task, but I got it worked.

Posted: Fri Mar 12, 2010 8:10 am
by chulett
vmcburney wrote:This is fixed in DataStage 8.5
Now, that is interesting news. Any idea on the nature, the how of the fix?

Posted: Fri Mar 12, 2010 1:38 pm
by ray.wurlod
No, it was just a bullet point on the DataStage roadmap presentation at IOD 2009 conference last October.

Posted: Wed Nov 03, 2010 12:01 pm
by Nagin
lstsaur,
Can you please let me know how I can use this work around? Which technology are these StAX parser and DOM subtrees built in?

Thanks,
Nagin.

Posted: Wed Nov 03, 2010 1:05 pm
by ray.wurlod
You can get version 8.5 that does handle very large XML files using a totally redesigned technique that uses streams rather than trying to store the entire XML file in memory. This new stage is only available in parallel jobs, however.

Posted: Wed Nov 03, 2010 1:53 pm
by eostic
No....fully available in Server 8.5 also!

If you are not able to prepare for 8.5, the answer is above in the thread...you will need to break up the document externally (I've heard of creative solutions using Java and I know our own lab services offers the opportunity for such) and tools like XMLMax can do it.........or you need to read it with another tool such as WebSphere TX.......

Ernie

Posted: Wed Nov 10, 2010 12:42 pm
by Nagin
Looks like we can't go to 8.5 yet. I am leaning towards splitting up the file with the help of a shell script.

But, I just heard about XSLT. If I use the style sheet, do you think DataStage will still read the entire XML file into memory.

In the Job have seen it looks like we are providing the XSLT file and the xml source file to XML Transformer Stage and all the parsing is happening. Looks like the parsing happening on Unix itself.

I think with this approach the entire XML does not need to be loaded into memory.

Any ideas?