Job process dies at around 1GB RAM usage

ian_bennett · Post by **ian_bennett** » Tue Oct 25, 2005 6:25 pm

When the job process hit around the 1GB of RAM mark it soon aborts - no error message given.

This is occuring when using the XML reader to parse large files and we have noticed it with a few jobs.

Anyone else experienced this behaviour?

vmcburney · Post by **vmcburney** » Tue Oct 25, 2005 6:51 pm

Yep, the XML reader is a dog. It tries to read the entire XML file as a single document in one very big text field. Very large XML files cannot be handled by the XML input stage as they run out of memory. You are better off processing it via a sequential file stage and parsing it in a transformer. You will find it wont hit a memory limit and it will run at least 100 times faster. BASIC routines can help you handle tags, stage variables can handle complex transactions by saving values between rows and building up a flat record from an XML structure.

Spent a few days doing this for some 8G plus XML files and it was worth the effort.

DataStage TX is better at XML processing then DataStage Server or Enterprise.

aartlett · Post by **aartlett** » Tue Oct 25, 2005 11:28 pm

I would have to agree about the XML reader in DS Server. Haven't trird TX.

When I last had to play with XML files, especially when I had to extract multiple files (record types) out of theone XML file we generated it externally using a script and I think it was the SAXXON XSLT. This was a Java system so we took a hit because it was Java. We tried the C++ version but the sysadmins where I was were very draconian and would allow an upgrade to some libraries, I could easily get around thos eproblems in the Java version.

I've had to support XML that is sequential files and transform processed. It is very quick but a bugger to handle changes (I think it is yours I'm supporting Vince).

I hold to the maxims: Data Stage is not ETL, PX is not Data Stage. Use the best tool for the job. I get in a bit of trouble sometimes, but generally I get the support for this.

vmcburney · Post by **vmcburney** » Tue Oct 25, 2005 11:35 pm

That's the drawback, sequential file processing is fast but it can be hard to support, especially if you need to change or add XML elements.

vmcburney · Post by **vmcburney** » Wed Oct 26, 2005 3:10 am

So how do people feel about large repeating XML files? Seems to me a misuse of the format. An XML file with one million repeating records also holds one million duplicate definitions of those records which take up as much if not more space then the data itself. Is this incorrect use of XML? Should the data source be producing these volumes in a delimited of complex flat file format? XML seems ideal for small complex real time transactions but a big pain for large batch processing.

I know at the project Andrew and I worked on they probably would have saved a lot more time and money rewriting the code that produced the XML to produce delimited files, saved a heap of disk space, built all the jobs in a fraction of the time and saved a heap of money.

chulett · Post by **chulett** » Wed Oct 26, 2005 7:00 am

If there is a choice in the matter, then I whole-heartedly agree with you Vince.

However, in my recent introduction to the Wonderful World of XML there was no choice given - feeds in and out to business partners like Yahoo and other search engines, build this huge zillion row XML file, read this huge zillion row XML file. Yark.

Starting to do smaller 'payloads' of XML for RTI/SOA jobs as well, but right now we've been having fun dealing with something that would have worked more better as a flat file IMHO, but since XML is The Way Things Should Be Done nowadays...

aartlett · Post by **aartlett** » Wed Oct 26, 2005 6:57 pm

One of the problems with the XML we are getting, and anyone who has processed weblog data will know, is the amount of redundant and non-required information. We are trying to get the source system to provide us with the data in a simple flat file.

One problem we had was that the originating source changed the tag for email links. They neglected to pass this down the line, who needed to know anyway

and we stopped getting the data we needed. If they had provided flat files, a change to their xml wouldn't have effected us and all would be happy and I wouldn't be trying to modify Vince's code

.

One advantage of XML is that if the file format changes (new fields are added) the XSLT will ignore them as they haven't been asked to be extracted. This is what happened at my last gig. The source changed the feed and we had zero impact until we decided we wanted to use the fields

vmcburney · Post by **vmcburney** » Wed Oct 26, 2005 7:18 pm

The standard web log reader rocks, you could churn through a big web log in no time with all the handy routines that come with the web log import such as URL parsers. It's when you need non-standard browse and click event event information from your web site that you get into trouble.