Large Xml Files

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
RelianceDss
Premium Member
Premium Member
Posts: 27
Joined: Wed Apr 11, 2007 12:53 am

Large Xml Files

Post by RelianceDss »

Hi,

We are facing problems reading large xml files(250 mbs). It gives following error
PLU_XML..XML_PLU: XML input document value is empty or NULL. Column Name = "Record"
I have tried both 2 parameter and single parameter(url path) approach.
Also the reading of xml is pretty slow.
1. Are there any other methods for better reading of xmls?
2. How to improve performance while xml reading?
3. Can anyone give me names of xml to flat file converter utilities?

I have read in many posts that large XMLs are insane, but this is what we are getting from the source and we need to process them at our end.
eostic
Premium Member
Premium Member
Posts: 3838
Joined: Mon Oct 17, 2005 9:34 am

Post by eostic »

Hi...

How far do you get? Are you certain it's a size issue? How large an instance (of this same structured document) are you able to read succcessfully? 50m? 100m? 200m?

Ernie
wnogalski
Charter Member
Charter Member
Posts: 54
Joined: Thu Jan 06, 2005 10:49 am
Location: Warsaw

Post by wnogalski »

DataStage has problems with XML files larger than 200MBs.

A better solution is to write a simple program which will parse the file and create f.e. a comma delimited sequential file from the XML.

HTH
Regards,
Wojciech Nogalski
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

I don't know how anyone can make a blanket statement about 200MBs being a limit for DataStage, there's way too many variables involved. I've succesfully processed files nearly twice that on my system.

Plus, whenever I've had 'size' problems they aren't nearly so nice - the job just falls over dead. Is this one large file that gives this message? Many 'large' files? If it is just one, I'd wonder if the problem lies in the file itself and not its size. :?

And not everyone can 'write a simple program' to parse XML... what would you suggest to do that? Seeing as how some companies make a living delivering XML tools, not really sure how 'simple' it would really be. [shrug]
-craig

"You can never have too many knives" -- Logan Nine Fingers
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Back to basics, can you guarantee that the XML is well-formed?
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
RelianceDss
Premium Member
Premium Member
Posts: 27
Joined: Wed Apr 11, 2007 12:53 am

Post by RelianceDss »

ray.wurlod wrote:Back to basics, can you guarantee that the XML is well-formed?
--The XML are well formed for sure.
--I am able to process 100 Mbs of xml file but anything more than that fails. The tag which we are trying to read has 69 attributes.
It gives error followinf error message
"Abnormal termination of stage Xml_test..XML_Input_1 detected"
The box is a 4 cpu machine with 16GB physical mem. Its a sun4u sparc SUNW,Sun-Fire-V440. Os-SunOS 5.10

While testing I see to it that nothing else is running on the box.
Can we anywayz avoid using Folder stage while using XML-input stage?
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

If you have used the Folder stage with one column passing only the filename and then set the XML Input stage to 'URL/Filepath' the issue is not with the Folder stage.

Post your question to your Support provider. I think you'll find the issue is your O/S which has... quirks... with XML, including fun things like the 'square root of a negative number' problem, which from what I recall is unique to Sun.
-craig

"You can never have too many knives" -- Logan Nine Fingers
mystuff
Premium Member
Premium Member
Posts: 200
Joined: Wed Apr 11, 2007 2:06 pm

Post by mystuff »

I don't know how anyone can make a blanket statement about 200MBs being a limit for DataStage, there's way too many variables involved.
a) Can you give me an idea about those variables involved.. All I know is size of the file .... which probably depends on physical memory available.

b) Could there be a rough estimated depending on available physical memory? Like for 16GB preferably not more than 200MB of each XML file.. something like that?
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

I've never seen any metric for that. It seems to vary widely enough that all you can do is process files on your system until you reach your particular limit. :?

Perhaps Ernie might come along and have more light to shed on this.
-craig

"You can never have too many knives" -- Logan Nine Fingers
eostic
Premium Member
Premium Member
Posts: 3838
Joined: Mon Oct 17, 2005 9:34 am

Post by eostic »

I haven't seen any good ways to calculate it, unfortunately. Perhaps it's something that could be dug up on the web (apache C++ xerces and xalan is being implemented deep inside the Stage). I suspect it's probably wildly variable depending on hierarchy, data types, element and attribute name lengths, values, etc.

Ernie
RelianceDss
Premium Member
Premium Member
Posts: 27
Joined: Wed Apr 11, 2007 12:53 am

Post by RelianceDss »

Hi,

We have taken a directional decision in not using xml-input plugin. We are trying to flatten XMLs using perl program where we have also added some business rules.
Our testing on big XML files(more than 100mbs) was giving inconsistent result and the performance was also pathetic.

Anywayz guys thnks a lot for all the suggestions.

rgrds
Post Reply