How to improve throughput or performance of XML parser job?

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
eq8547
Premium Member
Premium Member
Posts: 4
Joined: Mon Feb 25, 2013 9:39 am
Location: SAN ANTONIO

How to improve throughput or performance of XML parser job?

Post by eq8547 »

Hi,

I have a parallel job that has the following stages:
External Source==> XML ==> Dataset

Number of nodes used= 8 nodes. (Tried 2 and 4 nodes also but 8 nodes resulted to fastest throughput).

Average XML size is approximately 500 KB.

Current performance/throughput=1,000 XMLs per minute.

Target performance (i.e. throughput)= 2,500 XMLs per minute or better.

Schema (xsd of input XML) used:

<xs:schema attributeFormDefault="unqualified" elementFormDefault="qualified" xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xs:element name="Call">
<xs:complexType>
<xs:sequence>
<xs:element name="index">
<xs:complexType>
<xs:sequence>
<xs:element type="xs:dateTime" name="startTime"/>
<xs:element type="xs:string" name="callID"/>
<xs:element type="xs:string" name="appName"/>
<xs:element type="xs:string" name="appLanguage"/>
<xs:element type="xs:string" name="appRegion"/>
<xs:element type="xs:string" name="ivrName"/>
<xs:element type="xs:string" name="ivrPort"/>
<xs:element type="xs:string" name="codeRelease"/>
<xs:element type="xs:string" name="dataRelease"/>
<xs:element type="xs:dateTime" name="endTime"/>
</xs:sequence>
</xs:complexType>
</xs:element>
<xs:element name="rptTag" maxOccurs="unbounded" minOccurs="0">
<xs:complexType>
<xs:sequence>
<xs:element type="xs:string" name="name"/>
<xs:element name="attrib" maxOccurs="unbounded" minOccurs="0">
<xs:complexType>
<xs:sequence>
<xs:element type="xs:string" name="name"/>
<xs:element type="xs:string" name="value"/>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:schema>

Also, I tried to enforce parallelism per IBM's documentation found in http://pic.dhe.ibm.com/infocenter/iisin ... rsing.html but to no success.

In addition I also need every element (i.e. startTime, CallID, etc, etc in my output as well, so not sure if XML parser parallelism will help me even if I'm able to get it to work.


Any tip on how to improve performance and meet the target throughput of 2,500 xmls per minute or better is greatly appreciated.


Thanks,
Edgar
eostic
Premium Member
Premium Member
Posts: 3838
Joined: Mon Oct 17, 2005 9:34 am

Post by eostic »

Hard to say, and we don't know anything about your machine, but some things to consider:

a) be sure that validation is set to "minimal". XSD validation is slow...no matter what tool you are using. For that matter, xml parsing is never a screamer....but you should certainly be able to improve things.
b) be sure that ALL of your output columns are varchar (it's text anyway ...you don't want DataStage doing any sort of implicit datatype translation.
c) Start studying parallelism at the DataStage Job/Stage level...not within the XML Stage. Learn about the config file....you'll need to be careful about multiple readers at the External Source Stage, but you should be able to make gains by allowing the xml Stage to run in parallel. Can the xml docs be processed in any order?
d) Consider also, though not as elegant, using multi-instancing. Then you could give different patterns at the External Source for the filenames that are sent downstream. Now you will be parallelizing the parsing activity, but at the Job execution level instead of via the config file.

....the first two should be absolutes.....ultimately all of these ideas could be used in combination...

Ernie
Ernie Ostic

blogit!
<a href="https://dsrealtime.wordpress.com/2015/0 ... ere/">Open IGC is Here!</a>
eq8547
Premium Member
Premium Member
Posts: 4
Joined: Mon Feb 25, 2013 9:39 am
Location: SAN ANTONIO

Post by eq8547 »

Hi Ernie,

I have tried all a, b, and c suggestions with no gain in throughput.

Here's the results:
a. with 5K input and minimal validation the job completed in 5:57 minutes. No gain compared when it was using strict validation.
b. my schema is already all varchars and same as my output columns.
c. I guess I understand parallelism concept. I tried running this job in 2, 4, 8 and even tried with 16 nodes. With 2 nodes, it took over 45 minutes to process 5K!
d. I tried to split 5K into 2.5K inputs and ran 2 instances with each instance processing 2.5k xmls. One instace completed in 5:33 minutes and the second instance completed in 5:16 minutes.

Not sure what other options there is to try.

Thanks,
Edgar
Post Reply