Performance problem while processing xml file

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
HemaV
Participant
Posts: 63
Joined: Wed Jan 09, 2008 1:38 am
Location: Bangalore

Performance problem while processing xml file

Post by HemaV »

I have a sequential file with 50000 xml records in it. I m splitting the xml tags into columns and putting into a sequential file like Sequential file------XML input stage----transformer---target sequential file. Its taking me more than 5hours to process.

Can you anyone let me know how can I improve the performance?

Thanks in Advance.
eostic
Premium Member
Premium Member
Posts: 3838
Joined: Mon Oct 17, 2005 9:34 am

Post by eostic »

Without knowing things about your machine, it's not possible to be sure that this will help, but consider that there may be some benefit to using Inter-process row buffering in combination with manual parallelism to send the rows to independent XMLInput Stages....and then re-collect them later. If the performance issue is really on the XMLStages, running multiples of them in parallel may help. This is a fairly detailed concept in and of itself, so do some reading on it.

XML is not a screamer, though, so expect it to be slower than other functions. If the xml strings coming from the file are really long with lots of elements, it may simply be the time required to load it up into memory. Also, avoid validation...if you have that checked it will slow things down even further.

How variable are your values in XML? It's rare, but I saw a case once (just like this) where the strings in the sequential file were "fixed records" stored in XML......consequently, every element and attribute value was the same length. We spent the time to define the entire file as fixed length, just skipping all of the predictable tags as "filler" in the meta data and eliminating xml processing altogether.

Ernie
Ernie Ostic

blogit!
<a href="https://dsrealtime.wordpress.com/2015/0 ... ere/">Open IGC is Here!</a>
HemaV
Participant
Posts: 63
Joined: Wed Jan 09, 2008 1:38 am
Location: Bangalore

Post by HemaV »

I am not doing Validation in xml input stage. It was unchecked earlier only.

I used Interprocess stage between sequential file and xml input stage like:
SequentialFile----Interprocess----Transformer------Interprocess-----XMLInputStage-----. from xml input stage I am passing to three different with three different logics. Now its processing very slow between Interprocess-----XmlInputStage-----3 links.

The strings in sequential file is variable length.

Please let me know anyother way to do.
Sreenivasulu
Premium Member
Premium Member
Posts: 892
Joined: Thu Oct 16, 2003 5:18 am

Post by Sreenivasulu »

Hi ,

Datastage loads flat files fast but not the xml files. xml input stages are used to load status information,exchange information but not the input data.

Regards
Sreeni
HemaV
Participant
Posts: 63
Joined: Wed Jan 09, 2008 1:38 am
Location: Bangalore

Post by HemaV »

Hi,
But I need to split the xml record coming in a sequential file into different fields based on xsd structure and put into a sequential target file.
Is there any other way of processing xml record?
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Answer Ernie's questions about your xml. Without the gory details, no-one can provide much more help. Five hours for that many records does seem too long for just file-to-file processing, unless there are a crapload of elements to parse. :?
-craig

"You can never have too many knives" -- Logan Nine Fingers
eostic
Premium Member
Premium Member
Posts: 3838
Joined: Mon Oct 17, 2005 9:34 am

Post by eostic »

have you tried sending the data from the flat file, thru a transformer to many XMLStages? If it's such a painful XML document to parse, perhaps you will have gains by using interprocess and having those xml documents work concurrently. This might cause you bottlenecks later on, but its worth looking into.....

Ernie
Ernie Ostic

blogit!
<a href="https://dsrealtime.wordpress.com/2015/0 ... ere/">Open IGC is Here!</a>
HemaV
Participant
Posts: 63
Joined: Wed Jan 09, 2008 1:38 am
Location: Bangalore

Post by HemaV »

I am running the DS jobs in Datastage server 7.5 version.

If i am running for one xml record, then I had put the interprocess stage before entering into the XMLInput Stage but the output of xml input stage goes into three different links with different logics and at the end i need to do lookups on these three links to get 1 xml record in the target file.

when i run the job by putting interprocess, link partitioner, link collector stages and for the only one link out of three links coming from xml input stage, it took me 1hour for processing 50K records.

I think because I am taking three output links from one XmlInput Stage its taking longtime to process it.

Added to it i got a suggestion to do "We had to tweak the XSLT logic in XSL file." But i dont know what is this XSLT logic?

Kindly give me your valuable suggestion on this?
eostic
Premium Member
Premium Member
Posts: 3838
Joined: Mon Oct 17, 2005 9:34 am

Post by eostic »

You still should try having "multiple" XMLInput Stage instances.... not just multiple links coming from one. Get the XML to be processed in separate processes and see what impact that has...

Ernie
Ernie Ostic

blogit!
<a href="https://dsrealtime.wordpress.com/2015/0 ... ere/">Open IGC is Here!</a>
HemaV
Participant
Posts: 63
Joined: Wed Jan 09, 2008 1:38 am
Location: Bangalore

Post by HemaV »

when i am trying to run with multiple links for the xml tags like :
<servicechar><name>AAA</name><value>111</value></servicechar>
<servicechar><name>BBB</name><value>222</value></servicechar>
<servicechar><name>CCC</name><value>333</value></servicechar>
<servicechar><name>DDD</name><value>444</value></servicechar>
<servicechar><name>EEE</name><value>555</value></servicechar>

I am facing problem to split these tags into different columns.

Unable to run for the above xml using parallel mechanism in datastage server. Its taking again <1 row/sec for processing above tags.

Can you please suggest me on the above how can I read such tags faster?
HemaV
Participant
Posts: 63
Joined: Wed Jan 09, 2008 1:38 am
Location: Bangalore

Post by HemaV »

My xml looks like below:

<soap>
<orderid>AB1234</orderid>
<orderdate>20090808</orderdate>
<eDateLst>
<eDate><name>ABC</name><value>20090817</value></eDate>
<eDate><name>CDE</name><value>20090827</value></eDate>
<eDate><name>EFG</name><value>20090837</value></eDate>
<eDate><name>GHI</name><value>20090847</value></eDate>
</eDateLst>
<eName>Hello</eName>
<eAddr>12 4th Street</eAddr>
<service><name>AAA</name><value>1111</value></service>
<service><name>BBB</name><value>2222</value></service>
<service><name>CCC</name><value>3333</value></service>
<service><name>DDD</name><value>4444</value></service>
<service><name>EEE</name><value>5555</value></service>
<srcd>
<servicecode>
<sequence>1</sequence>
<cpair><ap>hello</ap><dr>hai</dr></cpair>
</servicecode>
<servicecode>
<sequence>2</sequence>
<cer><ap>hello</ap><dr>hai</dr></cer>
------------------
-----------------
-------------------
<servicecode>
<sequence>n</sequence>
<cers><ap>hello</ap><dr>hai</dr></cers>
</servicecode>
<srcd>
</soap>

Can you please suggest me how can I improve the performance in Datastage server jobs in order to read this xml and divide into different fields.
the tag,
<servicecode>
<sequence>1</sequence>
<cpair><ap>hello</ap><dr>hai</dr></cpair>
</servicecode>
occurs 'N' number of times and it gets inserted into a child table.

Can you please suggest me on this.
Is there any other I can read this xml and split accross different fields and i need to even handle the sequences tags part.
ragasambath
Participant
Posts: 12
Joined: Wed Oct 03, 2007 9:11 am
Location: London

Post by ragasambath »

Hello HemaV,

XML stage can't handle huge volume of input data and they are not efficient.

The workaround is to parse the XML into a txt file using XSLT Parser Such as Xalan or JAXB .you have to design the XSL file to parse the XML

Your can read the flat file easily through Sequential file stage

We have already done this exercise; we are loading nearly 100000 records (approx 4000000 lines of XML file) in 45 seconds

Regards

Ragasambath

HemaV wrote:My xml looks like below:


<soap>
<orderid>AB1234</orderid>
<orderdate>20090808</orderdate>
<eDateLst>
<eDate><name>ABC</name><value>20090817</value></eDate>
<eDate><name>CDE</name><value>20090827</value></eDate>
<eDate><name>EFG</name><value>20090837</value></eDate>
<eDate><name>GHI</name><value>20090847</value></eDate>
</eDateLst>
<eName>Hello</eName>
<eAddr>12 4th Street</eAddr>
<service><name>AAA</name><value>1111</value></service>
<service><name>BBB</name><value>2222</value></service>
<service><name>CCC</name><value>3333</value></service>
<service><name>DDD</name><value>4444</value></service>
<service><name>EEE</name><value>5555</value></service>
<srcd>
<servicecode>
<sequence>1</sequence>
<cpair><ap>hello</ap><dr>hai</dr></cpair>
</servicecode>
<servicecode>
<sequence>2</sequence>
<cer><ap>hello</ap><dr>hai</dr></cer>
------------------
-----------------
-------------------
<servicecode>
<sequence>n</sequence>
<cers><ap>hello</ap><dr>hai</dr></cers>
</servicecode>
<srcd>
</soap>

Can you please suggest me how can I improve the performance in Datastage server jobs in order to read this xml and divide into different fields.
the tag,
<servicecode>
<sequence>1</sequence>
<cpair><ap>hello</ap><dr>hai</dr></cpair>
</servicecode>
occurs 'N' number of times and it gets inserted into a child table.

Can you please suggest me on this.
Is there any other I can read this xml and split accross different fields and i need to even handle the sequences tags part.
Regards

Raga
Post Reply