Performance problem while processing xml file
Moderators: chulett, rschirm, roy
Performance problem while processing xml file
I have a sequential file with 50000 xml records in it. I m splitting the xml tags into columns and putting into a sequential file like Sequential file------XML input stage----transformer---target sequential file. Its taking me more than 5hours to process.
Can you anyone let me know how can I improve the performance?
Thanks in Advance.
Can you anyone let me know how can I improve the performance?
Thanks in Advance.
Without knowing things about your machine, it's not possible to be sure that this will help, but consider that there may be some benefit to using Inter-process row buffering in combination with manual parallelism to send the rows to independent XMLInput Stages....and then re-collect them later. If the performance issue is really on the XMLStages, running multiples of them in parallel may help. This is a fairly detailed concept in and of itself, so do some reading on it.
XML is not a screamer, though, so expect it to be slower than other functions. If the xml strings coming from the file are really long with lots of elements, it may simply be the time required to load it up into memory. Also, avoid validation...if you have that checked it will slow things down even further.
How variable are your values in XML? It's rare, but I saw a case once (just like this) where the strings in the sequential file were "fixed records" stored in XML......consequently, every element and attribute value was the same length. We spent the time to define the entire file as fixed length, just skipping all of the predictable tags as "filler" in the meta data and eliminating xml processing altogether.
Ernie
XML is not a screamer, though, so expect it to be slower than other functions. If the xml strings coming from the file are really long with lots of elements, it may simply be the time required to load it up into memory. Also, avoid validation...if you have that checked it will slow things down even further.
How variable are your values in XML? It's rare, but I saw a case once (just like this) where the strings in the sequential file were "fixed records" stored in XML......consequently, every element and attribute value was the same length. We spent the time to define the entire file as fixed length, just skipping all of the predictable tags as "filler" in the meta data and eliminating xml processing altogether.
Ernie
Ernie Ostic
blogit!
<a href="https://dsrealtime.wordpress.com/2015/0 ... ere/">Open IGC is Here!</a>
blogit!
<a href="https://dsrealtime.wordpress.com/2015/0 ... ere/">Open IGC is Here!</a>
I am not doing Validation in xml input stage. It was unchecked earlier only.
I used Interprocess stage between sequential file and xml input stage like:
SequentialFile----Interprocess----Transformer------Interprocess-----XMLInputStage-----. from xml input stage I am passing to three different with three different logics. Now its processing very slow between Interprocess-----XmlInputStage-----3 links.
The strings in sequential file is variable length.
Please let me know anyother way to do.
I used Interprocess stage between sequential file and xml input stage like:
SequentialFile----Interprocess----Transformer------Interprocess-----XMLInputStage-----. from xml input stage I am passing to three different with three different logics. Now its processing very slow between Interprocess-----XmlInputStage-----3 links.
The strings in sequential file is variable length.
Please let me know anyother way to do.
-
- Premium Member
- Posts: 892
- Joined: Thu Oct 16, 2003 5:18 am
Answer Ernie's questions about your xml. Without the gory details, no-one can provide much more help. Five hours for that many records does seem too long for just file-to-file processing, unless there are a crapload of elements to parse. ![Confused :?](./images/smilies/icon_confused.gif)
![Confused :?](./images/smilies/icon_confused.gif)
-craig
"You can never have too many knives" -- Logan Nine Fingers
"You can never have too many knives" -- Logan Nine Fingers
have you tried sending the data from the flat file, thru a transformer to many XMLStages? If it's such a painful XML document to parse, perhaps you will have gains by using interprocess and having those xml documents work concurrently. This might cause you bottlenecks later on, but its worth looking into.....
Ernie
Ernie
Ernie Ostic
blogit!
<a href="https://dsrealtime.wordpress.com/2015/0 ... ere/">Open IGC is Here!</a>
blogit!
<a href="https://dsrealtime.wordpress.com/2015/0 ... ere/">Open IGC is Here!</a>
I am running the DS jobs in Datastage server 7.5 version.
If i am running for one xml record, then I had put the interprocess stage before entering into the XMLInput Stage but the output of xml input stage goes into three different links with different logics and at the end i need to do lookups on these three links to get 1 xml record in the target file.
when i run the job by putting interprocess, link partitioner, link collector stages and for the only one link out of three links coming from xml input stage, it took me 1hour for processing 50K records.
I think because I am taking three output links from one XmlInput Stage its taking longtime to process it.
Added to it i got a suggestion to do "We had to tweak the XSLT logic in XSL file." But i dont know what is this XSLT logic?
Kindly give me your valuable suggestion on this?
If i am running for one xml record, then I had put the interprocess stage before entering into the XMLInput Stage but the output of xml input stage goes into three different links with different logics and at the end i need to do lookups on these three links to get 1 xml record in the target file.
when i run the job by putting interprocess, link partitioner, link collector stages and for the only one link out of three links coming from xml input stage, it took me 1hour for processing 50K records.
I think because I am taking three output links from one XmlInput Stage its taking longtime to process it.
Added to it i got a suggestion to do "We had to tweak the XSLT logic in XSL file." But i dont know what is this XSLT logic?
Kindly give me your valuable suggestion on this?
You still should try having "multiple" XMLInput Stage instances.... not just multiple links coming from one. Get the XML to be processed in separate processes and see what impact that has...
Ernie
Ernie
Ernie Ostic
blogit!
<a href="https://dsrealtime.wordpress.com/2015/0 ... ere/">Open IGC is Here!</a>
blogit!
<a href="https://dsrealtime.wordpress.com/2015/0 ... ere/">Open IGC is Here!</a>
when i am trying to run with multiple links for the xml tags like :
<servicechar><name>AAA</name><value>111</value></servicechar>
<servicechar><name>BBB</name><value>222</value></servicechar>
<servicechar><name>CCC</name><value>333</value></servicechar>
<servicechar><name>DDD</name><value>444</value></servicechar>
<servicechar><name>EEE</name><value>555</value></servicechar>
I am facing problem to split these tags into different columns.
Unable to run for the above xml using parallel mechanism in datastage server. Its taking again <1 row/sec for processing above tags.
Can you please suggest me on the above how can I read such tags faster?
<servicechar><name>AAA</name><value>111</value></servicechar>
<servicechar><name>BBB</name><value>222</value></servicechar>
<servicechar><name>CCC</name><value>333</value></servicechar>
<servicechar><name>DDD</name><value>444</value></servicechar>
<servicechar><name>EEE</name><value>555</value></servicechar>
I am facing problem to split these tags into different columns.
Unable to run for the above xml using parallel mechanism in datastage server. Its taking again <1 row/sec for processing above tags.
Can you please suggest me on the above how can I read such tags faster?
My xml looks like below:
<soap>
<orderid>AB1234</orderid>
<orderdate>20090808</orderdate>
<eDateLst>
<eDate><name>ABC</name><value>20090817</value></eDate>
<eDate><name>CDE</name><value>20090827</value></eDate>
<eDate><name>EFG</name><value>20090837</value></eDate>
<eDate><name>GHI</name><value>20090847</value></eDate>
</eDateLst>
<eName>Hello</eName>
<eAddr>12 4th Street</eAddr>
<service><name>AAA</name><value>1111</value></service>
<service><name>BBB</name><value>2222</value></service>
<service><name>CCC</name><value>3333</value></service>
<service><name>DDD</name><value>4444</value></service>
<service><name>EEE</name><value>5555</value></service>
<srcd>
<servicecode>
<sequence>1</sequence>
<cpair><ap>hello</ap><dr>hai</dr></cpair>
</servicecode>
<servicecode>
<sequence>2</sequence>
<cer><ap>hello</ap><dr>hai</dr></cer>
------------------
-----------------
-------------------
<servicecode>
<sequence>n</sequence>
<cers><ap>hello</ap><dr>hai</dr></cers>
</servicecode>
<srcd>
</soap>
Can you please suggest me how can I improve the performance in Datastage server jobs in order to read this xml and divide into different fields.
the tag,
<servicecode>
<sequence>1</sequence>
<cpair><ap>hello</ap><dr>hai</dr></cpair>
</servicecode>
occurs 'N' number of times and it gets inserted into a child table.
Can you please suggest me on this.
Is there any other I can read this xml and split accross different fields and i need to even handle the sequences tags part.
<soap>
<orderid>AB1234</orderid>
<orderdate>20090808</orderdate>
<eDateLst>
<eDate><name>ABC</name><value>20090817</value></eDate>
<eDate><name>CDE</name><value>20090827</value></eDate>
<eDate><name>EFG</name><value>20090837</value></eDate>
<eDate><name>GHI</name><value>20090847</value></eDate>
</eDateLst>
<eName>Hello</eName>
<eAddr>12 4th Street</eAddr>
<service><name>AAA</name><value>1111</value></service>
<service><name>BBB</name><value>2222</value></service>
<service><name>CCC</name><value>3333</value></service>
<service><name>DDD</name><value>4444</value></service>
<service><name>EEE</name><value>5555</value></service>
<srcd>
<servicecode>
<sequence>1</sequence>
<cpair><ap>hello</ap><dr>hai</dr></cpair>
</servicecode>
<servicecode>
<sequence>2</sequence>
<cer><ap>hello</ap><dr>hai</dr></cer>
------------------
-----------------
-------------------
<servicecode>
<sequence>n</sequence>
<cers><ap>hello</ap><dr>hai</dr></cers>
</servicecode>
<srcd>
</soap>
Can you please suggest me how can I improve the performance in Datastage server jobs in order to read this xml and divide into different fields.
the tag,
<servicecode>
<sequence>1</sequence>
<cpair><ap>hello</ap><dr>hai</dr></cpair>
</servicecode>
occurs 'N' number of times and it gets inserted into a child table.
Can you please suggest me on this.
Is there any other I can read this xml and split accross different fields and i need to even handle the sequences tags part.
-
- Participant
- Posts: 12
- Joined: Wed Oct 03, 2007 9:11 am
- Location: London
Hello HemaV,
XML stage can't handle huge volume of input data and they are not efficient.
The workaround is to parse the XML into a txt file using XSLT Parser Such as Xalan or JAXB .you have to design the XSL file to parse the XML
Your can read the flat file easily through Sequential file stage
We have already done this exercise; we are loading nearly 100000 records (approx 4000000 lines of XML file) in 45 seconds
Regards
Ragasambath
XML stage can't handle huge volume of input data and they are not efficient.
The workaround is to parse the XML into a txt file using XSLT Parser Such as Xalan or JAXB .you have to design the XSL file to parse the XML
Your can read the flat file easily through Sequential file stage
We have already done this exercise; we are loading nearly 100000 records (approx 4000000 lines of XML file) in 45 seconds
Regards
Ragasambath
HemaV wrote:My xml looks like below:
<soap>
<orderid>AB1234</orderid>
<orderdate>20090808</orderdate>
<eDateLst>
<eDate><name>ABC</name><value>20090817</value></eDate>
<eDate><name>CDE</name><value>20090827</value></eDate>
<eDate><name>EFG</name><value>20090837</value></eDate>
<eDate><name>GHI</name><value>20090847</value></eDate>
</eDateLst>
<eName>Hello</eName>
<eAddr>12 4th Street</eAddr>
<service><name>AAA</name><value>1111</value></service>
<service><name>BBB</name><value>2222</value></service>
<service><name>CCC</name><value>3333</value></service>
<service><name>DDD</name><value>4444</value></service>
<service><name>EEE</name><value>5555</value></service>
<srcd>
<servicecode>
<sequence>1</sequence>
<cpair><ap>hello</ap><dr>hai</dr></cpair>
</servicecode>
<servicecode>
<sequence>2</sequence>
<cer><ap>hello</ap><dr>hai</dr></cer>
------------------
-----------------
-------------------
<servicecode>
<sequence>n</sequence>
<cers><ap>hello</ap><dr>hai</dr></cers>
</servicecode>
<srcd>
</soap>
Can you please suggest me how can I improve the performance in Datastage server jobs in order to read this xml and divide into different fields.
the tag,
<servicecode>
<sequence>1</sequence>
<cpair><ap>hello</ap><dr>hai</dr></cpair>
</servicecode>
occurs 'N' number of times and it gets inserted into a child table.
Can you please suggest me on this.
Is there any other I can read this xml and split accross different fields and i need to even handle the sequences tags part.
Regards
Raga
Raga