Page 1 of 1

Lookup stage performance Versu Merge stage performance

Posted: Wed Jan 12, 2011 5:14 pm
by Nagin
Hi,
I have job which parses an xml file and looks up against a dataset(table dump) and if the keys are existing it will return the key, if not it will generate a new key and writes to a dataset.

My data volumes are really huge, So, once the lookup dataset got close 3.5GB the job was failing due to lack of temp space. So, I thought if I replace lookup stage with Merge it is going help this situation and with performance of the job as well.

But, I don't see any improvement. In fact the job with Lookup is running 30 seconds faster. This is when the volume is a little above 1 million rows.

Are there any specific parameters I need to enable for Merge to perform faster?

Thanks for your help.

Posted: Wed Jan 12, 2011 6:35 pm
by vmcburney
What is the slow part of your job, the lookup/merge or parsing the XML file? If you have the new XML assembly that can be added to DataStage 8.5 you should get massive XML processing improvements. If you are parsing it using a sequential file stage you could try multiple readers.

Posted: Wed Jan 12, 2011 6:50 pm
by Nagin
vmcburney wrote:What is the slow part of your job, the lookup/merge or parsing the XML file? If you have the new XML assembly that can be added to DataStage 8.5 you should get massive XML processing improvements. If you are parsing it using a sequential file stage you could try multiple readers.
Through put after XML parsing and till the Merge is close to 7000 rows plus per sec, but after Merge it is down to 1100 rows per sec.

Also, We are on 8.1 here , can we get the new XML assembly as a patch to 8.1? Do you know?

Posted: Thu Jan 13, 2011 3:26 am
by Sreenivasulu
As far i know XML assembly patch is available only for 8.5

Regards
Sreeni

Posted: Thu Jan 13, 2011 4:44 am
by ray.wurlod
Relative performance between Lookup and Merge stages is irrelevant, because they perform different tasks.

Posted: Thu Jan 13, 2011 11:17 am
by jwiles
Along the lines of different tasks, if Sort and Partition insertion has not been disabled, the engine likely inserted a sort/partition for each input for the merge if they weren't already in the job design. You wouldn't have the sorts inserted for a lookup stage, and typically Entire partitioning on the reference inputs only. For the inserted sorts, your data wouldn't flow out of the sort until the full stream has been sorted and that can affect the rows/sec displayed in monitor/perf statistics (the displayed value is essentially the average since the job began processing data, not an instantaneous value for the stage itself)

Regards,