Substitution for Lookup Stage in DS Job

creatingfusion · Post by **creatingfusion** » Fri Feb 18, 2011 11:49 am

I have a DataStage Job which uses lookup stage to lookup on some key values coming from two links.
The amount of incoming records being quite high around 50 millions. So the job performance gets badly hampered.

Need suggestion from the group in the ways the jobs performance can be improved and which all things can be done in substitution of the lookup stage.

Thanks to all in advance.
Abhijit

jwiles · Post by **jwiles** » Fri Feb 18, 2011 12:18 pm

What type of lookup are you performing?

Which input to the lookup stage is receiving 50000000 records?

Are you performing any paritioning and sorting on the inputs to the lookup stage?

And, most important, what do you mean by "performance is badly hampered"? How long do the job run?

Regards,

creatingfusion · Post by **creatingfusion** » Fri Feb 18, 2011 12:23 pm

Its simple look up with entire partitioning ..... and the job sometimes handles about 50M rows in the input link so its runtime gets around 3 hours .....

So please suggest some solution

jwiles · Post by **jwiles** » Fri Feb 18, 2011 1:38 pm

Something else I should've asked

-- How many partitions are you running the job with and how large is the reference data?

Generally, a normal lookup is very quick once the reference table has been loaded, running about as quick as data can be fed to it from upstream and as quick as it can be accepted downstream. It's very unlikely that the lookup is your bottleneck in this case.

Depending upon the processing capacity of your server, 50mm rows in 3 hours may be ok. Do other jobs processing the same amount of data run much quicker?

If you feel the performance is bad, consider the following:
1) The source of your 50mm records - what is it and how quick can they be supplied to the lookup? For example, 50mm records provided by a complex SQL query can take a while to begin entering the lookup stage.
2) What else does your job do before and after the lookup? One or more of these functions may be the bottleneck.

Regards,

ray.wurlod · Post by **ray.wurlod** » Fri Feb 18, 2011 2:57 pm

It would be useful to run the Performance Analysis tool over this job.

iHijazi · Post by **iHijazi** » Mon Feb 21, 2011 7:41 am

I'd suggest using Join stage. Look up stage with so much data can lead to memory leaks/corrupted data. Also, is the data coming from the reference link is bigger than the main stream? If yes, then this is an additional reason to use the Join stage.

Overall experience, with big data cases like this one, it's most better to use Join stage.

Let me know how it goes.

Cheers.

ray.wurlod · Post by **ray.wurlod** » Tue Feb 22, 2011 4:50 am

iHijazi wrote:Look up stage with so much data can lead to memory leaks/corrupted data.

Can you please provide some proof of this assertion? It is not something I have ever encountered.

iHijazi · Post by **iHijazi** » Tue Feb 22, 2011 5:14 am

Sure buddy,

Check this out from the technical documentation:
"a Lookup stage might thrash because the reference data sets might not fit in RAM along with everything else that has to be in RAM. This results in very slow performance since each lookup operation can, and typically does, cause a page fault and an I/O operation.

i know you are going to point out the memory leaks. memory leak, basically, is "dynamically allocated memory has become unreachable". And I have seen that coming, and I've been monitoring a certain jobs about a month ago, turned out very ugly my friend

on the other hand: "A join does a high-speed sort on the driving and reference data sets. This can involve I/O if the data is big enough, but the I/O is all highly optimized and sequential. After the sort is over, the join processing is very fast and never involves paging or other I/O."

Not going to make a CS class here, but hope that helps.

Cheers.

jwiles · Post by **jwiles** » Tue Feb 22, 2011 12:13 pm

All good reasons to be aware of the resources available on the target system and design accordingly. While memory swapping is definitely a performance killer (something I'm sure we've all seen on undersized/overutilized systems), so is sort work file I/O when not optimized for the size of your data and sort requirements.

Regards,