Substitution for Lookup Stage in DS Job

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
creatingfusion
Participant
Posts: 46
Joined: Tue Jul 20, 2010 1:26 pm
Location: USA
Contact:

Substitution for Lookup Stage in DS Job

Post by creatingfusion »

I have a DataStage Job which uses lookup stage to lookup on some key values coming from two links.
The amount of incoming records being quite high around 50 millions. So the job performance gets badly hampered.

Need suggestion from the group in the ways the jobs performance can be improved and which all things can be done in substitution of the lookup stage.

Thanks to all in advance.
Abhijit
jwiles
Premium Member
Premium Member
Posts: 1274
Joined: Sun Nov 14, 2004 8:50 pm
Contact:

Post by jwiles »

What type of lookup are you performing?

Which input to the lookup stage is receiving 50000000 records?

Are you performing any paritioning and sorting on the inputs to the lookup stage?

And, most important, what do you mean by "performance is badly hampered"? How long do the job run?

Regards,
- james wiles


All generalizations are false, including this one - Mark Twain.
creatingfusion
Participant
Posts: 46
Joined: Tue Jul 20, 2010 1:26 pm
Location: USA
Contact:

Post by creatingfusion »

Its simple look up with entire partitioning ..... and the job sometimes handles about 50M rows in the input link so its runtime gets around 3 hours .....

So please suggest some solution
jwiles
Premium Member
Premium Member
Posts: 1274
Joined: Sun Nov 14, 2004 8:50 pm
Contact:

Post by jwiles »

Something else I should've asked :) -- How many partitions are you running the job with and how large is the reference data?

Generally, a normal lookup is very quick once the reference table has been loaded, running about as quick as data can be fed to it from upstream and as quick as it can be accepted downstream. It's very unlikely that the lookup is your bottleneck in this case.

Depending upon the processing capacity of your server, 50mm rows in 3 hours may be ok. Do other jobs processing the same amount of data run much quicker?

If you feel the performance is bad, consider the following:
1) The source of your 50mm records - what is it and how quick can they be supplied to the lookup? For example, 50mm records provided by a complex SQL query can take a while to begin entering the lookup stage.
2) What else does your job do before and after the lookup? One or more of these functions may be the bottleneck.

Regards,
- james wiles


All generalizations are false, including this one - Mark Twain.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

It would be useful to run the Performance Analysis tool over this job.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
iHijazi
Participant
Posts: 45
Joined: Sun Oct 24, 2010 12:05 am

Post by iHijazi »

I'd suggest using Join stage. Look up stage with so much data can lead to memory leaks/corrupted data. Also, is the data coming from the reference link is bigger than the main stream? If yes, then this is an additional reason to use the Join stage.

Overall experience, with big data cases like this one, it's most better to use Join stage.

Let me know how it goes.

Cheers.
Not only thoughts, but a little bit of experience.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

iHijazi wrote:Look up stage with so much data can lead to memory leaks/corrupted data.
Can you please provide some proof of this assertion? It is not something I have ever encountered.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
iHijazi
Participant
Posts: 45
Joined: Sun Oct 24, 2010 12:05 am

Post by iHijazi »

Sure buddy,

Check this out from the technical documentation:
"a Lookup stage might thrash because the reference data sets might not fit in RAM along with everything else that has to be in RAM. This results in very slow performance since each lookup operation can, and typically does, cause a page fault and an I/O operation.

i know you are going to point out the memory leaks. memory leak, basically, is "dynamically allocated memory has become unreachable". And I have seen that coming, and I've been monitoring a certain jobs about a month ago, turned out very ugly my friend :)

on the other hand: "A join does a high-speed sort on the driving and reference data sets. This can involve I/O if the data is big enough, but the I/O is all highly optimized and sequential. After the sort is over, the join processing is very fast and never involves paging or other I/O."

Not going to make a CS class here, but hope that helps.

Cheers.
Not only thoughts, but a little bit of experience.
jwiles
Premium Member
Premium Member
Posts: 1274
Joined: Sun Nov 14, 2004 8:50 pm
Contact:

Post by jwiles »

All good reasons to be aware of the resources available on the target system and design accordingly. While memory swapping is definitely a performance killer (something I'm sure we've all seen on undersized/overutilized systems), so is sort work file I/O when not optimized for the size of your data and sort requirements.

Regards,
- james wiles


All generalizations are false, including this one - Mark Twain.
Post Reply