Substitution for Lookup Stage in DS Job
Moderators: chulett, rschirm, roy
-
- Participant
- Posts: 46
- Joined: Tue Jul 20, 2010 1:26 pm
- Location: USA
- Contact:
Substitution for Lookup Stage in DS Job
I have a DataStage Job which uses lookup stage to lookup on some key values coming from two links.
The amount of incoming records being quite high around 50 millions. So the job performance gets badly hampered.
Need suggestion from the group in the ways the jobs performance can be improved and which all things can be done in substitution of the lookup stage.
Thanks to all in advance.
Abhijit
The amount of incoming records being quite high around 50 millions. So the job performance gets badly hampered.
Need suggestion from the group in the ways the jobs performance can be improved and which all things can be done in substitution of the lookup stage.
Thanks to all in advance.
Abhijit
What type of lookup are you performing?
Which input to the lookup stage is receiving 50000000 records?
Are you performing any paritioning and sorting on the inputs to the lookup stage?
And, most important, what do you mean by "performance is badly hampered"? How long do the job run?
Regards,
Which input to the lookup stage is receiving 50000000 records?
Are you performing any paritioning and sorting on the inputs to the lookup stage?
And, most important, what do you mean by "performance is badly hampered"? How long do the job run?
Regards,
- james wiles
All generalizations are false, including this one - Mark Twain.
All generalizations are false, including this one - Mark Twain.
-
- Participant
- Posts: 46
- Joined: Tue Jul 20, 2010 1:26 pm
- Location: USA
- Contact:
Something else I should've asked
-- How many partitions are you running the job with and how large is the reference data?
Generally, a normal lookup is very quick once the reference table has been loaded, running about as quick as data can be fed to it from upstream and as quick as it can be accepted downstream. It's very unlikely that the lookup is your bottleneck in this case.
Depending upon the processing capacity of your server, 50mm rows in 3 hours may be ok. Do other jobs processing the same amount of data run much quicker?
If you feel the performance is bad, consider the following:
1) The source of your 50mm records - what is it and how quick can they be supplied to the lookup? For example, 50mm records provided by a complex SQL query can take a while to begin entering the lookup stage.
2) What else does your job do before and after the lookup? One or more of these functions may be the bottleneck.
Regards,
![Smile :)](./images/smilies/icon_smile.gif)
Generally, a normal lookup is very quick once the reference table has been loaded, running about as quick as data can be fed to it from upstream and as quick as it can be accepted downstream. It's very unlikely that the lookup is your bottleneck in this case.
Depending upon the processing capacity of your server, 50mm rows in 3 hours may be ok. Do other jobs processing the same amount of data run much quicker?
If you feel the performance is bad, consider the following:
1) The source of your 50mm records - what is it and how quick can they be supplied to the lookup? For example, 50mm records provided by a complex SQL query can take a while to begin entering the lookup stage.
2) What else does your job do before and after the lookup? One or more of these functions may be the bottleneck.
Regards,
- james wiles
All generalizations are false, including this one - Mark Twain.
All generalizations are false, including this one - Mark Twain.
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
I'd suggest using Join stage. Look up stage with so much data can lead to memory leaks/corrupted data. Also, is the data coming from the reference link is bigger than the main stream? If yes, then this is an additional reason to use the Join stage.
Overall experience, with big data cases like this one, it's most better to use Join stage.
Let me know how it goes.
Cheers.
Overall experience, with big data cases like this one, it's most better to use Join stage.
Let me know how it goes.
Cheers.
Not only thoughts, but a little bit of experience.
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
Can you please provide some proof of this assertion? It is not something I have ever encountered.iHijazi wrote:Look up stage with so much data can lead to memory leaks/corrupted data.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Sure buddy,
Check this out from the technical documentation:
"a Lookup stage might thrash because the reference data sets might not fit in RAM along with everything else that has to be in RAM. This results in very slow performance since each lookup operation can, and typically does, cause a page fault and an I/O operation.
i know you are going to point out the memory leaks. memory leak, basically, is "dynamically allocated memory has become unreachable". And I have seen that coming, and I've been monitoring a certain jobs about a month ago, turned out very ugly my friend![Smile :)](./images/smilies/icon_smile.gif)
on the other hand: "A join does a high-speed sort on the driving and reference data sets. This can involve I/O if the data is big enough, but the I/O is all highly optimized and sequential. After the sort is over, the join processing is very fast and never involves paging or other I/O."
Not going to make a CS class here, but hope that helps.
Cheers.
Check this out from the technical documentation:
"a Lookup stage might thrash because the reference data sets might not fit in RAM along with everything else that has to be in RAM. This results in very slow performance since each lookup operation can, and typically does, cause a page fault and an I/O operation.
i know you are going to point out the memory leaks. memory leak, basically, is "dynamically allocated memory has become unreachable". And I have seen that coming, and I've been monitoring a certain jobs about a month ago, turned out very ugly my friend
![Smile :)](./images/smilies/icon_smile.gif)
on the other hand: "A join does a high-speed sort on the driving and reference data sets. This can involve I/O if the data is big enough, but the I/O is all highly optimized and sequential. After the sort is over, the join processing is very fast and never involves paging or other I/O."
Not going to make a CS class here, but hope that helps.
Cheers.
Not only thoughts, but a little bit of experience.
All good reasons to be aware of the resources available on the target system and design accordingly. While memory swapping is definitely a performance killer (something I'm sure we've all seen on undersized/overutilized systems), so is sort work file I/O when not optimized for the size of your data and sort requirements.
Regards,
Regards,
- james wiles
All generalizations are false, including this one - Mark Twain.
All generalizations are false, including this one - Mark Twain.