Lookup vs. Join

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
TonyInFrance
Premium Member
Premium Member
Posts: 288
Joined: Tue May 27, 2008 3:42 am
Location: Luxembourg

Lookup vs. Join

Post by TonyInFrance »

Hello world

This is a question that involves the use of lookups and joins - so basic DataStage fundamentals.

This is what my current knowledge on the subject stands at, i.e. when to use what.

Considering that I want a left outer join, if the number of lines in my lookup dataset (right hand side dataset for a join) is NOT more than a million lines, I can use the lookup stage else it should be a join - Am I correct?

In case of using a lookup stage the reference dataset should be partitioned using the 'Entire' partition method while the stream should be hash partitioned and sorted on the lookup key /s - Am I correct?

In case I am using a join stage, both datasets should be hash partitioned and sorted on the lookup key /s - Am I correct?

Assuming that the above are correct (if not please correct me) I'm auditing 5 year old jobs (DataStage v8.5 having a 4 node configuration file where each node has its resource disk & resource scratchdisk defined on separate directories on the file system) where I find that lookup stages have been used for reference datasets as large as 100 million lines paritioned using 'Entire'

Thinking that I'd replace these with a 'Join' stage I did so thus sorting and hash partitioning both left and right datasets on the lookup keys.

Surpisingly this is further slowing down the entire job which takes almost double the time as it did earlier.

I had always thought that this should optimize the job and thus make it zippier since a 100 million lines aren't being copied to each node and hash partitioning (and sorting) on the key for each dataset (left and right) would ensure that the same data (wrt the key) is copied from both left and right datasets thus making the actual join function faster.

Where does it look like I'm going wrong? Should I just keep the partitioning to hash and avoid sorting? My first guess would be sorting a 100 million lines is where unnecessary time is consumed. Is sorting necessary over and above hash partitioning for the join stage?

Thanks

Tony
Tony
BI Consultant - Datastage
TonyInFrance
Premium Member
Premium Member
Posts: 288
Joined: Tue May 27, 2008 3:42 am
Location: Luxembourg

Post by TonyInFrance »

Hi Thomas

Thanks very much for that.

I thus understand that it's not the total number of records in the reference link that decides whether a join should be used in place of a lookup. Rather it is the ratio between the number of lines in the reference and stream that plays the decisive role? In this case, when should I prefer a JOIN instead of a lookup?

And when I do use a JOIN is it advisable to sort and hash partition for each (left & right) link

I will try out the round robin method using lookup - but you're sure that for around 15 million rows on reference an 'Entire' partition is advisable? Thus, 15 million rows would be copied to each node... :-O
Tony
BI Consultant - Datastage
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

If you are in an SMP environment, "Entire" partitioning is managed via a single copy in shared memory, so there's no "copy all rows to all nodes" involved.

To use a Lookup stage the only restriction is that the reference data must fit in memory. You can time shift building of the index by pre-loading the reference data into a Lookup File Set. (Otherwise it's done during the main job's run time.)

The perfect scenario is to partition the reference data exactly as the stream data are partitioned. But it's so much easier - especially in an SMP environment where it's no cost - just to go with (Auto), which gives you Entire.

Do NOT use a partitioning algorithm on the reference data that is not Entire and not key-based. That is, use only Entire, Hash or Modulus. Any other will cause you to miss some lookups.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
TonyInFrance
Premium Member
Premium Member
Posts: 288
Joined: Tue May 27, 2008 3:42 am
Location: Luxembourg

Post by TonyInFrance »

Thanks Ray.

I'll try and find out what kind of environement we're in (SMP/MPP) but i'll go by the assumption that and 'Entire' Partition is not going to affect the runtime too adversely.

I admit I'm not too familiar with the Lookup File Set stage - however if what I understand is correct, this is not refreshed at job runtime? If that's true then what if between two runs of the main job the lookup data changes?

Auto partitioning is something I've religiously tried to avoid but if you say that for a SMP environment this isn't dangerous I guess I can relent a bit....:-)

However I do wonder that if the perfect scenario (as you put it) is to partition the reference data exactly as the stream data are partitioned then why does convention state that for the 'Lookup' stage we should use 'Entire' partition for reference data? Shouldn't the rules / norms be that whether its Lookup or Join or Merge the data on BOTH links should be sorted and hash partitioned? Why would an 'Auto' partition in this case translate to 'Entire and not hash?

As for your last tip about not using Entire paritioning or a non-key based patitioning algorithm on reference data, that's what I strive to do - but if I'm on SMP then even an 'Auto' here is automatically understood by DataStage as Entire?
Tony
BI Consultant - Datastage
asorrell
Posts: 1707
Joined: Fri Apr 04, 2003 2:00 pm
Location: Colleyville, Texas

Post by asorrell »

however if what I understand is correct, this is not refreshed at job runtime?
You are correct - Lookup File Stages use static lookup tables that must be refreshed manually.
Auto partitioning is something I've religiously tried to avoid
I understand this and used to do the same, but for the most part DataStage is pretty good at picking the correct settings for Auto partitioning. The danger it entails is that most developers use Auto because they don't really understand how partitioning works. They build poorly designed jobs that require lots of re-partitioning and then blame DataStage for the horrible performance. If you DO understand partitioning, and design the job to minimize re-partitioning, then Auto isn't a bad thing.
why does convention state that for the 'Lookup' stage we should use 'Entire' partition for reference data
Well - it was "hidden" change done a few versions ago that really wasn't documented. And if I remember correctly you only got the shared memory structure if you used "Auto" and let it set the "Entire" partitioning auto-magically... Though I'm not certain that's still the case for the latest versions.
I'm on SMP then even an 'Auto' here is automatically understood by DataStage as Entire?
Yes - as long as it can figure out how to use a key-based look-up.
Andy Sorrell
Certified DataStage Consultant
IBM Analytics Champion 2009 - 2020
TonyInFrance
Premium Member
Premium Member
Posts: 288
Joined: Tue May 27, 2008 3:42 am
Location: Luxembourg

Post by TonyInFrance »

asorrell wrote:
however if what I understand is correct, this is not refreshed at job runtime?
You are correct - Lookup File Stages use static lookup tables that must be refreshed manually.
Which means this is useful only if my lookup data doesn't change too often. In my case my reference (lookup data) changes everyday and thus if I were to use a lookup file stage I'd have to refresh them before each run - which for me would be equivalent to using a lookup stage within my job which serves the same purpose. Correct?
asorrell wrote:
Auto partitioning is something I've religiously tried to avoid
I understand this and used to do the same, but for the most part DataStage is pretty good at picking the correct settings for Auto partitioning. The danger it entails is that most developers use Auto because they don't really understand how partitioning works. They build poorly designed jobs that require lots of re-partitioning and then blame DataStage for the horrible performance. If you DO understand partitioning, and design the job to minimize re-partitioning, then Auto isn't a bad thing.
I'm not an authority on partitioning but I think I know enough of at least the partitioning methods I most commonly use. That said my current scenario is a source stage that has to procure reference data from multiple reference tables / datasets where each lookup is on a different key value/s. Here, repartitioning is inevitable. thus should I use Auto in this case or 'entire' for each reference link and hash/sort on each stream link where I'm forced to specify the different lookup key combinations.

When it comes to choosing lookup vs. join (the original topic of this conversation) I guess I should consider each time whether the total number of lines in my lookup dataset fits in memory - which as Ray said is the SOLE criteria?
Tony
BI Consultant - Datastage
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Your comments on the Lookup File Set are spot on, IMHO, but they do come in handy if you have a situation where the same set of lookup data is needed by multiple jobs. Build once, leverage many. And Ray mentioned the 'time shifting' aspect of them, if they are static enough the build could be shifted earlier to, let's say, an otherwise idle time so the build time doesn't impact the eventual runtime. Now, if it is changing right up to the moment the job runs and only this job needs that set of data then yes, not that useful.

As to the lookup data fitting into memory, seeing as how when it doesn't the job falls over dead from what I recall, then that does seem like a pretty important criteria to me. :wink:
-craig

"You can never have too many knives" -- Logan Nine Fingers
Post Reply