What's happening when we use join stage instead of lookup ?

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
kaps
Participant
Posts: 452
Joined: Tue May 10, 2005 12:36 pm

What's happening when we use join stage instead of lookup ?

Post by kaps »

I am just wondering what's happening on the background when we use Join stage instead of Lookup stage? We had a failure saying insufficient disk space when we used lookup stage to do the lookup to a DB2 table. Though it says it's a space issue it's not as we have lot of space left and also lot of memory. But anyway we changed the job design to use join stage instead of lookup and when I see the when it runs, the join stage is waiting till all the records are read from lookup table and then it starts processing. So join stage does not hold records in memory? Does not the lookup does same?

Job Design is Join stage is connected to sequential file and DB2 stage and then transformer and to a DB2 stage.

Appreciate your input...
asorrell
Posts: 1707
Joined: Fri Apr 04, 2003 2:00 pm
Location: Colleyville, Texas

Post by asorrell »

You haven't given us all the info required, namely sorting and partitioning being used, but I believe I can tell what is happening.

The join stage requires its inputs to be sorted on the join keys. If the incoming data isn't sorted correctly, it is probably adding a tsort which is causing some of the issues you are seeing.

If you are truly sort on disk space, make sure both your inputs are sorted. Put an ORDER BY clause on the DB2 SELECT. You can also use the UNIX sort command with an EXECSH (Before Job Command) to pre-sort the sequential file.

See if that helps eliminate both the road block at the join stage and the resource problem.
Andy Sorrell
Certified DataStage Consultant
IBM Analytics Champion 2009 - 2020
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Actually, putting an ORDER BY into the extraction SQL is not sufficient.

You also have to tell DataStage that the inputs to Join are already sorted, to prevent it from inserting a tsort operator (using a Sort stage set to "don't sort, already sorted", or to prevent insertion of tsort operators using the APT_NO_SORT_INSERTION environment variable in your job.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
kaps
Participant
Posts: 452
Joined: Tue May 10, 2005 12:36 pm

Post by kaps »

First, Thanks for the replies.

Andy/Ray - I chose AUTO so that it does the sorting and key partition by itself. So can I say like "When we select AUTO for parition method, the lookup records gets loaded in memory" before moving to downstraem stages ? If that's the case then how does it differs from Lookup stage ?

Is it not a good practice to let the DataStage sort the data sets ? Should we sort them and tell the DataStage not to sort it ?

Thanks
jwiles
Premium Member
Premium Member
Posts: 1274
Joined: Sun Nov 14, 2004 8:50 pm
Contact:

Post by jwiles »

You say you have plenty of disk space but your lookup job is dying because the lookup stage ran out of space. How much space is allocated for:
-- scratch usage (resource scratch in config files)
-- disk usage (parallel dataset data files) (resource disk in config files)
-- temp usage (/tmp or $TMPDIR)

And, of course, how big is the lookup table from the database (number of bytes)? How does that compare to your available disk space and system memory?

Regards,
- james wiles


All generalizations are false, including this one - Mark Twain.
kaps
Participant
Posts: 452
Joined: Tue May 10, 2005 12:36 pm

Post by kaps »

Scratch and Resource disk is pointing to /datastage node in the server which has over 200 GB left. /tmp has almost 1 GB left. Server has 8 GB of Memory.

Table has 23 million records and I am just selecting two fields from that.
asorrell
Posts: 1707
Joined: Fri Apr 04, 2003 2:00 pm
Location: Colleyville, Texas

Post by asorrell »

Ray,

Thanks for putting the bit about APT variable in your reply. It hit me later that I had forgotten to mention it, then when I came back to correct it I saw your reply. Thanks for covering!

Kaps - there is a good overview of Lookup vs. Join in the documentation. I suggest you check it out.

Lookups do not require sorted input streams, joins do. However, lookups may consume large amounts of memory depending on how they are setup.

From the Parallel Developer's Guide:
In all cases you are concerned with the size of the reference datasets. If these take up a large amount of memory relative to the physical RAM memory size of the computer you are running on, then a lookup stage might thrash because the reference datasets might not fit in RAM along with everything else that has to be in RAM. This results in very slow performance since each lookup operation can, and typically does, cause a page fault and an I/O operation.

So, if the reference datasets are big enough to cause trouble, use a join. A join does a high-speed sort on the driving and reference datasets. This can involve I/O if the data is big enough, but the I/O is all highly optimized and sequential. Once the sort is over the join processing is very fast and never involves paging or other I/O.
Andy Sorrell
Certified DataStage Consultant
IBM Analytics Champion 2009 - 2020
Post Reply