What's happening when we use join stage instead of lookup ?

kaps · Post by **kaps** » Mon Feb 14, 2011 12:15 pm

I am just wondering what's happening on the background when we use Join stage instead of Lookup stage? We had a failure saying insufficient disk space when we used lookup stage to do the lookup to a DB2 table. Though it says it's a space issue it's not as we have lot of space left and also lot of memory. But anyway we changed the job design to use join stage instead of lookup and when I see the when it runs, the join stage is waiting till all the records are read from lookup table and then it starts processing. So join stage does not hold records in memory? Does not the lookup does same?

Job Design is Join stage is connected to sequential file and DB2 stage and then transformer and to a DB2 stage.

Appreciate your input...

IBM Analytics Champion 2009 - 2020 · Post by **asorrell** » Mon Feb 14, 2011 1:58 pm

You haven't given us all the info required, namely sorting and partitioning being used, but I believe I can tell what is happening.

The join stage requires its inputs to be sorted on the join keys. If the incoming data isn't sorted correctly, it is probably adding a tsort which is causing some of the issues you are seeing.

If you are truly sort on disk space, make sure both your inputs are sorted. Put an ORDER BY clause on the DB2 SELECT. You can also use the UNIX sort command with an EXECSH (Before Job Command) to pre-sort the sequential file.

See if that helps eliminate both the road block at the join stage and the resource problem.

ray.wurlod · Post by **ray.wurlod** » Mon Feb 14, 2011 3:43 pm

Actually, putting an ORDER BY into the extraction SQL is not sufficient.

You also have to tell DataStage that the inputs to Join are already sorted, to prevent it from inserting a tsort operator (using a Sort stage set to "don't sort, already sorted", or to prevent insertion of tsort operators using the APT_NO_SORT_INSERTION environment variable in your job.

kaps · Post by **kaps** » Mon Feb 14, 2011 5:57 pm

First, Thanks for the replies.

Andy/Ray - I chose AUTO so that it does the sorting and key partition by itself. So can I say like "When we select AUTO for parition method, the lookup records gets loaded in memory" before moving to downstraem stages ? If that's the case then how does it differs from Lookup stage ?

Is it not a good practice to let the DataStage sort the data sets ? Should we sort them and tell the DataStage not to sort it ?

Thanks

jwiles · Post by **jwiles** » Mon Feb 14, 2011 6:05 pm

You say you have plenty of disk space but your lookup job is dying because the lookup stage ran out of space. How much space is allocated for:
-- scratch usage (resource scratch in config files)
-- disk usage (parallel dataset data files) (resource disk in config files)
-- temp usage (/tmp or $TMPDIR)

And, of course, how big is the lookup table from the database (number of bytes)? How does that compare to your available disk space and system memory?

Regards,

kaps · Post by **kaps** » Tue Feb 15, 2011 9:44 am

Scratch and Resource disk is pointing to /datastage node in the server which has over 200 GB left. /tmp has almost 1 GB left. Server has 8 GB of Memory.

Table has 23 million records and I am just selecting two fields from that.

IBM Analytics Champion 2009 - 2020 · Post by **asorrell** » Tue Feb 15, 2011 10:48 am

Ray,

Thanks for putting the bit about APT variable in your reply. It hit me later that I had forgotten to mention it, then when I came back to correct it I saw your reply. Thanks for covering!

Kaps - there is a good overview of Lookup vs. Join in the documentation. I suggest you check it out.

Lookups do not require sorted input streams, joins do. However, lookups may consume large amounts of memory depending on how they are setup.

From the Parallel Developer's Guide:

In all cases you are concerned with the size of the reference datasets. If these take up a large amount of memory relative to the physical RAM memory size of the computer you are running on, then a lookup stage might thrash because the reference datasets might not fit in RAM along with everything else that has to be in RAM. This results in very slow performance since each lookup operation can, and typically does, cause a page fault and an I/O operation.

So, if the reference datasets are big enough to cause trouble, use a join. A join does a high-speed sort on the driving and reference datasets. This can involve I/O if the data is big enough, but the I/O is all highly optimized and sequential. Once the sort is over the join processing is very fast and never involves paging or other I/O.