Why join is faster ?

kpavan2004 · Post by **kpavan2004** » Tue Jul 06, 2010 5:06 am

Hi,

As per the parallel job developer's guide join is preferable over lookup stage for large volume of reference data. I have below query .

1. How it process the large volume of data so fast?
As per guide "A join does a high-speed sort on the driving and reference datasets" so is this the primary reason for it performance ?

2. Will it query the db for every input row / store the entire output in memory? I am not aware these mechanism.

Could anyone please explain me in detail the internal mechanism behind join stage, its memory usage for large volume. Thanks in advance

priyadarshikunal · Post by **priyadarshikunal** » Tue Jul 06, 2010 5:18 am

PDFs which comes with datastage explains the memory usage for Lookup, Join and Merge. Have a look at it.

ArndW · Post by **ArndW** » Tue Jul 06, 2010 5:22 am

Using data sorted on the join keys is particularly fast and is also light on memory usage since only the data for a key grouping must be kept in memory during processing.

Contrast that to the method needed for a lookup, where the complete lookup reference data is held in storage. I won't go into sparse lookups, since those are inherently slow since they necessitate a lookup to the source data for each row.

When the reference data is small and/or the source data volume is large and (relatively) unsorted then a lookup stage can be significantly faster than a join. The actual point at which a join becomes more efficient than a lookup is not easy to define as it depends upon a number of factors including row byte size, number of key columns, relative size of lookup data, available memory, system load type (IO,disk,CPU,Memory) to name just a few.

Sorting data where the key is already in some non-random order (partially sorted) is much faster than sorting randomly spread keys. A sort like this will be a blocking stage in processing, causing the whole job pipeline to come to a halt while the sort is performed. The lookup stage doesn't perform blocking of the main data stream.

kpavan2004 · Post by **kpavan2004** » Tue Jul 06, 2010 6:09 am

ArndW wrote:Using data sorted on the join keys is particularly fast and is also light on memory usage since only the data for a key grouping must be kept in memory during processing.

Thanks ArndW for the helpful information. So as per my understanding even join will store the data in memory but in join only store data for a key grouping if we use sorted dataset . Hence it is faster Correct me if i am wrong.

ArndW · Post by **ArndW** » Tue Jul 06, 2010 6:22 am

While it true that the join uses much less memory with similar lookup data volumes than a lookup stage, the conclusion that this is why it is faster is not necessarily true. On a system which would otherwise not use the additional memory required by a lookup stage it would make little difference,
If the source data needs to be sorted and it is very large, and the lookup data set is small, then a join stage, despite being light on memory usage, will most certainly be slower.