Why join is faster ?

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
kpavan2004
Participant
Posts: 9
Joined: Sun Oct 19, 2008 7:09 am

Why join is faster ?

Post by kpavan2004 »

Hi,

As per the parallel job developer's guide join is preferable over lookup stage for large volume of reference data. I have below query .

1. How it process the large volume of data so fast?
As per guide "A join does a high-speed sort on the driving and reference datasets" so is this the primary reason for it performance ?

2. Will it query the db for every input row / store the entire output in memory? I am not aware these mechanism.


Could anyone please explain me in detail the internal mechanism behind join stage, its memory usage for large volume. Thanks in advance
Pavan
priyadarshikunal
Premium Member
Premium Member
Posts: 1735
Joined: Thu Mar 01, 2007 5:44 am
Location: Troy, MI

Post by priyadarshikunal »

PDFs which comes with datastage explains the memory usage for Lookup, Join and Merge. Have a look at it.
Priyadarshi Kunal

Genius may have its limitations, but stupidity is not thus handicapped. :wink:
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

Using data sorted on the join keys is particularly fast and is also light on memory usage since only the data for a key grouping must be kept in memory during processing.

Contrast that to the method needed for a lookup, where the complete lookup reference data is held in storage. I won't go into sparse lookups, since those are inherently slow since they necessitate a lookup to the source data for each row.

When the reference data is small and/or the source data volume is large and (relatively) unsorted then a lookup stage can be significantly faster than a join. The actual point at which a join becomes more efficient than a lookup is not easy to define as it depends upon a number of factors including row byte size, number of key columns, relative size of lookup data, available memory, system load type (IO,disk,CPU,Memory) to name just a few.

Sorting data where the key is already in some non-random order (partially sorted) is much faster than sorting randomly spread keys. A sort like this will be a blocking stage in processing, causing the whole job pipeline to come to a halt while the sort is performed. The lookup stage doesn't perform blocking of the main data stream.
kpavan2004
Participant
Posts: 9
Joined: Sun Oct 19, 2008 7:09 am

Post by kpavan2004 »

ArndW wrote:Using data sorted on the join keys is particularly fast and is also light on memory usage since only the data for a key grouping must be kept in memory during processing.
Thanks ArndW for the helpful information. So as per my understanding even join will store the data in memory but in join only store data for a key grouping if we use sorted dataset . Hence it is faster Correct me if i am wrong.
Pavan
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

While it true that the join uses much less memory with similar lookup data volumes than a lookup stage, the conclusion that this is why it is faster is not necessarily true. On a system which would otherwise not use the additional memory required by a lookup stage it would make little difference,
If the source data needs to be sorted and it is very large, and the lookup data set is small, then a join stage, despite being light on memory usage, will most certainly be slower.
Post Reply