working of join stage!

prasson_ibm · Post by **prasson_ibm** » Mon Sep 28, 2009 6:25 am

Hi all,
Can anyone tell me how join stage really works?
Well suppose i have two i/p datasets & partitioned on four nodes and i m performing join on one key column.
1) will join stage first performs join on both i/p datasets on first node then second and so on.....and fetches the records to downsteam operators as soon as records matches.??

or

will it first take all records from letf link into the buffer and then matches the records from rignt link and as soon as records matched,process to downsteam operators..??
Please rectify me if i m wrong..??

ArndW · Post by **ArndW** » Mon Sep 28, 2009 6:38 am

The join stage will first sort the input streams according to your join key(s), unless that has been already done or explicitly turned off. That way the join mechanism can process the data without having to store large amounts of interim data in memory or on disk.
Each Node is completely distinct and separate from the other nodes, so if the data is not partioned either explicitly or implicitly on the join key then the resulting output might not be what what you expect.

prasson_ibm · Post by **prasson_ibm** » Mon Sep 28, 2009 7:18 am

ArndW wrote: The join stage will first sort the input streams according to your join key(s), unless that has been already done or explicitly turned off. That way the join mechanism can process the data without hav ...

ok that means it sorts the data on the input links and performs join on the sorted data......but if sorting condition will not be same,will output will be correct which i expect?

chulett · Post by **chulett** » Mon Sep 28, 2009 7:20 am

"but if sorting condition will not be same"? Sorry, what?

prasson_ibm · Post by **prasson_ibm** » Mon Sep 28, 2009 7:26 am

chulett wrote:"but if sorting condition will not be same"? Sorry, what?

Well i mean to say if in left input link i explicitly made it decreasing on the sorted key col. and in the other link i changed to increasing,so in that condition output will be correct..??

ArndW · Post by **ArndW** » Mon Sep 28, 2009 7:36 am

DataStage will detect that difference and resort implicitly (turn on APT_DUMP_SCORE to check). If you disable this functionality, i.e. through APT_DISABLE_SORT_INSERTION and sort your two links differently as you posted, the job will abort.

prasson_ibm · Post by **prasson_ibm** » Mon Sep 28, 2009 7:48 am

ArndW wrote:DataStage will detect that difference and resort implicitly (turn on APT_DUMP_SCORE to check). If you disable this functionality, i.e. through APT_DISABLE_SORT_INSERTION and sort your two links differ ...

So in case of join,merge always we need to do hash partiotions and explicitly sort(if not internally sorted) on the input links right..??

ArndW · Post by **ArndW** » Mon Sep 28, 2009 8:11 am

No, not quite. The partitioning algorithm is not a factor, just that all the records that you wish to join need to be in the same partition on all links - effectively meaning you should use the same partitioning algorithm for all partitions.

The topic of using explicit or implicit sorts has been discussed a couple of times recently, it is a matter of taste on whether or not to use explicit sorts or let DataStage do it for you. I personally prefer an explicit sort, even to putting in a "dummy" sort stage that does nothing but specify on which columns the data has been previously sorted and also turning off insertion with $APT_SORT_INSERTION_CHECK_ONLY. But it is by no means necessary.

meet_deb85 · Post by **meet_deb85** » Mon Sep 28, 2009 11:17 am

ArndW wrote:The join stage will first sort the input streams according to your join key(s), unless that has been already done or explicitly turned off. That way the join mechanism can process the data without having to store large amounts of interim data in memory or on disk.
Each Node is completely distinct and separate from the other nodes, so if the data is not partioned either explicitly or implicitly on the join key then the resulting output might not be what what you expect.

As you have said join stage process data without having intermediate data in memory or disk,but i have read that some stages like (sort,join,agg.) must read the entire i/ps datasets into memory until the end of data/end of data group has been processed before o/p a single record.so without using buffer how it process the end of data..??

ray.wurlod · Post by **ray.wurlod** » Mon Sep 28, 2009 3:31 pm

DataStage jobs manage end of data via a token that is passed through the job. I believe this will be exposed as a system variable in a near future release (@END.OF.DATA ?).

ArndW · Post by **ArndW** » Tue Sep 29, 2009 1:07 am

Stages such as the aggregator will store the minimum amount of data necessary. If the input stream is sorted by Columns A and B, then an aggregate function on Column B would only need to store/process data for one value of Column B, once that changes it knows that no further identical values of Column B can occur and can compute the aggregate function and output a value. In the case of unsorted data, the function would need to get an end-of-file signal before being able to do computations. Similarly, a sort stage on Columns A and B when the input stream already is sorted on Column A and not on Column B could, if the specifications state for Column A "don't sort, previously sorted", perform small sorts on data for one value of Column A, once that changes the sort is done for all column B values and the data passed downstream, thus not completely blocking the data flow.

zulfi123786 · Post by **zulfi123786** » Fri May 07, 2010 9:55 pm

ArndW wrote:The join stage will first sort the input streams according to your join key(s), unless that has been already done or explicitly turned off. That way the join mechanism can process the data without having to store large amounts of interim data in memory or on disk.

The one question that is bugging me since long time now i have got a chance to ask

How does sorting of data at the input of join eliminate the necessity of storing large amounts of interim data in memory or on disk ?

When ever it was said join needs sorted data i had 2 views in my mind:
1) for each record in the left link the join searches in the right link and the search happens faster in case the data is sorted, that is why sorting is necessary
2) The join stage takes a chunk of data from each link into the memory and then performs the join and in this case the sorting makes sure that values which are equal in both links are present in the respective chunks selected at each instance

I may be wrong and like to be corrected

ray.wurlod · Post by **ray.wurlod** » Sat May 08, 2010 1:50 am

It's your first choice, with the added benefit that it can process only one key value at a time from the driver input, and therefore can (because both inputs are sorted) free that memory once all the rows from the probe input have been collected and joined. And THEY are all adjacent because the data are sorted on the join key.

zulfi123786 · Post by **zulfi123786** » Sun Apr 15, 2012 5:37 am

was looking more into the working of join operation and reviving the two year old post

Here is what i infer :

1) A block of data is loaded into memory from left link.

2) A block of data is loaded into memory from right link.

3) The first record from left link is taken out of the block and the key field value is read.

4) A scan is performed on the block of data from right link untill a match is found.

when a match is found:
5.1) If match is found then the record is joined
5.2) If the next record on the right link happens to have a same value then again the records are joined else the the operator moves to next record from left link.

No match found
6.1) The search on right link continues by replacing next block of right link data into memory untill end of data is reached.

7) When the scan for the left link record is completed the operator moves to next record of left link and reloads first block of data from right link and the same process is followed.

8.) When all records from left link in the block loaded into memory are processed, next block of data is loaded from the left link.

Sorting benifits:
with the sorting the operator stops scanning the data when it identifies there wont be any more records that might succeed the join operation hence makes it efficient.

Please let me know if the above in entirety or partly is incorrect.

ray.wurlod · Post by **ray.wurlod** » Sun Apr 15, 2012 12:59 pm

Step 3 should be "all records with the same Join Key value are read from the left link". After all, the same join criteria will obtain all of them. This may require more than one block (buffer) of data to be read. Similarly more than one block (buffer) of data may be needed from the right link.

There are slight variations for outer join types.