working of join stage!

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

prasson_ibm
Premium Member
Premium Member
Posts: 536
Joined: Thu Oct 11, 2007 1:48 am
Location: Bangalore

working of join stage!

Post by prasson_ibm »

Hi all,
Can anyone tell me how join stage really works?
Well suppose i have two i/p datasets & partitioned on four nodes and i m performing join on one key column.
1) will join stage first performs join on both i/p datasets on first node then second and so on.....and fetches the records to downsteam operators as soon as records matches.??

or

will it first take all records from letf link into the buffer and then matches the records from rignt link and as soon as records matched,process to downsteam operators..??
Please rectify me if i m wrong..??
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

The join stage will first sort the input streams according to your join key(s), unless that has been already done or explicitly turned off. That way the join mechanism can process the data without having to store large amounts of interim data in memory or on disk.
Each Node is completely distinct and separate from the other nodes, so if the data is not partioned either explicitly or implicitly on the join key then the resulting output might not be what what you expect.
prasson_ibm
Premium Member
Premium Member
Posts: 536
Joined: Thu Oct 11, 2007 1:48 am
Location: Bangalore

Post by prasson_ibm »

ArndW wrote: The join stage will first sort the input streams according to your join key(s), unless that has been already done or explicitly turned off. That way the join mechanism can process the data without hav ...

ok that means it sorts the data on the input links and performs join on the sorted data......but if sorting condition will not be same,will output will be correct which i expect?
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

"but if sorting condition will not be same"? Sorry, what?
-craig

"You can never have too many knives" -- Logan Nine Fingers
prasson_ibm
Premium Member
Premium Member
Posts: 536
Joined: Thu Oct 11, 2007 1:48 am
Location: Bangalore

Post by prasson_ibm »

chulett wrote:"but if sorting condition will not be same"? Sorry, what?
Well i mean to say if in left input link i explicitly made it decreasing on the sorted key col. and in the other link i changed to increasing,so in that condition output will be correct..??
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

DataStage will detect that difference and resort implicitly (turn on APT_DUMP_SCORE to check). If you disable this functionality, i.e. through APT_DISABLE_SORT_INSERTION and sort your two links differently as you posted, the job will abort.
prasson_ibm
Premium Member
Premium Member
Posts: 536
Joined: Thu Oct 11, 2007 1:48 am
Location: Bangalore

Post by prasson_ibm »

ArndW wrote:DataStage will detect that difference and resort implicitly (turn on APT_DUMP_SCORE to check). If you disable this functionality, i.e. through APT_DISABLE_SORT_INSERTION and sort your two links differ ...
So in case of join,merge always we need to do hash partiotions and explicitly sort(if not internally sorted) on the input links right..??
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

No, not quite. The partitioning algorithm is not a factor, just that all the records that you wish to join need to be in the same partition on all links - effectively meaning you should use the same partitioning algorithm for all partitions.

The topic of using explicit or implicit sorts has been discussed a couple of times recently, it is a matter of taste on whether or not to use explicit sorts or let DataStage do it for you. I personally prefer an explicit sort, even to putting in a "dummy" sort stage that does nothing but specify on which columns the data has been previously sorted and also turning off insertion with $APT_SORT_INSERTION_CHECK_ONLY. But it is by no means necessary.
meet_deb85
Premium Member
Premium Member
Posts: 132
Joined: Tue Sep 04, 2007 11:38 am
Location: NOIDA

Post by meet_deb85 »

ArndW wrote:The join stage will first sort the input streams according to your join key(s), unless that has been already done or explicitly turned off. That way the join mechanism can process the data without having to store large amounts of interim data in memory or on disk.
Each Node is completely distinct and separate from the other nodes, so if the data is not partioned either explicitly or implicitly on the join key then the resulting output might not be what what you expect.
As you have said join stage process data without having intermediate data in memory or disk,but i have read that some stages like (sort,join,agg.) must read the entire i/ps datasets into memory until the end of data/end of data group has been processed before o/p a single record.so without using buffer how it process the end of data..??
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

DataStage jobs manage end of data via a token that is passed through the job. I believe this will be exposed as a system variable in a near future release (@END.OF.DATA ?).
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

Stages such as the aggregator will store the minimum amount of data necessary. If the input stream is sorted by Columns A and B, then an aggregate function on Column B would only need to store/process data for one value of Column B, once that changes it knows that no further identical values of Column B can occur and can compute the aggregate function and output a value. In the case of unsorted data, the function would need to get an end-of-file signal before being able to do computations. Similarly, a sort stage on Columns A and B when the input stream already is sorted on Column A and not on Column B could, if the specifications state for Column A "don't sort, previously sorted", perform small sorts on data for one value of Column A, once that changes the sort is done for all column B values and the data passed downstream, thus not completely blocking the data flow.
zulfi123786
Premium Member
Premium Member
Posts: 730
Joined: Tue Nov 04, 2008 10:14 am
Location: Bangalore

Post by zulfi123786 »

ArndW wrote:The join stage will first sort the input streams according to your join key(s), unless that has been already done or explicitly turned off. That way the join mechanism can process the data without having to store large amounts of interim data in memory or on disk.
The one question that is bugging me since long time now i have got a chance to ask

How does sorting of data at the input of join eliminate the necessity of storing large amounts of interim data in memory or on disk ?

When ever it was said join needs sorted data i had 2 views in my mind:
1) for each record in the left link the join searches in the right link and the search happens faster in case the data is sorted, that is why sorting is necessary
2) The join stage takes a chunk of data from each link into the memory and then performs the join and in this case the sorting makes sure that values which are equal in both links are present in the respective chunks selected at each instance

I may be wrong and like to be corrected
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

It's your first choice, with the added benefit that it can process only one key value at a time from the driver input, and therefore can (because both inputs are sorted) free that memory once all the rows from the probe input have been collected and joined. And THEY are all adjacent because the data are sorted on the join key.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
zulfi123786
Premium Member
Premium Member
Posts: 730
Joined: Tue Nov 04, 2008 10:14 am
Location: Bangalore

Post by zulfi123786 »

was looking more into the working of join operation and reviving the two year old post :)

Here is what i infer :

1) A block of data is loaded into memory from left link.

2) A block of data is loaded into memory from right link.

3) The first record from left link is taken out of the block and the key field value is read.

4) A scan is performed on the block of data from right link untill a match is found.

when a match is found:
5.1) If match is found then the record is joined
5.2) If the next record on the right link happens to have a same value then again the records are joined else the the operator moves to next record from left link.

No match found
6.1) The search on right link continues by replacing next block of right link data into memory untill end of data is reached.

7) When the scan for the left link record is completed the operator moves to next record of left link and reloads first block of data from right link and the same process is followed.

8.) When all records from left link in the block loaded into memory are processed, next block of data is loaded from the left link.

Sorting benifits:
with the sorting the operator stops scanning the data when it identifies there wont be any more records that might succeed the join operation hence makes it efficient.

Please let me know if the above in entirety or partly is incorrect.
- Zulfi
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Step 3 should be "all records with the same Join Key value are read from the left link". After all, the same join criteria will obtain all of them. This may require more than one block (buffer) of data to be read. Similarly more than one block (buffer) of data may be needed from the right link.

There are slight variations for outer join types.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Post Reply