working of join stage!
Moderators: chulett, rschirm, roy
-
- Premium Member
- Posts: 536
- Joined: Thu Oct 11, 2007 1:48 am
- Location: Bangalore
working of join stage!
Hi all,
Can anyone tell me how join stage really works?
Well suppose i have two i/p datasets & partitioned on four nodes and i m performing join on one key column.
1) will join stage first performs join on both i/p datasets on first node then second and so on.....and fetches the records to downsteam operators as soon as records matches.??
or
will it first take all records from letf link into the buffer and then matches the records from rignt link and as soon as records matched,process to downsteam operators..??
Please rectify me if i m wrong..??
Can anyone tell me how join stage really works?
Well suppose i have two i/p datasets & partitioned on four nodes and i m performing join on one key column.
1) will join stage first performs join on both i/p datasets on first node then second and so on.....and fetches the records to downsteam operators as soon as records matches.??
or
will it first take all records from letf link into the buffer and then matches the records from rignt link and as soon as records matched,process to downsteam operators..??
Please rectify me if i m wrong..??
The join stage will first sort the input streams according to your join key(s), unless that has been already done or explicitly turned off. That way the join mechanism can process the data without having to store large amounts of interim data in memory or on disk.
Each Node is completely distinct and separate from the other nodes, so if the data is not partioned either explicitly or implicitly on the join key then the resulting output might not be what what you expect.
Each Node is completely distinct and separate from the other nodes, so if the data is not partioned either explicitly or implicitly on the join key then the resulting output might not be what what you expect.
<a href=http://www.worldcommunitygrid.org/team/ ... TZ9H4CGVP1 target="WCGWin">
</a>
</a>
-
- Premium Member
- Posts: 536
- Joined: Thu Oct 11, 2007 1:48 am
- Location: Bangalore
ArndW wrote: The join stage will first sort the input streams according to your join key(s), unless that has been already done or explicitly turned off. That way the join mechanism can process the data without hav ...
ok that means it sorts the data on the input links and performs join on the sorted data......but if sorting condition will not be same,will output will be correct which i expect?
-
- Premium Member
- Posts: 536
- Joined: Thu Oct 11, 2007 1:48 am
- Location: Bangalore
DataStage will detect that difference and resort implicitly (turn on APT_DUMP_SCORE to check). If you disable this functionality, i.e. through APT_DISABLE_SORT_INSERTION and sort your two links differently as you posted, the job will abort.
<a href=http://www.worldcommunitygrid.org/team/ ... TZ9H4CGVP1 target="WCGWin">
</a>
</a>
-
- Premium Member
- Posts: 536
- Joined: Thu Oct 11, 2007 1:48 am
- Location: Bangalore
So in case of join,merge always we need to do hash partiotions and explicitly sort(if not internally sorted) on the input links right..??ArndW wrote:DataStage will detect that difference and resort implicitly (turn on APT_DUMP_SCORE to check). If you disable this functionality, i.e. through APT_DISABLE_SORT_INSERTION and sort your two links differ ...
No, not quite. The partitioning algorithm is not a factor, just that all the records that you wish to join need to be in the same partition on all links - effectively meaning you should use the same partitioning algorithm for all partitions.
The topic of using explicit or implicit sorts has been discussed a couple of times recently, it is a matter of taste on whether or not to use explicit sorts or let DataStage do it for you. I personally prefer an explicit sort, even to putting in a "dummy" sort stage that does nothing but specify on which columns the data has been previously sorted and also turning off insertion with $APT_SORT_INSERTION_CHECK_ONLY. But it is by no means necessary.
The topic of using explicit or implicit sorts has been discussed a couple of times recently, it is a matter of taste on whether or not to use explicit sorts or let DataStage do it for you. I personally prefer an explicit sort, even to putting in a "dummy" sort stage that does nothing but specify on which columns the data has been previously sorted and also turning off insertion with $APT_SORT_INSERTION_CHECK_ONLY. But it is by no means necessary.
<a href=http://www.worldcommunitygrid.org/team/ ... TZ9H4CGVP1 target="WCGWin">
</a>
</a>
-
- Premium Member
- Posts: 132
- Joined: Tue Sep 04, 2007 11:38 am
- Location: NOIDA
As you have said join stage process data without having intermediate data in memory or disk,but i have read that some stages like (sort,join,agg.) must read the entire i/ps datasets into memory until the end of data/end of data group has been processed before o/p a single record.so without using buffer how it process the end of data..??ArndW wrote:The join stage will first sort the input streams according to your join key(s), unless that has been already done or explicitly turned off. That way the join mechanism can process the data without having to store large amounts of interim data in memory or on disk.
Each Node is completely distinct and separate from the other nodes, so if the data is not partioned either explicitly or implicitly on the join key then the resulting output might not be what what you expect.
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
DataStage jobs manage end of data via a token that is passed through the job. I believe this will be exposed as a system variable in a near future release (@END.OF.DATA ?).
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Stages such as the aggregator will store the minimum amount of data necessary. If the input stream is sorted by Columns A and B, then an aggregate function on Column B would only need to store/process data for one value of Column B, once that changes it knows that no further identical values of Column B can occur and can compute the aggregate function and output a value. In the case of unsorted data, the function would need to get an end-of-file signal before being able to do computations. Similarly, a sort stage on Columns A and B when the input stream already is sorted on Column A and not on Column B could, if the specifications state for Column A "don't sort, previously sorted", perform small sorts on data for one value of Column A, once that changes the sort is done for all column B values and the data passed downstream, thus not completely blocking the data flow.
<a href=http://www.worldcommunitygrid.org/team/ ... TZ9H4CGVP1 target="WCGWin">
</a>
</a>
-
- Premium Member
- Posts: 730
- Joined: Tue Nov 04, 2008 10:14 am
- Location: Bangalore
The one question that is bugging me since long time now i have got a chance to askArndW wrote:The join stage will first sort the input streams according to your join key(s), unless that has been already done or explicitly turned off. That way the join mechanism can process the data without having to store large amounts of interim data in memory or on disk.
How does sorting of data at the input of join eliminate the necessity of storing large amounts of interim data in memory or on disk ?
When ever it was said join needs sorted data i had 2 views in my mind:
1) for each record in the left link the join searches in the right link and the search happens faster in case the data is sorted, that is why sorting is necessary
2) The join stage takes a chunk of data from each link into the memory and then performs the join and in this case the sorting makes sure that values which are equal in both links are present in the respective chunks selected at each instance
I may be wrong and like to be corrected
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
It's your first choice, with the added benefit that it can process only one key value at a time from the driver input, and therefore can (because both inputs are sorted) free that memory once all the rows from the probe input have been collected and joined. And THEY are all adjacent because the data are sorted on the join key.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
-
- Premium Member
- Posts: 730
- Joined: Tue Nov 04, 2008 10:14 am
- Location: Bangalore
was looking more into the working of join operation and reviving the two year old post
Here is what i infer :
1) A block of data is loaded into memory from left link.
2) A block of data is loaded into memory from right link.
3) The first record from left link is taken out of the block and the key field value is read.
4) A scan is performed on the block of data from right link untill a match is found.
when a match is found:
5.1) If match is found then the record is joined
5.2) If the next record on the right link happens to have a same value then again the records are joined else the the operator moves to next record from left link.
No match found
6.1) The search on right link continues by replacing next block of right link data into memory untill end of data is reached.
7) When the scan for the left link record is completed the operator moves to next record of left link and reloads first block of data from right link and the same process is followed.
8.) When all records from left link in the block loaded into memory are processed, next block of data is loaded from the left link.
Sorting benifits:
with the sorting the operator stops scanning the data when it identifies there wont be any more records that might succeed the join operation hence makes it efficient.
Please let me know if the above in entirety or partly is incorrect.
Here is what i infer :
1) A block of data is loaded into memory from left link.
2) A block of data is loaded into memory from right link.
3) The first record from left link is taken out of the block and the key field value is read.
4) A scan is performed on the block of data from right link untill a match is found.
when a match is found:
5.1) If match is found then the record is joined
5.2) If the next record on the right link happens to have a same value then again the records are joined else the the operator moves to next record from left link.
No match found
6.1) The search on right link continues by replacing next block of right link data into memory untill end of data is reached.
7) When the scan for the left link record is completed the operator moves to next record of left link and reloads first block of data from right link and the same process is followed.
8.) When all records from left link in the block loaded into memory are processed, next block of data is loaded from the left link.
Sorting benifits:
with the sorting the operator stops scanning the data when it identifies there wont be any more records that might succeed the join operation hence makes it efficient.
Please let me know if the above in entirety or partly is incorrect.
- Zulfi
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
Step 3 should be "all records with the same Join Key value are read from the left link". After all, the same join criteria will obtain all of them. This may require more than one block (buffer) of data to be read. Similarly more than one block (buffer) of data may be needed from the right link.
There are slight variations for outer join types.
There are slight variations for outer join types.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.