Join stage is not giving expected result.
Moderators: chulett, rschirm, roy
Join stage is not giving expected result.
To the Join stage :
I have left set as -
B
C
D
E
F
H
M
N
O
Q
I have right set as -
A,B
B,C
C,D
D,E
E,F
F,G
F,H
H,I
H,J
E,K
E,L
A,M
M,N
N,O
O,P
N,Q
Q,E
Q,R
After Join I expect data as -
B,C
C,D
D,E
E,F
F,G
F,H
H,I
H,J
E,K
E,L
M,N
N,O
O,P
N,Q
Q,E
Q,R
i.e. Retrieving those records which has common values as first column.
So before join stage I have data sorted by using explicit sort on 1st column. In the join stage I have applied inner join. Key used is 1st column. Data coming to join stage is hash sorted. But in the output of this join stage I am not getting the expected result.
Can anyone please guide me, where I am going wrong?
I have left set as -
B
C
D
E
F
H
M
N
O
Q
I have right set as -
A,B
B,C
C,D
D,E
E,F
F,G
F,H
H,I
H,J
E,K
E,L
A,M
M,N
N,O
O,P
N,Q
Q,E
Q,R
After Join I expect data as -
B,C
C,D
D,E
E,F
F,G
F,H
H,I
H,J
E,K
E,L
M,N
N,O
O,P
N,Q
Q,E
Q,R
i.e. Retrieving those records which has common values as first column.
So before join stage I have data sorted by using explicit sort on 1st column. In the join stage I have applied inner join. Key used is 1st column. Data coming to join stage is hash sorted. But in the output of this join stage I am not getting the expected result.
Can anyone please guide me, where I am going wrong?
Thanks with regards,
videsh.
videsh.
I have one node configuration file only. Below is configuration details for the job, which I extracted from the director log.
Code: Select all
node "node01"
{
fastname "machine-name"
pools ""
resource disk "/data/ds/node01/resource" {pools "" }
resource scratchdisk "/data/ds/node01/scratch" {pools "" }
resource scratchdisk "/data/ds/node01/buffer" {pools "buffer"}
}
Thanks with regards,
videsh.
videsh.
Though the output should not be exactly as you expect, it should be following
The problem was only during cross product. But join should do cross product perfectly. Merge stage does this kind of manipulated output. i.e., output of first available update row for all the Master data.
Code: Select all
Expected Actual
B,C B,C
C,D C,D
D,E D,E
E,F E,F
E,K E,F
E,L E,F
F,G F,G
F,H F,G
H,I H,I
H,J H,I
M,N M,N
N,O N,O
N,Q N,O
O,P O,P
Q,E Q,E
Q,R Q,E
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
Duplication of results can also result from improper partitioning. The Join stage requires inputs to be identically partitioned using a key-based partitioning algorithm. And sorted on at least these keys.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
-
- Participant
- Posts: 43
- Joined: Mon Jan 15, 2007 10:53 pm
Hi.
Are you explicitly sorting the datasets before sending them to the Join Stage?
If yes, then are you clearing the partitioning before sending them in for sorting? If not, clear the partitioning in the stage immediately before the sort in each case, and then set the partitioning method in the sort stage.
This will ensure that both the datasets are sorted AND partitioned on the same key.
If you aren't explicitly sorting the datasets, try doing that, as it may be possible that while Join stage also is capable of sorting the data, it may not be as effective as an explicit sort.
Are you explicitly sorting the datasets before sending them to the Join Stage?
If yes, then are you clearing the partitioning before sending them in for sorting? If not, clear the partitioning in the stage immediately before the sort in each case, and then set the partitioning method in the sort stage.
This will ensure that both the datasets are sorted AND partitioned on the same key.
If you aren't explicitly sorting the datasets, try doing that, as it may be possible that while Join stage also is capable of sorting the data, it may not be as effective as an explicit sort.
Regards,
Vivek D. Reddy
__________________________________________
If knowledge can create problems, it is not through ignorance that we can solve them. - Isaac Asimov
Vivek D. Reddy
__________________________________________
If knowledge can create problems, it is not through ignorance that we can solve them. - Isaac Asimov