Page 1 of 2

Join stage is not giving expected result.

Posted: Tue Mar 06, 2007 7:10 am
by videsh77
To the Join stage :

I have left set as -

B
C
D
E
F
H
M
N
O
Q


I have right set as -

A,B
B,C
C,D
D,E
E,F
F,G
F,H
H,I
H,J
E,K
E,L
A,M
M,N
N,O
O,P
N,Q
Q,E
Q,R

After Join I expect data as -

B,C
C,D
D,E
E,F
F,G
F,H
H,I
H,J
E,K
E,L
M,N
N,O
O,P
N,Q
Q,E
Q,R

i.e. Retrieving those records which has common values as first column.

So before join stage I have data sorted by using explicit sort on 1st column. In the join stage I have applied inner join. Key used is 1st column. Data coming to join stage is hash sorted. But in the output of this join stage I am not getting the expected result.

Can anyone please guide me, where I am going wrong?

Posted: Tue Mar 06, 2007 7:46 am
by DSguru2B
Everything looks ok to me. What is your result? Also try it in sequential order to see if the data is really getting joined or not? You might be having a leading trailing spaces.

Posted: Tue Mar 06, 2007 10:33 pm
by videsh77
Data I am having is in the DataSets. Also I checked the Sequential files, before they are passed on to DataSet.

Checked with cat -v filename in AIX. File seemed to be Ok, didnt find any hidden chars.

Any other hint .... ?

Posted: Tue Mar 06, 2007 10:50 pm
by kumar_s
What is the output that you have got?
Are they both key are of same datatype and length? Varchar type will make most of the hidden issues resolved.

Posted: Tue Mar 06, 2007 11:03 pm
by videsh77
Output which I am getting, is something similar to this -

B,C
C,D
D,E
E,F
E,F
E,F
F,G
F,G
H,I
H,I
M,N
N,O
N,O
O,P
Q,E
Q,E

You may notice, there is duplication of records.

Posted: Tue Mar 06, 2007 11:11 pm
by kumar_s
Are you sure its sorted and Hash partitioned on the keys and maintained as SAME partition to join stage?
Could you post the list of stages used and the partition used in each stage in the job.

Posted: Wed Mar 07, 2007 12:39 am
by videsh77
Exhibit I am having for this job as below :

Right DataSet -> Sort \
Join -> Output DataSet
Left DataSet -> Sort /

Before both the sorts, I have data Hash partitioned on the key col1+col2 for right dataset & col1 for left dataset.

Posted: Wed Mar 07, 2007 12:42 am
by kumar_s
You need to do Hash Partition on Col1 for both right and left dataset on sort stage. And maintain SAME partition in Join stage for both the link.

Posted: Wed Mar 07, 2007 3:51 am
by videsh77
I precisely have the same settings. Is there any intermediate buffer is becoming the culprit ?

Posted: Wed Mar 07, 2007 4:40 am
by kumar_s
What happens if you run in sequential mode or in single node configuration using APT_CONFIG_FILE variable for this particular job?

Posted: Wed Mar 07, 2007 6:26 am
by videsh77
I have one node configuration file only. Below is configuration details for the job, which I extracted from the director log.

Code: Select all

 
node "node01"
{
                fastname "machine-name"
                pools "" 
                resource disk "/data/ds/node01/resource" {pools "" }
                resource scratchdisk "/data/ds/node01/scratch" {pools "" }
                resource scratchdisk "/data/ds/node01/buffer" {pools "buffer"}
        }

Posted: Wed Mar 07, 2007 6:41 am
by kumar_s
Though the output should not be exactly as you expect, it should be following

Code: Select all

Expected Actual
B,C	B,C
C,D	C,D
D,E	D,E
E,F	E,F
E,K	E,F
E,L	E,F
F,G	F,G
F,H	F,G
H,I	H,I
H,J	H,I
M,N	M,N
N,O	N,O
N,Q	N,O
O,P	O,P
Q,E	Q,E
Q,R 	Q,E
The problem was only during cross product. But join should do cross product perfectly. Merge stage does this kind of manipulated output. i.e., output of first available update row for all the Master data.

Posted: Wed Mar 07, 2007 1:18 pm
by ray.wurlod
Duplication of results can also result from improper partitioning. The Join stage requires inputs to be identically partitioned using a key-based partitioning algorithm. And sorted on at least these keys.

Posted: Wed Mar 07, 2007 5:56 pm
by kumar_s
Partitioning in single node is same as running in sequential mode. Why can't you give a try running the job in Sequential mode.

Posted: Fri Mar 09, 2007 5:30 am
by vivekreddy
Hi.

Are you explicitly sorting the datasets before sending them to the Join Stage?

If yes, then are you clearing the partitioning before sending them in for sorting? If not, clear the partitioning in the stage immediately before the sort in each case, and then set the partitioning method in the sort stage.
This will ensure that both the datasets are sorted AND partitioned on the same key.

If you aren't explicitly sorting the datasets, try doing that, as it may be possible that while Join stage also is capable of sorting the data, it may not be as effective as an explicit sort.