Join stage is not giving expected result.

videsh77 · Post by **videsh77** » Tue Mar 06, 2007 7:10 am

To the Join stage :

I have left set as -

B
C
D
E
F
H
M
N
O
Q

I have right set as -

A,B
B,C
C,D
D,E
E,F
F,G
F,H
H,I
H,J
E,K
E,L
A,M
M,N
N,O
O,P
N,Q
Q,E
Q,R

After Join I expect data as -

B,C
C,D
D,E
E,F
F,G
F,H
H,I
H,J
E,K
E,L
M,N
N,O
O,P
N,Q
Q,E
Q,R

i.e. Retrieving those records which has common values as first column.

So before join stage I have data sorted by using explicit sort on 1st column. In the join stage I have applied inner join. Key used is 1st column. Data coming to join stage is hash sorted. But in the output of this join stage I am not getting the expected result.

Can anyone please guide me, where I am going wrong?

DSguru2B · Post by **DSguru2B** » Tue Mar 06, 2007 7:46 am

Everything looks ok to me. What is your result? Also try it in sequential order to see if the data is really getting joined or not? You might be having a leading trailing spaces.

videsh77 · Post by **videsh77** » Tue Mar 06, 2007 10:33 pm

Data I am having is in the DataSets. Also I checked the Sequential files, before they are passed on to DataSet.

Checked with cat -v filename in AIX. File seemed to be Ok, didnt find any hidden chars.

Any other hint .... ?

kumar_s · Post by **kumar_s** » Tue Mar 06, 2007 10:50 pm

What is the output that you have got?
Are they both key are of same datatype and length? Varchar type will make most of the hidden issues resolved.

videsh77 · Post by **videsh77** » Tue Mar 06, 2007 11:03 pm

Output which I am getting, is something similar to this -

B,C
C,D
D,E
E,F
E,F
E,F
F,G
F,G
H,I
H,I
M,N
N,O
N,O
O,P
Q,E
Q,E

You may notice, there is duplication of records.

kumar_s · Post by **kumar_s** » Tue Mar 06, 2007 11:11 pm

Are you sure its sorted and Hash partitioned on the keys and maintained as SAME partition to join stage?
Could you post the list of stages used and the partition used in each stage in the job.

videsh77 · Post by **videsh77** » Wed Mar 07, 2007 12:39 am

Exhibit I am having for this job as below :

Right DataSet -> Sort \
Join -> Output DataSet
Left DataSet -> Sort /

Before both the sorts, I have data Hash partitioned on the key col1+col2 for right dataset & col1 for left dataset.

kumar_s · Post by **kumar_s** » Wed Mar 07, 2007 12:42 am

You need to do Hash Partition on Col1 for both right and left dataset on sort stage. And maintain SAME partition in Join stage for both the link.

videsh77 · Post by **videsh77** » Wed Mar 07, 2007 3:51 am

I precisely have the same settings. Is there any intermediate buffer is becoming the culprit ?

kumar_s · Post by **kumar_s** » Wed Mar 07, 2007 4:40 am

What happens if you run in sequential mode or in single node configuration using APT_CONFIG_FILE variable for this particular job?

videsh77 · Post by **videsh77** » Wed Mar 07, 2007 6:26 am

I have one node configuration file only. Below is configuration details for the job, which I extracted from the director log.

Code: Select all

 
node "node01"
{
                fastname "machine-name"
                pools "" 
                resource disk "/data/ds/node01/resource" {pools "" }
                resource scratchdisk "/data/ds/node01/scratch" {pools "" }
                resource scratchdisk "/data/ds/node01/buffer" {pools "buffer"}
        }

kumar_s · Post by **kumar_s** » Wed Mar 07, 2007 6:41 am

Though the output should not be exactly as you expect, it should be following

Code: Select all

Expected Actual
B,C	B,C
C,D	C,D
D,E	D,E
E,F	E,F
E,K	E,F
E,L	E,F
F,G	F,G
F,H	F,G
H,I	H,I
H,J	H,I
M,N	M,N
N,O	N,O
N,Q	N,O
O,P	O,P
Q,E	Q,E
Q,R 	Q,E

The problem was only during cross product. But join should do cross product perfectly. Merge stage does this kind of manipulated output. i.e., output of first available update row for all the Master data.

ray.wurlod · Post by **ray.wurlod** » Wed Mar 07, 2007 1:18 pm

Duplication of results can also result from improper partitioning. The Join stage requires inputs to be identically partitioned using a key-based partitioning algorithm. And sorted on at least these keys.

kumar_s · Post by **kumar_s** » Wed Mar 07, 2007 5:56 pm

Partitioning in single node is same as running in sequential mode. Why can't you give a try running the job in Sequential mode.

vivekreddy · Post by **vivekreddy** » Fri Mar 09, 2007 5:30 am

Hi.

Are you explicitly sorting the datasets before sending them to the Join Stage?

If yes, then are you clearing the partitioning before sending them in for sorting? If not, clear the partitioning in the stage immediately before the sort in each case, and then set the partitioning method in the sort stage.
This will ensure that both the datasets are sorted AND partitioned on the same key.

If you aren't explicitly sorting the datasets, try doing that, as it may be possible that while Join stage also is capable of sorting the data, it may not be as effective as an explicit sort.