Page 1 of 1

Join stage output quantity mismatch query

Posted: Thu Feb 08, 2007 7:31 am
by vivekreddy
I have a job wherein I am joining two datasets using a join stage and join type as left outer.
I have sorted and partitioned both datasets on the join key.
However, still I am not getting the desired output.

I have around 6670 rows as input to the join stage, but as output, I get only 2099. I thought that 6670 remains the minimum possible when performing left-outer join.

Any suggestions on what I should do?

Re: Join stage output quantity mismatch query

Posted: Thu Feb 08, 2007 9:10 am
by vijayrc
vivekreddy wrote:I have a job wherein I am joining two datasets using a join stage and join type as left outer.
I have sorted and partitioned both datasets on the join key.
However, still I am not getting the desired output.

I have around 6670 rows as input to the join stage, but as output, I get only 2099. I thought that 6670 remains the minimum possible when performing left-outer join.

Any suggestions on what I should do?
Though you have mentioned LEFT OUTER, make sure the files are Link Ordered properly [Left and Right]

Posted: Thu Feb 08, 2007 3:41 pm
by ray.wurlod
It may also be possible that your data are not partitioned based upon the join key (in addition to any other potential cause). Check that also.

Posted: Thu Feb 08, 2007 11:28 pm
by vivekreddy
All done, still not working :oops:

Posted: Thu Feb 08, 2007 11:32 pm
by kumar_s
What type of partition is used?
Just be aware that, 6670 will be output, only if its a Outer Join on the dataset which has 6670 number of records.

Posted: Thu Feb 08, 2007 11:58 pm
by vivekreddy
Entire

Posted: Fri Feb 09, 2007 12:40 am
by kumar_s
Entire wont be a prescribed partition for Join stage. But still this will increase the number of resultant rows and not decrease.
Now explain more on what are the keys, and what is the partition that used on which stage and especially on the join stage, for both the input.
Basically need more details on job design.

Posted: Fri Feb 09, 2007 12:51 am
by vivekreddy
The key is a character field of length 2. In one dataset, partitioning method is Auto, whereas from the other, the left link, it is entire.

Posted: Fri Feb 09, 2007 12:53 am
by vivekreddy
The partitioning method on the Join stage is Entire.

Posted: Fri Feb 09, 2007 2:30 am
by kumar_s
I would suggest, do a hash partition on the Key, well before join, i.e,. the stages where you sort the data. And use same partition till Join stage.
Check if by any chance, you have any unique sort option enabled, and it removes duplicates.