Join stage output quantity mismatch query

vivekreddy · Post by **vivekreddy** » Thu Feb 08, 2007 7:31 am

I have a job wherein I am joining two datasets using a join stage and join type as left outer.
I have sorted and partitioned both datasets on the join key.
However, still I am not getting the desired output.

I have around 6670 rows as input to the join stage, but as output, I get only 2099. I thought that 6670 remains the minimum possible when performing left-outer join.

Any suggestions on what I should do?

vijayrc · Post by **vijayrc** » Thu Feb 08, 2007 9:10 am

vivekreddy wrote:I have a job wherein I am joining two datasets using a join stage and join type as left outer.
I have sorted and partitioned both datasets on the join key.
However, still I am not getting the desired output.

I have around 6670 rows as input to the join stage, but as output, I get only 2099. I thought that 6670 remains the minimum possible when performing left-outer join.

Any suggestions on what I should do?

Though you have mentioned LEFT OUTER, make sure the files are Link Ordered properly [Left and Right]

ray.wurlod · Post by **ray.wurlod** » Thu Feb 08, 2007 3:41 pm

It may also be possible that your data are not partitioned based upon the join key (in addition to any other potential cause). Check that also.

vivekreddy · Post by **vivekreddy** » Thu Feb 08, 2007 11:28 pm

All done, still not working

kumar_s · Post by **kumar_s** » Thu Feb 08, 2007 11:32 pm

What type of partition is used?
Just be aware that, 6670 will be output, only if its a Outer Join on the dataset which has 6670 number of records.

vivekreddy · Post by **vivekreddy** » Thu Feb 08, 2007 11:58 pm

Entire

kumar_s · Post by **kumar_s** » Fri Feb 09, 2007 12:40 am

Entire wont be a prescribed partition for Join stage. But still this will increase the number of resultant rows and not decrease.
Now explain more on what are the keys, and what is the partition that used on which stage and especially on the join stage, for both the input.
Basically need more details on job design.

vivekreddy · Post by **vivekreddy** » Fri Feb 09, 2007 12:51 am

The key is a character field of length 2. In one dataset, partitioning method is Auto, whereas from the other, the left link, it is entire.

vivekreddy · Post by **vivekreddy** » Fri Feb 09, 2007 12:53 am

The partitioning method on the Join stage is Entire.

kumar_s · Post by **kumar_s** » Fri Feb 09, 2007 2:30 am

I would suggest, do a hash partition on the Key, well before join, i.e,. the stages where you sort the data. And use same partition till Join stage.
Check if by any chance, you have any unique sort option enabled, and it removes duplicates.

DSXchange

Join stage output quantity mismatch query

Join stage output quantity mismatch query

Re: Join stage output quantity mismatch query