Join stage not recognising match

vivekreddy · Post by **vivekreddy** » Mon Jan 15, 2007 10:59 pm

I have a parallel job wherein I am expected to check whether a record in source data contains valid data.
The format of the record is as follows:
ACCESSID BIGINT 18 NOT NULL
USERID VARCHAR 6 NOT NULL
PURPOSEID SMALLINT 5 NOT NULL
CLIENTID INTEGER 10 NOT NULL
MASTERID INTEGER 10 NOT NULL

Now, I have to check whether the userid, purposeid, clientid and masterid stated in this source data is present in lookup tables, and have used join stage (left outer join mode) to check the same.
Initially, to check whether the job worked for valid data combinations, I populated the source table with values from the lookup tables themselves. However, despite doing this, the output shows that there is no match, and all records in source are rejected, when the expected result is that all be accepted as valid

.

Could anyone tell me what may possibly be the problem?

ray.wurlod · Post by **ray.wurlod** » Mon Jan 15, 2007 11:52 pm

Welcome aboard. :D

Before Join stage will work properly both its inputs much be identically partitioned and sorted on the join key(s). Is this the case in your job?

vivekreddy · Post by **vivekreddy** » Tue Jan 16, 2007 12:27 am

ray.wurlod wrote:Welcome aboard. :D

Before Join stage will work properly both its inputs much be identically partitioned and sorted on the join key(s). Is this the case in your job? ...

Hi.

The inputs are sorted on the join key(s), but the problem still persists.

ray.wurlod · Post by **ray.wurlod** » Tue Jan 16, 2007 12:34 am

What about partitioned on the join key(s)?

vivekreddy · Post by **vivekreddy** » Tue Jan 16, 2007 12:54 am

ray.wurlod wrote:What about partitioned on the join key(s)? ...

Thanks for your assistance.
Partitioning on the join keys worked. But I am not able to understand as to why it is necessary or rather why my not having partitioned the input data set(s) on the join key(s) was causing a problem. Could you please explain the reason(s) for this problem?

kumar_s · Post by **kumar_s** » Tue Jan 16, 2007 2:09 am

Non partitioned or Auto partitioned data will not assure you that identical key data will flow in the same partition.
Say if you have 4 partitions in your system (as decided by you config file), the data will be split in to 4 streams in each active stage. So Join stage will have 4 streams in left and right input. The data will be joined only if the identical key flows in corresponding partition. Thats the reason, Entire partition was recomended for Lookup, to get a match for the key irrespective of the partition.
We have discussed this a lot many times. If you do search, you can find humpty no of posts.