Page 1 of 2

Join behaves strangely

Posted: Tue May 25, 2010 7:56 am
by mjgmc
Hello,

I have a strange behaviour with a Join Stage.

Two input links (Hashed and sorted on one Key column). We have 500 000 records on the left link and over 23 000 000 on the right one (that's why we use a join and not a lookup). After the Join, we realise some records don't match. I can't understand why. We tried to pass on sequential (both links sorted) and we still have the same behaviour (in production server) and we can't reproduce the error in development environment.

The only solution was to pre-filter the right linkwith a lookup on the left before passing it as a reference link to a second Lookup.

First Job (the one who looses information):

| DS1 |
|
|
|
\/
| Left Outer Join | <----------| DS2 |
|
|
|
\/
_____
| DS3 |
-----

Where DS1 and DS2 are two datasets hashed and sorted.

What I tried and worked is:

| DS1 |
|
|
|
\/
| Copy | -------> | Lookup1 |<----------| DS2 |
| |
| |
| |
\/ |
| Lookup2| <---------|
|
|
|
\/
| DS3 |

Where DS2 is the stream input to Lookup1 (Drop) and DS2 is the stream input to Lookup2 (Continue). Entre Copy and Lookup1 there's a Remode Duplicate.

It's not easy to design here, I hope you understand my jobs.

Does anyone has an explanation to the error on my first Job?

Thank you all.[/quote]

Re: Join behaves strangely

Posted: Tue May 25, 2010 8:16 am
by Sainath.Srinivasan
mjgmc wrote:Two input links (Hashed and sorted on one Key column).
How many key columns do you have in total in the dataset ?

Did you sort the data just before the join or somewhere else ?

Re: Join behaves strangely

Posted: Tue May 25, 2010 8:25 am
by mjgmc
Sainath.Srinivasan wrote:
mjgmc wrote:Two input links (Hashed and sorted on one Key column).
How many key columns do you have in total in the dataset ?

Did you sort the data just before the join or som ...
I only have one Key Column both sides. Both datasets are hashed and sorted on previous jobs. And in this job we keep partionned mode "SAME" both sides. But we tried to put Join Stage in Sequential with a sort at the Join Stage (both sides) and it didn't work.

I can't read the rest of your message, so I don't know if you asked something else.

Posted: Tue May 25, 2010 8:57 am
by agpt
are the keys for lookup1 and lookup2 are same?

when you say in the end data don't match, what exactly u mean? do u mean to say you are getting data which you shouldn't or you are not getting those records which should have come into ouput?

Would be good if you can give some data example here

Re: Join behaves strangely

Posted: Tue May 25, 2010 4:07 pm
by vinnz
mjgmc wrote:Both datasets are hashed and sorted on previous jobs. And in this job we keep partionned mode "SAME" both sides.
Are you sure no other job is messing with your datasets after they are hashed and sorted in your previous job and before you read them in the curent job?What happens if you add hash partioning and sorting in your current job instead of using same?

Posted: Wed May 26, 2010 2:20 am
by mjgmc
agpt wrote:are the keys for lookup1 and lookup2 are same?

when you say in the end data don't match, what exactly u mean? do u mean to say you are getting data which you shouldn't or you are not getting those records which should have come into ouput?

Would be good if you can give some data example here
DS1 has plainty of columns, so just a sample:
Code|Group|Detail|
123|"A"|"Hello"
123|"B"|"Bye"
300|"C"|
900|"A"|"World"

DS2 has only two columns:
Code|Key_Col
123|1
300|2
900|3

I want to have on my output:
Code|Group|Detail|Key_Col
123|"A"|"Hello"|1
123|"B"|"Bye"|1
300|"C"||2
900|"A"|"World"|3

The problem is that some of the records have the Key_Col with Null When the Code exists in DS1 and DS2.

Re: Join behaves strangely

Posted: Wed May 26, 2010 2:21 am
by mjgmc
vinnz wrote:
mjgmc wrote:Both datasets are hashed and sorted on previous jobs. And in this job we keep partionned mode "SAME" both sides.
Are you sure no other job is messing with your datasets after they are hashed and sorted in your previous job and before you read them in the curent job?What happens if you add hash partioning and sorting in your current job instead of using same?
We even tried putting the Join Stage Sequential with Sort in both input links and it didn't work

Posted: Wed May 26, 2010 4:07 am
by ray.wurlod
What partitioning are you using on the inputs of the Join stage?

Posted: Wed May 26, 2010 5:11 am
by mjgmc
ray.wurlod wrote:What partitioning are you using on the inputs of the Join stage? ...
At the beginning we had the "Same" partitionning, but as I sad before, we tried everything even passing the Join Stage in Sequential with both sides sorted.

Posted: Wed May 26, 2010 10:29 am
by agpt
Did you try incorporating Null handling for Key_Col?

Posted: Wed May 26, 2010 6:45 pm
by ray.wurlod
As a test, try explicitly specifying Hash as the partitioning algorithm on both inputs of the Join stage, using the join key as the partitioning key (and sorting on these).

Posted: Fri May 28, 2010 2:09 am
by mjgmc
agpt wrote:Did you try incorporating Null handling for Key_Col?
It's a not null column in DS2.

Posted: Fri May 28, 2010 2:24 am
by antonyraj.deva
mjgmc wrote: DS1 has plainty of columns, so just a sample:
Code|Group|Detail|
123|"A"|"Hello"
123|"B"|"Bye"
300|"C"|
900|"A"|"World"

DS2 has only two columns:
Code|Key_Col
123|1
300|2
900|3

I want to have on my output:
Code|Group|Detail|Key_Col
123|"A"|"Hello"|1
123|"B"|"Bye"|1
300|"C"||2
900|"A"|"World"|3

The problem is that some of the records have the Key_Col with Null When the Code exists in DS1 and DS2.
Which column in the sample data is used as the KEY for Join?

Posted: Fri May 28, 2010 6:52 am
by ray.wurlod
Code.

It's the only one with common domain.

Posted: Wed Jun 09, 2010 11:05 am
by agpt
mjgmc wrote:
ray.wurlod wrote:What partitioning are you using on the inputs of the Join stage? ...
At the beginning we had the "Same" partitionning, but as I sad before, we tried everything even passing the Join Stage in Sequential with both sides sorted.
but you only said
"The problem is that some of the records have the Key_Col with Null When the Code exists in DS1 and DS2."

so if it is defined as not null, how there are some values with NULL? :?