Join behaves strangely

mjgmc · Post by **mjgmc** » Tue May 25, 2010 7:56 am

Hello,

I have a strange behaviour with a Join Stage.

Two input links (Hashed and sorted on one Key column). We have 500 000 records on the left link and over 23 000 000 on the right one (that's why we use a join and not a lookup). After the Join, we realise some records don't match. I can't understand why. We tried to pass on sequential (both links sorted) and we still have the same behaviour (in production server) and we can't reproduce the error in development environment.

The only solution was to pre-filter the right linkwith a lookup on the left before passing it as a reference link to a second Lookup.

First Job (the one who looses information):

| DS1 |
|
|
|
\/
| Left Outer Join | <----------| DS2 |
|
|
|
\/
_____
| DS3 |
-----

Where DS1 and DS2 are two datasets hashed and sorted.

What I tried and worked is:

| DS1 |
|
|
|
\/
| Copy | -------> | Lookup1 |<----------| DS2 |
| |
| |
| |
\/ |
| Lookup2| <---------|
|
|
|
\/
| DS3 |

Where DS2 is the stream input to Lookup1 (Drop) and DS2 is the stream input to Lookup2 (Continue). Entre Copy and Lookup1 there's a Remode Duplicate.

It's not easy to design here, I hope you understand my jobs.

Does anyone has an explanation to the error on my first Job?

Thank you all.[/quote]

Sainath.Srinivasan · Post by **Sainath.Srinivasan** » Tue May 25, 2010 8:16 am

mjgmc wrote:Two input links (Hashed and sorted on one Key column).

How many key columns do you have in total in the dataset ?

Did you sort the data just before the join or somewhere else ?

mjgmc · Post by **mjgmc** » Tue May 25, 2010 8:25 am

Sainath.Srinivasan wrote:
mjgmc wrote:Two input links (Hashed and sorted on one Key column).
How many key columns do you have in total in the dataset ?

Did you sort the data just before the join or som ...

I only have one Key Column both sides. Both datasets are hashed and sorted on previous jobs. And in this job we keep partionned mode "SAME" both sides. But we tried to put Join Stage in Sequential with a sort at the Join Stage (both sides) and it didn't work.

I can't read the rest of your message, so I don't know if you asked something else.

agpt · Post by **agpt** » Tue May 25, 2010 8:57 am

are the keys for lookup1 and lookup2 are same?

when you say in the end data don't match, what exactly u mean? do u mean to say you are getting data which you shouldn't or you are not getting those records which should have come into ouput?

Would be good if you can give some data example here

vinnz · Post by **vinnz** » Tue May 25, 2010 4:07 pm

mjgmc wrote:Both datasets are hashed and sorted on previous jobs. And in this job we keep partionned mode "SAME" both sides.

Are you sure no other job is messing with your datasets after they are hashed and sorted in your previous job and before you read them in the curent job?What happens if you add hash partioning and sorting in your current job instead of using same?

mjgmc · Post by **mjgmc** » Wed May 26, 2010 2:20 am

agpt wrote:are the keys for lookup1 and lookup2 are same?

when you say in the end data don't match, what exactly u mean? do u mean to say you are getting data which you shouldn't or you are not getting those records which should have come into ouput?

Would be good if you can give some data example here

DS1 has plainty of columns, so just a sample:
Code|Group|Detail|
123|"A"|"Hello"
123|"B"|"Bye"
300|"C"|
900|"A"|"World"

DS2 has only two columns:
Code|Key_Col
123|1
300|2
900|3

I want to have on my output:
Code|Group|Detail|Key_Col
123|"A"|"Hello"|1
123|"B"|"Bye"|1
300|"C"||2
900|"A"|"World"|3

The problem is that some of the records have the Key_Col with Null When the Code exists in DS1 and DS2.

mjgmc · Post by **mjgmc** » Wed May 26, 2010 2:21 am

vinnz wrote:
mjgmc wrote:Both datasets are hashed and sorted on previous jobs. And in this job we keep partionned mode "SAME" both sides.
Are you sure no other job is messing with your datasets after they are hashed and sorted in your previous job and before you read them in the curent job?What happens if you add hash partioning and sorting in your current job instead of using same?

We even tried putting the Join Stage Sequential with Sort in both input links and it didn't work

ray.wurlod · Post by **ray.wurlod** » Wed May 26, 2010 4:07 am

What partitioning are you using on the inputs of the Join stage?

mjgmc · Post by **mjgmc** » Wed May 26, 2010 5:11 am

ray.wurlod wrote:What partitioning are you using on the inputs of the Join stage? ...

At the beginning we had the "Same" partitionning, but as I sad before, we tried everything even passing the Join Stage in Sequential with both sides sorted.

agpt · Post by **agpt** » Wed May 26, 2010 10:29 am

Did you try incorporating Null handling for Key_Col?

ray.wurlod · Post by **ray.wurlod** » Wed May 26, 2010 6:45 pm

As a test, try explicitly specifying Hash as the partitioning algorithm on both inputs of the Join stage, using the join key as the partitioning key (and sorting on these).

mjgmc · Post by **mjgmc** » Fri May 28, 2010 2:09 am

agpt wrote:Did you try incorporating Null handling for Key_Col?

It's a not null column in DS2.

antonyraj.deva · Post by **antonyraj.deva** » Fri May 28, 2010 2:24 am

mjgmc wrote: DS1 has plainty of columns, so just a sample:
Code|Group|Detail|
123|"A"|"Hello"
123|"B"|"Bye"
300|"C"|
900|"A"|"World"

DS2 has only two columns:
Code|Key_Col
123|1
300|2
900|3

I want to have on my output:
Code|Group|Detail|Key_Col
123|"A"|"Hello"|1
123|"B"|"Bye"|1
300|"C"||2
900|"A"|"World"|3

The problem is that some of the records have the Key_Col with Null When the Code exists in DS1 and DS2.

Which column in the sample data is used as the KEY for Join?

ray.wurlod · Post by **ray.wurlod** » Fri May 28, 2010 6:52 am

Code.

It's the only one with common domain.

agpt · Post by **agpt** » Wed Jun 09, 2010 11:05 am

mjgmc wrote:
ray.wurlod wrote:What partitioning are you using on the inputs of the Join stage? ...
At the beginning we had the "Same" partitionning, but as I sad before, we tried everything even passing the Join Stage in Sequential with both sides sorted.

but you only said
"The problem is that some of the records have the Key_Col with Null When the Code exists in DS1 and DS2."

so if it is defined as not null, how there are some values with NULL?

DSXchange

Join behaves strangely

Join behaves strangely

Re: Join behaves strangely

Re: Join behaves strangely

Re: Join behaves strangely

Re: Join behaves strangely