Join behaves strangely

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

mjgmc
Participant
Posts: 52
Joined: Thu Nov 25, 2004 8:06 am

Join behaves strangely

Post by mjgmc »

Hello,

I have a strange behaviour with a Join Stage.

Two input links (Hashed and sorted on one Key column). We have 500 000 records on the left link and over 23 000 000 on the right one (that's why we use a join and not a lookup). After the Join, we realise some records don't match. I can't understand why. We tried to pass on sequential (both links sorted) and we still have the same behaviour (in production server) and we can't reproduce the error in development environment.

The only solution was to pre-filter the right linkwith a lookup on the left before passing it as a reference link to a second Lookup.

First Job (the one who looses information):

| DS1 |
|
|
|
\/
| Left Outer Join | <----------| DS2 |
|
|
|
\/
_____
| DS3 |
-----

Where DS1 and DS2 are two datasets hashed and sorted.

What I tried and worked is:

| DS1 |
|
|
|
\/
| Copy | -------> | Lookup1 |<----------| DS2 |
| |
| |
| |
\/ |
| Lookup2| <---------|
|
|
|
\/
| DS3 |

Where DS2 is the stream input to Lookup1 (Drop) and DS2 is the stream input to Lookup2 (Continue). Entre Copy and Lookup1 there's a Remode Duplicate.

It's not easy to design here, I hope you understand my jobs.

Does anyone has an explanation to the error on my first Job?

Thank you all.[/quote]
Sainath.Srinivasan
Participant
Posts: 3337
Joined: Mon Jan 17, 2005 4:49 am
Location: United Kingdom

Re: Join behaves strangely

Post by Sainath.Srinivasan »

mjgmc wrote:Two input links (Hashed and sorted on one Key column).
How many key columns do you have in total in the dataset ?

Did you sort the data just before the join or somewhere else ?
mjgmc
Participant
Posts: 52
Joined: Thu Nov 25, 2004 8:06 am

Re: Join behaves strangely

Post by mjgmc »

Sainath.Srinivasan wrote:
mjgmc wrote:Two input links (Hashed and sorted on one Key column).
How many key columns do you have in total in the dataset ?

Did you sort the data just before the join or som ...
I only have one Key Column both sides. Both datasets are hashed and sorted on previous jobs. And in this job we keep partionned mode "SAME" both sides. But we tried to put Join Stage in Sequential with a sort at the Join Stage (both sides) and it didn't work.

I can't read the rest of your message, so I don't know if you asked something else.
agpt
Participant
Posts: 151
Joined: Sun May 16, 2010 12:53 am

Post by agpt »

are the keys for lookup1 and lookup2 are same?

when you say in the end data don't match, what exactly u mean? do u mean to say you are getting data which you shouldn't or you are not getting those records which should have come into ouput?

Would be good if you can give some data example here
vinnz
Participant
Posts: 92
Joined: Tue Feb 17, 2004 9:23 pm

Re: Join behaves strangely

Post by vinnz »

mjgmc wrote:Both datasets are hashed and sorted on previous jobs. And in this job we keep partionned mode "SAME" both sides.
Are you sure no other job is messing with your datasets after they are hashed and sorted in your previous job and before you read them in the curent job?What happens if you add hash partioning and sorting in your current job instead of using same?
mjgmc
Participant
Posts: 52
Joined: Thu Nov 25, 2004 8:06 am

Post by mjgmc »

agpt wrote:are the keys for lookup1 and lookup2 are same?

when you say in the end data don't match, what exactly u mean? do u mean to say you are getting data which you shouldn't or you are not getting those records which should have come into ouput?

Would be good if you can give some data example here
DS1 has plainty of columns, so just a sample:
Code|Group|Detail|
123|"A"|"Hello"
123|"B"|"Bye"
300|"C"|
900|"A"|"World"

DS2 has only two columns:
Code|Key_Col
123|1
300|2
900|3

I want to have on my output:
Code|Group|Detail|Key_Col
123|"A"|"Hello"|1
123|"B"|"Bye"|1
300|"C"||2
900|"A"|"World"|3

The problem is that some of the records have the Key_Col with Null When the Code exists in DS1 and DS2.
mjgmc
Participant
Posts: 52
Joined: Thu Nov 25, 2004 8:06 am

Re: Join behaves strangely

Post by mjgmc »

vinnz wrote:
mjgmc wrote:Both datasets are hashed and sorted on previous jobs. And in this job we keep partionned mode "SAME" both sides.
Are you sure no other job is messing with your datasets after they are hashed and sorted in your previous job and before you read them in the curent job?What happens if you add hash partioning and sorting in your current job instead of using same?
We even tried putting the Join Stage Sequential with Sort in both input links and it didn't work
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

What partitioning are you using on the inputs of the Join stage?
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
mjgmc
Participant
Posts: 52
Joined: Thu Nov 25, 2004 8:06 am

Post by mjgmc »

ray.wurlod wrote:What partitioning are you using on the inputs of the Join stage? ...
At the beginning we had the "Same" partitionning, but as I sad before, we tried everything even passing the Join Stage in Sequential with both sides sorted.
agpt
Participant
Posts: 151
Joined: Sun May 16, 2010 12:53 am

Post by agpt »

Did you try incorporating Null handling for Key_Col?
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

As a test, try explicitly specifying Hash as the partitioning algorithm on both inputs of the Join stage, using the join key as the partitioning key (and sorting on these).
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
mjgmc
Participant
Posts: 52
Joined: Thu Nov 25, 2004 8:06 am

Post by mjgmc »

agpt wrote:Did you try incorporating Null handling for Key_Col?
It's a not null column in DS2.
antonyraj.deva
Premium Member
Premium Member
Posts: 138
Joined: Wed Jul 16, 2008 9:51 pm
Location: Kolkata

Post by antonyraj.deva »

mjgmc wrote: DS1 has plainty of columns, so just a sample:
Code|Group|Detail|
123|"A"|"Hello"
123|"B"|"Bye"
300|"C"|
900|"A"|"World"

DS2 has only two columns:
Code|Key_Col
123|1
300|2
900|3

I want to have on my output:
Code|Group|Detail|Key_Col
123|"A"|"Hello"|1
123|"B"|"Bye"|1
300|"C"||2
900|"A"|"World"|3

The problem is that some of the records have the Key_Col with Null When the Code exists in DS1 and DS2.
Which column in the sample data is used as the KEY for Join?
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Code.

It's the only one with common domain.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
agpt
Participant
Posts: 151
Joined: Sun May 16, 2010 12:53 am

Post by agpt »

mjgmc wrote:
ray.wurlod wrote:What partitioning are you using on the inputs of the Join stage? ...
At the beginning we had the "Same" partitionning, but as I sad before, we tried everything even passing the Join Stage in Sequential with both sides sorted.
but you only said
"The problem is that some of the records have the Key_Col with Null When the Code exists in DS1 and DS2."

so if it is defined as not null, how there are some values with NULL? :?
Post Reply