Join stage is not giving expected result.

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

videsh77
Premium Member
Premium Member
Posts: 97
Joined: Thu Dec 02, 2004 10:43 am
Contact:

Join stage is not giving expected result.

Post by videsh77 »

To the Join stage :

I have left set as -

B
C
D
E
F
H
M
N
O
Q


I have right set as -

A,B
B,C
C,D
D,E
E,F
F,G
F,H
H,I
H,J
E,K
E,L
A,M
M,N
N,O
O,P
N,Q
Q,E
Q,R

After Join I expect data as -

B,C
C,D
D,E
E,F
F,G
F,H
H,I
H,J
E,K
E,L
M,N
N,O
O,P
N,Q
Q,E
Q,R

i.e. Retrieving those records which has common values as first column.

So before join stage I have data sorted by using explicit sort on 1st column. In the join stage I have applied inner join. Key used is 1st column. Data coming to join stage is hash sorted. But in the output of this join stage I am not getting the expected result.

Can anyone please guide me, where I am going wrong?
Thanks with regards,
videsh.
DSguru2B
Charter Member
Charter Member
Posts: 6854
Joined: Wed Feb 09, 2005 3:44 pm
Location: Houston, TX

Post by DSguru2B »

Everything looks ok to me. What is your result? Also try it in sequential order to see if the data is really getting joined or not? You might be having a leading trailing spaces.
Creativity is allowing yourself to make mistakes. Art is knowing which ones to keep.
videsh77
Premium Member
Premium Member
Posts: 97
Joined: Thu Dec 02, 2004 10:43 am
Contact:

Post by videsh77 »

Data I am having is in the DataSets. Also I checked the Sequential files, before they are passed on to DataSet.

Checked with cat -v filename in AIX. File seemed to be Ok, didnt find any hidden chars.

Any other hint .... ?
Thanks with regards,
videsh.
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

What is the output that you have got?
Are they both key are of same datatype and length? Varchar type will make most of the hidden issues resolved.
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
videsh77
Premium Member
Premium Member
Posts: 97
Joined: Thu Dec 02, 2004 10:43 am
Contact:

Post by videsh77 »

Output which I am getting, is something similar to this -

B,C
C,D
D,E
E,F
E,F
E,F
F,G
F,G
H,I
H,I
M,N
N,O
N,O
O,P
Q,E
Q,E

You may notice, there is duplication of records.
Thanks with regards,
videsh.
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

Are you sure its sorted and Hash partitioned on the keys and maintained as SAME partition to join stage?
Could you post the list of stages used and the partition used in each stage in the job.
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
videsh77
Premium Member
Premium Member
Posts: 97
Joined: Thu Dec 02, 2004 10:43 am
Contact:

Post by videsh77 »

Exhibit I am having for this job as below :

Right DataSet -> Sort \
Join -> Output DataSet
Left DataSet -> Sort /

Before both the sorts, I have data Hash partitioned on the key col1+col2 for right dataset & col1 for left dataset.
Thanks with regards,
videsh.
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

You need to do Hash Partition on Col1 for both right and left dataset on sort stage. And maintain SAME partition in Join stage for both the link.
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
videsh77
Premium Member
Premium Member
Posts: 97
Joined: Thu Dec 02, 2004 10:43 am
Contact:

Post by videsh77 »

I precisely have the same settings. Is there any intermediate buffer is becoming the culprit ?
Thanks with regards,
videsh.
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

What happens if you run in sequential mode or in single node configuration using APT_CONFIG_FILE variable for this particular job?
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
videsh77
Premium Member
Premium Member
Posts: 97
Joined: Thu Dec 02, 2004 10:43 am
Contact:

Post by videsh77 »

I have one node configuration file only. Below is configuration details for the job, which I extracted from the director log.

Code: Select all

 
node "node01"
{
                fastname "machine-name"
                pools "" 
                resource disk "/data/ds/node01/resource" {pools "" }
                resource scratchdisk "/data/ds/node01/scratch" {pools "" }
                resource scratchdisk "/data/ds/node01/buffer" {pools "buffer"}
        }
Thanks with regards,
videsh.
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

Though the output should not be exactly as you expect, it should be following

Code: Select all

Expected Actual
B,C	B,C
C,D	C,D
D,E	D,E
E,F	E,F
E,K	E,F
E,L	E,F
F,G	F,G
F,H	F,G
H,I	H,I
H,J	H,I
M,N	M,N
N,O	N,O
N,Q	N,O
O,P	O,P
Q,E	Q,E
Q,R 	Q,E
The problem was only during cross product. But join should do cross product perfectly. Merge stage does this kind of manipulated output. i.e., output of first available update row for all the Master data.
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Duplication of results can also result from improper partitioning. The Join stage requires inputs to be identically partitioned using a key-based partitioning algorithm. And sorted on at least these keys.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

Partitioning in single node is same as running in sequential mode. Why can't you give a try running the job in Sequential mode.
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
vivekreddy
Participant
Posts: 43
Joined: Mon Jan 15, 2007 10:53 pm

Post by vivekreddy »

Hi.

Are you explicitly sorting the datasets before sending them to the Join Stage?

If yes, then are you clearing the partitioning before sending them in for sorting? If not, clear the partitioning in the stage immediately before the sort in each case, and then set the partitioning method in the sort stage.
This will ensure that both the datasets are sorted AND partitioned on the same key.

If you aren't explicitly sorting the datasets, try doing that, as it may be possible that while Join stage also is capable of sorting the data, it may not be as effective as an explicit sort.
Regards,
Vivek D. Reddy

__________________________________________
If knowledge can create problems, it is not through ignorance that we can solve them. - Isaac Asimov
Post Reply