Strange problem about join stage

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
xinhuang66
Participant
Posts: 161
Joined: Wed Aug 02, 2006 4:30 am

Strange problem about join stage

Post by xinhuang66 »

I got a very strange problem today from join stage.

Use inner join

left input, Dataset A has 5M rows,
copy dataset A to aggregator, group by keya,( count rows, and I need to filter out some of them)

Output to right dataset B, which has 830 rows, and I didn remove duplicate, still 830 rows.

Inner join Dataset A and dataset B output 5.1m rows ? I couldn't believe it since there is no duplicate in right input, couldn't get more than 5 M

And if I check unique option in sort stage for right datasetB, the output result set is right, 5M rows.

It looks like that there are some duplicates in right datasetB, but I did remove duplicate, still get the same number of rows.

And if right input datasetB has no duplicates, what is the difference between check unique and uncheck unique in sort stage for right input ?

Anybody can help me , thanks a lot !
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Are you partitioning and sorting on the join keys? This is necessary if the join stage is to produce correct results.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
xinhuang66
Participant
Posts: 161
Joined: Wed Aug 02, 2006 4:30 am

Post by xinhuang66 »

Yes, I use hash partition and sort by the join key column

The very strange thing is that if I choose unique for the right link in sort stage, it is right.

but the right input link is already removed duplicated by remove_duplicate stage.
xinhuang66
Participant
Posts: 161
Joined: Wed Aug 02, 2006 4:30 am

Post by xinhuang66 »

Yes, I use hash partition and sort by the join key column

The very strange thing is that if I choose unique for the right link in sort stage, it is right.

but the right input link is already removed duplicated by remove_duplicate stage.
michaeld
Premium Member
Premium Member
Posts: 88
Joined: Tue Apr 04, 2006 8:42 am
Location: Toronto, Canada

Post by michaeld »

Maybe you have case sensitive selected in the dubplicate stage and not in the join stage.
Mike
xinhuang66
Participant
Posts: 161
Joined: Wed Aug 02, 2006 4:30 am

Post by xinhuang66 »

I spend around 3 hours and use change capture stage and find the problem.

I have some duplicate keys (actually only 1 row) in my right input , which is supposed to be unique. Two strange things.

1) I used remove duplicated stage on the right input, and it didn't remove out the duplicated one, but it looks like that if I check the unique option in sort stage, it workds

2) Why the result out from aggregator is not key unique, I couldn't understand it.

sombody can help.
xinhuang66
Participant
Posts: 161
Joined: Wed Aug 02, 2006 4:30 am

Post by xinhuang66 »

Update about the problem, I am still working on it and finding some hints

If i put a remove dupliate stage in front of aggregator, the output result is what i want ?

And if I use aggregator stage inside unique option for the input stream, and the output of aggregator is not right, there is one row of duplicate key.

Then what is the difference between above two methods to get unique input based on key column for aggregator ?

Somebody can help?

thanks a lot !
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Aggregator will only remove duplicates if you group on that column AND ONLY THAT COLUMN.

Does that help to explain?
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
xinhuang66
Participant
Posts: 161
Joined: Wed Aug 02, 2006 4:30 am

Post by xinhuang66 »

My case is like
id1, id2
1, a
1, a
1, b
1, b
1, c
2, a
3, a
4, c

I need to remove any id1 who has more than 2 different id2. It is a little bit tricky.

What I did is remove duplicate based on id1, id2, hope to get
id1,id2
1,a
1,b
1,c
2,a
3,a
4,c

Then I use aggregator to count rows based on id1, hope to get
1, 3
2, 1
3, 1
4, 1

Then I can filter out any id1 who have rowCount >2, and get the right id1.

And my problems is actually in step1, when I try to remove duplicate by remove duplicate stage, it is fine.

And if I use aggegator stage inside unique option to get the unique for both id1 and id2, and the aggregator output always has one duplicate row on id1. I don't know why, it took me a lot of time to find it.

Anybody can explain and help.

hope I describe my case clearly
Post Reply