Strange problem about join stage

xinhuang66 · Post by **xinhuang66** » Tue Apr 03, 2007 2:32 am

I got a very strange problem today from join stage.

Use inner join

left input, Dataset A has 5M rows,
copy dataset A to aggregator, group by keya,( count rows, and I need to filter out some of them)

Output to right dataset B, which has 830 rows, and I didn remove duplicate, still 830 rows.

Inner join Dataset A and dataset B output 5.1m rows ? I couldn't believe it since there is no duplicate in right input, couldn't get more than 5 M

And if I check unique option in sort stage for right datasetB, the output result set is right, 5M rows.

It looks like that there are some duplicates in right datasetB, but I did remove duplicate, still get the same number of rows.

And if right input datasetB has no duplicates, what is the difference between check unique and uncheck unique in sort stage for right input ?

Anybody can help me , thanks a lot !

ray.wurlod · Post by **ray.wurlod** » Tue Apr 03, 2007 6:49 am

Are you partitioning and sorting on the join keys? This is necessary if the join stage is to produce correct results.

xinhuang66 · Post by **xinhuang66** » Tue Apr 03, 2007 7:18 am

Yes, I use hash partition and sort by the join key column

The very strange thing is that if I choose unique for the right link in sort stage, it is right.

but the right input link is already removed duplicated by remove_duplicate stage.

xinhuang66 · Post by **xinhuang66** » Tue Apr 03, 2007 7:19 am

Yes, I use hash partition and sort by the join key column

The very strange thing is that if I choose unique for the right link in sort stage, it is right.

but the right input link is already removed duplicated by remove_duplicate stage.

michaeld · Post by **michaeld** » Tue Apr 03, 2007 9:21 am

Maybe you have case sensitive selected in the dubplicate stage and not in the join stage.

xinhuang66 · Post by **xinhuang66** » Tue Apr 03, 2007 5:19 pm

I spend around 3 hours and use change capture stage and find the problem.

I have some duplicate keys (actually only 1 row) in my right input , which is supposed to be unique. Two strange things.

1) I used remove duplicated stage on the right input, and it didn't remove out the duplicated one, but it looks like that if I check the unique option in sort stage, it workds

2) Why the result out from aggregator is not key unique, I couldn't understand it.

sombody can help.

xinhuang66 · Post by **xinhuang66** » Tue Apr 03, 2007 7:56 pm

Update about the problem, I am still working on it and finding some hints

If i put a remove dupliate stage in front of aggregator, the output result is what i want ?

And if I use aggregator stage inside unique option for the input stream, and the output of aggregator is not right, there is one row of duplicate key.

Then what is the difference between above two methods to get unique input based on key column for aggregator ?

Somebody can help?

thanks a lot !

ray.wurlod · Post by **ray.wurlod** » Tue Apr 03, 2007 11:00 pm

Aggregator will only remove duplicates if you group on that column AND ONLY THAT COLUMN.

Does that help to explain?

xinhuang66 · Post by **xinhuang66** » Wed Apr 04, 2007 1:48 am

My case is like
id1, id2
1, a
1, a
1, b
1, b
1, c
2, a
3, a
4, c

I need to remove any id1 who has more than 2 different id2. It is a little bit tricky.

What I did is remove duplicate based on id1, id2, hope to get
id1,id2
1,a
1,b
1,c
2,a
3,a
4,c

Then I use aggregator to count rows based on id1, hope to get
1, 3
2, 1
3, 1
4, 1

Then I can filter out any id1 who have rowCount >2, and get the right id1.

And my problems is actually in step1, when I try to remove duplicate by remove duplicate stage, it is fine.

And if I use aggegator stage inside unique option to get the unique for both id1 and id2, and the aggregator output always has one duplicate row on id1. I don't know why, it took me a lot of time to find it.

Anybody can explain and help.

hope I describe my case clearly