Page 1 of 1

Remove Duplicates not removing duplicates

Posted: Sat Jul 10, 2010 6:52 pm
by pillip
hi,

I have an issue with the remove duplicates stage in the job. It is not removing the duplicates based on the key, even though identical values are coming in the key.
Could you please let me know why this could happen.



Thanks

Posted: Sat Jul 10, 2010 7:21 pm
by chulett
Is your data sorted properly?

Re: Remove Duplicates not removing duplicates

Posted: Sat Jul 10, 2010 7:25 pm
by kwwilliams
Two ways I can think of:

1. The data is not hash partitioned in a manner that would have the records on the same partition.
2. The data is not sorted properly. The remove duplicate stage is removing duplicates when they are sorted by the key, essentially removing duplicates that are located one after another.

So is the data sorted and hash paritioned correctly?

Posted: Sun Jul 11, 2010 3:31 am
by ray.wurlod
Keith has it. Your data are not partitioned correctly or not sorted correctly or both.

Posted: Sun Jul 11, 2010 6:49 am
by chulett
Figured I'd start with sorted and go from there. :wink:

Posted: Sun Jul 11, 2010 7:31 pm
by pillip
chulett wrote:Figured I'd start with sorted and go from there. :wink:
Should the data be sorted and hash partitioned or just sorted?


Thanks

Posted: Sun Jul 11, 2010 8:11 pm
by chulett
Unless you are running on a single node, you need both.