Page 1 of 1

Remove Duplicates

Posted: Fri Oct 21, 2016 7:08 pm
by India2000
Hi,

I have a scenario where I need to remove duplicates using a complete record based on Y or N indicator. A few columns in the record have nulls. How do I need to partition?

This is what I have done, sorted the input rows using sort stage with Indicator in desc. Partitioned the data using all columns except the indicator. Then used the remove duplicate stage with same partition and sorted using the indicator and other key columns (text columns).

Remove duplicate is not working correctly. Sometimes it works and sometimes doesn't. Can any one let me know where exactly I'm going wrong.

Thanks

Posted: Fri Oct 21, 2016 10:22 pm
by chulett
You've lost me but any time I see "sometimes works, sometimes doesn't" complaint I have to ask - does it "always work" if you run it on a single node?

Posted: Mon Oct 24, 2016 10:29 am
by asorrell
Neither Craig or I understand your problem description. Are you removing duplicate records based on the entire record (all columns) being duplicated? If so, what role does the indicator column play?

Posted: Mon Nov 14, 2016 3:57 pm
by abc123
I am assuming that what he is saying is, other than the indicator column, he wants to find duplicates of the rest of the columns. If there is a Y in the indicator column of a row, it is a duplicate of the previous row in all columns other than the indicator column.

Posted: Tue Nov 15, 2016 9:16 am
by UCDI
I don't think you can sort off the indicator. I think you need to sort off the data columns that you expect to be identical. And you should partition by the same (one or more of the columns that you expect to be identical, hashed).

If there are many columns you might want to do a checksum on them and hash/sort off that.