remove duplicates using Transformer

neena · Post by **neena** » Thu Sep 10, 2009 11:21 am

Hi,

I am trying to remove duplictes only using Transformer. In the input tab of transformer I am doing hash partitioning and doing perform sort. On one of key I am doing sort partitioning and then on other key I did just sorting Asc and then other column I did sorting Descending.

Key1 (Sorting,partitioning)
Key2(Sorting Asc)
column1(sorting Des)

Then I checked the Stable and Unique check box in the tab expecting to retain the first record when there are duplictes, but I don't see any duplicate records getting dropped.
Could any one please let me know how this stable and unique works because in documentation it is mentioned that if I check both stable and Unique the first duplicte record will be retained. Please let me know if I am missing anything or any other postes regardign this.
Any help would be really appreciated.

ArndW · Post by **ArndW** » Thu Sep 10, 2009 11:24 am

Are all 3 keys supposed to denote the duplicates or just the first or second keys?

neena · Post by **neena** » Thu Sep 10, 2009 11:29 am

Its first and second keys, both of them.

ArndW · Post by **ArndW** » Thu Sep 10, 2009 11:33 am

But since the comparison is done on all 3 sorted columns you won't get duplicates...

neena · Post by **neena** » Thu Sep 10, 2009 11:46 am

Thank you much, you are right I tested with only key 1 and key 2 and it worked just fine, removing the duplicates. I guess I has to use remove duplicate stage and retain the first record.
After the transformer stage I will use the same partitioning in the remove duplicate stage and retain the first record. Please let me know if thats not correct approach.

betterthanever · Post by **betterthanever** » Thu Sep 10, 2009 12:10 pm

neena wrote:Thank you much, you are right I tested with only key 1 and key 2 and it worked just fine, removing the duplicates. I guess I has to use remove duplicate stage and retain the first record.
After the transformer stage I will use the same partitioning in the remove duplicate stage and retain the first record. Please let me know if thats not correct approach.

by default...the remove dups stage again inserts the sort operator...

neena · Post by **neena** » Thu Sep 10, 2009 12:16 pm

The reason I was trying to avoid using remove duplicate stage is because this is an existing code and I am trying to avoid adding stages.
What I did was, in transformer I did Hash partitioning and perform sort but didn't checked the stable and unique check boxe's.

Key1 (Sorting,partitioning)
Key2(Sorting Asc)
column1(sorting Des)

Next stage after this transformer is copy stage, so I used "same" partitioning in copy stage and checked perform sort, stable and Unique check boxes and selected the Key1 and Key2

Key1 (Sorting, Asc)
Key2(Sorting Asc)

It worked fine, but please let me know if there are any down sides of doing this.