Capturing the Duplicates

ravij · Post by **ravij** » Mon Dec 26, 2005 4:23 am

Hi,

I am using the Remove duplicate stage to eliminate the duplicates. In the properties of Rem Dup Stage there is one Option called "Duplicate To Retain" . It has two values like First or Last.

I want to capture the Duplicate data. How can I achieve this?

Any answer is appreciated.
thanks in advance

balajisr · Post by **balajisr** » Mon Dec 26, 2005 4:40 am

Hi ravi

I do not think it is possible to capture the duplicate date in remove duplicate stage.

Remove duplicate stage takes single dataset as input and outputs a single dataset with duplicates removed.

When two records are duplicate of each other by default first record is retained and other discarded. "Duplicate to Retain" option allows you to retain last record rather than the first.

kumar_s · Post by **kumar_s** » Mon Dec 26, 2005 5:24 am

Hi rajiv,

There is no option to capture duplicated from REMOVE DUPLICATE stage. No reject option as well.
Better find a DIFFERENCE between the original dataset and the final dataset where in which duplicates are removed.
Or you have another option, use sort to capture the change in the key, and a transformer to collect the duplicates.
Other work around would be, do the join/merge between the two dataset and extract the unmatched rows, which is internally accomplished by difference.

-Kumar

ravij · Post by **ravij** » Tue Jan 10, 2006 4:37 am

Hi Kumar,

use sort to capture the change in the key, and a transformer to collect the duplicates.

Can u explain the quoted one indetail plz?

djm · Post by **djm** » Tue Jan 10, 2006 4:41 am

If the data is in a flat file and you are happy executing UNIX commands (e.g. ExecSh in a "before-job subroutine), you may want to consider the UNIX "uniq" command. There is an option that allows you to only output duplicated rows. Try "man uniq" at the UNIX command line for more information.

David

kumar_s · Post by **kumar_s** » Tue Jan 10, 2006 5:23 am

Hi,
In sort stage you have a option called Create Key Change Column.
Enable it to true. It will give you the information of the change in Key column.
i.e,

Code: Select all

Key    KeyChangeCol
1000      1       
1000      2
2000      1
3000      1
3000      2
3000      3

You can use a transformer to collect the rows which are greater than 1 in KeyChangeCol.

-Kumar

balajisr · Post by **balajisr** » Tue Jan 10, 2006 5:37 am

Kumar,

Code: Select all

 Key    KeyChangeCol 
1000      1        
1000      2 
2000      1 
3000      1 
3000      2 
3000      3

I have a doubt.

Will not the KeyChangeCol will be 0 for duplicates rather than 2, 3 etc..
In that case transformer should be change to accept rows greater than 0 right?
Correct me if i am wrong.

--Balaji S.R

kumar_s · Post by **kumar_s** » Tue Jan 10, 2006 6:37 am

Sorry, I didnt checked before posting

-Kumar

balajisr · Post by **balajisr** » Tue Jan 10, 2006 6:59 am

Kumar,

Sorry, I didnt checked before posting

Not a problem kumar.

--Balaji S.R

kwwilliams · Post by **kwwilliams** » Thu Jan 12, 2006 4:01 pm

To then capture the duplicates you could use a filter stage to filter based upon the key column change = 1. Set your output rejects = True and hang a reject link off of the filter and put them wherever you want.

somu_june · Post by **somu_june** » Mon Feb 13, 2006 1:16 pm

Hi ,

Iam capturing duplicate records with sort stage Keychange columm. To achieve this Iam using two duplicate stages one for sorting records on key1,key2,key3 and key4(price) and another sort stage is for keychange columm. Actually my requirement is to find duplicate for key1,key2 and key3 and I have to sort key4(price) descending to get maximum price and capture that. If I am using only one sort stage Iam getting key4(price) with different valuess KeyChange columm value one.so Iam getting both price records even though they have same key1,key2,key3. so Iam using second sort stage. Iam able to achieve my target but an warning message is saying that sort stage already sorted on keys .How to eliminate this warning message.

Thanks,
somaraju

somu_june · Post by **somu_june** » Mon Feb 13, 2006 1:37 pm

Hi Williams,

Thanks for solving my problem.

Thanks,
Somaraju