Page 1 of 1

Capturing the Duplicates

Posted: Mon Dec 26, 2005 4:23 am
by ravij
Hi,

I am using the Remove duplicate stage to eliminate the duplicates. In the properties of Rem Dup Stage there is one Option called "Duplicate To Retain" . It has two values like First or Last.

I want to capture the Duplicate data. How can I achieve this?

Any answer is appreciated.
thanks in advance

Posted: Mon Dec 26, 2005 4:40 am
by balajisr
Hi ravi

I do not think it is possible to capture the duplicate date in remove duplicate stage.

Remove duplicate stage takes single dataset as input and outputs a single dataset with duplicates removed.

When two records are duplicate of each other by default first record is retained and other discarded. "Duplicate to Retain" option allows you to retain last record rather than the first.

Posted: Mon Dec 26, 2005 5:24 am
by kumar_s
Hi rajiv,

There is no option to capture duplicated from REMOVE DUPLICATE stage. No reject option as well.
Better find a DIFFERENCE between the original dataset and the final dataset where in which duplicates are removed.
Or you have another option, use sort to capture the change in the key, and a transformer to collect the duplicates.
Other work around would be, do the join/merge between the two dataset and extract the unmatched rows, which is internally accomplished by difference.

-Kumar

Posted: Tue Jan 10, 2006 4:37 am
by ravij
Hi Kumar,
use sort to capture the change in the key, and a transformer to collect the duplicates.
Can u explain the quoted one indetail plz?

Posted: Tue Jan 10, 2006 4:41 am
by djm
If the data is in a flat file and you are happy executing UNIX commands (e.g. ExecSh in a "before-job subroutine), you may want to consider the UNIX "uniq" command. There is an option that allows you to only output duplicated rows. Try "man uniq" at the UNIX command line for more information.

David

Posted: Tue Jan 10, 2006 5:23 am
by kumar_s
Hi,
In sort stage you have a option called Create Key Change Column.
Enable it to true. It will give you the information of the change in Key column.
i.e,

Code: Select all

Key    KeyChangeCol
1000      1       
1000      2
2000      1
3000      1
3000      2
3000      3
You can use a transformer to collect the rows which are greater than 1 in KeyChangeCol.

-Kumar

Posted: Tue Jan 10, 2006 5:37 am
by balajisr
Kumar,

Code: Select all

 Key    KeyChangeCol 
1000      1        
1000      2 
2000      1 
3000      1 
3000      2 
3000      3
I have a doubt.

Will not the KeyChangeCol will be 0 for duplicates rather than 2, 3 etc..
In that case transformer should be change to accept rows greater than 0 right?
Correct me if i am wrong.

--Balaji S.R

Posted: Tue Jan 10, 2006 6:37 am
by kumar_s
Sorry, I didnt checked before posting :roll:

-Kumar

Posted: Tue Jan 10, 2006 6:59 am
by balajisr
Kumar,

Sorry, I didnt checked before posting
Not a problem kumar.

--Balaji S.R

Posted: Thu Jan 12, 2006 4:01 pm
by kwwilliams
To then capture the duplicates you could use a filter stage to filter based upon the key column change = 1. Set your output rejects = True and hang a reject link off of the filter and put them wherever you want.

Posted: Mon Feb 13, 2006 1:16 pm
by somu_june
Hi ,

Iam capturing duplicate records with sort stage Keychange columm. To achieve this Iam using two duplicate stages one for sorting records on key1,key2,key3 and key4(price) and another sort stage is for keychange columm. Actually my requirement is to find duplicate for key1,key2 and key3 and I have to sort key4(price) descending to get maximum price and capture that. If I am using only one sort stage Iam getting key4(price) with different valuess KeyChange columm value one.so Iam getting both price records even though they have same key1,key2,key3. so Iam using second sort stage. Iam able to achieve my target but an warning message is saying that sort stage already sorted on keys .How to eliminate this warning message.


Thanks,
somaraju

Posted: Mon Feb 13, 2006 1:37 pm
by somu_june
Hi Williams,



Thanks for solving my problem.




Thanks,
Somaraju