Capturing the Duplicates

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
ravij
Premium Member
Premium Member
Posts: 170
Joined: Mon Oct 10, 2005 7:04 am
Location: India

Capturing the Duplicates

Post by ravij »

Hi,

I am using the Remove duplicate stage to eliminate the duplicates. In the properties of Rem Dup Stage there is one Option called "Duplicate To Retain" . It has two values like First or Last.

I want to capture the Duplicate data. How can I achieve this?

Any answer is appreciated.
thanks in advance
Ravi
balajisr
Charter Member
Charter Member
Posts: 785
Joined: Thu Jul 28, 2005 8:58 am

Post by balajisr »

Hi ravi

I do not think it is possible to capture the duplicate date in remove duplicate stage.

Remove duplicate stage takes single dataset as input and outputs a single dataset with duplicates removed.

When two records are duplicate of each other by default first record is retained and other discarded. "Duplicate to Retain" option allows you to retain last record rather than the first.
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

Hi rajiv,

There is no option to capture duplicated from REMOVE DUPLICATE stage. No reject option as well.
Better find a DIFFERENCE between the original dataset and the final dataset where in which duplicates are removed.
Or you have another option, use sort to capture the change in the key, and a transformer to collect the duplicates.
Other work around would be, do the join/merge between the two dataset and extract the unmatched rows, which is internally accomplished by difference.

-Kumar
ravij
Premium Member
Premium Member
Posts: 170
Joined: Mon Oct 10, 2005 7:04 am
Location: India

Post by ravij »

Hi Kumar,
use sort to capture the change in the key, and a transformer to collect the duplicates.
Can u explain the quoted one indetail plz?
Ravi
djm
Participant
Posts: 68
Joined: Wed Mar 02, 2005 3:42 am
Location: N.Z.

Post by djm »

If the data is in a flat file and you are happy executing UNIX commands (e.g. ExecSh in a "before-job subroutine), you may want to consider the UNIX "uniq" command. There is an option that allows you to only output duplicated rows. Try "man uniq" at the UNIX command line for more information.

David
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

Hi,
In sort stage you have a option called Create Key Change Column.
Enable it to true. It will give you the information of the change in Key column.
i.e,

Code: Select all

Key    KeyChangeCol
1000      1       
1000      2
2000      1
3000      1
3000      2
3000      3
You can use a transformer to collect the rows which are greater than 1 in KeyChangeCol.

-Kumar
balajisr
Charter Member
Charter Member
Posts: 785
Joined: Thu Jul 28, 2005 8:58 am

Post by balajisr »

Kumar,

Code: Select all

 Key    KeyChangeCol 
1000      1        
1000      2 
2000      1 
3000      1 
3000      2 
3000      3
I have a doubt.

Will not the KeyChangeCol will be 0 for duplicates rather than 2, 3 etc..
In that case transformer should be change to accept rows greater than 0 right?
Correct me if i am wrong.

--Balaji S.R
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

Sorry, I didnt checked before posting :roll:

-Kumar
balajisr
Charter Member
Charter Member
Posts: 785
Joined: Thu Jul 28, 2005 8:58 am

Post by balajisr »

Kumar,

Sorry, I didnt checked before posting
Not a problem kumar.

--Balaji S.R
kwwilliams
Participant
Posts: 437
Joined: Fri Oct 21, 2005 10:00 pm

Post by kwwilliams »

To then capture the duplicates you could use a filter stage to filter based upon the key column change = 1. Set your output rejects = True and hang a reject link off of the filter and put them wherever you want.
somu_june
Premium Member
Premium Member
Posts: 439
Joined: Wed Sep 14, 2005 9:28 am
Location: 36p,reading road

Post by somu_june »

Hi ,

Iam capturing duplicate records with sort stage Keychange columm. To achieve this Iam using two duplicate stages one for sorting records on key1,key2,key3 and key4(price) and another sort stage is for keychange columm. Actually my requirement is to find duplicate for key1,key2 and key3 and I have to sort key4(price) descending to get maximum price and capture that. If I am using only one sort stage Iam getting key4(price) with different valuess KeyChange columm value one.so Iam getting both price records even though they have same key1,key2,key3. so Iam using second sort stage. Iam able to achieve my target but an warning message is saying that sort stage already sorted on keys .How to eliminate this warning message.


Thanks,
somaraju
somaraju
somu_june
Premium Member
Premium Member
Posts: 439
Joined: Wed Sep 14, 2005 9:28 am
Location: 36p,reading road

Post by somu_june »

Hi Williams,



Thanks for solving my problem.




Thanks,
Somaraju
somaraju
Post Reply