Remove duplicates

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
uppalapati2003
Participant
Posts: 70
Joined: Thu Nov 09, 2006 2:14 am

Remove duplicates

Post by uppalapati2003 »

Hello All,

in my source i am having duplicates,if any duplicates in the source i want reject those two records
kindly help me on that

Thanks
Srini
keshav0307
Premium Member
Premium Member
Posts: 783
Joined: Mon Jan 16, 2006 10:17 pm
Location: Sydney, Australia

Post by keshav0307 »

this has been discussed many many times here in this forum. try search.
keshav0307
Premium Member
Premium Member
Posts: 783
Joined: Mon Jan 16, 2006 10:17 pm
Location: Sydney, Australia

Post by keshav0307 »

your question is not very clear to me.

"if any duplicates in the source i want reject those two records"

you want to reject both the records

or

only want to remove the duplicate.
uppalapati2003
Participant
Posts: 70
Joined: Thu Nov 09, 2006 2:14 am

Post by uppalapati2003 »

i want remove two records
Srini
sreddy
Participant
Posts: 144
Joined: Sun Oct 21, 2007 9:13 am

Re: Remove duplicates

Post by sreddy »

Uppalapati
  • Use Sort stages instead of Remove duplicate stages. Sort stage has got more grouping options and sort indicator options.

    sort the records using the key field.In sort stage put "key change column = true".Then zero will be assigned to the duplicate records.then put a condition as which is record is zero then send it to reject link
-------------------------------------------------------------------------------------

The Remove Duplicates doesn't have a reject option, nor does the sort stage with remove duplicates checked.

To capture rejected duplicates use a Transformer. Partition and sort on your primary key. In a transformer keep the primary key stored in a Stage Variable. Compare incoming primary key to the stored primary key Stage Variable. If it is the same output the incoming row as a duplicate, if it is different output the row as unique and save the new primary key.

You need at least two stage variables, one to do the comparison and the other to store the key value:

Variable: Derivation
IsDuplicate: input.keyfield = SavedKey
SavedKey: input.keyfield


uppalapati2003 wrote:Hello All,

in my source i am having duplicates,if any duplicates in the source i want reject those two records
kindly help me on that

Thanks
SReddy
dwpractices@gmail.com
Analyzing Performance
uppalapati2003
Participant
Posts: 70
Joined: Thu Nov 09, 2006 2:14 am

Post by uppalapati2003 »

First of all thanks for u responce
I am not sure In this sceneraio whether both records will reject or single record

For Example
i have a data like this

10,AAA
20,BBB
20.BBB
30,CCC
in this my output should be the
10,AAA
30,CCC

The Id 20 has to be in the rejected file

Thanks
Srini
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

You need a "fork join" design. Use a Copy stage to send the first column through an aggregator to get counted, then join back to the detail rows with a Join stage. You will have the count along with each detail row. Then filter based on the value of the count.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Post Reply