Remove Duplicates - Retain both Duplicates

jzajde1 · Post by **jzajde1** » Tue Feb 10, 2015 2:58 pm

Hello,

Is there a way I can retain both duplicates from a stage in DataStage?
The primary key is column 1.

Ex.

Column1|Column2|Column3

111|EA|203
111|EA|201
112|EA|200
113|EA|200

I want to remove both records where column 1 = 111.

Please advise.

Thanks

qt_ky · Post by **qt_ky** » Tue Feb 10, 2015 3:46 pm

Could you clarify if you are wanting to retain (keep) or remove, or something in between, like route one or both to their own separate stage?

ray.wurlod · Post by **ray.wurlod** » Tue Feb 10, 2015 4:24 pm

Create a fork-join to identify the count from each key. Downstream of the Join, create a filter that passes only those key values for which the count is 1.

chulett · Post by **chulett** » Tue Feb 10, 2015 4:48 pm

Yup, fork that join.

jzajde1 · Post by **jzajde1** » Wed Feb 11, 2015 6:50 am

qt_ky:

I want to retain(keep) the records and route them to their own stage.

chulett & ray.wurlod: thank you for your post. I will test the fork join and reply.

ShaneMuir · Post by **ShaneMuir** » Wed Feb 11, 2015 8:19 am

Just as a question, what is the data source in this process? If its a DB there might be ways of avoiding a split fork join by incorporating the identification of potential duplicates into your select query.

jzajde1 · Post by **jzajde1** » Wed Feb 11, 2015 9:56 am

ShaneMuir:

The source is a sequential file.