Removing all instances of duplicates

asadi · Post by **asadi** » Thu Dec 18, 2003 7:42 pm

I am trying to remove all instances of duplicates from my data but yet retain the duplicates as rejects. Seen as the removeDups stage does not support reject links and has to either retain first or last records of the duplicate l can't use this.

Has someone implemented something like this ??

Teej · Post by **Teej** » Thu Dec 18, 2003 8:45 pm

asadi wrote:Has someone implemented something like this ??

Nah. But if the data size is small enough, a copy stage before the remove duplicate, and a lookup afterward can give you what you want -- the records that was dropped.

Yeah, Remove Duplicate should have a drop/reject link, but until that is implemented, this is the best I could come up with at short notice.

-T.J.

vzoubov · Post by **vzoubov** » Thu Dec 18, 2003 9:21 pm

Teej wrote:
asadi wrote:Has someone implemented something like this ??
Nah. But if the data size is small enough, a copy stage before the remove duplicate, and a lookup afterward can give you what you want -- the records that was dropped.

Yeah, Remove Duplicate should have a drop/reject link, but until that is implemented, this is the best I could come up with at short notice.

-T.J.

I'd use Change Capture instead of a lookup for this type of processing.

Vitali.

asadi · Post by **asadi** » Thu Dec 18, 2003 10:04 pm

Thanks T.J, but that would be only part of the solution though l would still have one of those duplicates hanging around in my data. The objective is to remove all instances. I have the aggregator and transformer stages doing this for me but l cant all the rejected rows. I might have to use the change capture with a copy. I was looking for a simple solution but this doesn't seem likely.

bigpoppa · Post by **bigpoppa** » Fri Dec 19, 2003 10:56 am

You just need a transform and another filter/transform..

First hash and sort by the dup keys. Then use one transform to assign a sequential # to each record in a dup group. Then in the second transform/filter, use the sequential # as a constraint/filter:

if the # = 1, then that's the record to retain,
if the # <> 1, then send it to a reject link.

- BP

Teej · Post by **Teej** » Fri Dec 19, 2003 12:26 pm

asadi wrote:Thanks T.J, but that would be only part of the solution though l would still have one of those duplicates hanging around in my data.

No you won't.

Code: Select all

copy -----> Rem Dup -----> copy -----> Unique Records.
 \                           |
  \                          V
   ---------------------> lookup ----> dead end
                             \
                              \
                               ------> Duplicates.

The downward from the 2nd copy is a reference link, and the downward from lookup is reject/drop link.

Using a Transform Stage would as BigPoppa suggested would be faster than this solution, since you will still have to hash and sort before the Rem Dup stage. However, it would be tricky if you have multiple keys of different types. Use stage variables to do the constraints.

-T.J.

Teej · Post by **Teej** » Fri Dec 19, 2003 12:32 pm

asadi wrote:l would still have one of those duplicates hanging around in my data.

As an aside, if you are hashing and sorting, ensure that the non-key fields you are sorting for (to get the first record) only have "Sorting", not "Sorting, Partitioning". You will run the risk of having duplicates if you do that. When I say non-key fields, I mean the fields you are not using in that stage to define duplicates.

-T.J.

DSXchange