Removing all instances of duplicates

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
asadi
Participant
Posts: 10
Joined: Sun Nov 02, 2003 9:18 pm

Removing all instances of duplicates

Post by asadi »

I am trying to remove all instances of duplicates from my data but yet retain the duplicates as rejects. Seen as the removeDups stage does not support reject links and has to either retain first or last records of the duplicate l can't use this.

Has someone implemented something like this ??
Teej
Participant
Posts: 677
Joined: Fri Aug 08, 2003 9:26 am
Location: USA

Re: Removing all instances of duplicates

Post by Teej »

asadi wrote:Has someone implemented something like this ??
Nah. But if the data size is small enough, a copy stage before the remove duplicate, and a lookup afterward can give you what you want -- the records that was dropped.

Yeah, Remove Duplicate should have a drop/reject link, but until that is implemented, this is the best I could come up with at short notice.

-T.J.
Developer of DataStage Parallel Engine (Orchestrate).
vzoubov
Participant
Posts: 28
Joined: Tue Feb 05, 2002 12:30 pm
Location: Boston, MA

Re: Removing all instances of duplicates

Post by vzoubov »

Teej wrote:
asadi wrote:Has someone implemented something like this ??
Nah. But if the data size is small enough, a copy stage before the remove duplicate, and a lookup afterward can give you what you want -- the records that was dropped.

Yeah, Remove Duplicate should have a drop/reject link, but until that is implemented, this is the best I could come up with at short notice.

-T.J.
I'd use Change Capture instead of a lookup for this type of processing.

Vitali.
asadi
Participant
Posts: 10
Joined: Sun Nov 02, 2003 9:18 pm

Post by asadi »

Thanks T.J, but that would be only part of the solution though l would still have one of those duplicates hanging around in my data. The objective is to remove all instances. I have the aggregator and transformer stages doing this for me but l cant all the rejected rows. I might have to use the change capture with a copy. I was looking for a simple solution but this doesn't seem likely.
bigpoppa
Participant
Posts: 190
Joined: Fri Feb 28, 2003 11:39 am

Removing all instances of duplicates

Post by bigpoppa »

You just need a transform and another filter/transform..

First hash and sort by the dup keys. Then use one transform to assign a sequential # to each record in a dup group. Then in the second transform/filter, use the sequential # as a constraint/filter:

if the # = 1, then that's the record to retain,
if the # <> 1, then send it to a reject link.

- BP
Teej
Participant
Posts: 677
Joined: Fri Aug 08, 2003 9:26 am
Location: USA

Post by Teej »

asadi wrote:Thanks T.J, but that would be only part of the solution though l would still have one of those duplicates hanging around in my data.
No you won't.

Code: Select all

copy -----> Rem Dup -----> copy -----> Unique Records.
 \                           |
  \                          V
   ---------------------> lookup ----> dead end
                             \
                              \
                               ------> Duplicates.
The downward from the 2nd copy is a reference link, and the downward from lookup is reject/drop link.

Using a Transform Stage would as BigPoppa suggested would be faster than this solution, since you will still have to hash and sort before the Rem Dup stage. However, it would be tricky if you have multiple keys of different types. Use stage variables to do the constraints.

-T.J.
Developer of DataStage Parallel Engine (Orchestrate).
Teej
Participant
Posts: 677
Joined: Fri Aug 08, 2003 9:26 am
Location: USA

Post by Teej »

asadi wrote:l would still have one of those duplicates hanging around in my data.
As an aside, if you are hashing and sorting, ensure that the non-key fields you are sorting for (to get the first record) only have "Sorting", not "Sorting, Partitioning". You will run the risk of having duplicates if you do that. When I say non-key fields, I mean the fields you are not using in that stage to define duplicates.

-T.J.
Developer of DataStage Parallel Engine (Orchestrate).
Post Reply