I am trying to remove all instances of duplicates from my data but yet retain the duplicates as rejects. Seen as the removeDups stage does not support reject links and has to either retain first or last records of the duplicate l can't use this.
Has someone implemented something like this ??
Removing all instances of duplicates
Moderators: chulett, rschirm, roy
Re: Removing all instances of duplicates
Nah. But if the data size is small enough, a copy stage before the remove duplicate, and a lookup afterward can give you what you want -- the records that was dropped.asadi wrote:Has someone implemented something like this ??
Yeah, Remove Duplicate should have a drop/reject link, but until that is implemented, this is the best I could come up with at short notice.
-T.J.
Developer of DataStage Parallel Engine (Orchestrate).
Re: Removing all instances of duplicates
I'd use Change Capture instead of a lookup for this type of processing.Teej wrote:Nah. But if the data size is small enough, a copy stage before the remove duplicate, and a lookup afterward can give you what you want -- the records that was dropped.asadi wrote:Has someone implemented something like this ??
Yeah, Remove Duplicate should have a drop/reject link, but until that is implemented, this is the best I could come up with at short notice.
-T.J.
Vitali.
Thanks T.J, but that would be only part of the solution though l would still have one of those duplicates hanging around in my data. The objective is to remove all instances. I have the aggregator and transformer stages doing this for me but l cant all the rejected rows. I might have to use the change capture with a copy. I was looking for a simple solution but this doesn't seem likely.
Removing all instances of duplicates
You just need a transform and another filter/transform..
First hash and sort by the dup keys. Then use one transform to assign a sequential # to each record in a dup group. Then in the second transform/filter, use the sequential # as a constraint/filter:
if the # = 1, then that's the record to retain,
if the # <> 1, then send it to a reject link.
- BP
First hash and sort by the dup keys. Then use one transform to assign a sequential # to each record in a dup group. Then in the second transform/filter, use the sequential # as a constraint/filter:
if the # = 1, then that's the record to retain,
if the # <> 1, then send it to a reject link.
- BP
No you won't.asadi wrote:Thanks T.J, but that would be only part of the solution though l would still have one of those duplicates hanging around in my data.
Code: Select all
copy -----> Rem Dup -----> copy -----> Unique Records.
\ |
\ V
---------------------> lookup ----> dead end
\
\
------> Duplicates.
Using a Transform Stage would as BigPoppa suggested would be faster than this solution, since you will still have to hash and sort before the Rem Dup stage. However, it would be tricky if you have multiple keys of different types. Use stage variables to do the constraints.
-T.J.
Developer of DataStage Parallel Engine (Orchestrate).
As an aside, if you are hashing and sorting, ensure that the non-key fields you are sorting for (to get the first record) only have "Sorting", not "Sorting, Partitioning". You will run the risk of having duplicates if you do that. When I say non-key fields, I mean the fields you are not using in that stage to define duplicates.asadi wrote:l would still have one of those duplicates hanging around in my data.
-T.J.
Developer of DataStage Parallel Engine (Orchestrate).