Removing Dup's

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
vzmz
Participant
Posts: 36
Joined: Sun Nov 23, 2003 12:10 pm
Location: Dallas

Removing Dup's

Post by vzmz »

Hi All
I have a sequential file of fields f1 f2 f3 in trasformation i am checking for f1 and f2 for null and then i have to remove dups so i am placing a remove dupilcate stage where we have to sepecify the key on which we are checking dupilcates.
I have to delete or remove records which are dup but not on stand alone fields f1 or f2 i have to remove them after combing f1f2 together much be unique
Thanks
1stpoint
Participant
Posts: 165
Joined: Thu Nov 13, 2003 2:10 pm
Contact:

Post by 1stpoint »

Here is how I understand you're problem

Seq File
F1 F2 F3
------ ----- -----
A A A
A A B
<nul> A C
<nul> <nul> D
<nul> <nul> <nul>

I would use an Aggregator Stage. The question is on the multile occurrences of F1 and F2, which F3 do you want? First, Min, Max?


Here is the resulting design:

SeqStage -> Aggregator -> Output
Teej
Participant
Posts: 677
Joined: Fri Aug 08, 2003 9:26 am
Location: USA

Re: Removing Dup's

Post by Teej »

Okay. Here's what I believe you are doing right now:

Input -> Transform -> Remove Duplicates

Input have F1, F2, and F3.

F1 and F2 are considered keys. They needs to be unique.

* * *

What you should do is to remove duplicate with TWO keys - F1 and F2 in that particular vertical order. If only we are allowed to do screenshots here. :)

Also in input, you need to Hash AND sort using the keys. You can do that on the Input/Partition tabs. You can use a Sort stage before, and set the input/Partition for Remove Duplicate as "same" for partition type.

-T.J.
Developer of DataStage Parallel Engine (Orchestrate).
1stpoint
Participant
Posts: 165
Joined: Thu Nov 13, 2003 2:10 pm
Contact:

Post by 1stpoint »

I believe the Aggregator stage will accomplish both of these objectives.
vmcburney
Participant
Posts: 3593
Joined: Thu Jan 23, 2003 5:25 pm
Location: Australia, Melbourne
Contact:

Post by vmcburney »

You want F1 and F2 combination to be unique, how do you decide which F3 values to toss out and which one to keep? You need a business rule that states which duplicate rows to remove. You can use the aggregation stage if your rule is "take the highest F3 value". If your rule is more complex, "take the F3 value that is not null and less than 1000..." then you might need to use a hash file lookup instead.
vzmz
Participant
Posts: 36
Joined: Sun Nov 23, 2003 12:10 pm
Location: Dallas

How to get the rejected rows

Post by vzmz »

I have given both the fields(f1 & f2) as keys in remove dups stage and its working fine.
I have to store all the rejected records in a separate file. I tried putting a reject link but the remove dups stage take only one output.
How can i get the reject records ?
Thanks
MZ
Teej
Participant
Posts: 677
Joined: Fri Aug 08, 2003 9:26 am
Location: USA

Re: How to get the rejected rows

Post by Teej »

I posted a solution elsewhere just yesterday with the idea of using copy stages before and after the key fields with a lookup stage to determine records that has been dropped.

I will have to ask some folks at Ascential today on the viability of having a reject link for all stages, to preserve any dropped records for other purposes.

-T.J.
Developer of DataStage Parallel Engine (Orchestrate).
Post Reply