Removing Dup's

vzmz · Post by **vzmz** » Wed Dec 03, 2003 11:42 am

Hi All
I have a sequential file of fields f1 f2 f3 in trasformation i am checking for f1 and f2 for null and then i have to remove dups so i am placing a remove dupilcate stage where we have to sepecify the key on which we are checking dupilcates.
I have to delete or remove records which are dup but not on stand alone fields f1 or f2 i have to remove them after combing f1f2 together much be unique
Thanks

1stpoint · Post by **1stpoint** » Wed Dec 03, 2003 12:08 pm

Here is how I understand you're problem

Seq File
F1 F2 F3
------ ----- -----
A A A
A A B
<nul> A C
<nul> <nul> D
<nul> <nul> <nul>

I would use an Aggregator Stage. The question is on the multile occurrences of F1 and F2, which F3 do you want? First, Min, Max?

Here is the resulting design:

SeqStage -> Aggregator -> Output

Teej · Post by **Teej** » Wed Dec 03, 2003 12:11 pm

Okay. Here's what I believe you are doing right now:

Input -> Transform -> Remove Duplicates

Input have F1, F2, and F3.

F1 and F2 are considered keys. They needs to be unique.

* * *

What you should do is to remove duplicate with TWO keys - F1 and F2 in that particular vertical order. If only we are allowed to do screenshots here.

Also in input, you need to Hash AND sort using the keys. You can do that on the Input/Partition tabs. You can use a Sort stage before, and set the input/Partition for Remove Duplicate as "same" for partition type.

-T.J.

1stpoint · Post by **1stpoint** » Wed Dec 03, 2003 1:13 pm

I believe the Aggregator stage will accomplish both of these objectives.

vmcburney · Post by **vmcburney** » Wed Dec 03, 2003 5:31 pm

You want F1 and F2 combination to be unique, how do you decide which F3 values to toss out and which one to keep? You need a business rule that states which duplicate rows to remove. You can use the aggregation stage if your rule is "take the highest F3 value". If your rule is more complex, "take the F3 value that is not null and less than 1000..." then you might need to use a hash file lookup instead.

vzmz · Post by **vzmz** » Thu Dec 04, 2003 1:06 am

I have given both the fields(f1 & f2) as keys in remove dups stage and its working fine.
I have to store all the rejected records in a separate file. I tried putting a reject link but the remove dups stage take only one output.
How can i get the reject records ?
Thanks
MZ

Teej · Post by **Teej** » Thu Dec 04, 2003 7:02 am

I posted a solution elsewhere just yesterday with the idea of using copy stages before and after the key fields with a lookup stage to determine records that has been dropped.

I will have to ask some folks at Ascential today on the viability of having a reject link for all stages, to preserve any dropped records for other purposes.

-T.J.

DSXchange

Removing Dup's

Removing Dup's

Re: Removing Dup's

How to get the rejected rows

Re: How to get the rejected rows