Hi All
I have a sequential file of fields f1 f2 f3 in trasformation i am checking for f1 and f2 for null and then i have to remove dups so i am placing a remove dupilcate stage where we have to sepecify the key on which we are checking dupilcates.
I have to delete or remove records which are dup but not on stand alone fields f1 or f2 i have to remove them after combing f1f2 together much be unique
Thanks
Removing Dup's
Moderators: chulett, rschirm, roy
Here is how I understand you're problem
Seq File
F1 F2 F3
------ ----- -----
A A A
A A B
<nul> A C
<nul> <nul> D
<nul> <nul> <nul>
I would use an Aggregator Stage. The question is on the multile occurrences of F1 and F2, which F3 do you want? First, Min, Max?
Here is the resulting design:
SeqStage -> Aggregator -> Output
Seq File
F1 F2 F3
------ ----- -----
A A A
A A B
<nul> A C
<nul> <nul> D
<nul> <nul> <nul>
I would use an Aggregator Stage. The question is on the multile occurrences of F1 and F2, which F3 do you want? First, Min, Max?
Here is the resulting design:
SeqStage -> Aggregator -> Output
Re: Removing Dup's
Okay. Here's what I believe you are doing right now:
Input -> Transform -> Remove Duplicates
Input have F1, F2, and F3.
F1 and F2 are considered keys. They needs to be unique.
* * *
What you should do is to remove duplicate with TWO keys - F1 and F2 in that particular vertical order. If only we are allowed to do screenshots here.
Also in input, you need to Hash AND sort using the keys. You can do that on the Input/Partition tabs. You can use a Sort stage before, and set the input/Partition for Remove Duplicate as "same" for partition type.
-T.J.
Input -> Transform -> Remove Duplicates
Input have F1, F2, and F3.
F1 and F2 are considered keys. They needs to be unique.
* * *
What you should do is to remove duplicate with TWO keys - F1 and F2 in that particular vertical order. If only we are allowed to do screenshots here.
Also in input, you need to Hash AND sort using the keys. You can do that on the Input/Partition tabs. You can use a Sort stage before, and set the input/Partition for Remove Duplicate as "same" for partition type.
-T.J.
Developer of DataStage Parallel Engine (Orchestrate).
-
- Participant
- Posts: 3593
- Joined: Thu Jan 23, 2003 5:25 pm
- Location: Australia, Melbourne
- Contact:
You want F1 and F2 combination to be unique, how do you decide which F3 values to toss out and which one to keep? You need a business rule that states which duplicate rows to remove. You can use the aggregation stage if your rule is "take the highest F3 value". If your rule is more complex, "take the F3 value that is not null and less than 1000..." then you might need to use a hash file lookup instead.
Certus Solutions
Blog: Tooling Around in the InfoSphere
Twitter: @vmcburney
LinkedIn:Vincent McBurney LinkedIn
Blog: Tooling Around in the InfoSphere
Twitter: @vmcburney
LinkedIn:Vincent McBurney LinkedIn
How to get the rejected rows
I have given both the fields(f1 & f2) as keys in remove dups stage and its working fine.
I have to store all the rejected records in a separate file. I tried putting a reject link but the remove dups stage take only one output.
How can i get the reject records ?
Thanks
MZ
I have to store all the rejected records in a separate file. I tried putting a reject link but the remove dups stage take only one output.
How can i get the reject records ?
Thanks
MZ
Re: How to get the rejected rows
I posted a solution elsewhere just yesterday with the idea of using copy stages before and after the key fields with a lookup stage to determine records that has been dropped.
I will have to ask some folks at Ascential today on the viability of having a reject link for all stages, to preserve any dropped records for other purposes.
-T.J.
I will have to ask some folks at Ascential today on the viability of having a reject link for all stages, to preserve any dropped records for other purposes.
-T.J.
Developer of DataStage Parallel Engine (Orchestrate).