Remove duplicate keys
Moderators: chulett, rschirm, roy
-
- Premium Member
- Posts: 43
- Joined: Wed Oct 18, 2006 6:03 am
Remove duplicate keys
Hello!
I would like to identify duplicate dealnumbers. If a dealnumber is shown one time it's ok. If the same dealnumber is in two rows or more I want to remove the rows and insert them into another output for further use.
This seems to be easy but i cant figure out how.
Thank you for your help,
Mattias
I would like to identify duplicate dealnumbers. If a dealnumber is shown one time it's ok. If the same dealnumber is in two rows or more I want to remove the rows and insert them into another output for further use.
This seems to be easy but i cant figure out how.
Thank you for your help,
Mattias
-
- Premium Member
- Posts: 43
- Joined: Wed Oct 18, 2006 6:03 am
Its really not all that complicated. All you need is sorted input on the key, two stage variables and two output links. The stage variable will compare current key to previous and mark flag it. Use the output link constraints to constraint the output.
Creativity is allowing yourself to make mistakes. Art is knowing which ones to keep.
-
- Premium Member
- Posts: 43
- Joined: Wed Oct 18, 2006 6:03 am
But I think my problem is a more complicated than that.
If a dealnumber is repeated I want to remove all the rows containing the dealnumber. I dont want to keep any of them. If I only check if it is the same dealnumber with the row below I will always get one dealnumber left, or am I wrong.
10 100
10 200
20 300
30 400
should generate:
first output:
20 300
30 400
second output:
10 100
10 200
thx,
Mattias
If a dealnumber is repeated I want to remove all the rows containing the dealnumber. I dont want to keep any of them. If I only check if it is the same dealnumber with the row below I will always get one dealnumber left, or am I wrong.
10 100
10 200
20 300
30 400
should generate:
first output:
20 300
30 400
second output:
10 100
10 200
thx,
Mattias
Do this.
Sort the incoming data on your key. Define two stage variables in the transformer, say condFlag and prevVal. The will basically detect duplicates and flag them. Their both will be initialized to 0. Their derivation will be as follows:
Have a link coming out of here to a hashed file keyed on src.key (your first column).
Constraint for buildHash: condFlag = 'Y'
In the same job or maybe a second job feed the same source file and do a lookup on this hashed file keyed on your first column (key). Provide the constraint as reflink.NOTFOUNDfor one output link, where reflink is your reference link name. The link will give you your "first output" as you mentioned in your previours post. which will have all the records which are duplicates based on key.
Have a second output link with constraint NOT(reflink.NOTFOUND). This will give you your "second output".
Sort the incoming data on your key. Define two stage variables in the transformer, say condFlag and prevVal. The will basically detect duplicates and flag them. Their both will be initialized to 0. Their derivation will be as follows:
Code: Select all
condFlag | if (prevVal <> src.key) then 'X' else 'Y'
prevVal | src.key
Have a link coming out of here to a hashed file keyed on src.key (your first column).
Constraint for buildHash: condFlag = 'Y'
In the same job or maybe a second job feed the same source file and do a lookup on this hashed file keyed on your first column (key). Provide the constraint as reflink.NOTFOUNDfor one output link, where reflink is your reference link name. The link will give you your "first output" as you mentioned in your previours post. which will have all the records which are duplicates based on key.
Have a second output link with constraint NOT(reflink.NOTFOUND). This will give you your "second output".
Creativity is allowing yourself to make mistakes. Art is knowing which ones to keep.
-
- Premium Member
- Posts: 43
- Joined: Wed Oct 18, 2006 6:03 am
You wont be deleting anything. The hashed file will contain only keys which have duplicates in your source file. Then you run your source file again against this hashed file and do a NOT of NOTFOUND for duplicate records and NOTFOUND non duplicate records. No data will be deleted or missed.
Creativity is allowing yourself to make mistakes. Art is knowing which ones to keep.
-
- Premium Member
- Posts: 43
- Joined: Wed Oct 18, 2006 6:03 am
"Step by step"
I feel like the new kid on the block needing this much help!
Now it works and it is beautiful. Thank you very much. If you ever come by Santiago or Stockholm, let me buy you beer ;-)
//Mattias
Now it works and it is beautiful. Thank you very much. If you ever come by Santiago or Stockholm, let me buy you beer ;-)
//Mattias
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
-
- Premium Member
- Posts: 43
- Joined: Wed Oct 18, 2006 6:03 am