Check Duplicate without using Remove Duplicates
Moderators: chulett, rschirm, roy
-
- Premium Member
- Posts: 729
- Joined: Tue Apr 28, 2009 10:49 pm
Check Duplicate without using Remove Duplicates
All,
We receive security return file from the external vendor on monthly basis. If the file contain duplicates (key columns: securty_no, month_end_date), we don't want to process the complete file. We need to ensure that the file doesn't have any single duplicate. I think it is a kind of Quality check. Do i need to use Quality stage/standardize/Investigate? Do i need to create any new rule set? I don't know how to start or from where to start. Can anybody help me ?
We receive security return file from the external vendor on monthly basis. If the file contain duplicates (key columns: securty_no, month_end_date), we don't want to process the complete file. We need to ensure that the file doesn't have any single duplicate. I think it is a kind of Quality check. Do i need to use Quality stage/standardize/Investigate? Do i need to create any new rule set? I don't know how to start or from where to start. Can anybody help me ?
Karthik
Re: Check Duplicate without using Remove Duplicates
Use an Aggregator and group by your key-columns. If there is a count > 1 You can evaluate that as an error. For example use a Lookup-Stage, do a lookup for the fixed value 1 against the Counter from the Aggregator and keep the default setting for Lookup-Failure as 'Fail' - if there is duplicate key in the file the job will abort.
Within the Sequence calling this job, you should then set the execution action for this job to "Reset if required, then run", because You make the job abort in certain conditions defined by business-rules.
Within the Sequence calling this job, you should then set the execution action for this job to "Reset if required, then run", because You make the job abort in certain conditions defined by business-rules.
"It is not the lucky ones are grateful.
There are the grateful those are happy." Francis Bacon
There are the grateful those are happy." Francis Bacon
-
- Participant
- Posts: 47
- Joined: Thu Dec 09, 2010 4:37 am
Re: Check Duplicate without using Remove Duplicates
You can use SED / SORT Unix command in before routine and remove.
DS User
DS User
But this only helps if the rows are exact duplicates of each other. To check for duplicate keys, see above. There are likely to be other solutions, as almost always in DataStage...pandeesh wrote:Use uniq -d and if returns any result simply do not process the file
"It is not the lucky ones are grateful.
There are the grateful those are happy." Francis Bacon
There are the grateful those are happy." Francis Bacon
Not intending this to be a good solution, but I needed something quickly and cooked up a way to use transformer constraints to capture duplicates (needed to be saved in a rejects file). I also check some fields for validity.
It took complete "coverage" in the constraints logic. Failed edits to one link, duplicates to another link, and a carefully written "reverse" logic for rows that pass and are to be processed in next steps.
It took complete "coverage" in the constraints logic. Failed edits to one link, duplicates to another link, and a carefully written "reverse" logic for rows that pass and are to be processed in next steps.
Franklin Evans
"Shared pain is lessened, shared joy increased. Thus do we refute entropy." -- Spider Robinson
Using mainframe data FAQ: viewtopic.php?t=143596 Using CFF FAQ: viewtopic.php?t=157872
"Shared pain is lessened, shared joy increased. Thus do we refute entropy." -- Spider Robinson
Using mainframe data FAQ: viewtopic.php?t=143596 Using CFF FAQ: viewtopic.php?t=157872
Or just use a sort stage, group and sort by the key and add a key change column to the output.
As the next stage you can use a transformer or a filter to check the keychange column. This column is set to 1 for the first row in each group where the value of the sort key changes. Subsequent records in the group have the column set to 0. You can output the respective rows as you require.
As the next stage you can use a transformer or a filter to check the keychange column. This column is set to 1 for the first row in each group where the value of the sort key changes. Subsequent records in the group have the column set to 0. You can output the respective rows as you require.