Page 1 of 1

Check Duplicate without using Remove Duplicates

Posted: Fri Sep 30, 2011 2:02 am
by karthi_gana
All,

We receive security return file from the external vendor on monthly basis. If the file contain duplicates (key columns: securty_no, month_end_date), we don't want to process the complete file. We need to ensure that the file doesn't have any single duplicate. I think it is a kind of Quality check. Do i need to use Quality stage/standardize/Investigate? Do i need to create any new rule set? I don't know how to start or from where to start. Can anybody help me ?

Posted: Fri Sep 30, 2011 2:20 am
by arvind_ds
You can use sort stage also in datastage parallel jobs to remove duplicates.

Re: Check Duplicate without using Remove Duplicates

Posted: Fri Sep 30, 2011 2:23 am
by BI-RMA
Use an Aggregator and group by your key-columns. If there is a count > 1 You can evaluate that as an error. For example use a Lookup-Stage, do a lookup for the fixed value 1 against the Counter from the Aggregator and keep the default setting for Lookup-Failure as 'Fail' - if there is duplicate key in the file the job will abort.

Within the Sequence calling this job, you should then set the execution action for this job to "Reset if required, then run", because You make the job abort in certain conditions defined by business-rules.

Posted: Fri Sep 30, 2011 2:49 am
by vishal_rastogi
with the help of stage variables in transformer you can remove the duplicate

Re: Check Duplicate without using Remove Duplicates

Posted: Fri Sep 30, 2011 8:11 pm
by SURA
You can use SED / SORT Unix command in before routine and remove.

DS User

Posted: Sat Oct 01, 2011 2:32 am
by pandeesh
Use uniq -d and if returns any result simply do not process the file

Posted: Sat Oct 01, 2011 4:32 am
by BI-RMA
pandeesh wrote:Use uniq -d and if returns any result simply do not process the file
But this only helps if the rows are exact duplicates of each other. To check for duplicate keys, see above. There are likely to be other solutions, as almost always in DataStage...

Posted: Mon Oct 03, 2011 8:33 am
by FranklinE
Not intending this to be a good solution, but I needed something quickly and cooked up a way to use transformer constraints to capture duplicates (needed to be saved in a rejects file). I also check some fields for validity.

It took complete "coverage" in the constraints logic. Failed edits to one link, duplicates to another link, and a carefully written "reverse" logic for rows that pass and are to be processed in next steps.

Posted: Wed Nov 16, 2011 8:25 am
by sanjayS
First sort the records based on the key column whcih is followed by transformer stage.

And use use stage variable and constraint to remove dublicates.

sv1 : Key column
sv2 : If sv1<>sv3 Then 1 Else 0
sv3 : sv1

use constraint as sv2 = 1

-----------------
sanjay

Posted: Wed Nov 16, 2011 9:35 am
by ShaneMuir
Or just use a sort stage, group and sort by the key and add a key change column to the output.

As the next stage you can use a transformer or a filter to check the keychange column. This column is set to 1 for the first row in each group where the value of the sort key changes. Subsequent records in the group have the column set to 0. You can output the respective rows as you require.