Check Duplicate without using Remove Duplicates

karthi_gana · Post by **karthi_gana** » Fri Sep 30, 2011 2:02 am

All,

We receive security return file from the external vendor on monthly basis. If the file contain duplicates (key columns: securty_no, month_end_date), we don't want to process the complete file. We need to ensure that the file doesn't have any single duplicate. I think it is a kind of Quality check. Do i need to use Quality stage/standardize/Investigate? Do i need to create any new rule set? I don't know how to start or from where to start. Can anybody help me ?

arvind_ds · Post by **arvind_ds** » Fri Sep 30, 2011 2:20 am

You can use sort stage also in datastage parallel jobs to remove duplicates.

BI-RMA · Post by **BI-RMA** » Fri Sep 30, 2011 2:23 am

Use an Aggregator and group by your key-columns. If there is a count > 1 You can evaluate that as an error. For example use a Lookup-Stage, do a lookup for the fixed value 1 against the Counter from the Aggregator and keep the default setting for Lookup-Failure as 'Fail' - if there is duplicate key in the file the job will abort.

Within the Sequence calling this job, you should then set the execution action for this job to "Reset if required, then run", because You make the job abort in certain conditions defined by business-rules.

vishal_rastogi · Post by **vishal_rastogi** » Fri Sep 30, 2011 2:49 am

with the help of stage variables in transformer you can remove the duplicate

SURA · Post by **SURA** » Fri Sep 30, 2011 8:11 pm

You can use SED / SORT Unix command in before routine and remove.

DS User

pandeesh · Post by **pandeesh** » Sat Oct 01, 2011 2:32 am

Use uniq -d and if returns any result simply do not process the file

BI-RMA · Post by **BI-RMA** » Sat Oct 01, 2011 4:32 am

pandeesh wrote:Use uniq -d and if returns any result simply do not process the file

But this only helps if the rows are exact duplicates of each other. To check for duplicate keys, see above. There are likely to be other solutions, as almost always in DataStage...

FranklinE · Post by **FranklinE** » Mon Oct 03, 2011 8:33 am

Not intending this to be a good solution, but I needed something quickly and cooked up a way to use transformer constraints to capture duplicates (needed to be saved in a rejects file). I also check some fields for validity.

It took complete "coverage" in the constraints logic. Failed edits to one link, duplicates to another link, and a carefully written "reverse" logic for rows that pass and are to be processed in next steps.

sanjayS · Post by **sanjayS** » Wed Nov 16, 2011 8:25 am

First sort the records based on the key column whcih is followed by transformer stage.

And use use stage variable and constraint to remove dublicates.

sv1 : Key column
sv2 : If sv1<>sv3 Then 1 Else 0
sv3 : sv1

use constraint as sv2 = 1

-----------------
sanjay

ShaneMuir · Post by **ShaneMuir** » Wed Nov 16, 2011 9:35 am

Or just use a sort stage, group and sort by the key and add a key change column to the output.

As the next stage you can use a transformer or a filter to check the keychange column. This column is set to 1 for the first row in each group where the value of the sort key changes. Subsequent records in the group have the column set to 0. You can output the respective rows as you require.

DSXchange

Check Duplicate without using Remove Duplicates

Check Duplicate without using Remove Duplicates

Re: Check Duplicate without using Remove Duplicates

Re: Check Duplicate without using Remove Duplicates