Check Duplicate without using Remove Duplicates

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
karthi_gana
Premium Member
Premium Member
Posts: 729
Joined: Tue Apr 28, 2009 10:49 pm

Check Duplicate without using Remove Duplicates

Post by karthi_gana »

All,

We receive security return file from the external vendor on monthly basis. If the file contain duplicates (key columns: securty_no, month_end_date), we don't want to process the complete file. We need to ensure that the file doesn't have any single duplicate. I think it is a kind of Quality check. Do i need to use Quality stage/standardize/Investigate? Do i need to create any new rule set? I don't know how to start or from where to start. Can anybody help me ?
Karthik
arvind_ds
Participant
Posts: 428
Joined: Thu Aug 16, 2007 11:38 pm
Location: Manali

Post by arvind_ds »

You can use sort stage also in datastage parallel jobs to remove duplicates.
Arvind
BI-RMA
Premium Member
Premium Member
Posts: 463
Joined: Sun Nov 01, 2009 3:55 pm
Location: Hamburg

Re: Check Duplicate without using Remove Duplicates

Post by BI-RMA »

Use an Aggregator and group by your key-columns. If there is a count > 1 You can evaluate that as an error. For example use a Lookup-Stage, do a lookup for the fixed value 1 against the Counter from the Aggregator and keep the default setting for Lookup-Failure as 'Fail' - if there is duplicate key in the file the job will abort.

Within the Sequence calling this job, you should then set the execution action for this job to "Reset if required, then run", because You make the job abort in certain conditions defined by business-rules.
"It is not the lucky ones are grateful.
There are the grateful those are happy." Francis Bacon
vishal_rastogi
Participant
Posts: 47
Joined: Thu Dec 09, 2010 4:37 am

Post by vishal_rastogi »

with the help of stage variables in transformer you can remove the duplicate
Vish
SURA
Premium Member
Premium Member
Posts: 1229
Joined: Sat Jul 14, 2007 5:16 am
Location: Sydney

Re: Check Duplicate without using Remove Duplicates

Post by SURA »

You can use SED / SORT Unix command in before routine and remove.

DS User
pandeesh
Premium Member
Premium Member
Posts: 1399
Joined: Sun Oct 24, 2010 5:15 am
Location: CHENNAI, TAMIL NADU

Post by pandeesh »

Use uniq -d and if returns any result simply do not process the file
pandeeswaran
BI-RMA
Premium Member
Premium Member
Posts: 463
Joined: Sun Nov 01, 2009 3:55 pm
Location: Hamburg

Post by BI-RMA »

pandeesh wrote:Use uniq -d and if returns any result simply do not process the file
But this only helps if the rows are exact duplicates of each other. To check for duplicate keys, see above. There are likely to be other solutions, as almost always in DataStage...
"It is not the lucky ones are grateful.
There are the grateful those are happy." Francis Bacon
FranklinE
Premium Member
Premium Member
Posts: 739
Joined: Tue Nov 25, 2008 2:19 pm
Location: Malvern, PA

Post by FranklinE »

Not intending this to be a good solution, but I needed something quickly and cooked up a way to use transformer constraints to capture duplicates (needed to be saved in a rejects file). I also check some fields for validity.

It took complete "coverage" in the constraints logic. Failed edits to one link, duplicates to another link, and a carefully written "reverse" logic for rows that pass and are to be processed in next steps.
Franklin Evans
"Shared pain is lessened, shared joy increased. Thus do we refute entropy." -- Spider Robinson

Using mainframe data FAQ: viewtopic.php?t=143596 Using CFF FAQ: viewtopic.php?t=157872
sanjayS
Participant
Posts: 16
Joined: Mon Apr 18, 2011 10:56 pm

Post by sanjayS »

First sort the records based on the key column whcih is followed by transformer stage.

And use use stage variable and constraint to remove dublicates.

sv1 : Key column
sv2 : If sv1<>sv3 Then 1 Else 0
sv3 : sv1

use constraint as sv2 = 1

-----------------
sanjay
ShaneMuir
Premium Member
Premium Member
Posts: 508
Joined: Tue Jun 15, 2004 5:00 am
Location: London

Post by ShaneMuir »

Or just use a sort stage, group and sort by the key and add a key change column to the output.

As the next stage you can use a transformer or a filter to check the keychange column. This column is set to 1 for the first row in each group where the value of the sort key changes. Subsequent records in the group have the column set to 0. You can output the respective rows as you require.
Post Reply