How can identify invalid duplicates?

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
tbtcust
Premium Member
Premium Member
Posts: 230
Joined: Tue Mar 04, 2008 9:07 am

How can identify invalid duplicates?

Post by tbtcust »

I have a file that has a primary key that is duplicated. The file also has a field1 and field2. I need to capture the records that have the same primary key and field1 and/or field 2 is different. I tried searching for "capture duplicate" with no luck for this situation.

Example input
001 AAA
001 AAA
001 AAA
002 BBB
002 CCC
002 AAA

I'd like to have the following in an exception file.
002 BBB
002 CCC
002 AAA


Thanks in advance for any help
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Not sure what you mean by "invalid". What makes 002 BBB an exception, for example? Why 002 AAA rather than 001 AAA?

Does the Change Key column in the Sort stage suffice for a solution? If this is 0 then you have a duplicate.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

:? Confusing. Do you only want the records that are not duplicated within the file?
-craig

"You can never have too many knives" -- Logan Nine Fingers
tbtcust
Premium Member
Premium Member
Posts: 230
Joined: Tue Mar 04, 2008 9:07 am

Post by tbtcust »

Sorry for the confusion guys.

001 and 002 is in the keyed field1. AAA, BBB, and CCC are in field2. When field1 is a duplicate the data in field2 should have the same value.

I need to capture all the records that have duplicates in field1 where field2 is different. In the following example input:

001 AAA
001 AAA
001 AAA
002 BBB
002 CCC
002 AAA

I'd like to capture the following in an file
002 BBB
002 CCC
002 AAA
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

"When field1 is a duplicate the data in field2 should have the same value" - as what, the first key value found? :?

What would be your expected output for this input:

001 AAA
001 AAA
001 AAA
001 BBB
001 BBB
002 BBB
002 CCC
002 CCC
002 AAA
-craig

"You can never have too many knives" -- Logan Nine Fingers
tbtcust
Premium Member
Premium Member
Posts: 230
Joined: Tue Mar 04, 2008 9:07 am

Post by tbtcust »

chulett , the expected output is:
001 AAA
001 AAA
001 AAA
001 BBB
001 BBB
002 BBB
002 CCC
002 CCC
002 AAA

That is because there are 001s with AAA abd BBB. There are 002s with BBB, CCC, and AAA.
Kryt0n
Participant
Posts: 584
Joined: Wed Jun 22, 2005 7:28 pm

Post by Kryt0n »

tbtcust wrote:Sorry for the confusion guys.

001 and 002 is in the keyed field1. AAA, BBB, and CCC are in field2. When field1 is a duplicate the data in field2 should have the same value.

I need to capture all the records that have duplicates in field1 where field2 is different. In the following example input:

001 AAA
001 AAA
001 AAA
002 BBB
002 CCC
002 AAA

I'd like to capture the following in an file
002 BBB
002 CCC
002 AAA
The way that first springs to mind...

Sort data by field1,field2
In transform stage, set up sv's to compare
if rowX(field1,field2) <> row(x-1)(field1,field2) - send down stream 1
if rowX(field1,field2) = row(x-1)(field1,field2) - send down stream 2

Use stream2 as a lookup in to stream1, if found in stream2, drop the record. With more thought, there may be something better to do but it's a start...
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Two Sort stages, each generating "change" columns. The first sorts by field1, the second by field1 ("don't sort, already sorted") and field2. The first generates a Key Change column, the second generates a Cluster Key Change column. The combination of these can be used to filter your rows.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
manu_20lko
Participant
Posts: 1
Joined: Mon Mar 02, 2009 2:48 am

Post by manu_20lko »

If you are using a version 8.1 than
SCD stage can solve your problem.

in SCD stage define both field1 and field2 as business keys and rest of keys if any as Type1 fields ....

SCD stage will keep the unique combination of both fields
Sainath.Srinivasan
Participant
Posts: 3337
Joined: Mon Jan 17, 2005 4:49 am
Location: United Kingdom

Post by Sainath.Srinivasan »

Try

Code: Select all

sort -u <YourFile> | cut -f1 | uniq -c | grep -v ' 1 '
and use the output as a lookup to identify duplicates.
tbtcust
Premium Member
Premium Member
Posts: 230
Joined: Tue Mar 04, 2008 9:07 am

Post by tbtcust »

Thank you all for your suggestions.

ray, I use the 2 sort stages and a transform to generate a lookup file. That lookup file is used to filter out the records that I need.
Post Reply