Page 1 of 1

How can identify invalid duplicates?

Posted: Wed Feb 25, 2009 2:50 pm
by tbtcust
I have a file that has a primary key that is duplicated. The file also has a field1 and field2. I need to capture the records that have the same primary key and field1 and/or field 2 is different. I tried searching for "capture duplicate" with no luck for this situation.

Example input
001 AAA
001 AAA
001 AAA
002 BBB
002 CCC
002 AAA

I'd like to have the following in an exception file.
002 BBB
002 CCC
002 AAA


Thanks in advance for any help

Posted: Wed Feb 25, 2009 3:44 pm
by ray.wurlod
Not sure what you mean by "invalid". What makes 002 BBB an exception, for example? Why 002 AAA rather than 001 AAA?

Does the Change Key column in the Sort stage suffice for a solution? If this is 0 then you have a duplicate.

Posted: Wed Feb 25, 2009 3:54 pm
by chulett
:? Confusing. Do you only want the records that are not duplicated within the file?

Posted: Wed Feb 25, 2009 4:19 pm
by tbtcust
Sorry for the confusion guys.

001 and 002 is in the keyed field1. AAA, BBB, and CCC are in field2. When field1 is a duplicate the data in field2 should have the same value.

I need to capture all the records that have duplicates in field1 where field2 is different. In the following example input:

001 AAA
001 AAA
001 AAA
002 BBB
002 CCC
002 AAA

I'd like to capture the following in an file
002 BBB
002 CCC
002 AAA

Posted: Wed Feb 25, 2009 4:27 pm
by chulett
"When field1 is a duplicate the data in field2 should have the same value" - as what, the first key value found? :?

What would be your expected output for this input:

001 AAA
001 AAA
001 AAA
001 BBB
001 BBB
002 BBB
002 CCC
002 CCC
002 AAA

Posted: Wed Feb 25, 2009 4:49 pm
by tbtcust
chulett , the expected output is:
001 AAA
001 AAA
001 AAA
001 BBB
001 BBB
002 BBB
002 CCC
002 CCC
002 AAA

That is because there are 001s with AAA abd BBB. There are 002s with BBB, CCC, and AAA.

Posted: Wed Feb 25, 2009 5:16 pm
by Kryt0n
tbtcust wrote:Sorry for the confusion guys.

001 and 002 is in the keyed field1. AAA, BBB, and CCC are in field2. When field1 is a duplicate the data in field2 should have the same value.

I need to capture all the records that have duplicates in field1 where field2 is different. In the following example input:

001 AAA
001 AAA
001 AAA
002 BBB
002 CCC
002 AAA

I'd like to capture the following in an file
002 BBB
002 CCC
002 AAA
The way that first springs to mind...

Sort data by field1,field2
In transform stage, set up sv's to compare
if rowX(field1,field2) <> row(x-1)(field1,field2) - send down stream 1
if rowX(field1,field2) = row(x-1)(field1,field2) - send down stream 2

Use stream2 as a lookup in to stream1, if found in stream2, drop the record. With more thought, there may be something better to do but it's a start...

Posted: Wed Feb 25, 2009 6:01 pm
by ray.wurlod
Two Sort stages, each generating "change" columns. The first sorts by field1, the second by field1 ("don't sort, already sorted") and field2. The first generates a Key Change column, the second generates a Cluster Key Change column. The combination of these can be used to filter your rows.

Posted: Mon Mar 02, 2009 5:09 am
by manu_20lko
If you are using a version 8.1 than
SCD stage can solve your problem.

in SCD stage define both field1 and field2 as business keys and rest of keys if any as Type1 fields ....

SCD stage will keep the unique combination of both fields

Posted: Mon Mar 02, 2009 6:56 am
by Sainath.Srinivasan
Try

Code: Select all

sort -u <YourFile> | cut -f1 | uniq -c | grep -v ' 1 '
and use the output as a lookup to identify duplicates.

Posted: Tue Mar 03, 2009 5:57 pm
by tbtcust
Thank you all for your suggestions.

ray, I use the 2 sort stages and a transform to generate a lookup file. That lookup file is used to filter out the records that I need.