How can identify invalid duplicates?

tbtcust · Post by **tbtcust** » Wed Feb 25, 2009 2:50 pm

I have a file that has a primary key that is duplicated. The file also has a field1 and field2. I need to capture the records that have the same primary key and field1 and/or field 2 is different. I tried searching for "capture duplicate" with no luck for this situation.

Example input
001 AAA
001 AAA
001 AAA
002 BBB
002 CCC
002 AAA

I'd like to have the following in an exception file.
002 BBB
002 CCC
002 AAA

Thanks in advance for any help

ray.wurlod · Post by **ray.wurlod** » Wed Feb 25, 2009 3:44 pm

Not sure what you mean by "invalid". What makes 002 BBB an exception, for example? Why 002 AAA rather than 001 AAA?

Does the Change Key column in the Sort stage suffice for a solution? If this is 0 then you have a duplicate.

chulett · Post by **chulett** » Wed Feb 25, 2009 3:54 pm

Confusing. Do you only want the records that are not duplicated within the file?

tbtcust · Post by **tbtcust** » Wed Feb 25, 2009 4:19 pm

Sorry for the confusion guys.

001 and 002 is in the keyed field1. AAA, BBB, and CCC are in field2. When field1 is a duplicate the data in field2 should have the same value.

I need to capture all the records that have duplicates in field1 where field2 is different. In the following example input:

001 AAA
001 AAA
001 AAA
002 BBB
002 CCC
002 AAA

I'd like to capture the following in an file
002 BBB
002 CCC
002 AAA

chulett · Post by **chulett** » Wed Feb 25, 2009 4:27 pm

"When field1 is a duplicate the data in field2 should have the same value" - as what, the first key value found?

What would be your expected output for this input:

001 AAA
001 AAA
001 AAA
001 BBB
001 BBB
002 BBB
002 CCC
002 CCC
002 AAA

tbtcust · Post by **tbtcust** » Wed Feb 25, 2009 4:49 pm

chulett , the expected output is:
001 AAA
001 AAA
001 AAA
001 BBB
001 BBB
002 BBB
002 CCC
002 CCC
002 AAA

That is because there are 001s with AAA abd BBB. There are 002s with BBB, CCC, and AAA.

Kryt0n · Post by **Kryt0n** » Wed Feb 25, 2009 5:16 pm

tbtcust wrote:Sorry for the confusion guys.

001 and 002 is in the keyed field1. AAA, BBB, and CCC are in field2. When field1 is a duplicate the data in field2 should have the same value.

I need to capture all the records that have duplicates in field1 where field2 is different. In the following example input:

001 AAA
001 AAA
001 AAA
002 BBB
002 CCC
002 AAA

I'd like to capture the following in an file
002 BBB
002 CCC
002 AAA

The way that first springs to mind...

Sort data by field1,field2
In transform stage, set up sv's to compare
if rowX(field1,field2) <> row(x-1)(field1,field2) - send down stream 1
if rowX(field1,field2) = row(x-1)(field1,field2) - send down stream 2

Use stream2 as a lookup in to stream1, if found in stream2, drop the record. With more thought, there may be something better to do but it's a start...

ray.wurlod · Post by **ray.wurlod** » Wed Feb 25, 2009 6:01 pm

Two Sort stages, each generating "change" columns. The first sorts by field1, the second by field1 ("don't sort, already sorted") and field2. The first generates a Key Change column, the second generates a Cluster Key Change column. The combination of these can be used to filter your rows.

manu_20lko · Post by **manu_20lko** » Mon Mar 02, 2009 5:09 am

If you are using a version 8.1 than
SCD stage can solve your problem.

in SCD stage define both field1 and field2 as business keys and rest of keys if any as Type1 fields ....

SCD stage will keep the unique combination of both fields

Sainath.Srinivasan · Post by **Sainath.Srinivasan** » Mon Mar 02, 2009 6:56 am

Try

Code: Select all

sort -u <YourFile> | cut -f1 | uniq -c | grep -v ' 1 '

and use the output as a lookup to identify duplicates.

tbtcust · Post by **tbtcust** » Tue Mar 03, 2009 5:57 pm

Thank you all for your suggestions.

ray, I use the 2 sort stages and a transform to generate a lookup file. That lookup file is used to filter out the records that I need.