How can identify invalid duplicates?
Moderators: chulett, rschirm, roy
How can identify invalid duplicates?
I have a file that has a primary key that is duplicated. The file also has a field1 and field2. I need to capture the records that have the same primary key and field1 and/or field 2 is different. I tried searching for "capture duplicate" with no luck for this situation.
Example input
001 AAA
001 AAA
001 AAA
002 BBB
002 CCC
002 AAA
I'd like to have the following in an exception file.
002 BBB
002 CCC
002 AAA
Thanks in advance for any help
Example input
001 AAA
001 AAA
001 AAA
002 BBB
002 CCC
002 AAA
I'd like to have the following in an exception file.
002 BBB
002 CCC
002 AAA
Thanks in advance for any help
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
Not sure what you mean by "invalid". What makes 002 BBB an exception, for example? Why 002 AAA rather than 001 AAA?
Does the Change Key column in the Sort stage suffice for a solution? If this is 0 then you have a duplicate.
Does the Change Key column in the Sort stage suffice for a solution? If this is 0 then you have a duplicate.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Sorry for the confusion guys.
001 and 002 is in the keyed field1. AAA, BBB, and CCC are in field2. When field1 is a duplicate the data in field2 should have the same value.
I need to capture all the records that have duplicates in field1 where field2 is different. In the following example input:
001 AAA
001 AAA
001 AAA
002 BBB
002 CCC
002 AAA
I'd like to capture the following in an file
002 BBB
002 CCC
002 AAA
001 and 002 is in the keyed field1. AAA, BBB, and CCC are in field2. When field1 is a duplicate the data in field2 should have the same value.
I need to capture all the records that have duplicates in field1 where field2 is different. In the following example input:
001 AAA
001 AAA
001 AAA
002 BBB
002 CCC
002 AAA
I'd like to capture the following in an file
002 BBB
002 CCC
002 AAA
"When field1 is a duplicate the data in field2 should have the same value" - as what, the first key value found?
What would be your expected output for this input:
001 AAA
001 AAA
001 AAA
001 BBB
001 BBB
002 BBB
002 CCC
002 CCC
002 AAA
What would be your expected output for this input:
001 AAA
001 AAA
001 AAA
001 BBB
001 BBB
002 BBB
002 CCC
002 CCC
002 AAA
-craig
"You can never have too many knives" -- Logan Nine Fingers
"You can never have too many knives" -- Logan Nine Fingers
The way that first springs to mind...tbtcust wrote:Sorry for the confusion guys.
001 and 002 is in the keyed field1. AAA, BBB, and CCC are in field2. When field1 is a duplicate the data in field2 should have the same value.
I need to capture all the records that have duplicates in field1 where field2 is different. In the following example input:
001 AAA
001 AAA
001 AAA
002 BBB
002 CCC
002 AAA
I'd like to capture the following in an file
002 BBB
002 CCC
002 AAA
Sort data by field1,field2
In transform stage, set up sv's to compare
if rowX(field1,field2) <> row(x-1)(field1,field2) - send down stream 1
if rowX(field1,field2) = row(x-1)(field1,field2) - send down stream 2
Use stream2 as a lookup in to stream1, if found in stream2, drop the record. With more thought, there may be something better to do but it's a start...
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
Two Sort stages, each generating "change" columns. The first sorts by field1, the second by field1 ("don't sort, already sorted") and field2. The first generates a Key Change column, the second generates a Cluster Key Change column. The combination of these can be used to filter your rows.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
-
- Participant
- Posts: 1
- Joined: Mon Mar 02, 2009 2:48 am
-
- Participant
- Posts: 3337
- Joined: Mon Jan 17, 2005 4:49 am
- Location: United Kingdom
Try
and use the output as a lookup to identify duplicates.
Code: Select all
sort -u <YourFile> | cut -f1 | uniq -c | grep -v ' 1 '