Catching all duplicate and non duplicate records seperately

somu_june · Post by **somu_june** » Tue Jul 17, 2007 11:50 am

Hi,

I have to capture the records that have duplicates in to one file and non duplicate records in to another file . For Example

ProductID Country Currency CH Price

1 FR EUR 00 50
1 FR EUR 00 50
1 FR EUR 00 50
1 UK EUR 00 50
1 GR EUR 00 100

Now I want all FR country records in to one file , I mean all 3 records in to one file and UK and GR record in to another file. Please tell me is there a way do that.

Thanks,
SomaRaju

ray.wurlod · Post by **ray.wurlod** » Tue Jul 17, 2007 4:16 pm

Run a fork-join design, using an Aggregator to count the number of each grouping key combinations. Run them all through two streams, one with a Remove Duplicates stage, the other with a filter stage on the count being >= 2.

JoshGeorge · Post by **JoshGeorge** » Tue Jul 17, 2007 8:09 pm

Stream your output to 2 output files and in the 'non duplicate records' file stage put the below code in the filter

Code: Select all

 sort |uniq -u

if records are already sorted then you can use just

Code: Select all

uniq -u

Now in the 'duplicate records' file stage put the below code in the filter

Code: Select all

 sort |uniq -dc

if records are already sorted then you can use just

Code: Select all

uniq -dc

Note: 'duplicate records' file will have the first column as the number of occurrence of that duplicate record.

bucks · Post by **bucks** » Wed Jul 18, 2007 1:22 pm

You can use sort stage to filter out the duplicates in to two different files based on the change keys.
but the above ones seems to be good.

us1aslam1us · Post by **us1aslam1us** » Wed Jul 18, 2007 2:28 pm

bucks wrote:You can use sort stage to filter out the duplicates in to two different files based on the change keys.
but the above ones seems to be good.

I don't think that will work as the OP needs to seperate the duplicate data from unique records. Josh's explanation is simple and clear to implement but in sequential mode.

sas · Post by **sas** » Thu Jul 19, 2007 10:22 pm

Put Copy after reading the data. One output of copy should have only key columns for Aggregator stage. Aggregator will give count of duplicate records along with key columns. Use second output of Copy as an input (left link) to Join alongwith Aggregator output. The Join output should be used as an input to Transformer to determine the number of occurance(coming from Aggregator). Based on the occurances write the data in 2 outputs.

ray.wurlod · Post by **ray.wurlod** » Thu Jul 19, 2007 11:32 pm

My solution works in parallel mode.

The Aggregator counts the members of each group. The filter stage only passes rows for which the count is greater than 1.

Code: Select all

  ---->  Copy  ---------------------> Join ----->  Copy  ----->  RemDups ----->
           |                            ^           |
           +------->  Aggregator -------+           +----------> Filter  ----->