Page 1 of 1

Catching all duplicate and non duplicate records seperately

Posted: Tue Jul 17, 2007 11:50 am
by somu_june
Hi,

I have to capture the records that have duplicates in to one file and non duplicate records in to another file . For Example


ProductID Country Currency CH Price

1 FR EUR 00 50
1 FR EUR 00 50
1 FR EUR 00 50
1 UK EUR 00 50
1 GR EUR 00 100

Now I want all FR country records in to one file , I mean all 3 records in to one file and UK and GR record in to another file. Please tell me is there a way do that.


Thanks,
SomaRaju

Posted: Tue Jul 17, 2007 4:16 pm
by ray.wurlod
Run a fork-join design, using an Aggregator to count the number of each grouping key combinations. Run them all through two streams, one with a Remove Duplicates stage, the other with a filter stage on the count being >= 2.

Posted: Tue Jul 17, 2007 8:09 pm
by JoshGeorge
Stream your output to 2 output files and in the 'non duplicate records' file stage put the below code in the filter

Code: Select all

 sort |uniq -u 
if records are already sorted then you can use just

Code: Select all

uniq -u 
Now in the 'duplicate records' file stage put the below code in the filter

Code: Select all

 sort |uniq -dc
if records are already sorted then you can use just

Code: Select all

uniq -dc
Note: 'duplicate records' file will have the first column as the number of occurrence of that duplicate record.

Posted: Wed Jul 18, 2007 1:22 pm
by bucks
You can use sort stage to filter out the duplicates in to two different files based on the change keys.
but the above ones seems to be good.

Posted: Wed Jul 18, 2007 2:28 pm
by us1aslam1us
bucks wrote:You can use sort stage to filter out the duplicates in to two different files based on the change keys.
but the above ones seems to be good.
I don't think that will work as the OP needs to seperate the duplicate data from unique records. Josh's explanation is simple and clear to implement but in sequential mode.

Posted: Thu Jul 19, 2007 10:22 pm
by sas
Put Copy after reading the data. One output of copy should have only key columns for Aggregator stage. Aggregator will give count of duplicate records along with key columns. Use second output of Copy as an input (left link) to Join alongwith Aggregator output. The Join output should be used as an input to Transformer to determine the number of occurance(coming from Aggregator). Based on the occurances write the data in 2 outputs.

Posted: Thu Jul 19, 2007 11:32 pm
by ray.wurlod
My solution works in parallel mode.

The Aggregator counts the members of each group. The filter stage only passes rows for which the count is greater than 1.

Code: Select all

  ---->  Copy  ---------------------> Join ----->  Copy  ----->  RemDups ----->
           |                            ^           |
           +------->  Aggregator -------+           +----------> Filter  ----->