Catching all duplicate and non duplicate records seperately

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
somu_june
Premium Member
Premium Member
Posts: 439
Joined: Wed Sep 14, 2005 9:28 am
Location: 36p,reading road

Catching all duplicate and non duplicate records seperately

Post by somu_june »

Hi,

I have to capture the records that have duplicates in to one file and non duplicate records in to another file . For Example


ProductID Country Currency CH Price

1 FR EUR 00 50
1 FR EUR 00 50
1 FR EUR 00 50
1 UK EUR 00 50
1 GR EUR 00 100

Now I want all FR country records in to one file , I mean all 3 records in to one file and UK and GR record in to another file. Please tell me is there a way do that.


Thanks,
SomaRaju
somaraju
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Run a fork-join design, using an Aggregator to count the number of each grouping key combinations. Run them all through two streams, one with a Remove Duplicates stage, the other with a filter stage on the count being >= 2.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
JoshGeorge
Participant
Posts: 612
Joined: Thu May 03, 2007 4:59 am
Location: Melbourne

Post by JoshGeorge »

Stream your output to 2 output files and in the 'non duplicate records' file stage put the below code in the filter

Code: Select all

 sort |uniq -u 
if records are already sorted then you can use just

Code: Select all

uniq -u 
Now in the 'duplicate records' file stage put the below code in the filter

Code: Select all

 sort |uniq -dc
if records are already sorted then you can use just

Code: Select all

uniq -dc
Note: 'duplicate records' file will have the first column as the number of occurrence of that duplicate record.
Joshy George
<a href="http://www.linkedin.com/in/joshygeorge1" ><img src="http://www.linkedin.com/img/webpromo/bt ... _80x15.gif" width="80" height="15" border="0"></a>
bucks
Premium Member
Premium Member
Posts: 21
Joined: Tue Feb 20, 2007 1:31 pm

Post by bucks »

You can use sort stage to filter out the duplicates in to two different files based on the change keys.
but the above ones seems to be good.
us1aslam1us
Charter Member
Charter Member
Posts: 822
Joined: Sat Sep 17, 2005 5:25 pm
Location: USA

Post by us1aslam1us »

bucks wrote:You can use sort stage to filter out the duplicates in to two different files based on the change keys.
but the above ones seems to be good.
I don't think that will work as the OP needs to seperate the duplicate data from unique records. Josh's explanation is simple and clear to implement but in sequential mode.
I haven't failed, I've found 10,000 ways that don't work.
Thomas Alva Edison(1847-1931)
sas
Charter Member
Charter Member
Posts: 13
Joined: Sun Mar 05, 2006 12:37 am

Post by sas »

Put Copy after reading the data. One output of copy should have only key columns for Aggregator stage. Aggregator will give count of duplicate records along with key columns. Use second output of Copy as an input (left link) to Join alongwith Aggregator output. The Join output should be used as an input to Transformer to determine the number of occurance(coming from Aggregator). Based on the occurances write the data in 2 outputs.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

My solution works in parallel mode.

The Aggregator counts the members of each group. The filter stage only passes rows for which the count is greater than 1.

Code: Select all

  ---->  Copy  ---------------------> Join ----->  Copy  ----->  RemDups ----->
           |                            ^           |
           +------->  Aggregator -------+           +----------> Filter  ----->
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Post Reply