Hi,
I have to capture the records that have duplicates in to one file and non duplicate records in to another file . For Example
ProductID Country Currency CH Price
1 FR EUR 00 50
1 FR EUR 00 50
1 FR EUR 00 50
1 UK EUR 00 50
1 GR EUR 00 100
Now I want all FR country records in to one file , I mean all 3 records in to one file and UK and GR record in to another file. Please tell me is there a way do that.
Thanks,
SomaRaju
Catching all duplicate and non duplicate records seperately
Moderators: chulett, rschirm, roy
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
Run a fork-join design, using an Aggregator to count the number of each grouping key combinations. Run them all through two streams, one with a Remove Duplicates stage, the other with a filter stage on the count being >= 2.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
-
- Participant
- Posts: 612
- Joined: Thu May 03, 2007 4:59 am
- Location: Melbourne
Stream your output to 2 output files and in the 'non duplicate records' file stage put the below code in the filter if records are already sorted then you can use just
Now in the 'duplicate records' file stage put the below code in the filter if records are already sorted then you can use just
Note: 'duplicate records' file will have the first column as the number of occurrence of that duplicate record.
Code: Select all
sort |uniq -u
Code: Select all
uniq -u
Code: Select all
sort |uniq -dc
Code: Select all
uniq -dc
Joshy George
<a href="http://www.linkedin.com/in/joshygeorge1" ><img src="http://www.linkedin.com/img/webpromo/bt ... _80x15.gif" width="80" height="15" border="0"></a>
<a href="http://www.linkedin.com/in/joshygeorge1" ><img src="http://www.linkedin.com/img/webpromo/bt ... _80x15.gif" width="80" height="15" border="0"></a>
-
- Charter Member
- Posts: 822
- Joined: Sat Sep 17, 2005 5:25 pm
- Location: USA
I don't think that will work as the OP needs to seperate the duplicate data from unique records. Josh's explanation is simple and clear to implement but in sequential mode.bucks wrote:You can use sort stage to filter out the duplicates in to two different files based on the change keys.
but the above ones seems to be good.
I haven't failed, I've found 10,000 ways that don't work.
Thomas Alva Edison(1847-1931)
Thomas Alva Edison(1847-1931)
Put Copy after reading the data. One output of copy should have only key columns for Aggregator stage. Aggregator will give count of duplicate records along with key columns. Use second output of Copy as an input (left link) to Join alongwith Aggregator output. The Join output should be used as an input to Transformer to determine the number of occurance(coming from Aggregator). Based on the occurances write the data in 2 outputs.
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
My solution works in parallel mode.
The Aggregator counts the members of each group. The filter stage only passes rows for which the count is greater than 1.
The Aggregator counts the members of each group. The filter stage only passes rows for which the count is greater than 1.
Code: Select all
----> Copy ---------------------> Join -----> Copy -----> RemDups ----->
| ^ |
+-------> Aggregator -------+ +----------> Filter ----->
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.