Page 1 of 1

remove all duplicate records

Posted: Wed Jun 09, 2010 4:58 am
by ds_dwh
hi,

source wiil be like this:
co1,col2,col3
100,a,hyd
101,b,hyd
102,c,blore

if i use remove duplicate o/p will be like this:100,a,hyd
102,c,blore
but required o/p will be like this:102,c,blore

can any one help.......

Posted: Wed Jun 09, 2010 5:06 am
by antonyraj.deva
ds_dwh wrote:co1,col2,col3
100,a,hyd
101,b,hyd
102,c,blore
Remove Duplicate stage works based on a key column. Firstly, What is your key column? And also your required output is unclear.

Thanks,
Tony

Posted: Wed Jun 09, 2010 5:46 am
by sureshreddy2009
If your requirement is to remove all records which are repeated more than once then this is the logic

step1:read all the records
step2:pass to aggregator and count on particular key column
step3:use filter to pass the records where count=1
if you use aggregator basically all columns can't come as output so take help of copy stage and join stage

Posted: Wed Jun 09, 2010 4:28 pm
by ray.wurlod
Sort and partition on the third column, which is declared as the "key" for the purposes of the Remove Duplicates stage.

Re: remove all duplicate records

Posted: Thu Jun 10, 2010 12:13 am
by mayura
ds_dwh wrote:hi,

source wiil be like this:
co1,col2,col3
100,a,hyd
101,b,hyd
102,c,blore

if i use remove duplicate o/p will be like this:100,a,hyd
102,c,blore
but required o/p will be like this:102,c,blore

can any one help.......

use col3 as key column (depends on your process requirement) then u will get the good records also if you are using remove duplicate stage click on sort and unique options inside it.
hope it will heplful...
:idea:

Re: remove all duplicate records

Posted: Thu Jun 10, 2010 12:39 am
by g_rkrish
ds_dwh wrote:hi,

100,a,hyd
101,b,hyd
102,c,blore

if i use remove duplicate o/p will be like this:100,a,hyd
102,c,blore
but required o/p will be like this:102,c,blore

can any one help.......
Will the third column will be of same size.then u can use as key but when your coulmn comes like blore and banglore then you can't remove that.

Re: remove all duplicate records

Posted: Thu Jun 10, 2010 1:15 am
by ray.wurlod
ds_dwh wrote:hi,

source wiil be like this:
co1,col2,col3
100,a,hyd
101,b,hyd
102,c,blore

if i use remove duplicate o/p will be like this:100,a,hyd
102,c,blore
but required o/p will be like this:102,c,blore

can any one help.......
Please create a written specification about how this output is to be produced. Is it that you only want rows for which no duplicate occurs? In that case, use a fork-join design, count the distinct values in col3 using an Aggregator stage, join to original detail data, then pass only those rows for which the count is 1.