remove all duplicate records

ds_dwh · Post by **ds_dwh** » Wed Jun 09, 2010 4:58 am

hi,

source wiil be like this:
co1,col2,col3
100,a,hyd
101,b,hyd
102,c,blore

if i use remove duplicate o/p will be like this:100,a,hyd
102,c,blore
but required o/p will be like this:102,c,blore

can any one help.......

antonyraj.deva · Post by **antonyraj.deva** » Wed Jun 09, 2010 5:06 am

ds_dwh wrote:co1,col2,col3
100,a,hyd
101,b,hyd
102,c,blore

Remove Duplicate stage works based on a key column. Firstly, What is your key column? And also your required output is unclear.

Thanks,
Tony

sureshreddy2009 · Post by **sureshreddy2009** » Wed Jun 09, 2010 5:46 am

If your requirement is to remove all records which are repeated more than once then this is the logic

step1:read all the records
step2:pass to aggregator and count on particular key column
step3:use filter to pass the records where count=1
if you use aggregator basically all columns can't come as output so take help of copy stage and join stage

ray.wurlod · Post by **ray.wurlod** » Wed Jun 09, 2010 4:28 pm

Sort and partition on the third column, which is declared as the "key" for the purposes of the Remove Duplicates stage.

mayura · Post by **mayura** » Thu Jun 10, 2010 12:13 am

ds_dwh wrote:hi,

source wiil be like this:
co1,col2,col3
100,a,hyd
101,b,hyd
102,c,blore

if i use remove duplicate o/p will be like this:100,a,hyd
102,c,blore
but required o/p will be like this:102,c,blore

can any one help.......

use col3 as key column (depends on your process requirement) then u will get the good records also if you are using remove duplicate stage click on sort and unique options inside it.
hope it will heplful...

g_rkrish · Post by **g_rkrish** » Thu Jun 10, 2010 12:39 am

ds_dwh wrote:hi,

100,a,hyd
101,b,hyd
102,c,blore

if i use remove duplicate o/p will be like this:100,a,hyd
102,c,blore
but required o/p will be like this:102,c,blore

can any one help.......

Will the third column will be of same size.then u can use as key but when your coulmn comes like blore and banglore then you can't remove that.

ray.wurlod · Post by **ray.wurlod** » Thu Jun 10, 2010 1:15 am

ds_dwh wrote:hi,

source wiil be like this:
co1,col2,col3
100,a,hyd
101,b,hyd
102,c,blore

if i use remove duplicate o/p will be like this:100,a,hyd
102,c,blore
but required o/p will be like this:102,c,blore

can any one help.......

Please create a written specification about how this output is to be produced. Is it that you only want rows for which no duplicate occurs? In that case, use a fork-join design, count the distinct values in col3 using an Aggregator stage, join to original detail data, then pass only those rows for which the count is 1.

DSXchange

remove all duplicate records

remove all duplicate records

Re: remove all duplicate records

Re: remove all duplicate records

Re: remove all duplicate records