Page 1 of 1

How can i remove duplicate rows

Posted: Thu Mar 02, 2006 3:54 am
by praveenala
Hi all,



I am new in DS. I have one doubt.My source is flat file,is having duplicates.I want to load the data to flat file(target).I dont want duplicates in my target.How can i do that?

please help.

Thanks in Advance


praveen

Posted: Thu Mar 02, 2006 3:59 am
by ds_user78
Use a hashed file selecting the keys as the columns based on which you want eliminate the dups.

Re: How can i remove duplicate rows

Posted: Thu Mar 02, 2006 4:01 am
by cchylik
Hi!

1. solution: While writing the data to the file, you write the key columns to a hashed file, which you lookup for knowing the duplicates. If this lookup is a success you discard this column. This gives you always the first row!

2.solution :Wrtie the data to a hashed file complete, read the hashed file and write your output file. This gives you always the last row!

Good Luck

Posted: Thu Mar 02, 2006 4:01 am
by elavenil
There are few ways to eliminate duplicates in a server job. Load data into hashed file (define the key) and load into target as hashed file will not allow duplicates. Stage variables/aggregation stage can be used to eliminate duplicates as well.

HTWH.

Regards
Elavenil

Posted: Thu Mar 02, 2006 4:27 am
by ArndW
Since your source is a flat file, you can pre-process this using "sort -u" on UNIX or if your server has a package like MKS Toolkit installed. If not, you can do a normal DOS sort (which won't remove duplicates) and then use a stage variable in a transform to store the previous row's value and then doing a compare with the current value to remove duplicates.

Both of these methods can be faster than using an interim hashed file, which is also a viable answer to your question.

Posted: Thu Mar 02, 2006 4:29 am
by balajisr
Hi

You can also use unix sort -u command to remove the duplicates.
--Balaji S.R

Posted: Thu Mar 02, 2006 4:33 am
by balajisr
Sorry did not notice the post by Arnd.

--Balaji S.R

Posted: Thu Mar 02, 2006 5:17 am
by praveenala
elavenil wrote:There are few ways to eliminate duplicates in a server job. Load data into hashed file (define the key) and load into target as hashed file will not allow duplicates. Stage variables/aggregation stage can be used to eliminate duplicates as well.

HTWH.

Regards
Elavenil

Thanks to all,

How can u eliminate duplicates using stage variables?


Regards
praveen

Posted: Thu Mar 02, 2006 5:36 am
by ArndW
Let's say you have two columns in your link, In.Key and In.Data and you want to remove duplicate keys and keep the first record.

Declare stage variables and derivations as:

Code: Select all

IsDuplicate     IF(In.Key = LastKey) THEN 1 ELSE 0
LastKey     In.Key
Then in your constraint put the value "IsDuplicate = 0" so that duplicates are not passed out of the transform stage. The order of "IsDuplicate" and "LastKey" is vital, if you change the order around the method will not work.

Posted: Thu Mar 02, 2006 6:40 am
by kumar_s
Dare to search :?: