How can i remove duplicate rows

praveenala · Post by **praveenala** » Thu Mar 02, 2006 3:54 am

Hi all,

I am new in DS. I have one doubt.My source is flat file,is having duplicates.I want to load the data to flat file(target).I dont want duplicates in my target.How can i do that?

please help.

Thanks in Advance

praveen

ds_user78 · Post by **ds_user78** » Thu Mar 02, 2006 3:59 am

Use a hashed file selecting the keys as the columns based on which you want eliminate the dups.

cchylik · Post by **cchylik** » Thu Mar 02, 2006 4:01 am

Hi!

1. solution: While writing the data to the file, you write the key columns to a hashed file, which you lookup for knowing the duplicates. If this lookup is a success you discard this column. This gives you always the first row!

2.solution :Wrtie the data to a hashed file complete, read the hashed file and write your output file. This gives you always the last row!

Good Luck

elavenil · Post by **elavenil** » Thu Mar 02, 2006 4:01 am

There are few ways to eliminate duplicates in a server job. Load data into hashed file (define the key) and load into target as hashed file will not allow duplicates. Stage variables/aggregation stage can be used to eliminate duplicates as well.

HTWH.

Regards
Elavenil

ArndW · Post by **ArndW** » Thu Mar 02, 2006 4:27 am

Since your source is a flat file, you can pre-process this using "sort -u" on UNIX or if your server has a package like MKS Toolkit installed. If not, you can do a normal DOS sort (which won't remove duplicates) and then use a stage variable in a transform to store the previous row's value and then doing a compare with the current value to remove duplicates.

Both of these methods can be faster than using an interim hashed file, which is also a viable answer to your question.

balajisr · Post by **balajisr** » Thu Mar 02, 2006 4:29 am

Hi

You can also use unix sort -u command to remove the duplicates.
--Balaji S.R

balajisr · Post by **balajisr** » Thu Mar 02, 2006 4:33 am

Sorry did not notice the post by Arnd.

--Balaji S.R

praveenala · Post by **praveenala** » Thu Mar 02, 2006 5:17 am

elavenil wrote:There are few ways to eliminate duplicates in a server job. Load data into hashed file (define the key) and load into target as hashed file will not allow duplicates. Stage variables/aggregation stage can be used to eliminate duplicates as well.

HTWH.

Regards
Elavenil

Thanks to all,

How can u eliminate duplicates using stage variables?

Regards
praveen

ArndW · Post by **ArndW** » Thu Mar 02, 2006 5:36 am

Let's say you have two columns in your link, In.Key and In.Data and you want to remove duplicate keys and keep the first record.

Declare stage variables and derivations as:

Code: Select all

IsDuplicate     IF(In.Key = LastKey) THEN 1 ELSE 0
LastKey     In.Key

Then in your constraint put the value "IsDuplicate = 0" so that duplicates are not passed out of the transform stage. The order of "IsDuplicate" and "LastKey" is vital, if you change the order around the method will not work.

kumar_s · Post by **kumar_s** » Thu Mar 02, 2006 6:40 am

Dare to search

DSXchange

How can i remove duplicate rows

How can i remove duplicate rows

Re: How can i remove duplicate rows