How can i remove duplicate rows

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
praveenala
Participant
Posts: 6
Joined: Fri Jul 08, 2005 7:36 am

How can i remove duplicate rows

Post by praveenala »

Hi all,



I am new in DS. I have one doubt.My source is flat file,is having duplicates.I want to load the data to flat file(target).I dont want duplicates in my target.How can i do that?

please help.

Thanks in Advance


praveen
ds_user78
Participant
Posts: 23
Joined: Thu Nov 11, 2004 5:39 pm

Post by ds_user78 »

Use a hashed file selecting the keys as the columns based on which you want eliminate the dups.
cchylik
Participant
Posts: 4
Joined: Wed Mar 01, 2006 6:24 am

Re: How can i remove duplicate rows

Post by cchylik »

Hi!

1. solution: While writing the data to the file, you write the key columns to a hashed file, which you lookup for knowing the duplicates. If this lookup is a success you discard this column. This gives you always the first row!

2.solution :Wrtie the data to a hashed file complete, read the hashed file and write your output file. This gives you always the last row!

Good Luck
elavenil
Premium Member
Premium Member
Posts: 467
Joined: Thu Jan 31, 2002 10:20 pm
Location: Singapore

Post by elavenil »

There are few ways to eliminate duplicates in a server job. Load data into hashed file (define the key) and load into target as hashed file will not allow duplicates. Stage variables/aggregation stage can be used to eliminate duplicates as well.

HTWH.

Regards
Elavenil
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

Since your source is a flat file, you can pre-process this using "sort -u" on UNIX or if your server has a package like MKS Toolkit installed. If not, you can do a normal DOS sort (which won't remove duplicates) and then use a stage variable in a transform to store the previous row's value and then doing a compare with the current value to remove duplicates.

Both of these methods can be faster than using an interim hashed file, which is also a viable answer to your question.
balajisr
Charter Member
Charter Member
Posts: 785
Joined: Thu Jul 28, 2005 8:58 am

Post by balajisr »

Hi

You can also use unix sort -u command to remove the duplicates.
--Balaji S.R
balajisr
Charter Member
Charter Member
Posts: 785
Joined: Thu Jul 28, 2005 8:58 am

Post by balajisr »

Sorry did not notice the post by Arnd.

--Balaji S.R
praveenala
Participant
Posts: 6
Joined: Fri Jul 08, 2005 7:36 am

Post by praveenala »

elavenil wrote:There are few ways to eliminate duplicates in a server job. Load data into hashed file (define the key) and load into target as hashed file will not allow duplicates. Stage variables/aggregation stage can be used to eliminate duplicates as well.

HTWH.

Regards
Elavenil

Thanks to all,

How can u eliminate duplicates using stage variables?


Regards
praveen
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

Let's say you have two columns in your link, In.Key and In.Data and you want to remove duplicate keys and keep the first record.

Declare stage variables and derivations as:

Code: Select all

IsDuplicate     IF(In.Key = LastKey) THEN 1 ELSE 0
LastKey     In.Key
Then in your constraint put the value "IsDuplicate = 0" so that duplicates are not passed out of the transform stage. The order of "IsDuplicate" and "LastKey" is vital, if you change the order around the method will not work.
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

Dare to search :?:
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
Post Reply