How can i remove duplicate rows
Moderators: chulett, rschirm, roy
-
- Participant
- Posts: 6
- Joined: Fri Jul 08, 2005 7:36 am
How can i remove duplicate rows
Hi all,
I am new in DS. I have one doubt.My source is flat file,is having duplicates.I want to load the data to flat file(target).I dont want duplicates in my target.How can i do that?
please help.
Thanks in Advance
praveen
I am new in DS. I have one doubt.My source is flat file,is having duplicates.I want to load the data to flat file(target).I dont want duplicates in my target.How can i do that?
please help.
Thanks in Advance
praveen
Re: How can i remove duplicate rows
Hi!
1. solution: While writing the data to the file, you write the key columns to a hashed file, which you lookup for knowing the duplicates. If this lookup is a success you discard this column. This gives you always the first row!
2.solution :Wrtie the data to a hashed file complete, read the hashed file and write your output file. This gives you always the last row!
Good Luck
1. solution: While writing the data to the file, you write the key columns to a hashed file, which you lookup for knowing the duplicates. If this lookup is a success you discard this column. This gives you always the first row!
2.solution :Wrtie the data to a hashed file complete, read the hashed file and write your output file. This gives you always the last row!
Good Luck
Since your source is a flat file, you can pre-process this using "sort -u" on UNIX or if your server has a package like MKS Toolkit installed. If not, you can do a normal DOS sort (which won't remove duplicates) and then use a stage variable in a transform to store the previous row's value and then doing a compare with the current value to remove duplicates.
Both of these methods can be faster than using an interim hashed file, which is also a viable answer to your question.
Both of these methods can be faster than using an interim hashed file, which is also a viable answer to your question.
<a href=http://www.worldcommunitygrid.org/team/ ... TZ9H4CGVP1 target="WCGWin">
</a>
</a>
-
- Participant
- Posts: 6
- Joined: Fri Jul 08, 2005 7:36 am
elavenil wrote:There are few ways to eliminate duplicates in a server job. Load data into hashed file (define the key) and load into target as hashed file will not allow duplicates. Stage variables/aggregation stage can be used to eliminate duplicates as well.
HTWH.
Regards
Elavenil
Thanks to all,
How can u eliminate duplicates using stage variables?
Regards
praveen
Let's say you have two columns in your link, In.Key and In.Data and you want to remove duplicate keys and keep the first record.
Declare stage variables and derivations as:
Then in your constraint put the value "IsDuplicate = 0" so that duplicates are not passed out of the transform stage. The order of "IsDuplicate" and "LastKey" is vital, if you change the order around the method will not work.
Declare stage variables and derivations as:
Code: Select all
IsDuplicate IF(In.Key = LastKey) THEN 1 ELSE 0
LastKey In.Key
<a href=http://www.worldcommunitygrid.org/team/ ... TZ9H4CGVP1 target="WCGWin">
</a>
</a>