duplicate records in sequential file

oracle · Post by **oracle** » Wed Jul 27, 2005 5:16 am

Hi friends

i am new to data stage, i have small problem pls solve

i have 3 sequential file file 1 , file 2, file 3
file 1 contains A, A records
file2 contains B, B records
file3 contains C, C records

my source is sequentiial file and target is sequential file.but output in target is A,B,C records. how can i eliminate duplicate records in sequential file, according to above example pls solve my problem

WoMaWil · Post by **WoMaWil** » Wed Jul 27, 2005 5:30 am

use a Hashfile-Stage in between and be cautious with your keyfields, all duplicate keys are eliminated, if you choose too few you will even get too less.

Wolfgang

elavenil · Post by **elavenil** » Wed Jul 27, 2005 6:18 am

You can use stage variables to eliminate duplicate records. This topic was covered lot. Pls search the forum.

Regards
Saravanan

WoMaWil · Post by **WoMaWil** » Wed Jul 27, 2005 9:43 am

StageVariables are only helpfull if duplicate records follow one after the other, which could be easily done by fetching a DataBase-Table in a certain order. For flatfiles unless you know the a double is directly followed by one an other it is not adiquate.

Best is using Hashfiles, but for hash-files you have to know what you do. (READ THE MANUAL)

Wolfgang

Sainath.Srinivasan · Post by **Sainath.Srinivasan** » Wed Jul 27, 2005 9:45 am

You can sort the rows before using stage variables.

pnchowdary · Post by **pnchowdary** » Wed Jul 27, 2005 9:56 am

Hi Sai,

Cant we use the count function in Aggregator stage and based on the criteria that count > 1 along with stage variables, eliminate duplicates?. This might not be the most efficient way to do it, but theoritically it should work, right ?

Sainath.Srinivasan · Post by **Sainath.Srinivasan** » Wed Jul 27, 2005 9:57 am

You can do that to identify duplicate keys. But what about the data value associated with the key?

pnchowdary · Post by **pnchowdary** » Wed Jul 27, 2005 10:04 am

Hi Sai,

We can store the list of keys that are identified as duplicate keys in a file. Then use it as a reference to either pass the entire row (key+data) (assuming that using stage variables to pass only one row in case its a duplicate row) to the output row or not.

Sainath.Srinivasan · Post by **Sainath.Srinivasan** » Wed Jul 27, 2005 10:57 am

How will you know which one of the duplicate is to be picked?

The soln will be to store the row number. Again this leads to more steps and using hash file for reference.

pnchowdary · Post by **pnchowdary** » Wed Jul 27, 2005 12:02 pm

Hi Sai,

Yes, I certainly agree that it is not the most efficient way to do it, but was just trying to see if it was theoritically possible. From the discussion we had, I can conclude that it is indeed possible to do it that way, but not very efficient. Thanks a lot for sharing your inputs on this issue.

ranga1970 · Post by **ranga1970** » Wed Jul 27, 2005 11:08 pm

Simplest way is passing it through hash file and defining the keys by which you want to check for the duplicacy, here the last one is picked automatically

WoMaWil · Post by **WoMaWil** » Thu Jul 28, 2005 4:35 am

As in many questions the is not one "best" solution, it depends on the data you have.

The concat-seq-sort may work best for a one-time-few-space option
The hash is best for permanent all key fields equal last is right to win

Wolfgang