duplicate records in sequential file

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
oracle
Participant
Posts: 43
Joined: Sat Jun 25, 2005 11:52 pm

duplicate records in sequential file

Post by oracle »

Hi friends


i am new to data stage, i have small problem pls solve

i have 3 sequential file file 1 , file 2, file 3
file 1 contains A, A records
file2 contains B, B records
file3 contains C, C records

my source is sequentiial file and target is sequential file.but output in target is A,B,C records. how can i eliminate duplicate records in sequential file, according to above example pls solve my problem
WoMaWil
Participant
Posts: 482
Joined: Thu Mar 13, 2003 7:17 am
Location: Amsterdam

Post by WoMaWil »

use a Hashfile-Stage in between and be cautious with your keyfields, all duplicate keys are eliminated, if you choose too few you will even get too less.

Wolfgang
elavenil
Premium Member
Premium Member
Posts: 467
Joined: Thu Jan 31, 2002 10:20 pm
Location: Singapore

Post by elavenil »

You can use stage variables to eliminate duplicate records. This topic was covered lot. Pls search the forum.

Regards
Saravanan
WoMaWil
Participant
Posts: 482
Joined: Thu Mar 13, 2003 7:17 am
Location: Amsterdam

Post by WoMaWil »

StageVariables are only helpfull if duplicate records follow one after the other, which could be easily done by fetching a DataBase-Table in a certain order. For flatfiles unless you know the a double is directly followed by one an other it is not adiquate.

Best is using Hashfiles, but for hash-files you have to know what you do. (READ THE MANUAL)


Wolfgang
Sainath.Srinivasan
Participant
Posts: 3337
Joined: Mon Jan 17, 2005 4:49 am
Location: United Kingdom

Post by Sainath.Srinivasan »

You can sort the rows before using stage variables.
pnchowdary
Participant
Posts: 232
Joined: Sat May 07, 2005 2:49 pm
Location: USA

Post by pnchowdary »

Hi Sai,

Cant we use the count function in Aggregator stage and based on the criteria that count > 1 along with stage variables, eliminate duplicates?. This might not be the most efficient way to do it, but theoritically it should work, right ?
Thanks,
Naveen
Sainath.Srinivasan
Participant
Posts: 3337
Joined: Mon Jan 17, 2005 4:49 am
Location: United Kingdom

Post by Sainath.Srinivasan »

You can do that to identify duplicate keys. But what about the data value associated with the key?
pnchowdary
Participant
Posts: 232
Joined: Sat May 07, 2005 2:49 pm
Location: USA

Post by pnchowdary »

Hi Sai,

We can store the list of keys that are identified as duplicate keys in a file. Then use it as a reference to either pass the entire row (key+data) (assuming that using stage variables to pass only one row in case its a duplicate row) to the output row or not.
Thanks,
Naveen
Sainath.Srinivasan
Participant
Posts: 3337
Joined: Mon Jan 17, 2005 4:49 am
Location: United Kingdom

Post by Sainath.Srinivasan »

How will you know which one of the duplicate is to be picked?

The soln will be to store the row number. Again this leads to more steps and using hash file for reference.
pnchowdary
Participant
Posts: 232
Joined: Sat May 07, 2005 2:49 pm
Location: USA

Post by pnchowdary »

Hi Sai,

Yes, I certainly agree that it is not the most efficient way to do it, but was just trying to see if it was theoritically possible. From the discussion we had, I can conclude that it is indeed possible to do it that way, but not very efficient. Thanks a lot for sharing your inputs on this issue.
Thanks,
Naveen
ranga1970
Participant
Posts: 141
Joined: Thu Nov 04, 2004 3:29 pm
Location: Hyderabad

Post by ranga1970 »

Simplest way is passing it through hash file and defining the keys by which you want to check for the duplicacy, here the last one is picked automatically
RRCHINTALA
WoMaWil
Participant
Posts: 482
Joined: Thu Mar 13, 2003 7:17 am
Location: Amsterdam

Post by WoMaWil »

As in many questions the is not one "best" solution, it depends on the data you have.

The concat-seq-sort may work best for a one-time-few-space option
The hash is best for permanent all key fields equal last is right to win

Wolfgang
Post Reply