uniqueness check in sequential file??

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
fahad
Participant
Posts: 15
Joined: Sat Aug 07, 2004 7:48 am

uniqueness check in sequential file??

Post by fahad »

hi all,
i have a data source of text file and a target of two flat files,
i need to set a uniqueness check for two columns to avoid duplicates.
how can i do that?? is there any routine to do that?? :roll: :roll: :roll:

thank you all
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

No, sorry... no magic bullet. At least not on the Server side. :?

If by chance your data is sorted such that you can take advantage of supressing repeating groups, you can do the fairly standard Stage Variable thing to check for duplicates. Let the first value pair through, suppress the rest. I'm guessing that's not the case.

Use hash files keyed by the column to check, a technique mentioned here for example. You need to both write values to the hashes and do lookups against them (the exact same 'them'!) in the same transformer. If you don't get a hit on the hash, write the record to both your target and the hash. If you do get a hit, you've seen that value before and it can be 'skipped' or whatever your requirements are.

Couple of things to keep in mind. Do NOT use write caching on the hash you are writing to. On the lookup side, make sure 'Lock for Updates' is selected and you'll be fine.

You'll need to adjust this if your two columns need to be checked seperately or as a 'value pair' - i.e. one or two hash tables - but this should work fine for you. And not just for checking a sequential files. :wink:

Sometimes you can leverage your operating system. From a UNIX side (if you were on it) you may be able to take advantage of something like 'sort -u' to preprocess the file and remove the duplicates for you. This could be done in the Filter of the Sequential File stage or using scripts as part of a 'before job' process.
-craig

"You can never have too many knives" -- Logan Nine Fingers
rasi
Participant
Posts: 464
Joined: Fri Oct 25, 2002 1:33 am
Location: Australia, Sydney

Post by rasi »

Hi

You can change your target to hash file and make you unique columns as key. This will overwrite your duplicate rows and you will always have unique rows.

Thanks
Siva
fahad
Participant
Posts: 15
Joined: Sat Aug 07, 2004 7:48 am

it worked

Post by fahad »

thank you very much guys,
i used a hash file in the middle and it removed the duplicate rows

thanks again for quick help. :D :D :D
Post Reply