uniqueness check in sequential file??

fahad · Post by **fahad** » Sat Aug 21, 2004 8:34 am

hi all,
i have a data source of text file and a target of two flat files,
i need to set a uniqueness check for two columns to avoid duplicates.
how can i do that?? is there any routine to do that??

thank you all

chulett · Post by **chulett** » Sat Aug 21, 2004 9:11 am

No, sorry... no magic bullet. At least not on the Server side.

If by chance your data is sorted such that you can take advantage of supressing repeating groups, you can do the fairly standard Stage Variable thing to check for duplicates. Let the first value pair through, suppress the rest. I'm guessing that's not the case.

Use hash files keyed by the column to check, a technique mentioned here for example. You need to both write values to the hashes and do lookups against them (the exact same 'them'!) in the same transformer. If you don't get a hit on the hash, write the record to both your target and the hash. If you do get a hit, you've seen that value before and it can be 'skipped' or whatever your requirements are.

Couple of things to keep in mind. Do NOT use write caching on the hash you are writing to. On the lookup side, make sure 'Lock for Updates' is selected and you'll be fine.

You'll need to adjust this if your two columns need to be checked seperately or as a 'value pair' - i.e. one or two hash tables - but this should work fine for you. And not just for checking a sequential files.

Sometimes you can leverage your operating system. From a UNIX side (if you were on it) you may be able to take advantage of something like 'sort -u' to preprocess the file and remove the duplicates for you. This could be done in the Filter of the Sequential File stage or using scripts as part of a 'before job' process.

rasi · Post by **rasi** » Sat Aug 21, 2004 8:36 pm

Hi

You can change your target to hash file and make you unique columns as key. This will overwrite your duplicate rows and you will always have unique rows.

Thanks
Siva

fahad · Post by **fahad** » Sun Aug 22, 2004 12:36 am

thank you very much guys,
i used a hash file in the middle and it removed the duplicate rows

thanks again for quick help. :D :D :D

DSXchange

uniqueness check in sequential file??

uniqueness check in sequential file??

it worked