Hash File Lookup

ianm · Post by **ianm** » Wed Nov 10, 2004 4:12 am

I am fairly new to datastage and I'm trying to get the logic of hashfile look ups.
I have set up a couple of test hash files :-
TEST_hashfile
Field 1 (Key) Field2 Field3
row1 test1 test1
row2 test2 test2
row3 test3 test3

TEST_feed_hashfile
Field 1 (Key) Field2 Field3
row1 test1 test1
row2 change test2
row3 test3 test3
row4 new new

I'm trying to produce an output that will write out records that are both new and changed.
In this example test4 is a totally new record and Field2 in row2 has changed.
We need to record the fact that changes have occured not just do an update.
I have managed to get row4 writing out through the reference lookup (because of the Keyfield Field1) but whatever I try I can't seem to isolate the change in row2.
The job I have set up is :-

Hash_file ......(reference)..... Transformer1 ....... Output_1
Hash_file_feed.................... Transformer1 ....... Output_2

constraint :- lk_Hash_file.NOTFOUND
Output_2 is just set to receive any rejected rows.

any suggestions ?
Note the actual files I am going to apply this to eventually will contain several million rows.
Regards,
Ian

mleroux · Post by **mleroux** » Wed Nov 10, 2004 5:07 am

Welcome to DSXchange! :D

You're on the right track. The constraint you currently have will check for new records. To check for updates, you could have two output links, one for new records and another that's just for updates. Something like this:

Code: Select all

                      Hash_file
                          .
                          .
                        (Ref)
                          .
                          .
                     =========== --- NewRecs ---> Insert new rows only
Hash_file_feed ----> Transformer
                     =========== --- UpdRecs ---> Update existing rows

Then you'll have this constraint for NewRecs:

Code: Select all

Hash_file.NOTFOUND

And this constraint for UpdRecs:

Code: Select all

not(Hash_file.NOTFOUND) and
(
  Hash_file_feed.Field2 <> Hash_file.Field2 or
  Hash_file_feed.Field3 <> Hash_file.Field3
)

However, the method of checking values field-for-field is tedious and means overhead. A better way will be to use the checksum function and compare checksum values in UpdRecs instead:

Code: Select all

not(Hash_file.NOTFOUND) and
(
  checksum(Hash_file.Field2 : Hash_file.Field3) <>
  checksum(Hash_file_feed.Field2 : Hash_file_feed.Field3)
)

You could write the checksum to your lookup (Hash_file) so you don't need to do it in the constraint:

Code: Select all

not(Hash_file.NOTFOUND) and
(
  Hash_file.RowChkSum <>
  checksum(Hash_file_feed.Field2 : Hash_file_feed.Field3)
)

When you have a large volume of rows in your hashed files be sure to turn on write caching on each file's input tab. If you really want to crank up speed on your hashed files, you can read up on configuring static hashed files that are tailored for the source's volume of data as well as the makeup of the keys.

A hashed file for a lookup is necessary, but not for a stream feed (unless if you wrote to it in a previous step to eliminate duplicates, for example).

ianm · Post by **ianm** » Wed Nov 10, 2004 9:21 am

Thank you mleroux.
Tried your code and it worked perfectly.
Have yet to see how it copes with files that are 7 million rows large !

Regards,
Ianm

mhester · Post by **mhester** » Wed Nov 10, 2004 11:52 am

I would caution you to not use checksum, but rather to use CRC32. Checksum implementation in UniVerse is 16 bit and the algorithm is additive which can lead to some very undesirable results. The probability that a different checksum (incorrect) will be created for two identical rows (with the same key) or the same checksum will be generated for a row that has changed (again, the same key) is extremely high - somewhere around 1 in 65,536 or 2^16.

CRC32 on the other hand is not additive and the return is a 32 bit integer. Checksum has a difficult time detecting small changes in moderate to large fields and this is what makes it not desirable to use as a change data mechanism. If you do not believe me you can try performing a checksum on a data stream where the only thing that has changed might be a comma, period or an alpha character change in the data and checksum will not likely detect the change especially if the change is at the end of the string.

CRC32 is very efficient at detecting small or any kind of change in small, moderate or large data streams.

It would be my choice to use CRC32 versus checksum especially if you are planning on processing 7 million rows. There is a working example on the ADN under the download section that can be easily modified to suit your needs.

Regards,

mleroux · Post by **mleroux** » Thu Nov 11, 2004 4:49 am

Thanks for the CRC32 info, Michael.

Interesting...

ianm · Post by **ianm** » Thu Nov 11, 2004 5:35 am

Thanks Michael,

I have also been able to get this approach working as well.

Regards,
Ian