Failing hash-file lookup

michaelsarsnorum · Post by **michaelsarsnorum** » Wed Nov 16, 2005 8:32 am

Hi

I have a problem with a server-job after migrating to 7.5.1A from 5.2. The job reads from a flat file and loads into two tables in a database. One table contains the actual data, the other is used as a backup for the hashfiles.

When processing a new row Datastage looks up the ID in a hash-file. If the ID is found, the row is loaded into the data-table. If no UD is found, the row is given a new ID from a sequence, loaded into the data-table, written to the hash-file (there may be duplicate tuples in the hash-file) and to the backup-table.

The problem is that the integrity of the primary key on the backup-table is violated and the job aborts. I have checked that the ID is present in the hashfile and therefore should have been found and the row never written to the backup-table.

The field used for lookups in the hash-file contains a leading zero in the flat-file, but this zero is removed in the hash-file. I have removed the leading zero in the field with the FMT(fieldname, "Z") before passing it to the lookup. In addition this lookup only fails on some of the rows (150/3500).

Any help is welcomed.
Michael

ArndW · Post by **ArndW** » Wed Nov 16, 2005 11:31 am

This type of lookup is common so there is no technical reason why it isn't working the way you expect. Could your key contain leading or trailing spaces or perhaps unprintable characters? You can confirm that it doesn't have that data by putting in a dummy output link with a constraint looking like "FMT(fieldname, "Z") <> TRIM(OCONV(FMT(fieldname, "Z") ,'MCN')"? This will trim & replace nonprintable characters with "." in your string. No links should go down this row.

michaelsarsnorum · Post by **michaelsarsnorum** » Thu Nov 17, 2005 3:53 am

I went ahead and did some further investigations on this subject. It seems that the job somehow messes up the hash-file. I have one job that can regenerate the hash-file from the backup-table mentioned above. After running this job, the hash-file is correct and contains all the rows that create problems. However when the problematic job runs the hashfile is changed afterwards, only containing a fraction of the rows it is supposed to have.

The lookup was set to Preload to Memory = Disabled (Default). I changed this to Disabled, Lock for updates which cured the problem of the failing lookups (apparently). The problem however is that changing this has reduced the overall performance of the job by a factor of 5 (was 1000 rows/second, is now 200 rows/second).

Do anyone know if this reduction in performance is due to the "lock for updates" setting? I have tried changing the ArraySize of the main target-table setting it to different values, but have only been able to increase the performance marginally (to 250 rows/second)

Any help is apreciated.
Michael

ArndW · Post by **ArndW** » Thu Nov 17, 2005 4:20 am

Disabling memory lookups will have a great performance impact, as you have seen. The "lock for updates" and disabling loading to memory will ensure consistency when more than one job is modifying and using the hash file concurrently - is this what is happening in your job? If you have only this job read the hashed file and no other processes modifying the file at the same then you should stick with the preload-to-memory option for performance reasons.

michaelsarsnorum · Post by **michaelsarsnorum** » Thu Nov 17, 2005 4:45 am

The hash-file is both used for lookups and updates in the same job. No other jobs use this same hash-file concurrently.

The job is being migrated from a (rather old) server running DS 5.2 to 7.5.1. In 5.2 preloading was disabled and the job ran at 1000 rows/second. In the 7.5.1 environment, the hash-file (as mentioned) was messed up when preloading was just set to Disabled. When running both with Enabled- and Disabled lock for updates, teh job works as it should.

I haven't tested this in the 7.5.1 environment, but in the 5.2 environment performance decreased by a factor of 100 when I just used Enabled (apparently it reloaded the hash-file for every update).

Michael

ArndW · Post by **ArndW** » Thu Nov 17, 2005 5:16 am

If you are reading from and writing to the same hashed file in the job and you need the "written" record to be "read" subsequently in the same stream then you cannot pre-load the file to memory and you need to live with the slower mechanism.

DS 7.x does have the capability of using file cacheing - this functionality needs to be enabled at a system level, then at the job level and also at the file level. This can speed up hashed file usage, but if this is the only job where you are doing this type of read/write then the overhead is not worth the effort.