Large Hash Files

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
nkumar_home
Participant
Posts: 19
Joined: Fri Apr 02, 2004 10:13 am

Large Hash Files

Post by nkumar_home »

We are building some very large hash files which are then used for lookups in some jobs. It reads from an Oracle table around 4 million rows and it does a self referencing lookup to concatenate strings for the same key. The question is with the current hash algorithm of Type 30 Dynamic, Group 2, Split 80, Large record 10000, Hash Algo - General,, Merge Load - 50 - the job takes a very long time, several hours. I know the bottleneck is not reading from oracle.
The problem is as the hash file grows, and overflows into the OVER.30 file the job gets slower by the minute. I ran a anayze on the hash
and here is the ouput

File name .................. ProductHash
Pathname ................... ProductHash
File type .................. DYNAMIC
Hashing Algorithm .......... GENERAL
No. of groups (modulus) .... 157041 current ( minimum 1, 25 empty,
45377 overflowed, 3204 badly )
Number of records .......... 1945217
Large record size .......... 4084 bytes
Number of large records .... 469
Group size ................. 4096 bytes
Load factors ............... 80% (split), 50% (merge) and 80% (actual)
Total size ................. 837636096 bytes
Total size of record data .. 507663322 bytes
Total size of record IDs ... 17892390 bytes
Unused space ............... 312072192 bytes
Total space for records .... 837627904 bytes

I used the Hash file utility to come up with alternate algorithms but I am do not know enough to choose one method or the other.

The reason for building the hash and not going with OCI lookups when such a large number of rows is involved is we are inserting into the same table and will run into a Snapshot too Old error if we do that.

I can provide any more information if it is required.

Any inputs / suggestions will be helpful.

Naren
cohesion
Participant
Posts: 8
Joined: Wed Feb 18, 2004 3:32 pm
Location: Canada

Large Hash files

Post by cohesion »

You could try setting the "minimum modulus" closer to the value shown in the analyzer output. Also, if there are relatively few records looked up each time (compared to the size of the table) why not use an OCI lookup but output the data to a seq. file, then insert or bulk insert in a subsequent job?
R. Michael Pickering
Senior Architect
Cohesion Systems Consulting Inc.
nkumar_home
Participant
Posts: 19
Joined: Fri Apr 02, 2004 10:13 am

Re: Large Hash files

Post by nkumar_home »

cohesion wrote:You could try setting the "minimum modulus" closer to the value shown in the analyzer output. Also, if there are relatively few records looked up each time (compared to the size of the table) why not use an OCI lookup but output the data to a seq. file, then insert or bulk insert in a subsequent job?
Hi
Thanks for the suggestion. I am re-writing the job in PX with output to a dataset that can later be bulk inserted into the target table.

Does anyone know thw difference between HASH algorithm of 'GENERAL' vs 'SEQ.NUM'.

thanks
Naren
cohesion
Participant
Posts: 8
Joined: Wed Feb 18, 2004 3:32 pm
Location: Canada

Post by cohesion »

Using seq.num should work better if the hash key is numeric and regularly increasing, as in a system maintained sequence number.
R. Michael Pickering
Senior Architect
Cohesion Systems Consulting Inc.
Post Reply