Hash file limitation

yiminghu · Post by **yiminghu** » Wed Jun 01, 2005 9:37 am

Hi,

Is there a limitation on hash file size? I have a job which inserts new billing information in a billing table, which has more than 30 millon records. The input file provided by source system may contain the billing informaiton that already in the table. That means before I do insertion, I have to check whether the record already exists or not. The best way would be using hash file, but the table is so big, and number of key fields is 9. I don't know whether hash file would be able to handle such big volumn?
If I could not use hash file, what is my alternative?

Carol

ArndW · Post by **ArndW** » Wed Jun 01, 2005 9:44 am

Carol,

DataStage hash files are stored as either one or two file system files; thus if your file system limits files to 2Gb then you might get a problem there. In addition, you can also specify a 64BIT file pointer size with very large hash files if needs be. Your proposed file, while quite large, is within the capabilities of hash files. I am assuming that your record width is not particularly big - if you do store 1Mb of data per record that would be a different matter.

You should pre-calculate your file modulus if you stay with a type 30 (dynamic) file so that you don't get great speed slowdowns while writing.

anupam · Post by **anupam** » Wed Jun 01, 2005 9:45 am

Yeah, there is a limitation of size of a hash files. It can not grow beyond 2 Gb and infact ideally hash file should eb as small as possible.

Now, there is a way around if u still want the hash file to grow more then 2 GB. u can make a hash file with 64 bit pointer and it can grow much much more than 2 GB.

But please analyse before creating a big hash file about the requirement. Think twice do u need such a big hash file.

alanwms · Post by **alanwms** » Wed Jun 01, 2005 11:41 am

Carol,

How many new records are you inserting into your 30MM row table when you run the job? How many are updates? You may want to create a small table in your target of just the key fields of the records to be processed, then truncate a load that table, then select that table with an inner join to your 30MM row table and store the results is a hash file. This hash file would only contain the keys of the existing records from your incoming dataset, i.e., you could do a lookup on this hash file as you process each incoming row. If the lookup fails, the row goes to an output link for inserts, if the lookup is successful (the record exists in the 30MM table), the rows goes to an update link.

You could take another approach and introduce partitioned parallelism into your Server job. Using a partitioned approach, you could split your hash file into 5 or 10 or more separate hash files and use some part of the first key field to determine which partition the row belongs to. There needs to be an exact match between the algorithm used to partition the incoming data and the algorithm used to build the lookup hash files. For more on partiitioning parallelism in server jobs, look at the documentation on link splitting/collecting and also job instances.

Alan

ray.wurlod · Post by **ray.wurlod** » Wed Jun 01, 2005 4:59 pm

A key-only hashed file would do the job for you. Assuming your operating system allows files over 2GB, you can create or resize a hashed file to use 64-bit addressing, allowing a theoretical upper limit of approximately 19 million TB.

In your case, you have 30 million keys, let's say 100 bytes per key. So that's approximately 3GB. No problem at all, but you WILL need to use 64-bit addressing (or separate hashed files, as Alan suggested).

Immediately after creating the hashed file (so that it's quick), execute the following commands from the Adminstrator client Command window:

Code: Select all

SETFILE hashedfilepath hashedfilename OVERWRITING
RESIZE hashedfilename * * * 64BIT

The SETFILE command establishes a pointer to the hashed file in the project's "vocabulary" (if you've created the hashed file in the project this step can be omitted).