Hashfile sizing

dwscblr · Post by **dwscblr** » Sun Jul 18, 2004 2:28 pm

Hi
I am new to DataStage and hence have been innudating the forum with a lot of beginner questions.

Here's one more.
I have to create a hashfile containing 30 million recrods.
Key column:customer id varchar20
Nonkey columns: customer name, zipcode. total size 60 bytes.
i.e. size of each record size is 80 bytes.

The hashfile will eventually grow to a size of 50 million by end of this year.

Have chosen
File type:dynamic
Mnimum modulus:1000000 (50m/50rows per group)
Group size:2 (4K)

Have I got it right? Or am I way off track. Should I leave it for DataStage to decide.

kduke · Post by **kduke** » Sun Jul 18, 2004 2:56 pm

That will definitely help. The best way to pick this is to run the job then run ANALYZE.FILE MyHashFile. It will tell you the minimum modulo. There is a post by Ray which discusses the overhead per record for header and trailer information needed by Universe. If you really want to be precise. A close number will definitely improve performance a lot. It does not need to be exact.

kcbland · Post by **kcbland** » Sun Jul 18, 2004 4:02 pm

Why is your hash file so big? Are you maintaining a copy of a target table (or subset of columns) in a persistent staging hash file? Why not simply populate the hash file at runtime with those rows necessary for reference?

Your hash file is too big for caching, requires significant maintenance investment, potentially can corrupt leading to massive reload processing, and requires potentially rebuilding when new columns are needed. While what you are attempting and is feasible, there's a limited number of reasons to build hash files this big. My guess, as a beginner to DataStage, you heard hash files are great reference objects. Except sometimes people build a copy of their target within their ETL tool in an attempt to utilize them, but ultimately lose more than they gain.

dwscblr · Post by **dwscblr** » Mon Jul 19, 2004 8:39 am

Actually I too am against the use of such a large hashfile for lookups.

What we are trying to do is this:
1. Table 1 contains customer details.
2. Tables 2,3,4 need last name, first name and zipcode too on their fields.

Table 1,2,3,4 are loaded sequentially one after the other. Each table recieves data from it own set of files. It is not possible for the source system to send the last name, first name and zipcode with individual files.

So we load table 1 and then while loading table 2,3,4 for each table we lookup the hashfile to get the lastname, firstname and zipcode. The other alternative would be to use the OCI9 stage for lookup.

ogmios · Post by **ogmios** » Mon Jul 19, 2004 8:45 am

I would be in favor of OCI. How many rows will there be in table 2,3,4 for an average run.

Ogmios

kcbland · Post by **kcbland** » Mon Jul 19, 2004 8:49 am

Okay, but are you processing ALL target rows every time? Probably not. You probably are processing a vastly insignificant portion of your data each load cycle. Therefore, you only NEED in your hash file the data pertinent to your current load cycle.

Consider this approach. Parse your source data files and collect a unique list of natural customer keys. That is the finite list of rows you need in your hash file. Take that list and put it into a work table of just the natural key. Inner join your customer table to that work table and spool the results into a hash file. You now have a hash lookup that works exactly as you need, but is rebuilt each run. It also is not 50 million rows in size, putting it outside of all of the performance characteristics that make hash files so fast. The point of corruption is irrelevant, as each run rebuilds it.