Hashfile sizing

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
dwscblr
Participant
Posts: 19
Joined: Tue May 18, 2004 12:39 am

Hashfile sizing

Post by dwscblr »

Hi
I am new to DataStage and hence have been innudating the forum with a lot of beginner questions.

Here's one more.
I have to create a hashfile containing 30 million recrods.
Key column:customer id varchar20
Nonkey columns: customer name, zipcode. total size 60 bytes.
i.e. size of each record size is 80 bytes.

The hashfile will eventually grow to a size of 50 million by end of this year.

Have chosen
File type:dynamic
Mnimum modulus:1000000 (50m/50rows per group)
Group size:2 (4K)

Have I got it right? Or am I way off track. Should I leave it for DataStage to decide.
kduke
Charter Member
Charter Member
Posts: 5227
Joined: Thu May 29, 2003 9:47 am
Location: Dallas, TX
Contact:

Post by kduke »

That will definitely help. The best way to pick this is to run the job then run ANALYZE.FILE MyHashFile. It will tell you the minimum modulo. There is a post by Ray which discusses the overhead per record for header and trailer information needed by Universe. If you really want to be precise. A close number will definitely improve performance a lot. It does not need to be exact.
Mamu Kim
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

Why is your hash file so big? Are you maintaining a copy of a target table (or subset of columns) in a persistent staging hash file? Why not simply populate the hash file at runtime with those rows necessary for reference?

Your hash file is too big for caching, requires significant maintenance investment, potentially can corrupt leading to massive reload processing, and requires potentially rebuilding when new columns are needed. While what you are attempting and is feasible, there's a limited number of reasons to build hash files this big. My guess, as a beginner to DataStage, you heard hash files are great reference objects. Except sometimes people build a copy of their target within their ETL tool in an attempt to utilize them, but ultimately lose more than they gain.
Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
dwscblr
Participant
Posts: 19
Joined: Tue May 18, 2004 12:39 am

Post by dwscblr »

Actually I too am against the use of such a large hashfile for lookups.

What we are trying to do is this:
1. Table 1 contains customer details.
2. Tables 2,3,4 need last name, first name and zipcode too on their fields.

Table 1,2,3,4 are loaded sequentially one after the other. Each table recieves data from it own set of files. It is not possible for the source system to send the last name, first name and zipcode with individual files.

So we load table 1 and then while loading table 2,3,4 for each table we lookup the hashfile to get the lastname, firstname and zipcode. The other alternative would be to use the OCI9 stage for lookup.
ogmios
Participant
Posts: 659
Joined: Tue Mar 11, 2003 3:40 pm

Post by ogmios »

I would be in favor of OCI. How many rows will there be in table 2,3,4 for an average run.

Ogmios
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

Okay, but are you processing ALL target rows every time? Probably not. You probably are processing a vastly insignificant portion of your data each load cycle. Therefore, you only NEED in your hash file the data pertinent to your current load cycle.

Consider this approach. Parse your source data files and collect a unique list of natural customer keys. That is the finite list of rows you need in your hash file. Take that list and put it into a work table of just the natural key. Inner join your customer table to that work table and spool the results into a hash file. You now have a hash lookup that works exactly as you need, but is rebuilt each run. It also is not 50 million rows in size, putting it outside of all of the performance characteristics that make hash files so fast. The point of corruption is irrelevant, as each run rebuilds it.
Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
Post Reply