Question on Hash File Performance

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
dsuser
Participant
Posts: 12
Joined: Mon Sep 15, 2003 9:11 pm

Question on Hash File Performance

Post by dsuser »

Hi,

I have a table with almost 20 million records. Several jobs require to have this table as a lookup. But except for one job all other jobs require only a subset of records from this table. Is it good to create one hash file for the entire 25 million and use the same in all jobs. Or is it better to create different hash files only with the subset of records required say 2,5,6,7 million and use them in the respective jobs. Which option is best and what will be the difference in performance.
Can someone help on this.

Thanks in advance.
dsuser
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

Typically data being processed never requires a whole reference table for processing. Small reference tables and dimensions can easily be put into hash lookups in their entirety, but the large dimensions have issues.

For example, a customer dimension tends to be very large, but a product dimension can be smaller. You could typically put the product table in its entirety in to a hash file. But the customer table could be like your situation, 20-30 million rows. But, the fact data you probably are processing on an incremental basis usually is never for the complete customer base, usually an insigificant portion.

The trick will be to prescan your source data and collect the distinct list of natural keys for your large dimension. Put that distinct list into a work table in the target database. Then, in the job that pulls that table into a hash file, change it to inner join to your etl work table of natural keys. This will limit the rows put into the hash file to the minimum number required.

By reducing the number of rows going into your hash file, you make the file more optimized, use less space, take less time extracting and loading into a hash file.

You can do this for all of your job streams, just have each of them use their own hash file.
Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
Peytot
Participant
Posts: 145
Joined: Wed Jun 04, 2003 7:56 am
Location: France

Post by Peytot »

It's better to have more HashFile with only the row that it needs for the lookup (and depending to your mapping rules). You run faster than using only one biggggg Hash.

Pey
1stpoint
Participant
Posts: 165
Joined: Thu Nov 13, 2003 2:10 pm
Contact:

Post by 1stpoint »

By reducing the number of rows going into your hash file, you make the file more optimized, use less space, take less time extracting and loading into a hash file.
You can also, depending on the size of the more optimized hash files, load them into memory.
Post Reply