Question on Hash File Performance

dsuser · Post by **dsuser** » Wed Dec 03, 2003 10:43 am

Hi,

I have a table with almost 20 million records. Several jobs require to have this table as a lookup. But except for one job all other jobs require only a subset of records from this table. Is it good to create one hash file for the entire 25 million and use the same in all jobs. Or is it better to create different hash files only with the subset of records required say 2,5,6,7 million and use them in the respective jobs. Which option is best and what will be the difference in performance.
Can someone help on this.

Thanks in advance.
dsuser

kcbland · Post by **kcbland** » Wed Dec 03, 2003 11:28 am

Typically data being processed never requires a whole reference table for processing. Small reference tables and dimensions can easily be put into hash lookups in their entirety, but the large dimensions have issues.

For example, a customer dimension tends to be very large, but a product dimension can be smaller. You could typically put the product table in its entirety in to a hash file. But the customer table could be like your situation, 20-30 million rows. But, the fact data you probably are processing on an incremental basis usually is never for the complete customer base, usually an insigificant portion.

The trick will be to prescan your source data and collect the distinct list of natural keys for your large dimension. Put that distinct list into a work table in the target database. Then, in the job that pulls that table into a hash file, change it to inner join to your etl work table of natural keys. This will limit the rows put into the hash file to the minimum number required.

By reducing the number of rows going into your hash file, you make the file more optimized, use less space, take less time extracting and loading into a hash file.

You can do this for all of your job streams, just have each of them use their own hash file.

Peytot · Post by **Peytot** » Wed Dec 03, 2003 11:34 am

It's better to have more HashFile with only the row that it needs for the lookup (and depending to your mapping rules). You run faster than using only one biggggg Hash.

Pey

1stpoint · Post by **1stpoint** » Wed Dec 03, 2003 1:15 pm

By reducing the number of rows going into your hash file, you make the file more optimized, use less space, take less time extracting and loading into a hash file.

You can also, depending on the size of the more optimized hash files, load them into memory.