Optimal settings to create a large Hashed file

eldonp · Post by **eldonp** » Tue Mar 27, 2007 7:31 am

We are having problems with creating hashed files.

We have to perform lookups against large data - either 3 million rows with 50 cols or 20 million rows with few cols.

When creating this file, at some point the performance just slows and the file seems to stop growing. We've looked - the file is well below the 2GB limit.

We'd like to know what the optimal settings are to create hashed files with more than 100 000 rows.

kcbland · Post by **kcbland** » Tue Mar 27, 2007 7:44 am

Periodic resizing will give the appearance the file is now longer growing because Monitor shows no more rows flowing to the file, but looking in the physical directory you will find the actual DATA and OVER files are changing. You also need to look at the performance of the process writing the file (prstat, topas, glance, top, etc) and see if it is using the full cpu. If the job is SEQ --> XFM --> HASH, the job should either be fully using a cpu or waiting on the disks to catch up with growing/receiving the data.

The best thing you can do is pre-size the file with a minimum modulus setting based on the high-watermark expectation.

eldonp · Post by **eldonp** » Tue Mar 27, 2007 7:48 am

Add to that, what are the memory considerations from a hardware perspective.

Is there much memory needed to create and read from hashed files?

kcbland · Post by **kcbland** » Tue Mar 27, 2007 8:08 am

Forget about memory, it's very light. Worry about what I posted. First verify the problem. If your issue is disk contention, talking about other things doesn't matter. Analyze your system and see what's going on.

ray.wurlod · Post by **ray.wurlod** » Tue Mar 27, 2007 8:18 am

DataStage reports rows flowing when they are flowing. If they are flowing into the cache, fine, you get good rates. But the clock keeps running when the rows are being flushed to disk, even though no more rows are flowing. So the rate appears to diminish.

Optimal depends primarily on the combination of row size, number of rows and internal storage overheads. That's why Hashed File Calculator exists.