Optimal settings to create a large Hashed file

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
eldonp
Participant
Posts: 47
Joined: Thu Jun 19, 2003 3:49 am

Optimal settings to create a large Hashed file

Post by eldonp »

We are having problems with creating hashed files.

We have to perform lookups against large data - either 3 million rows with 50 cols or 20 million rows with few cols.

When creating this file, at some point the performance just slows and the file seems to stop growing. We've looked - the file is well below the 2GB limit.

We'd like to know what the optimal settings are to create hashed files with more than 100 000 rows.
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

Periodic resizing will give the appearance the file is now longer growing because Monitor shows no more rows flowing to the file, but looking in the physical directory you will find the actual DATA and OVER files are changing. You also need to look at the performance of the process writing the file (prstat, topas, glance, top, etc) and see if it is using the full cpu. If the job is SEQ --> XFM --> HASH, the job should either be fully using a cpu or waiting on the disks to catch up with growing/receiving the data.

The best thing you can do is pre-size the file with a minimum modulus setting based on the high-watermark expectation.
Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
eldonp
Participant
Posts: 47
Joined: Thu Jun 19, 2003 3:49 am

Post by eldonp »

Add to that, what are the memory considerations from a hardware perspective.

Is there much memory needed to create and read from hashed files?
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

Forget about memory, it's very light. Worry about what I posted. First verify the problem. If your issue is disk contention, talking about other things doesn't matter. Analyze your system and see what's going on.
Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

DataStage reports rows flowing when they are flowing. If they are flowing into the cache, fine, you get good rates. But the clock keeps running when the rows are being flushed to disk, even though no more rows are flowing. So the rate appears to diminish.

Optimal depends primarily on the combination of row size, number of rows and internal storage overheads. That's why Hashed File Calculator exists.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Post Reply