Performance Issue...Hashfiles and sequential File

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
branimir.kostic
Participant
Posts: 13
Joined: Thu Nov 04, 2004 4:30 am

Performance Issue...Hashfiles and sequential File

Post by branimir.kostic »

Hello everybody,

I almost despaire of performing my DataStage Jobs. In some jobs i have to use Hashfiles because of Duplicate Keys, but writing or reading in Hashfiles crashes our performance. At the beginning it works very good (about 600 - 1000 rows/sec) but after some time the performance is decreasing rapidly...we are reaching then only 40 rows /sec...Some Consultant from Ascential tried to optimize the hashfiles but without any success.

I tried several things. For example i replaced the hashfile with a sequential file. It results in a little more stabile performance but also after some time the rows/sec decreases.

What can I do?

Thanks!
roy
Participant
Posts: 2598
Joined: Wed Jul 30, 2003 2:05 am
Location: Israel

Post by roy »

Hi,
what kind of tuning did you try?
how many rows are involved?
what is the file's size/ estimated final size?
are you using dynamic or static hash files?
do you preallocate phisical size for the file?
how many columns compose the key?
did you try any method of splitting it to several hash files? (vertical/horizontal)

IHTH,
Roy R.
Time is money but when you don't have money time is all you can afford.

Search before posting:)

Join the DataStagers team effort at:
http://www.worldcommunitygrid.org
Image
roy
Participant
Posts: 2598
Joined: Wed Jul 30, 2003 2:05 am
Location: Israel

Post by roy »

since you said windows here are some more thoughts:
are you using local disk?
if so, are you by any chance limited by disk I/O?
are you reading and writing to the same disk?
Roy R.
Time is money but when you don't have money time is all you can afford.

Search before posting:)

Join the DataStagers team effort at:
http://www.worldcommunitygrid.org
Image
branimir.kostic
Participant
Posts: 13
Joined: Thu Nov 04, 2004 4:30 am

Post by branimir.kostic »

Our Windows 2003 Server has a Raid-System. Raid 1 for the System Disk (where DataStage is installed to) and Raid 5 for the Hashfiles. Under Raid 5 there are 5 Disks with each 180 GB acting under a 1 Logical Disk.
Its a little bit weird...with 100.000 I have great performance, but with 200.000 it seems to me that the job will never finish.

I tried several things...
(1) Set Hashfile to Dynamic and Static Hashfile
(2) analyzed how many rows will be written in Hashfiles (about 400000) and changed the 'minimum modulus' and set the 'Group Size' to 2
(3) Set the Large Size to 2102 (summarized the length of the columns)
(4) 6 Columns compose the key
(5) I havent tried to split the files because after that i have to join them
(6) Enabled 'Caching attributes'

All this doesnt help. The consultant from Ascential couldnt believe it, but i have to find some possibilities to optimize it in order to fulfill the plans of the test phase.

(?) How can i find out if we are limited by disk i/o?
(?) How can i preallocate phisical size for the file?
chucksmith
Premium Member
Premium Member
Posts: 385
Joined: Wed Jun 16, 2004 12:43 pm
Location: Virginia, USA
Contact:

Post by chucksmith »

Just in case it is not a size issue, let's talk about another design possibility.

Is it possible to sort your data by your six key columns? This is best done outside of DataStage. Once sorted, the data can be processed sequentially, either (1) using stage variables to track the value of the previous key and drop subsequent rows, or (2) using an aggregator stage and the "Last" derivation to keep the last row with common keys.

Design 1 is basically:

Code: Select all

SORT      Source ---> Xfr ---> Dest
Design 2 is basically:

Code: Select all

SORT      Source ---> Xfr ---> Aggr ---> Dest
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

Branimir,

if your hash size at it's maximum number of records is usually about the same (+/- 10%) then let your job fill the hash file and then, from the administrator do a
RESIZE <hashfilename> 2 49999 1 {this might take a while, it resizes the file to type 2, modulo 49999 and separation 1. The HASH.HELP doesn't work for the default file type of 30 so this step needs to be done.}
HASH.HELP <hashfilename> {will suggest a new file type, modulo and separation}
PRIME <suggested modulo + 10%>
RESIZE <yourfilename> [according to what HASH.HELP suggested, use the result of PRIME to the new modulo]

This will give you good performance for your current file record count - it does not dynamically correct for larger sizes (you would start getting overflows which slow down read & write performance). Changing the group size doesn't do much in most cases, but you seem to have very long keys (a hash file only has 1 key, so what DS does is to concatenate multiple virtual keys into one real column). You can also do a ANALYZE.FILE and report the results here.
roy
Participant
Posts: 2598
Joined: Wed Jul 30, 2003 2:05 am
Location: Israel

Post by roy »

Hi,
what was said about covers it.

beyond that did you try getting some OS analisys to see if there are any bottle necks (get sysamins to monitor this and asist you)?

and there is also the simple question of what is the disk's RPM?

I guess if in fact there is nothing you can do to prevent the performance degradation you'll need to try chucksmith's aproach if you want only one file it should give reasonable performance.

IHTH,
Roy R.
Time is money but when you don't have money time is all you can afford.

Search before posting:)

Join the DataStagers team effort at:
http://www.worldcommunitygrid.org
Image
Post Reply