insert into hash files

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
roy
Participant
Posts: 2598
Joined: Wed Jul 30, 2003 2:05 am
Location: Israel

Post by roy »

Hi,
this issue also depends on the disks you have (speed/size).

anyway, do you need to unload it to a hash file?
if no lookups are made on the output file use a sequential file instead.

if you do need lookups, is the enable write cache checkbox checked in the hash file's input?

bare in mind that hash files are limited in size (2 GB) as well ,unless you create them manually to be 64 bit.

Good Luck,



Roy R.
kduke
Charter Member
Charter Member
Posts: 5227
Joined: Thu May 29, 2003 9:47 am
Location: Dallas, TX
Contact:

Post by kduke »

Markos

Make sure you have a minimum modulo on this file. That will definitely increase your performance. A dynamic hash file will keep resizing itself as it grows. Dynamic files are the default types of hash files. ANALZE.FILE will tell you how big to make the minimum modulo.

Also I would take off cache. This file is probably too big to keep in memory. I would try it.

If you are on UNIX, probably Windows too, the closer a filesystem gets to 100% full the slower it gets. It spends most of its time looking for free space. I have seen this reduce speeds by a forth or more. Used to be able to control this. You could rebuild a filesystem so files were located in the disk together. Like when you defrag a disk. Improves performance a lot.

Kim.

Kim Duke
DwNav - ETL Navigator
www.Duke-Consulting.com
degraciavg
Premium Member
Premium Member
Posts: 39
Joined: Tue May 20, 2003 3:36 am
Location: Singapore

Post by degraciavg »

In addition to what Kim suggested, you might want to consider a Static hash file. It is known to increase performance by at least 30%.

You can use the hash file calculator (HFC) to estimate the pre-determined size of your hash file. HFC is found in the DataStage installation disk under Unsupported Utilities.

The downside is you'll lose all data that overflows your pre-defined file size. This is recommended if the volume of your data doesn't vary a lot (i.e. significantly increase in volume) on your subsequent runs.


regards,
vladimir
kduke
Charter Member
Charter Member
Posts: 5227
Joined: Thu May 29, 2003 9:47 am
Location: Dallas, TX
Contact:

Post by kduke »

Markos

Roy is right you can improve performance if you size it right. Size it wrong and you kill performance. You need HASH.HELP command to analyze a hash file that is not a DYNAMIC hash file. Use type 18 no matter what it recommends for the type.

Here is some other tricks I used in benchmarks on Universe. Undersize a hash file if it is used as a source. Less blank space to read. Oversize a hash file when it used as a lookup. Finds the record faster. Less records per group.

Kim.

Kim Duke
DwNav - ETL Navigator
www.Duke-Consulting.com
spracht
Participant
Posts: 105
Joined: Tue Apr 15, 2003 11:30 pm
Location: Germany

Post by spracht »

quote:Originally posted by degraciavg

The downside is you'll lose all data that overflows your pre-defined file size. This is recommended if the volume of your data doesn't vary a lot (i.e. significantly increase in volume) on your subsequent runs.



Vladimir (and all),

will there be a warning or fatal error in director if the file size is exeeded? What is the tolerance between the file size at creation time and the maximum? Would you go for a static file if you knew that it won't exeed it's limit in the next 6 month, but probably in 2 years?

Thanks in advance!

Stephan
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Assuming that the file is created with 64-bit pointers, so that the 2GB limit caused by 32-bit pointers does not come into play, there will be no warnings because there will be no integrity problems, only efficiency problems. If a static hashed file has more data in it than the allocated number of groups allows, then some of the groups will overflow. However, all this means is that additional buffers are daisy-chained to the end of the group; no data are lost. If a dynamic hashed file gets more data than its current number of groups allows (multiplied by its SPLIT.LOAD factor, default 80%), then the file automatically grows a new group.
Pre-allocating the correct number of groups is preferred, as other posters have suggested, but this requires a priori knowledge of the total amount of data to be stored.

Ray Wurlod
Education and Consulting Services
ABN 57 092 448 518
degraciavg
Premium Member
Premium Member
Posts: 39
Joined: Tue May 20, 2003 3:36 am
Location: Singapore

Post by degraciavg »

quote:Originally posted by spracht
Vladimir (and all),

Would you go for a static file if you knew that it won't exeed it's limit in the next 6 month, but probably in 2 years?

Thanks in advance!

Stephan


Hi Stephan,

It all boils down to how much performance improvement you've gained by using this solution. If it is significant, then I have no qualms using it. You just have to monitor the hash file to see if it exceeds the max size already.

One way to monitor it is to check the record count. If 2.6M distinct records are extracted from source, then the hash file must have 2.6M records too. If they're not the same (or if it breached your "threshold" value), then you need to increase the size of your hash file.


regards,
vladimir
roy
Participant
Posts: 2598
Joined: Wed Jul 30, 2003 2:05 am
Location: Israel

Post by roy »

Hi,
Another thing you can do, in case you use 32 bit OS and can't get more then 2GB hash files and you really need hash files is:

split the output into several hash files via a transformer using round robin
[i.e. mod(@INROWNUM,)= 1 goes to link1, mod(...)=2 goes to link2 and so on]
in a transformer stage at the constraint section.

this will gain you speed but will make you use hash files and later lookups instead of only 1.

Hope This Helps,


Roy R.
Post Reply