insert into hash files

roy · Post by **roy** » Tue Aug 26, 2003 9:09 am

Hi,
this issue also depends on the disks you have (speed/size).

anyway, do you need to unload it to a hash file?
if no lookups are made on the output file use a sequential file instead.

if you do need lookups, is the enable write cache checkbox checked in the hash file's input?

bare in mind that hash files are limited in size (2 GB) as well ,unless you create them manually to be 64 bit.

Good Luck,

Roy R.

kduke · Post by **kduke** » Tue Aug 26, 2003 10:08 am

Markos

Make sure you have a minimum modulo on this file. That will definitely increase your performance. A dynamic hash file will keep resizing itself as it grows. Dynamic files are the default types of hash files. ANALZE.FILE will tell you how big to make the minimum modulo.

Also I would take off cache. This file is probably too big to keep in memory. I would try it.

If you are on UNIX, probably Windows too, the closer a filesystem gets to 100% full the slower it gets. It spends most of its time looking for free space. I have seen this reduce speeds by a forth or more. Used to be able to control this. You could rebuild a filesystem so files were located in the disk together. Like when you defrag a disk. Improves performance a lot.

Kim.

Kim Duke
DwNav - ETL Navigator
www.Duke-Consulting.com

degraciavg · Post by **degraciavg** » Tue Aug 26, 2003 8:24 pm

In addition to what Kim suggested, you might want to consider a Static hash file. It is known to increase performance by at least 30%.

You can use the hash file calculator (HFC) to estimate the pre-determined size of your hash file. HFC is found in the DataStage installation disk under Unsupported Utilities.

The downside is you'll lose all data that overflows your pre-defined file size. This is recommended if the volume of your data doesn't vary a lot (i.e. significantly increase in volume) on your subsequent runs.

regards,
vladimir

kduke · Post by **kduke** » Tue Aug 26, 2003 10:56 pm

Markos

Roy is right you can improve performance if you size it right. Size it wrong and you kill performance. You need HASH.HELP command to analyze a hash file that is not a DYNAMIC hash file. Use type 18 no matter what it recommends for the type.

Here is some other tricks I used in benchmarks on Universe. Undersize a hash file if it is used as a source. Less blank space to read. Oversize a hash file when it used as a lookup. Finds the record faster. Less records per group.

Kim.

Kim Duke
DwNav - ETL Navigator
www.Duke-Consulting.com

spracht · Post by **spracht** » Thu Aug 28, 2003 6:45 am

quote:Originally posted by degraciavg

The downside is you'll lose all data that overflows your pre-defined file size. This is recommended if the volume of your data doesn't vary a lot (i.e. significantly increase in volume) on your subsequent runs.

Vladimir (and all),

will there be a warning or fatal error in director if the file size is exeeded? What is the tolerance between the file size at creation time and the maximum? Would you go for a static file if you knew that it won't exeed it's limit in the next 6 month, but probably in 2 years?

Thanks in advance!

Stephan

ray.wurlod · Post by **ray.wurlod** » Thu Aug 28, 2003 4:25 pm

Assuming that the file is created with 64-bit pointers, so that the 2GB limit caused by 32-bit pointers does not come into play, there will be no warnings because there will be no integrity problems, only efficiency problems. If a static hashed file has more data in it than the allocated number of groups allows, then some of the groups will overflow. However, all this means is that additional buffers are daisy-chained to the end of the group; no data are lost. If a dynamic hashed file gets more data than its current number of groups allows (multiplied by its SPLIT.LOAD factor, default 80%), then the file automatically grows a new group.
Pre-allocating the correct number of groups is preferred, as other posters have suggested, but this requires a priori knowledge of the total amount of data to be stored.

Ray Wurlod
Education and Consulting Services
ABN 57 092 448 518

degraciavg · Post by **degraciavg** » Thu Aug 28, 2003 9:26 pm

quote:Originally posted by spracht
Vladimir (and all),

Would you go for a static file if you knew that it won't exeed it's limit in the next 6 month, but probably in 2 years?

Thanks in advance!

Stephan

Hi Stephan,

It all boils down to how much performance improvement you've gained by using this solution. If it is significant, then I have no qualms using it. You just have to monitor the hash file to see if it exceeds the max size already.

One way to monitor it is to check the record count. If 2.6M distinct records are extracted from source, then the hash file must have 2.6M records too. If they're not the same (or if it breached your "threshold" value), then you need to increase the size of your hash file.

regards,
vladimir

roy · Post by **roy** » Tue Sep 02, 2003 3:56 am

Hi,
Another thing you can do, in case you use 32 bit OS and can't get more then 2GB hash files and you really need hash files is:

split the output into several hash files via a transformer using round robin
[i.e. mod(@INROWNUM,)= 1 goes to link1, mod(...)=2 goes to link2 and so on]
in a transformer stage at the constraint section.

this will gain you speed but will make you use hash files and later lookups instead of only 1.

Hope This Helps,

Roy R.