Problems setting large record size in hash files

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
witold
Participant
Posts: 5
Joined: Tue Apr 22, 2003 5:42 pm
Location: Canada

Problems setting large record size in hash files

Post by witold »

Well, that first post had to come sooner or later...

I am trying to achieve a specific LARGE.RECORD setting for a hash file. However, the value of LARGE.RECORD parameter used in CREATE.FILE is not reflected in the structure of the actual file. For example, the following command:
>CREATE.FILE LS_OI_CTD_INVOICE_COMPLETE_HASH 30 LARGE.RECORD 9999
Creating file "LS_OI_CTD_INVOICE_COMPLETE_HASH" as Type 30.
Creating file "D_LS_OI_CTD_INVOICE_COMPLETE_HASH" as Type 3, Modulo 1, Separation 2.
Added "@ID", the default record for RetrieVe, to "D_LS_OI_CTD_INVOICE_COMPLETE_HASH".

creates a file that produces ANALYZE.FILE results inconsistent with the CREATE.FILE command:
>ANALYZE.FILE LS_OI_CTD_INVOICE_COMPLETE_HASH
File name .................. LS_OI_CTD_INVOICE_COMPLETE_HASH
Pathname ................... LS_OI_CTD_INVOICE_COMPLETE_HASH
File type .................. DYNAMIC
Hashing Algorithm .......... GENERAL
No. of groups (modulus) .... 1 current ( minimum 1 )
Large record size .......... 2036 bytes
Group size ................. 2048 bytes
Load factors ............... 80% (split), 50% (merge) and 0% (actual)
Total size ................. 6144 bytes

Am I missing something here? Is LARGE.RECORD specified in some esoteric units other than bytes?
witold
Participant
Posts: 5
Joined: Tue Apr 22, 2003 5:42 pm
Location: Canada

Post by witold »

OK, I now know a bit more about the relationship between GROUP.SIZE and LARGE.RECORD. I also understand why with a Type 30 file I can only go up to LARGE.RECORD 4084.

What I don't understand is why the following configuration just made my hash file load perform much worse than before:

Type 30
GROUP.SIZE 2
LARGE.RECORD 4084
MINIMUM.MODULUS 500000

Help?

Also, can anybody closer explain the record-to-group relationship? Specifically, what inefficiency is there in using GROUP.SIZE 2 to store records that could fit into GROUP.SIZE 1?
kduke
Charter Member
Charter Member
Posts: 5227
Joined: Thu May 29, 2003 9:47 am
Location: Dallas, TX
Contact:

Post by kduke »

UNIX and DOS are setup to read 2K, 4K ... 16K chunks based on how the filesystems are created. So a group size of 1 is wasted. The group size should match the filesystem to be most efficient. The group size on static hash files is really size times 512. I think it is the same on dynamic but not certain. So a group size of 4 is really 2K. If you have large records then make your group size at least as big as the large record size. Large record size should be 1024 times 2 or 4 or 8 ...
Mamu Kim
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

There is no relationship between LARGE.RECORD and GROUP.SIZE. It looks like you've encountered a bug in CREATE.FILE.

Here are some alternatives.

Having created the file, change LARGE.RECORD with RESIZE.

Code: Select all

RESIZE filename * * * LARGE.RECORD 10000
(a multiple of 8 bytes is preferred)

Having created the file, change LARGE.RECORD with CONFIGURE.FILE.

Code: Select all

CONFIGURE.FILE filename LARGE.RECORD 10000
This change will not take immediate effect. It can be made to take immediate effect using RESIZE filename * * *

Try using a percentage when creating the file.

Code: Select all

CREATE.FILE filename 30 LARGE.RECORD 500%
(Fact: The default value is calculated as 80%).

Use the MINIMIZE.SPACE keyword with any of the above verbs.
CREATE.FILE filename 30 MINIMIZE.SPACE
RESIZE filename * * * MINIMIZE.SPACE
CONFIGURE.FILE filename MINIMIZE.SPACE
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
cohesion
Participant
Posts: 8
Joined: Wed Feb 18, 2004 3:32 pm
Location: Canada

Post by cohesion »

witold wrote:OK, I now know a bit more about the relationship between GROUP.SIZE and LARGE.RECORD. I also understand why with a Type 30 file I can only go up to LARGE.RECORD 4084.

What I don't understand is why the following configuration just made my hash file load perform much worse than before:

Type 30
GROUP.SIZE 2
LARGE.RECORD 4084
MINIMUM.MODULUS 500000
I can only tell you that in my experience, changing the minimum modulus is something you might try to do to improve the load performance of a hash file if it will be repeatedly loaded. DataStage normally will dynamically adjust the minimum modulus as required for large files, but this imposes additional overhead on the load process. If you check the final setting on the file, once loaded, and set to that value or some percentage higher to allow for growth, the load should be faster the next time. However, if you set the min modulus too high (not sure if that might be what you did here) the initial load performance could be worse.
R. Michael Pickering
Senior Architect
Cohesion Systems Consulting Inc.
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

This post may help clarify somethings about hash files:
viewtopic.php?t=85364

The same hash file can have different performance results day-to-day because of fluctuations in the volume put into the hash. This is because of the way overflow is handled by the dynamic nature of the hash. If the hash file is presized, that just means you're saving the overhead of constantly doubling the hash file because you know it's going to reach a certain size.

The performance problem comes into play when you underestimate the size of the hash file. When that happens, there is data spilling out of the hash and into the overflow file. See my referenced post for more details. What this means is that one day you might have enough data to trigger a dynamic resize, which puts all of the data back into a larger data file and leaves none in the overflow. On that day, the hash file is optimally tuned. But, on a day where a few rows less go into the hash file, it may be in a heavily overflowed state. In that case, rows are first searched in the data file using the optimized hashing algorithm, and if not found then sequentially scanned from the inferior overflow.

The point is, you want to find the high watermark of the hash file and set it to that minimum modulus plus some room for the unexpected. You should then find consistent reference performance. An oversized hash file is a waste of space, but an undersized hash file is a waste of time.
Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
gherbert
Participant
Posts: 9
Joined: Mon Mar 29, 2004 7:58 am
Location: Westboro, MA

Post by gherbert »

Actually, there IS a relationship between GROUP.SIZE and LARGE.RECORD size during the CREATE.FILE process. LARGE.RECORD can be no larger than GROUP.SIZE minus HeaderSize (12 bytes within 32bit files, 20 bytes with 64bit files), thus when you specify 9999, you actually get 2036 byes (assumming groupsize of 1 or 2048 bytes).

The CONFIGURE.FILE command, however, does not adhere to this limitation and will allow you to set LARGE.RECORD to any value desired.

This is not considered a bug as we implemented this per spec (based on Prime INFORMATION documentation and in-field operation) and has been in the engine since circa 1986.
Post Reply