Distributed HASH file

ewartpm · Post by **ewartpm** » Thu May 26, 2005 2:17 am

Hi Everyone

We are creating huge hash files that are starting to touch on the 2GB limit. Typically, these hash files have around 40 million rows in them. I want to create a distributed hash file instead to make the maintenance/design of the DataStage jobs easier (lookup's become a problem when you have to lookup on more that one hash file for the same type of information).

My questions are thus:
What is the performance impact of doing this?
Can the hash files be cached if they are in a distributed hash file?

Many thanks

Sainath.Srinivasan · Post by **Sainath.Srinivasan** » Thu May 26, 2005 2:26 am

Distributed hash file is similar to partitioned table. I do not think you will be able to cache as the main purpose of this is to reduce the scanning and loading into memory.

Precious · Post by **Precious** » Thu May 26, 2005 2:56 am

Here is some info and the posts it comes from:

viewtopic.php?t=90554

Part files accessed via the Distributed file name are not eligible to be cached.

viewtopic.php?t=90164

The distributed hash file has a lot of maintenance overhead, in that you must define partitioning algorithms. A distributed hash file is the same concept as a partitioned table in Oracle. You must include the partition key column when doing any lookup/join to benefit from partition pruning.

What this does is give you smaller, faster hash files to reference, keeping you in the 32-bit realm, distributes CPU load, and lets you balance processing better.

ArndW · Post by **ArndW** » Thu May 26, 2005 5:21 am

ewartpm,

the distributed hash file mechanism works well, but as you have seen from the other posters, it does invalidate the caching mechanism.

That said, if you choose your partitioning algorithm carefully, you can read and write to the individual "partitioned" files directly and use them in normal DataStage jobs; so you can still get the benefits of a distributed partitioned file.

The 2Gb limit doesn't apply to the windows files (assuming NTFS), so if you created a large hash file this limitation ought not to apply to you. WHy are you being limited to files of 2Gb?

ray.wurlod · Post by **ray.wurlod** » Thu May 26, 2005 7:22 am

You can stay with hashed files by resizing them to 64-bit addressing.

Code: Select all

RESIZE hashedfile * * * 64BIT

(This syntax assumes a VOC entry - it can also be done from the command line.)

ewartpm · Post by **ewartpm** » Thu May 26, 2005 7:50 am

Thanks for the replies.

I thought the 2GB limit applied to al OS, hence my concern

I have heard that 64 bit hash files do not perform well hence I have not pursued this option. Is this just a perception of mine or should they be avoided?

Precious · Post by **Precious** » Thu May 26, 2005 8:11 am

ArndW wrote:The 2Gb limit doesn't apply to the windows files (assuming NTFS), so if you created a large hash file this limitation ought not to apply to you. WHy are you being limited to files of 2Gb?

I was under the impression that this was the case on 32 bit OSs.

Please clarify.

ArndW · Post by **ArndW** » Thu May 26, 2005 8:21 am

Precious,

if the OS or hardwar word width is 32 bits you can still have a 64-bit pointer; it just spans the internal work boundary and is somewhat less efficient.

The file system pointers need to be bigger than 32bits to support that 2Gb limit; and if you use the NTFS file system on any of the newer Windoze platforms you get a very high theoretical file size. I didn't know what the limits were, but I just checked out Size Limitations in NTFS and FAT File systems and got some interesting data.

ray.wurlod · Post by **ray.wurlod** » Thu May 26, 2005 12:14 pm

In theory 19 million TB, but probably limited to 1TB on Windows platforms.