Distributed HASH file

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
ewartpm
Participant
Posts: 97
Joined: Wed Jun 25, 2003 2:15 am
Location: South Africa
Contact:

Distributed HASH file

Post by ewartpm »

Hi Everyone

We are creating huge hash files that are starting to touch on the 2GB limit. Typically, these hash files have around 40 million rows in them. I want to create a distributed hash file instead to make the maintenance/design of the DataStage jobs easier (lookup's become a problem when you have to lookup on more that one hash file for the same type of information).

My questions are thus:
What is the performance impact of doing this?
Can the hash files be cached if they are in a distributed hash file?

Many thanks
Sainath.Srinivasan
Participant
Posts: 3337
Joined: Mon Jan 17, 2005 4:49 am
Location: United Kingdom

Post by Sainath.Srinivasan »

Distributed hash file is similar to partitioned table. I do not think you will be able to cache as the main purpose of this is to reduce the scanning and loading into memory.
Precious
Charter Member
Charter Member
Posts: 53
Joined: Mon Aug 23, 2004 9:51 am
Location: South Africa
Contact:

Post by Precious »

Here is some info and the posts it comes from:

viewtopic.php?t=90554
Part files accessed via the Distributed file name are not eligible to be cached.
viewtopic.php?t=90164
The distributed hash file has a lot of maintenance overhead, in that you must define partitioning algorithms. A distributed hash file is the same concept as a partitioned table in Oracle. You must include the partition key column when doing any lookup/join to benefit from partition pruning.

What this does is give you smaller, faster hash files to reference, keeping you in the 32-bit realm, distributes CPU load, and lets you balance processing better.
Precious

Mosher's Law of Software Engineering: Don't worry if it doesn't work right. If everything did, you'd be out of a job.
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

ewartpm,

the distributed hash file mechanism works well, but as you have seen from the other posters, it does invalidate the caching mechanism.

That said, if you choose your partitioning algorithm carefully, you can read and write to the individual "partitioned" files directly and use them in normal DataStage jobs; so you can still get the benefits of a distributed partitioned file.

The 2Gb limit doesn't apply to the windows files (assuming NTFS), so if you created a large hash file this limitation ought not to apply to you. WHy are you being limited to files of 2Gb?
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

You can stay with hashed files by resizing them to 64-bit addressing.

Code: Select all

RESIZE hashedfile * * * 64BIT
(This syntax assumes a VOC entry - it can also be done from the command line.)
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
ewartpm
Participant
Posts: 97
Joined: Wed Jun 25, 2003 2:15 am
Location: South Africa
Contact:

Post by ewartpm »

Thanks for the replies.

I thought the 2GB limit applied to al OS, hence my concern :oops:

I have heard that 64 bit hash files do not perform well hence I have not pursued this option. Is this just a perception of mine or should they be avoided?
Precious
Charter Member
Charter Member
Posts: 53
Joined: Mon Aug 23, 2004 9:51 am
Location: South Africa
Contact:

Post by Precious »

ArndW wrote:The 2Gb limit doesn't apply to the windows files (assuming NTFS), so if you created a large hash file this limitation ought not to apply to you. WHy are you being limited to files of 2Gb?
I was under the impression that this was the case on 32 bit OSs. :? Please clarify.
Precious

Mosher's Law of Software Engineering: Don't worry if it doesn't work right. If everything did, you'd be out of a job.
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

Precious,

if the OS or hardwar word width is 32 bits you can still have a 64-bit pointer; it just spans the internal work boundary and is somewhat less efficient.

The file system pointers need to be bigger than 32bits to support that 2Gb limit; and if you use the NTFS file system on any of the newer Windoze platforms you get a very high theoretical file size. I didn't know what the limits were, but I just checked out Size Limitations in NTFS and FAT File systems and got some interesting data.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

In theory 19 million TB, but probably limited to 1TB on Windows platforms.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Post Reply