Distributed HASH file
Moderators: chulett, rschirm, roy
Distributed HASH file
Hi Everyone
We are creating huge hash files that are starting to touch on the 2GB limit. Typically, these hash files have around 40 million rows in them. I want to create a distributed hash file instead to make the maintenance/design of the DataStage jobs easier (lookup's become a problem when you have to lookup on more that one hash file for the same type of information).
My questions are thus:
What is the performance impact of doing this?
Can the hash files be cached if they are in a distributed hash file?
Many thanks
We are creating huge hash files that are starting to touch on the 2GB limit. Typically, these hash files have around 40 million rows in them. I want to create a distributed hash file instead to make the maintenance/design of the DataStage jobs easier (lookup's become a problem when you have to lookup on more that one hash file for the same type of information).
My questions are thus:
What is the performance impact of doing this?
Can the hash files be cached if they are in a distributed hash file?
Many thanks
-
- Participant
- Posts: 3337
- Joined: Mon Jan 17, 2005 4:49 am
- Location: United Kingdom
Here is some info and the posts it comes from:
viewtopic.php?t=90554
viewtopic.php?t=90554
viewtopic.php?t=90164Part files accessed via the Distributed file name are not eligible to be cached.
The distributed hash file has a lot of maintenance overhead, in that you must define partitioning algorithms. A distributed hash file is the same concept as a partitioned table in Oracle. You must include the partition key column when doing any lookup/join to benefit from partition pruning.
What this does is give you smaller, faster hash files to reference, keeping you in the 32-bit realm, distributes CPU load, and lets you balance processing better.
Precious
Mosher's Law of Software Engineering: Don't worry if it doesn't work right. If everything did, you'd be out of a job.
Mosher's Law of Software Engineering: Don't worry if it doesn't work right. If everything did, you'd be out of a job.
ewartpm,
the distributed hash file mechanism works well, but as you have seen from the other posters, it does invalidate the caching mechanism.
That said, if you choose your partitioning algorithm carefully, you can read and write to the individual "partitioned" files directly and use them in normal DataStage jobs; so you can still get the benefits of a distributed partitioned file.
The 2Gb limit doesn't apply to the windows files (assuming NTFS), so if you created a large hash file this limitation ought not to apply to you. WHy are you being limited to files of 2Gb?
the distributed hash file mechanism works well, but as you have seen from the other posters, it does invalidate the caching mechanism.
That said, if you choose your partitioning algorithm carefully, you can read and write to the individual "partitioned" files directly and use them in normal DataStage jobs; so you can still get the benefits of a distributed partitioned file.
The 2Gb limit doesn't apply to the windows files (assuming NTFS), so if you created a large hash file this limitation ought not to apply to you. WHy are you being limited to files of 2Gb?
<a href=http://www.worldcommunitygrid.org/team/ ... TZ9H4CGVP1 target="WCGWin">
</a>
</a>
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
You can stay with hashed files by resizing them to 64-bit addressing.
(This syntax assumes a VOC entry - it can also be done from the command line.)
Code: Select all
RESIZE hashedfile * * * 64BIT
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
I was under the impression that this was the case on 32 bit OSs. Please clarify.ArndW wrote:The 2Gb limit doesn't apply to the windows files (assuming NTFS), so if you created a large hash file this limitation ought not to apply to you. WHy are you being limited to files of 2Gb?
Precious
Mosher's Law of Software Engineering: Don't worry if it doesn't work right. If everything did, you'd be out of a job.
Mosher's Law of Software Engineering: Don't worry if it doesn't work right. If everything did, you'd be out of a job.
Precious,
if the OS or hardwar word width is 32 bits you can still have a 64-bit pointer; it just spans the internal work boundary and is somewhat less efficient.
The file system pointers need to be bigger than 32bits to support that 2Gb limit; and if you use the NTFS file system on any of the newer Windoze platforms you get a very high theoretical file size. I didn't know what the limits were, but I just checked out Size Limitations in NTFS and FAT File systems and got some interesting data.
if the OS or hardwar word width is 32 bits you can still have a 64-bit pointer; it just spans the internal work boundary and is somewhat less efficient.
The file system pointers need to be bigger than 32bits to support that 2Gb limit; and if you use the NTFS file system on any of the newer Windoze platforms you get a very high theoretical file size. I didn't know what the limits were, but I just checked out Size Limitations in NTFS and FAT File systems and got some interesting data.
<a href=http://www.worldcommunitygrid.org/team/ ... TZ9H4CGVP1 target="WCGWin">
</a>
</a>
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact: