Hash File as Driver

Archive of postings to DataStageUsers@Oliver.com. This forum intended only as a reference and cannot be posted to.

Moderators: chulett, rschirm

Locked
admin
Posts: 8720
Joined: Sun Jan 12, 2003 11:26 pm

Hash File as Driver

Post by admin »

I have a datastage architecture question: Is there any benefit to using a hash file as a DRIVER in a job design? For example:

Scenario 1: You have a sequential file driver with 6 million rows. This sequential file goes into a transform stage, and joins to 3 smaller hash files.

Scenario 2: Instead of a sequential file driver, you have a hash file, with the same key as the 3 hash files you are joining to.

Will scenario 2 be any faster? I didnt think hash files as drivers would make any difference in performance, since we would be supposedly taking each record sequentially, and joining the record to the hash files. I dont really know datastage architecture or usage of hash files in detail, however. Also, a small experiment conducted here that seemed to show results in favor of the hash scenario let me to ask this group. The results may be ambiguous since our test server changes performance drastically depending on whos running what at the time, but scenario 2 was tried out and seemed to run faster.

Any ideas?


Thanks,

Nowell Henry
Data Specialist
New York Life
admin
Posts: 8720
Joined: Sun Jan 12, 2003 11:26 pm

Post by admin »

Nowell,
I am definitely not an authority, but I would expect the hash files would be somewhat slower. First of all, the hash file is keyed (obviously) and having to write that key will take some processing time. This would not affect your job, but would cause the previous job (ie. the one that populates this file) to run slower. Secondly, reading from a sequential file should be quicker. I am thinking this would be true since large records in a hash file (based on the definition of the hash file and the length of your record) get split between two different physical sections. If you have a good percentage of large records, the disk access would not be in consecutive sections and may degrade some performance. Hash files are definitely quick (and for lookups much quicker than anything else), but if I dont have to use the data for lookups, I tend to stick with sequential files. ---Tony
Nowell Henry wrote: I have a datastage architecture question: Is there any benefit to using a hash file as a DRIVER in a job design? For example:

Scenario 1: You have a sequential file driver with 6 million rows. This sequential file goes into a transform stage, and joins to 3 smaller hash files.

Scenario 2: Instead of a sequential file driver, you have a hash file, with the same key as the 3 hash files you are joining to.

Will scenario 2 be any faster? I didnt think hash files as drivers would make any difference in performance, since we would be supposedly taking each record sequentially, and joining the record to the hash files. I dont really know datastage architecture or usage of hash files in detail, however. Also, a small experiment conducted here that seemed to show results in favor of the hash scenario let me to ask this group. The results may be ambiguous since our test server changes performance drastically depending on whos running what at the time, but scenario 2 was tried out and seemed to run faster.

Any ideas?


Thanks,

Nowell Henry
Data Specialist
New York Life



---------------------------------
Do You Yahoo!?
Make a great connection at Yahoo! Personals.
Locked