Split Hash File vs Universe Table approach

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
sumitgulati
Participant
Posts: 197
Joined: Mon Feb 17, 2003 11:20 pm
Location: India

Split Hash File vs Universe Table approach

Post by sumitgulati »

Hi,

As per a business requirement we need to make a lookup to a table based on one key column. The lookup table has 25 million records and a direct table lookup would kill the job performance. Hash file probably is a good idea but we do not want to create the hash file because of two reasons:
1) We expect the size of the file to grow upto around 1.5 GB.
2) The write into a hash file generally goes down for big hash files. In that case just to create the hash file it might take hours.

Below is the approach we tried:
We created a Universe table 'X' and loaded the required columns from the lookup table to it. The load was pretty fact. The join column for the lookup is defined as key column 'Y' in the Universe table 'X'.
In the main job to make a lookup to the Universe table we used the universe stage and joined the source column with column 'Y' of the Universe table 'X'.

The job performance is very good and also loading the Universe table is much faster than loading a hash file with the same data. My question now is that in the forum I saw a lot of people suggesting about splitting the hash file into multiple files for these kind of situations.
I am looking for a better insight into
Universe Table approach vs Multiple Hash files approach.

Thanks and Regards,
-Sumit
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

Sumit,

loading a UniVerse table is never faster than loading a hashed file. A UniVerse table is nothing but a view on a hashed file. Loading to a dynamic hashed file gets slower when writing large volumes of data because of misconfiguration, not because it is inherently slower.

Splitting a hashed file into several ones is only useful when you can split the keys into <n> files logically and then also split the associated reads and lookups to those <n> files as well. One lookup to the full dataset is much faster than doing 10 lookups to 1/10 of the file - if 9 of those lookups fail.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Preallocating the disk space when you create the empty hashed file will help. Use MINIMUM.MODULUS if creating a dynamic hashed file, or modulo if creating a static hashed file.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
sumitgulati
Participant
Posts: 197
Joined: Mon Feb 17, 2003 11:20 pm
Location: India

Post by sumitgulati »

Thanks Ray and Arnd,

Splitting the key column logically is not possible. So, if we create say 5 files there is no way we can ensure that the data would be evenly distributed across these 5 files. Certainly creating only 1 hash file would be a bad option because the size of that file would be around 1.5 GB. We can improve the write into the hash file by using MINIMUM.MODULUS but 1.5gb is huge.

Currently we are using a Universe table for the lookup and its working ok.
Our only concern is that when we searched for previous posts in the forum for handling such situations, not many people have suggested to use Universe Tables for large data sets but this seems to be working perfectly fine for us. Are we missing anything? Do you see any problem with this approach in future?

Thanks and Regards,
-Sumit
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

1.5GB is not huge. Hashed files (with 64-bit addressing) can theoretically be up to 19 million TB in size (if your o/s permits).
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
djdinjan
Participant
Posts: 36
Joined: Fri Jul 23, 2004 9:35 am

Post by djdinjan »

Ray, you are right with 64-bit addressing we have the option of using the hash file. I am still curious to know how good the universe table approach is.

Could someone please share their opinion on that.

Thanks,
Dj
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

Dj,

a Universe Table is just another way to look at a UniVerse hashed file.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

... and a UniVerse table behaves much like any other database table - you throw SQL at it and get results back. Constrained queries are assisted by indexing on the constraining columns (except for leading wildcard constraints).
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
sumitgulati
Participant
Posts: 197
Joined: Mon Feb 17, 2003 11:20 pm
Location: India

Post by sumitgulati »

Thank you all for your inputs.

Regards,
-Sumit
Post Reply