Split Hash File vs Universe Table approach

sumitgulati · Post by **sumitgulati** » Mon Dec 19, 2005 12:03 pm

Hi,

As per a business requirement we need to make a lookup to a table based on one key column. The lookup table has 25 million records and a direct table lookup would kill the job performance. Hash file probably is a good idea but we do not want to create the hash file because of two reasons:
1) We expect the size of the file to grow upto around 1.5 GB.
2) The write into a hash file generally goes down for big hash files. In that case just to create the hash file it might take hours.

Below is the approach we tried:
We created a Universe table 'X' and loaded the required columns from the lookup table to it. The load was pretty fact. The join column for the lookup is defined as key column 'Y' in the Universe table 'X'.
In the main job to make a lookup to the Universe table we used the universe stage and joined the source column with column 'Y' of the Universe table 'X'.

The job performance is very good and also loading the Universe table is much faster than loading a hash file with the same data. My question now is that in the forum I saw a lot of people suggesting about splitting the hash file into multiple files for these kind of situations.
I am looking for a better insight into
Universe Table approach vs Multiple Hash files approach.

Thanks and Regards,
-Sumit

ArndW · Post by **ArndW** » Mon Dec 19, 2005 12:50 pm

Sumit,

loading a UniVerse table is never faster than loading a hashed file. A UniVerse table is nothing but a view on a hashed file. Loading to a dynamic hashed file gets slower when writing large volumes of data because of misconfiguration, not because it is inherently slower.

Splitting a hashed file into several ones is only useful when you can split the keys into <n> files logically and then also split the associated reads and lookups to those <n> files as well. One lookup to the full dataset is much faster than doing 10 lookups to 1/10 of the file - if 9 of those lookups fail.

ray.wurlod · Post by **ray.wurlod** » Mon Dec 19, 2005 2:14 pm

Preallocating the disk space when you create the empty hashed file will help. Use MINIMUM.MODULUS if creating a dynamic hashed file, or modulo if creating a static hashed file.

sumitgulati · Post by **sumitgulati** » Mon Dec 19, 2005 4:41 pm

Thanks Ray and Arnd,

Splitting the key column logically is not possible. So, if we create say 5 files there is no way we can ensure that the data would be evenly distributed across these 5 files. Certainly creating only 1 hash file would be a bad option because the size of that file would be around 1.5 GB. We can improve the write into the hash file by using MINIMUM.MODULUS but 1.5gb is huge.

Currently we are using a Universe table for the lookup and its working ok.
Our only concern is that when we searched for previous posts in the forum for handling such situations, not many people have suggested to use Universe Tables for large data sets but this seems to be working perfectly fine for us. Are we missing anything? Do you see any problem with this approach in future?

Thanks and Regards,
-Sumit

ray.wurlod · Post by **ray.wurlod** » Tue Dec 20, 2005 1:29 am

1.5GB is not huge. Hashed files (with 64-bit addressing) can theoretically be up to 19 million TB in size (if your o/s permits).

djdinjan · Post by **djdinjan** » Tue Dec 20, 2005 3:29 pm

Ray, you are right with 64-bit addressing we have the option of using the hash file. I am still curious to know how good the universe table approach is.

Could someone please share their opinion on that.

Thanks,
Dj

ArndW · Post by **ArndW** » Tue Dec 20, 2005 5:27 pm

Dj,

a Universe Table is just another way to look at a UniVerse hashed file.

ray.wurlod · Post by **ray.wurlod** » Wed Dec 21, 2005 1:19 am

... and a UniVerse table behaves much like any other database table - you throw SQL at it and get results back. Constrained queries are assisted by indexing on the constraining columns (except for leading wildcard constraints).

sumitgulati · Post by **sumitgulati** » Wed Dec 21, 2005 11:40 am

Thank you all for your inputs.

Regards,
-Sumit