Split Hash File vs Universe Table approach
Moderators: chulett, rschirm, roy
-
- Participant
- Posts: 197
- Joined: Mon Feb 17, 2003 11:20 pm
- Location: India
Split Hash File vs Universe Table approach
Hi,
As per a business requirement we need to make a lookup to a table based on one key column. The lookup table has 25 million records and a direct table lookup would kill the job performance. Hash file probably is a good idea but we do not want to create the hash file because of two reasons:
1) We expect the size of the file to grow upto around 1.5 GB.
2) The write into a hash file generally goes down for big hash files. In that case just to create the hash file it might take hours.
Below is the approach we tried:
We created a Universe table 'X' and loaded the required columns from the lookup table to it. The load was pretty fact. The join column for the lookup is defined as key column 'Y' in the Universe table 'X'.
In the main job to make a lookup to the Universe table we used the universe stage and joined the source column with column 'Y' of the Universe table 'X'.
The job performance is very good and also loading the Universe table is much faster than loading a hash file with the same data. My question now is that in the forum I saw a lot of people suggesting about splitting the hash file into multiple files for these kind of situations.
I am looking for a better insight into
Universe Table approach vs Multiple Hash files approach.
Thanks and Regards,
-Sumit
As per a business requirement we need to make a lookup to a table based on one key column. The lookup table has 25 million records and a direct table lookup would kill the job performance. Hash file probably is a good idea but we do not want to create the hash file because of two reasons:
1) We expect the size of the file to grow upto around 1.5 GB.
2) The write into a hash file generally goes down for big hash files. In that case just to create the hash file it might take hours.
Below is the approach we tried:
We created a Universe table 'X' and loaded the required columns from the lookup table to it. The load was pretty fact. The join column for the lookup is defined as key column 'Y' in the Universe table 'X'.
In the main job to make a lookup to the Universe table we used the universe stage and joined the source column with column 'Y' of the Universe table 'X'.
The job performance is very good and also loading the Universe table is much faster than loading a hash file with the same data. My question now is that in the forum I saw a lot of people suggesting about splitting the hash file into multiple files for these kind of situations.
I am looking for a better insight into
Universe Table approach vs Multiple Hash files approach.
Thanks and Regards,
-Sumit
Sumit,
loading a UniVerse table is never faster than loading a hashed file. A UniVerse table is nothing but a view on a hashed file. Loading to a dynamic hashed file gets slower when writing large volumes of data because of misconfiguration, not because it is inherently slower.
Splitting a hashed file into several ones is only useful when you can split the keys into <n> files logically and then also split the associated reads and lookups to those <n> files as well. One lookup to the full dataset is much faster than doing 10 lookups to 1/10 of the file - if 9 of those lookups fail.
loading a UniVerse table is never faster than loading a hashed file. A UniVerse table is nothing but a view on a hashed file. Loading to a dynamic hashed file gets slower when writing large volumes of data because of misconfiguration, not because it is inherently slower.
Splitting a hashed file into several ones is only useful when you can split the keys into <n> files logically and then also split the associated reads and lookups to those <n> files as well. One lookup to the full dataset is much faster than doing 10 lookups to 1/10 of the file - if 9 of those lookups fail.
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
Preallocating the disk space when you create the empty hashed file will help. Use MINIMUM.MODULUS if creating a dynamic hashed file, or modulo if creating a static hashed file.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
-
- Participant
- Posts: 197
- Joined: Mon Feb 17, 2003 11:20 pm
- Location: India
Thanks Ray and Arnd,
Splitting the key column logically is not possible. So, if we create say 5 files there is no way we can ensure that the data would be evenly distributed across these 5 files. Certainly creating only 1 hash file would be a bad option because the size of that file would be around 1.5 GB. We can improve the write into the hash file by using MINIMUM.MODULUS but 1.5gb is huge.
Currently we are using a Universe table for the lookup and its working ok.
Our only concern is that when we searched for previous posts in the forum for handling such situations, not many people have suggested to use Universe Tables for large data sets but this seems to be working perfectly fine for us. Are we missing anything? Do you see any problem with this approach in future?
Thanks and Regards,
-Sumit
Splitting the key column logically is not possible. So, if we create say 5 files there is no way we can ensure that the data would be evenly distributed across these 5 files. Certainly creating only 1 hash file would be a bad option because the size of that file would be around 1.5 GB. We can improve the write into the hash file by using MINIMUM.MODULUS but 1.5gb is huge.
Currently we are using a Universe table for the lookup and its working ok.
Our only concern is that when we searched for previous posts in the forum for handling such situations, not many people have suggested to use Universe Tables for large data sets but this seems to be working perfectly fine for us. Are we missing anything? Do you see any problem with this approach in future?
Thanks and Regards,
-Sumit
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
... and a UniVerse table behaves much like any other database table - you throw SQL at it and get results back. Constrained queries are assisted by indexing on the constraining columns (except for leading wildcard constraints).
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
-
- Participant
- Posts: 197
- Joined: Mon Feb 17, 2003 11:20 pm
- Location: India