Hash files and lookups
Moderators: chulett, rschirm, roy
Hash files and lookups
Could you give the use of hash files and lookups in real time senario with an example?How it effects the performance?
sudhi
Hi,
Welcome aboard.
Search the forum,for your queries.
Sample from the forum.....
A hash file can simply be described as a file that distributes data throughout a pre-sized and space allocated file. Every row has to have a key to determine where the row will reside within the file. This key is run thru an algorithm (technically a hashing algorithm, hence the name) to determine the location in the file for the row.
One and only one row can exist at a single location.
Regards
kcs
Welcome aboard.
Search the forum,for your queries.
Sample from the forum.....
A hash file can simply be described as a file that distributes data throughout a pre-sized and space allocated file. Every row has to have a key to determine where the row will reside within the file. This key is run thru an algorithm (technically a hashing algorithm, hence the name) to determine the location in the file for the row.
One and only one row can exist at a single location.
Regards
kcs
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
Please be more specific about what you mean by "performance". Hashed files (note, not "hash" files) are the fastest possible mechanism for performing a key-based lookup, particularly if cached in memory.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Hashed files
Thank you all.I have some more doubts .please respondray.wurlod wrote:Please be more specific about what you mean by "performance". Hashed files (note, not "hash" files) are the fastest possible mechanism for performing a key-based lookup, particularly if cached in memory.
1. in real time senario lookupss are very large in size(GBs or TBs).
I such a case how Datastage can manage the storage in main memory.
2.If we update the the source file of lookup (eg oracle table,or asequncial file )how it will visible at the lookup dynamically?
sudhi
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
1. in real time senario lookupss are very large in size(GBs or TBs).
I such a case how Datastage can manage the storage in main memory.
No. The maximum size of a cached hashed file is 999MB.
If you were to use DataStage Enterprise Edition, then you could use a virtual Data Set (or Lookup File Set) that can be much larger. However, ask yourself do you really need all those columns and all those rows - are all the rows current rows, and are all the columns needed in your job design?
2.If we update the the source file of lookup (eg oracle table,or asequncial file )how it will visible at the lookup dynamically?
It won't happen automatically - you have to include updating the lookup table in real time into your job design, which rules out a read-only memory cache.
I such a case how Datastage can manage the storage in main memory.
No. The maximum size of a cached hashed file is 999MB.
If you were to use DataStage Enterprise Edition, then you could use a virtual Data Set (or Lookup File Set) that can be much larger. However, ask yourself do you really need all those columns and all those rows - are all the rows current rows, and are all the columns needed in your job design?
2.If we update the the source file of lookup (eg oracle table,or asequncial file )how it will visible at the lookup dynamically?
It won't happen automatically - you have to include updating the lookup table in real time into your job design, which rules out a read-only memory cache.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Re: Hash files and lookups
Thanks a lot .I am a biginner in DS it is very helpful to me.sudheepkv wrote:Could you give the use of hash files and lookups in real time senario with an example?How it effects the performance?
Could explin the following
1.What is Vertual Dataset
2.Please explin the second answer?I was not able to understand
sudhi
Hi
Dataset which is not persistent (Not stored in the disk) is virtual dataset.
--Balaji S.R
Code: Select all
1.What is Vertual Dataset
--Balaji S.R
Re: Hash files and lookups
Hi,sudheepkv wrote:Thanks a lot .I am a biginner in DS it is very helpful to me.sudheepkv wrote:Could you give the use of hash files and lookups in real time senario with an example?How it effects the performance?
Could explin the following
1.What is Vertual Dataset
2.Please explin the second answer?I was not able to understand
1.As explained, The dataset passed thorough the link is also considered as Virutal Dataset.
2.It actually ment, that you need to update the hash table manually once the source is also updated.
-Kumar
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
2.If we update the the source file of lookup (eg oracle table,or asequncial file )how it will visible at the lookup dynamically?
DataStage does not update your lookup table automatically. If you are using a hashed file you have pre-populated this with an image from the target table.
As the job runs, you need to keep the target data updated (this may involve new rows), whether you are writing directly to the target table or to a staging area.
Neither of these approaches updates the hashed file automatically. You have to include an extra stream in your DataStage job that updates the hashed file when required.
If you are doing this, you need it to happen immediately, so that if the same change appears in the next source row, it will be found in the hashed file. This means that you can not use a hashed file read-cache, which does not have the ability to be updated immediately; it is populated only when the job starts.
DataStage does not update your lookup table automatically. If you are using a hashed file you have pre-populated this with an image from the target table.
As the job runs, you need to keep the target data updated (this may involve new rows), whether you are writing directly to the target table or to a staging area.
Neither of these approaches updates the hashed file automatically. You have to include an extra stream in your DataStage job that updates the hashed file when required.
If you are doing this, you need it to happen immediately, so that if the same change appears in the next source row, it will be found in the hashed file. This means that you can not use a hashed file read-cache, which does not have the ability to be updated immediately; it is populated only when the job starts.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Hashed file Updation
ray.wurlod wrote:2.If we update the the source file of lookup (eg oracle table,or asequncial file )how it will visible at the lookup dynamically?
DataStage does not update your lookup table automatically. If you are using a hashed file you have pre-populated this with an image from the target table.
As the job runs, you need to keep the target data updated (this may involve new rows), whether you are writing directly to the target table or to a staging area.
Neither of these approaches updates the hashed file automatically. You have to include an extra stream in your DataStage job that updates the hashed file when required.
If you are doing this, you need it to happen immediately, so that if the same change appears in the next source row, it will be found in the hashed file. This means that you can not use a hashed file read-cache, which does not have the ability to be updated immediately; it is populated only when the job starts.
Thanks a lot .please tell me
How i can include an extra stream in your DataStage job that updates the hashed file when required?
sudhi
Re: Hashed file Updation
Sudheepkv,sudheepkv wrote:Thanks a lot .please tell me
How i can include an extra stream in your DataStage job that updates the hashed file when required?
adding an update or write stream to a DataStage hashed file is part of the basic job design method. You drop the hashed file stage on the designer canvas, pull a link from a transform (or other) stage to it and fill in the hashed file location and other details and get the columns from the source stage. I think the tutorial has some base examples of this.
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
Code: Select all
HFstage1 HFstage2
| ^
| |
V |
--------> TransformerStage -------->
HFstage1 performs lookups (cache disabled, lock for update). This sets a row-level lock in the hashed file.
HFstage2 writes to the hashed file only if the lookup failed. The write releases the record-level lock.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
In any scenario - real time or otherwise - it depends on what you're looking up and why.
For example, to detect existence, you need only the key values in the lookup table (or hashed file). Even in India this is unlikely to get anywhere near 2GB. But, if it does, all you have to do is change your hashed file to 64-bit internal addressing and you can have up to 19 million TB (provided that the operating system supports files of this size).
Of course, no affordable computer has this much RAM - it's nothing to do with DataStage. In DataStage you can limit the size of the memory cache.
In terms of slowly changing dimensions (which is what you describe), you must design to update the hashed file whenever you update the target, and make sure that that happens during data flow. Many sites have implemented this; it's a fundamental DataStage technique.
HFstage1, having been pre-populated with current records from the target table, reads (lookup) from the hashed file, with caching set to "disabled, lock for update". HFstage2 writes to the very same hashed file, with write cache disabled.
You must use separate hashed file stages, even though they work with the same hashed file.
For example, to detect existence, you need only the key values in the lookup table (or hashed file). Even in India this is unlikely to get anywhere near 2GB. But, if it does, all you have to do is change your hashed file to 64-bit internal addressing and you can have up to 19 million TB (provided that the operating system supports files of this size).
Of course, no affordable computer has this much RAM - it's nothing to do with DataStage. In DataStage you can limit the size of the memory cache.
In terms of slowly changing dimensions (which is what you describe), you must design to update the hashed file whenever you update the target, and make sure that that happens during data flow. Many sites have implemented this; it's a fundamental DataStage technique.
Code: Select all
HFstage1 HFstage2
| ^
| |
V |
-----> TransformerStage ----->
You must use separate hashed file stages, even though they work with the same hashed file.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.