Hash files and lookups

sudheepkv · Post by **sudheepkv** » Wed Jan 04, 2006 3:28 am

Could you give the use of hash files and lookups in real time senario with an example?How it effects the performance?

kcshankar · Post by **kcshankar** » Wed Jan 04, 2006 3:42 am

Hi,
Welcome aboard.
Search the forum,for your queries.
Sample from the forum.....
A hash file can simply be described as a file that distributes data throughout a pre-sized and space allocated file. Every row has to have a key to determine where the row will reside within the file. This key is run thru an algorithm (technically a hashing algorithm, hence the name) to determine the location in the file for the row.
One and only one row can exist at a single location.

Regards
kcs

ray.wurlod · Post by **ray.wurlod** » Wed Jan 04, 2006 3:54 pm

Please be more specific about what you mean by "performance". Hashed files (note, not "hash" files) are the fastest possible mechanism for performing a key-based lookup, particularly if cached in memory.

sudheepkv · Post by **sudheepkv** » Wed Jan 04, 2006 11:13 pm

ray.wurlod wrote:Please be more specific about what you mean by "performance". Hashed files (note, not "hash" files) are the fastest possible mechanism for performing a key-based lookup, particularly if cached in memory.

Thank you all.I have some more doubts .please respond

1. in real time senario lookupss are very large in size(GBs or TBs).
I such a case how Datastage can manage the storage in main memory.

2.If we update the the source file of lookup (eg oracle table,or asequncial file )how it will visible at the lookup dynamically?

ray.wurlod · Post by **ray.wurlod** » Wed Jan 04, 2006 11:35 pm

1. in real time senario lookupss are very large in size(GBs or TBs).
I such a case how Datastage can manage the storage in main memory.
No. The maximum size of a cached hashed file is 999MB.
If you were to use DataStage Enterprise Edition, then you could use a virtual Data Set (or Lookup File Set) that can be much larger. However, ask yourself do you really need all those columns and all those rows - are all the rows current rows, and are all the columns needed in your job design?

2.If we update the the source file of lookup (eg oracle table,or asequncial file )how it will visible at the lookup dynamically?
It won't happen automatically - you have to include updating the lookup table in real time into your job design, which rules out a read-only memory cache.

sudheepkv · Post by **sudheepkv** » Wed Jan 04, 2006 11:57 pm

sudheepkv wrote:Could you give the use of hash files and lookups in real time senario with an example?How it effects the performance?

Thanks a lot .I am a biginner in DS it is very helpful to me.

Could explin the following

1.What is Vertual Dataset
2.Please explin the second answer?I was not able to understand

balajisr · Post by **balajisr** » Thu Jan 05, 2006 12:37 am

Hi

Code: Select all

1.What is Vertual Dataset

Dataset which is not persistent (Not stored in the disk) is virtual dataset.

--Balaji S.R

kumar_s · Post by **kumar_s** » Thu Jan 05, 2006 1:27 am

sudheepkv wrote:
sudheepkv wrote:Could you give the use of hash files and lookups in real time senario with an example?How it effects the performance?
Thanks a lot .I am a biginner in DS it is very helpful to me.

Could explin the following

1.What is Vertual Dataset
2.Please explin the second answer?I was not able to understand

Hi,
1.As explained, The dataset passed thorough the link is also considered as Virutal Dataset.

2.It actually ment, that you need to update the hash table manually once the source is also updated.

-Kumar

ray.wurlod · Post by **ray.wurlod** » Thu Jan 05, 2006 3:49 pm

Virtual Data Sets only exist in parallel jobs. They do not exist in server jobs. The original post marked the job as a server job.

kumar_s · Post by **kumar_s** » Thu Jan 05, 2006 10:02 pm

ray.wurlod wrote:Virtual Data Sets only exist in parallel jobs. They do not exist in server jobs. The original post marked the job as a server job.

Is it ment that server jobs pass the data as stream by chunk and does not have option of memory mapped IO

-kumar

ray.wurlod · Post by **ray.wurlod** » Thu Jan 05, 2006 10:20 pm

2.If we update the the source file of lookup (eg oracle table,or asequncial file )how it will visible at the lookup dynamically?

DataStage does not update your lookup table automatically. If you are using a hashed file you have pre-populated this with an image from the target table.

As the job runs, you need to keep the target data updated (this may involve new rows), whether you are writing directly to the target table or to a staging area.

Neither of these approaches updates the hashed file automatically. You have to include an extra stream in your DataStage job that updates the hashed file when required.

If you are doing this, you need it to happen immediately, so that if the same change appears in the next source row, it will be found in the hashed file. This means that you can not use a hashed file read-cache, which does not have the ability to be updated immediately; it is populated only when the job starts.

sudheepkv · Post by **sudheepkv** » Thu Jan 05, 2006 11:37 pm

ray.wurlod wrote:2.If we update the the source file of lookup (eg oracle table,or asequncial file )how it will visible at the lookup dynamically?

DataStage does not update your lookup table automatically. If you are using a hashed file you have pre-populated this with an image from the target table.

As the job runs, you need to keep the target data updated (this may involve new rows), whether you are writing directly to the target table or to a staging area.

Neither of these approaches updates the hashed file automatically. You have to include an extra stream in your DataStage job that updates the hashed file when required.

If you are doing this, you need it to happen immediately, so that if the same change appears in the next source row, it will be found in the hashed file. This means that you can not use a hashed file read-cache, which does not have the ability to be updated immediately; it is populated only when the job starts.

Thanks a lot .please tell me

How i can include an extra stream in your DataStage job that updates the hashed file when required?

ArndW · Post by **ArndW** » Fri Jan 06, 2006 3:06 am

sudheepkv wrote:Thanks a lot .please tell me
How i can include an extra stream in your DataStage job that updates the hashed file when required?

Sudheepkv,

adding an update or write stream to a DataStage hashed file is part of the basic job design method. You drop the hashed file stage on the designer canvas, pull a link from a transform (or other) stage to it and fill in the hashed file location and other details and get the columns from the source stage. I think the tutorial has some base examples of this.

ray.wurlod · Post by **ray.wurlod** » Fri Jan 06, 2006 4:30 pm

Code: Select all

        HFstage1       HFstage2
          |              ^
          |              |
          V              |
--------> TransformerStage -------->

Both hashed file stages refer to the same hashed file. Use an annotation to warn future developers/maintainers that this is the case.

HFstage1 performs lookups (cache disabled, lock for update). This sets a row-level lock in the hashed file.

HFstage2 writes to the hashed file only if the lookup failed. The write releases the record-level lock.

ray.wurlod · Post by **ray.wurlod** » Fri Feb 03, 2006 9:05 pm

In any scenario - real time or otherwise - it depends on what you're looking up and why.
For example, to detect existence, you need only the key values in the lookup table (or hashed file). Even in India this is unlikely to get anywhere near 2GB. But, if it does, all you have to do is change your hashed file to 64-bit internal addressing and you can have up to 19 million TB (provided that the operating system supports files of this size).

Of course, no affordable computer has this much RAM - it's nothing to do with DataStage. In DataStage you can limit the size of the memory cache.

In terms of slowly changing dimensions (which is what you describe), you must design to update the hashed file whenever you update the target, and make sure that that happens during data flow. Many sites have implemented this; it's a fundamental DataStage technique.

Code: Select all

               HFstage1    HFstage2
                 |           ^
                 |           |
                 V           |
        ----->  TransformerStage ----->

HFstage1, having been pre-populated with current records from the target table, reads (lookup) from the hashed file, with caching set to "disabled, lock for update". HFstage2 writes to the very same hashed file, with write cache disabled.

You must use separate hashed file stages, even though they work with the same hashed file.

DSXchange

Hash files and lookups

Hash files and lookups

Hashed files

Re: Hash files and lookups

Re: Hash files and lookups

Hashed file Updation

Re: Hashed file Updation