Hash files and lookups

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

sudheepkv
Participant
Posts: 12
Joined: Tue Jan 03, 2006 3:03 am

Hash files and lookups

Post by sudheepkv »

Could you give the use of hash files and lookups in real time senario with an example?How it effects the performance?
sudhi
kcshankar
Charter Member
Charter Member
Posts: 91
Joined: Mon Jan 10, 2005 2:06 am

Post by kcshankar »

Hi,
Welcome aboard.
Search the forum,for your queries.
Sample from the forum.....
A hash file can simply be described as a file that distributes data throughout a pre-sized and space allocated file. Every row has to have a key to determine where the row will reside within the file. This key is run thru an algorithm (technically a hashing algorithm, hence the name) to determine the location in the file for the row.
One and only one row can exist at a single location.


Regards
kcs
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Please be more specific about what you mean by "performance". Hashed files (note, not "hash" files) are the fastest possible mechanism for performing a key-based lookup, particularly if cached in memory.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
sudheepkv
Participant
Posts: 12
Joined: Tue Jan 03, 2006 3:03 am

Hashed files

Post by sudheepkv »

ray.wurlod wrote:Please be more specific about what you mean by "performance". Hashed files (note, not "hash" files) are the fastest possible mechanism for performing a key-based lookup, particularly if cached in memory.
Thank you all.I have some more doubts .please respond

1. in real time senario lookupss are very large in size(GBs or TBs).
I such a case how Datastage can manage the storage in main memory.

2.If we update the the source file of lookup (eg oracle table,or asequncial file )how it will visible at the lookup dynamically?
sudhi
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

1. in real time senario lookupss are very large in size(GBs or TBs).
I such a case how Datastage can manage the storage in main memory.

No. The maximum size of a cached hashed file is 999MB.
If you were to use DataStage Enterprise Edition, then you could use a virtual Data Set (or Lookup File Set) that can be much larger. However, ask yourself do you really need all those columns and all those rows - are all the rows current rows, and are all the columns needed in your job design?

2.If we update the the source file of lookup (eg oracle table,or asequncial file )how it will visible at the lookup dynamically?
It won't happen automatically - you have to include updating the lookup table in real time into your job design, which rules out a read-only memory cache.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
sudheepkv
Participant
Posts: 12
Joined: Tue Jan 03, 2006 3:03 am

Re: Hash files and lookups

Post by sudheepkv »

sudheepkv wrote:Could you give the use of hash files and lookups in real time senario with an example?How it effects the performance?
Thanks a lot .I am a biginner in DS it is very helpful to me.

Could explin the following

1.What is Vertual Dataset
2.Please explin the second answer?I was not able to understand
sudhi
balajisr
Charter Member
Charter Member
Posts: 785
Joined: Thu Jul 28, 2005 8:58 am

Post by balajisr »

Hi

Code: Select all

1.What is Vertual Dataset 
Dataset which is not persistent (Not stored in the disk) is virtual dataset.

--Balaji S.R
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Re: Hash files and lookups

Post by kumar_s »

sudheepkv wrote:
sudheepkv wrote:Could you give the use of hash files and lookups in real time senario with an example?How it effects the performance?
Thanks a lot .I am a biginner in DS it is very helpful to me.

Could explin the following

1.What is Vertual Dataset
2.Please explin the second answer?I was not able to understand
Hi,
1.As explained, The dataset passed thorough the link is also considered as Virutal Dataset.

2.It actually ment, that you need to update the hash table manually once the source is also updated.

-Kumar
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Virtual Data Sets only exist in parallel jobs. They do not exist in server jobs. The original post marked the job as a server job.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

ray.wurlod wrote:Virtual Data Sets only exist in parallel jobs. They do not exist in server jobs. The original post marked the job as a server job.
Is it ment that server jobs pass the data as stream by chunk and does not have option of memory mapped IO :roll:

-kumar
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

2.If we update the the source file of lookup (eg oracle table,or asequncial file )how it will visible at the lookup dynamically?

DataStage does not update your lookup table automatically. If you are using a hashed file you have pre-populated this with an image from the target table.

As the job runs, you need to keep the target data updated (this may involve new rows), whether you are writing directly to the target table or to a staging area.

Neither of these approaches updates the hashed file automatically. You have to include an extra stream in your DataStage job that updates the hashed file when required.

If you are doing this, you need it to happen immediately, so that if the same change appears in the next source row, it will be found in the hashed file. This means that you can not use a hashed file read-cache, which does not have the ability to be updated immediately; it is populated only when the job starts.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
sudheepkv
Participant
Posts: 12
Joined: Tue Jan 03, 2006 3:03 am

Hashed file Updation

Post by sudheepkv »

ray.wurlod wrote:2.If we update the the source file of lookup (eg oracle table,or asequncial file )how it will visible at the lookup dynamically?

DataStage does not update your lookup table automatically. If you are using a hashed file you have pre-populated this with an image from the target table.

As the job runs, you need to keep the target data updated (this may involve new rows), whether you are writing directly to the target table or to a staging area.

Neither of these approaches updates the hashed file automatically. You have to include an extra stream in your DataStage job that updates the hashed file when required.

If you are doing this, you need it to happen immediately, so that if the same change appears in the next source row, it will be found in the hashed file. This means that you can not use a hashed file read-cache, which does not have the ability to be updated immediately; it is populated only when the job starts.

Thanks a lot .please tell me

How i can include an extra stream in your DataStage job that updates the hashed file when required?
sudhi
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Re: Hashed file Updation

Post by ArndW »

sudheepkv wrote:Thanks a lot .please tell me
How i can include an extra stream in your DataStage job that updates the hashed file when required?
Sudheepkv,

adding an update or write stream to a DataStage hashed file is part of the basic job design method. You drop the hashed file stage on the designer canvas, pull a link from a transform (or other) stage to it and fill in the hashed file location and other details and get the columns from the source stage. I think the tutorial has some base examples of this.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Code: Select all

        HFstage1       HFstage2
          |              ^
          |              |
          V              |
--------> TransformerStage -------->
Both hashed file stages refer to the same hashed file. Use an annotation to warn future developers/maintainers that this is the case.

HFstage1 performs lookups (cache disabled, lock for update). This sets a row-level lock in the hashed file.

HFstage2 writes to the hashed file only if the lookup failed. The write releases the record-level lock.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

In any scenario - real time or otherwise - it depends on what you're looking up and why.
For example, to detect existence, you need only the key values in the lookup table (or hashed file). Even in India this is unlikely to get anywhere near 2GB. But, if it does, all you have to do is change your hashed file to 64-bit internal addressing and you can have up to 19 million TB (provided that the operating system supports files of this size).

Of course, no affordable computer has this much RAM - it's nothing to do with DataStage. In DataStage you can limit the size of the memory cache.

In terms of slowly changing dimensions (which is what you describe), you must design to update the hashed file whenever you update the target, and make sure that that happens during data flow. Many sites have implemented this; it's a fundamental DataStage technique.

Code: Select all

               HFstage1    HFstage2
                 |           ^
                 |           |
                 V           |
        ----->  TransformerStage ----->
HFstage1, having been pre-populated with current records from the target table, reads (lookup) from the hashed file, with caching set to "disabled, lock for update". HFstage2 writes to the very same hashed file, with write cache disabled.

You must use separate hashed file stages, even though they work with the same hashed file.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Post Reply