Hash lookup Vs. Table look up for smaller volume

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
Xanadu
Participant
Posts: 61
Joined: Thu Jul 22, 2004 9:29 am

Hash lookup Vs. Table look up for smaller volume

Post by Xanadu »

Hello All,
As a rule of thumb I have always avoided reference table lookups..Most of the time I either do a hash lookup or if I am in a mood for some SQL would just use joins (left outer or inner appropriately....)
But now I have a question...Data model has a table that would not have more than 20 rows in DWH. This table is used for a lookup in another job.
Now the predicament is whether I should write that "small" table to hash file when I load that table with data and use that hash file for lookup? Would that make a considerable difference to speed of lookup (considering there are less than 50 rows ?)
Addl info: I am working on customising PeopleSoft EPM. So as a matter of maintaining "consistency" I am using Dynamic RDBMS stage even for the reference table lookups. (I understand OLE/ODBC might make it faster but again its smaller volume of data ..so...???)

Any input is appreciated.
Thanks
~Xanadu
kduke
Charter Member
Charter Member
Posts: 5227
Joined: Thu May 29, 2003 9:47 am
Location: Dallas, TX
Contact:

Post by kduke »

Small tables are the best solutions for hash files. You can load it in memory. Should be very fast as a lookup and to load.
Mamu Kim
peternolan9
Participant
Posts: 214
Joined: Mon Feb 23, 2004 2:10 am
Location: Dublin, Ireland
Contact:

Re: Hash lookup Vs. Table look up for smaller volume

Post by peternolan9 »

Xanadu,
I would always suggest you use hash files even for small lookup tables unless there is a reason not to.

Reasons not to....

1. The lookup is performed so few times it just doesn't matter....this is true of practically all dimension table maintenance if you are updating incrementally. A few hundred thousand lookups to the database vs a lookup table per night is not going to be noticable.

2. You need to see the updated data in the next lookup.....as we go more and more to 'real time' more jobs are needing to see the data in the table after each set of rows are processed....so you don't want to have to refresh or update your hash table as well...

3. The hash table is so huge that you cannot get it under the 999MB in memory limit and you cannot figure out a way to spit it so it is under the 999MB limit. Then you may be forced to work with database tables rather than hash tables....but you will have to have upwards of 10M rows in your customer/account tables to get into this sitaution, perhaps even more....

If that is the case, get someone who knows DS really well because he/she will be worth it....we just had this problem on a customer with 18M accounts.


HTH
Best Regards
Peter Nolan
www.peternolan.com
kduke
Charter Member
Charter Member
Posts: 5227
Joined: Thu May 29, 2003 9:47 am
Location: Dallas, TX
Contact:

Post by kduke »

I say try it both ways and tell us which one was faster overall.
Mamu Kim
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

You can update a hashed file (whether or not it's cached) quite easily. If the update needs to occur to the hashed file (output constraint on the link writing to the hashed file), the design looks like this.

Code: Select all

        HFStage_1       HFStage_2
           |                ^
           |                |
           v                |
  ----->  Transformer_Stage_Checking  ----->
On the link that performs the read from the hashed file, ensure that "lock for updating" is selected, whether or not load file to memory is selected.
On the link that writes to the hashed file, the cache setting must match that of the other hashed file stage (both must use cache, or neither).
And, of course, each hashed file stage refers to exactly the same hashed file with the same metadata.
The technique of using shared hashed files is covered in the Hashed Stage Disk Cache (?) manual (dsdskche.pdf).
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
peternolan9
Participant
Posts: 214
Joined: Mon Feb 23, 2004 2:10 am
Location: Dublin, Ireland
Contact:

Post by peternolan9 »

ray.wurlod wrote:You can update a hashed file (whether or not it's cached) quite easily.
Hi Ray/All,
of course, absolutely correct....my 2cents is to always write data back to the database so that I get all the backup/recovery facilities of the database and the data is in sync.....

This is one of the things that I didn't like about the get next key function in either DS or INFA...in both cases the 'next key' is stored in the database that is the ETL repository which means in the event of a failure and a restore of the database one must live with a 'gap' in the keys of edit the keys in their source to get the consecutive numbers again....or worse, if there was a failure in the repository database and it needed to be restored then one needs to go into the repository database and set the keys again to the last ones used....

I have always preferred to key the 'next key' in the same database as the dimension table onto which the key is going and have the update to the table maintaining the key to be in the same unit of work as the update to the dimension table itself......if at all possible.....

We did this with DS when I was at Ardent about 4 years ago, only it was pretty hard to do....

Like I said, my 2cents worth...
Best Regards
Peter Nolan
www.peternolan.com
datastage
Participant
Posts: 229
Joined: Wed Oct 23, 2002 10:10 am
Location: Omaha

Post by datastage »

as Kim requested, I would be interested to see your timings both ways, but like the others would assume that using the hashed file would be the best performing.

This may not fit into your current case with the 20 row table, but one thing to consider is when your reference table is possibly changing you can have the advantage in writing it to a hashed file to have the state of the table saved (at least until you clear or update the hashed file again) during the run. For instance, Monday night a production load happens. Tuesday morning the original database table, Monday afternoon a problem is found. You still have in the hashed file the data that was used during the load where the database table has changed.

Also, I agree with Peter Nolan on his comment about preferring to store the next key in the database (and preferably with a database sequence than your own custom store) over using the get next key function with the ETL tool
Byron Paul
WARNING: DO NOT OPERATE DATASTAGE WITHOUT ADULT SUPERVISION.

"Strange things are afoot in the reject links" - from Bill & Ted's DataStage Adventure
chucksmith
Premium Member
Premium Member
Posts: 385
Joined: Wed Jun 16, 2004 12:43 pm
Location: Virginia, USA
Contact:

Post by chucksmith »

Regarding backup, a simple job can copy the SDKSequences hash file to a database table to provide a point in time recovery option.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

I never make use of the "Next Key" methodology, not liking hash files in my projects. I also minimize the use of Oracle Sequences, if possible.

We store surrogate key values in an Oracle table and have built Shared Containers to access and update them in a standard fashion. The key is pulled at the beginning of a job, managed internally in the job and the final result is written back to Oracle.

My two cents. :wink:
-craig

"You can never have too many knives" -- Logan Nine Fingers
Post Reply