Which is more efficient Hash Tables or OCI lookups

RamonaReed · Post by **RamonaReed** » Wed Mar 03, 2004 12:13 pm

I am trying to find out what is more efficient, using hash tables as lookups or using OCI lookups. We have many dimension tables that have keys associated to them, and our fact tables need to retrieve these associated foreign keys. Some of our dimension tables have over 60 fields, that are all part of the key. Is there some set rule as to when we should use hash tables or when to use an OCI lookup?

Thanks for any input.

Mona

raju_chvr · Post by **raju_chvr** » Wed Mar 03, 2004 2:18 pm

This is a very wide open question. So many issues have been posted on this forum. You can do a search on that.

In short give the assumption that you have Oracle and DS on different machines across the network, take it that Hash-file look ups are much faster and they are re-usable in different jobs. As long as the size of hash-file doesn't exceed 2GB you are fine with the number of keys and columns. I believe the max no-of columns is 256 (one of Ray's posts).

See for urself the performance of the jobs. Design 2 jobs one with and the other without Hash-files. You will definitely find the difference in time for the 2 jobs.

ray.wurlod · Post by **ray.wurlod** » Wed Mar 03, 2004 2:59 pm

256 is the maximum number of columns that can be returned in a result set from MS Access.
There is no limit in hashed files.
Obviously, though, if you push the boundaries, you're defeating the aim of efficiency; whether it's a hashed file or a database table, a SELECT that retrieves two columns is going to be more efficient that one that returns 200.
And that's not all you can do. The rule of thumb when using hashed files is to load only the columns that the job will use, and only the rows that the job will use. For example, performing a reverse lookup of natural to surrogate keys in a slowly changing dimension, you would load only the "current" rows, not the expired ones, into the hashed file.
The hashed file itself can be tuned, though the defaults work well enough in most situations. It's when you need the final 5% that you start worrying about tuning the hashed file.
Finally, the most important aspect for efficiency, it is possible to cache most hashed files in memory, so that I/O is performed at memory speed rather than at disk speed (at least 1000 times quicker).

peternolan9 · Post by **peternolan9** » Sun Mar 07, 2004 1:32 am

Hi Mona,
why would you have 60 fields be part of a key on a dimension table? A dimension table should have one key and it should be an integer.

When creating the real key to integer key map one good way is to concatenate the real keys into a character string and put it onto the dimension table. Then, when you do lookups you merely need to string together the key in the job and lookup on one char string in the hash table. The char string usually gives you pretty good performance on the lookup. I have called the string field 'dim_char_ky_fld' (since 1994).

I have published cobol and C versions of how to do this on my web site.....(without the ability to have a DS license 'at home' it makes no sense to try and publish DS jobs....I can't read them or maintain them except when on a client site.)

If you do it this way and can load your hash tables into memory performance will be far better than OCI in any circumstances. If your hash files cannot fit into memory I'm not so sure...I am conducting some tests this week to answer that very question.

So, in any case, you should not try performing lookups on Hash files or Oracle with 60 fields......both will be slowed down quite a bit...

RamonaReed wrote:Some of our dimension tables have over 60 fields, that are all part of the key.

Thanks for any input.
Mona

peternolan9 · Post by **peternolan9** » Sun Mar 07, 2004 2:57 am

Hi Mona/All,
this might spark some 'discussion/debate'.

I have just been doing some preliminary testing where I have a 21M row hash table that I load from Oracle to 'improve performance'.

Well, on an untuned system (DS or Oracle) and just a simple multi-part index for Oracle vs a string index on a hash table I am seeing 5x better performance on the OCI than I am seeing on the hash stage when the hash stage is not loaded into memory.

I am sure we can tune both DS/Oracle to make them go faster.

We are going to try out an Oracle hashed index to see if it goes any faster than a multi-part b-tree index.

RamonaReed wrote:I am trying to find out what is more efficient, using hash tables as lookups or using OCI lookups.

Is there some set rule as to when we should use hash tables or when to use an OCI lookup?

Thanks for any input.
Mona

kcbland · Post by **kcbland** » Sun Mar 07, 2004 7:12 am

Not to be argumentative, but discretion has to be used when designing hash file usage for your ETL processes. I would argue that in most cases, an entire large reference table is not required to process the source data. For example, on a given day, only a small percentage of a source customer table will come into the data warehouse for processing into the customer dimension.

I have constantly tried to convince people not to load the entire customer dimension into a hash file just so that you can see if the incremental change source row is an insert or update. If your customer dimension has 50 million rows, and you're processing a 1 million row delta file, then why put 49 million extra rows into your hash file? You could simply extract natural keys from the source data, blast it into a work table in the target db, inner join extract and then place a < 1 million rows into your hash file (some customers will be new and therefore not yet in the target). Now, you have manageable pieces.

So, for beginners they usually do not understand how to use hash files therefore do things that totally make them look poor.

I can say this with full authority and assurance with years of experience with this product in large data warehouse environments: DataStage Server Edition's single most powerful aspect is its hash file technology. When used within the constraints for which it is designed, it will BY FAR be the superior reference construct over a DB lookup. I posted a link earlier to a brief I put on the forum trying to highlight this. I suggest you read it.

peternolan9 · Post by **peternolan9** » Mon Mar 08, 2004 11:50 pm

Hi Kenneth,
I don't feel you are arguementative so no problems.... :D . I'm happy to have your input.

Tom/I were fully aware of how to use hash files and I've considered them THE big difference between INFA/DS. (As well as the very un-intuitive INFA front end.)

So, as we were taught early on at ASCL, we have made extensive use of hash files during our development. (Also we developed on a small Win2000 box with only a sample of 700K customers/accounts.)

However, this client has 18M accounts and 15M customers and 60M CDRs per day. What we were not aware of were the 'limitations' on hash files since we had not hit them perviously. Especially because we were developing on 6 but planning on running V7.

What we are doing at the moment is testing out our 'initial load' of the ODS and hence, in many cases, we do want full access to all accounts and all customers...

Also, a telco is not like a bank or an insurance company in that most days most customers make one or more phone calls.......so in many cases we do want access to the customer/account dimensions as we attribute the CDRs.......

We will play around with a few other options....for example, if we sort the CDRs by Customer/Account prior to attribution we should get very high hit rates for lookup rows being in memory....

I would be more than keen to hear from people building CDR based DWs with DS who have 10M+ accounts/customers.

Detailed information on 'limitations' or 'tricks/technqiues' possible to get around issues generated by such large volumes are hard to come by, even inside ASCL.

Also, I am very keen to hear about performance of DS in this type of environment. We are on a 12 x 1.2Ghz processor Sun 12K with an 8 processor DS license. Our development machine is 4 processors. We have done no database tuning yet on the dev machine.....we are still awaiting a 60 day license performance stats software from Sun.

So, anyone out there who has a large scale environment on this who would like to share, please feel free....here or peter@peternolan.com.

ray.wurlod · Post by **ray.wurlod** » Tue Mar 09, 2004 12:22 am

Last year I worked with a telco in India. They were taking on one million new customers per week for eighteen months. (That's still only about 5% market share!) But the CDR numbers were huge, and the customer dimension was huge too. Hardware was a 32CPU SunFire 15000 with (initally) 34TB of disk and lots of memory. They are also running SAP BW and Oracle on this machine, and happy with performance so far.

kcbland · Post by **kcbland** » Tue Mar 09, 2004 9:14 pm

I haven't done a CDR (Call detail record?) warehouse, but the principles still apply. If you have a VLD (Very Large Dimension), then you typically have a high volume of fact records processing daily, this only makes sense.

In a high volume fact processing solution, you MUST employ job instantiation to divide and conquer your source data. This means you design a job design such that you can run multiple job instances, using a source data dividing technique to separate (partition) the source data such that each identical job clone (instance) operates on its share of data, producing the output for loading. This means that if you have 24 cpus, you potentially can have 24 instances each operating on its own cpu. Without job instantiation, you inevitably with have a single threaded process. For those using PX, this divide and conquer is done automatically for you because of the nature of the underlying technology. For Server jobs, we do things the old fashioned way.

Now, with respect to hash files, your large dimension is too big to efficiently be used by DataStage. But, let's pretend you choose to divide and conquer your fact data around your large dimension. Take your dimension and split (partition) the data into 24 portions, loading each portion into its own hash file. Let's say you use the MOD function on the customer_key and partition the data so that all rows resulting in 0 go to hash file stg_CUSTOMER_1 and all rows resulting in 1 go to stg_CUSTOMER_2 and so on to stg_CUSTOMER_24. You have smaller, efficient hash files that are amenable to caching. All of your jobs will have to parameters: NumberOfPartitions and MyPartitionNumer. Your hash file job looks something like:

Code: Select all

seq --> xfm --> hash

your xfm stage has a constraint to throw away all rows not for this job:

Code: Select all

MOD(inlink.customer_key, NumberOfPartitions) = MyPartitionNumber -1

and your hash file name is:

Code: Select all

stg_CUSTOMER_#MyPartitionNumber#

Your fact transformation job looks something like this:

Code: Select all

seq --> xfm w/lookups --> xfm w/lookups --> xfm w/lookups --> seq (inserts file)
                                                          --> seq (updates file)

You will put the same constraint into the job to throw away all rows not for this job:

Code: Select all

MOD(inlink.customer_key, NumberOfPartitions) = MyPartitionNumber -1

Now, your job control has to feed each job instance (1 to 24) the appropriate MyPartitionNumber value (1 to 24) respective to the job. It does this for the hash lookup build job, as well as the fact transformation job. The hash lookup build jobs execute in parallel, and the fact transformation jobs execute in parallel. You will crush your server, as well as get the most bang for the hash file buck.

You must NOT be mixing database interaction into the mix, because instantiation as described here will saturate the database with lookup queries. If you had a database lookup, and ran 24 instances, each one would be hitting the database and throughput would drop to 0 rows /second, as the database is getting too many requests and becomes a hindrance. This is where the hash file technology, with preloads and shared caching and cache daemon KILLS Informatica, even using Server technology. Also, you cannot be loading or sourcing from a database, as again this limits your ability to partition and instantiate jobs (query instantiation is difficult, and again you can saturate the database with too much work.)

Good luck!

ray.wurlod · Post by **ray.wurlod** » Wed Mar 10, 2004 2:33 am

I forgot to mention that we were VERY selective about which columns to load into the hashed file stage, so that the average row size was small, and we were able to get very many rows into the hashed file.

Eventually, however, it did get too large to cache, but even disk based performance was far better than trying to overwhelm the database server (Oracle) with lookups.

It remains the case, even when not cached, that hashed file technology - determining the location of a row without recourse to table scan or index scan by directly by computation on the key value - is "the best".

(And, yes, we instantiated multiple instances of jobs.)

kduke · Post by **kduke** » Wed Mar 10, 2004 9:08 am

Peter

When creating the real key to integer key map one good way is to concatenate the real keys into a character string and put it onto the dimension table. Then, when you do lookups you merely need to string together the key in the job and lookup on one char string in the hash table. The char string usually gives you pretty good performance on the lookup. I have called the string field 'dim_char_ky_fld' (since 1994).

I was wondering why you do not have a compound key in the hash file. This works very well and usually eliminates duplicates. Lets say your key is country plus telephone number. You may never have a duplicate but if it was store plus customer then you could with store 10 and customer 11 and store 101 and customer 1. DataStage uses a @TM or char(251) as the key field separator. Usually this is never part of the data stream. You may be doing this but I think a lot of users are unclear how simple compound keys are in hash files and the flaws to concantenating them into a single field.

Keys are positional just like hash fields so store needs to come first in the metadata. Metadata should always be copied from the job that builds the hash file to all the jobs which use the hash file so they are always in sync. MetaStage will produce hash file usage reports to help sync them up.. Very common mistake.

I am very impressed by all these numbers. I am still not sure who won the pissing contest.

I have seen hash files 5 to 1000s times faster than ODBC or native lookups. Ken's SCD rollup is a perfect example.

peternolan9 · Post by **peternolan9** » Fri Mar 12, 2004 3:10 am

Hi Kenneth,
I understand what you are saying with this append....I've been trying to figure out how to properly partition using server for a week or two now. Tom Nel has been doing some research on it for us.

In your suggestions you would have the input file read by a transformer and have 'mypartitionnumber' passes as a parameter, is that correct?

My question is this....we have a really neat way of invoking jobs which has a routine fetch the parameters from a file and pass them to the job. So we don't have the ability to pass a parameter to a job like 'mypartitionnumber'. Is there something that gets allocated to the job that we could read like some 'instance number' or some other way other than a parameter that we could tell the job which number it is?

For example, flat files, hash files etc must refer to different files...

Also, my understanding on caching is that there is a 1GB limit for ALL hash files at any single point in time. So breaking hash files into smaller portions but trynig to load all the small portions at once will not help....Am I mistaken on this? We will have a 40-48GB machine when we get to where we are going so we literally have tons and tons of memory...and my inital plan was to have a small number of large hash files shared by all jobs....this didn't work out so I have a large number of smaller hash files, but many of these hash files are in the 400-500MB range so we will run over the 1GB limit rigth away.....or is there something we are missign?

kcbland · Post by **kcbland** » Fri Mar 12, 2004 6:28 am

peternolan9 wrote: In your suggestions you would have the input file read by a transformer and have 'mypartitionnumber' passes as a parameter, is that correct?

Yes. One master job, with N instantiated clones. The job uses a partitioning algorithm to only get a portion of the source data.

peternolan9 wrote: My question is this....we have a really neat way of invoking jobs which has a routine fetch the parameters from a file and pass them to the job. So we don't have the ability to pass a parameter to a job like 'mypartitionnumber'. Is there something that gets allocated to the job that we could read like some 'instance number' or some other way other than a parameter that we could tell the job which number it is?

For example, flat files, hash files etc must refer to different files...

Sure. No matter what, your job control has to know about N instances of a master job to run. Each instance will require an invocation id. Your routine, which fetches parameters from a file, could easily generate an extra parameter to add to its list of parameters and values. If you make your invocation ID contain the partition number, then you can derive the partition number from the invocation ID.

peternolan9 wrote: Also, my understanding on caching is that there is a 1GB limit for ALL hash files at any single point in time. So breaking hash files into smaller portions but trynig to load all the small portions at once will not help....Am I mistaken on this? We will have a 40-48GB machine when we get to where we are going so we literally have tons and tons of memory...and my inital plan was to have a small number of large hash files shared by all jobs....this didn't work out so I have a large number of smaller hash files, but many of these hash files are in the 400-500MB range so we will run over the 1GB limit rigth away.....or is there something we are missign?

Really?

I must say I have never seen this. Once you're in the realm of really big hash files you have to weigh the merits of caching versus not caching. Unless every row is going to be referenced more than once caching is a waste of time preloading. I have built such designs as described previously, where a single large hash file would contain 30 million rows, but 10 hash files, partitioned by a column, easily fit into cache because each hash file only had 3 million rows. The transformation job which referenced it was instantiated X 10 and I did not see that it didn't cache the individual hash.

Either way, just remember that instantiation achieves multiples of improvement, and usually, no amount of knob tweaking can do the same. One of my largest sources of frustration is trying to convince people that hugh intergalactic jobs ARE THE WORST DESIGN. You must design small, modular Server jobs so as to make instantiation possible for each job. That is the only way to accelerate performance by multiples. One job too slow, instantiate it and run 10 and collate the results afterwards. Viola! 10X faster. Tuning a hash file usually won't make a single job 10X faster.

peternolan9 · Post by **peternolan9** » Sat Mar 13, 2004 12:36 pm

Really?

I must say I have never seen this.

Hi Kenneth,
I also said something slightly more than 'oops'....when I found that one out. I will dig up the documentation and send it to you....I am still surprised and I have asked the question plenty of different ways. I was in 'denial' for quite a few days thinking my advice must have been wrong...and I would be really, really happy to find out this is not so!!!!!

As I said, I will have a 40-48GB machine. I want to make use of that memory. Data in memory is always a 'good thing' (I used to be an MVS person...)

We are doing a telco with 60M CDRs per day to be inserted and 120M CDRs/day to be updated. So that's around 180M CDRs to have 'lookups' performed per day. Since the vast majority of people make more than one call per day we will have a minimum of 3 hits against a single customer per day. The original design was to do all the dimension table processing, then load all dimensions keys into memory and then startup CDR processing...the intention being to keep the hash tables in memory during the entire time the 180M CDRs were processed...(hence we bought lots of memory and processors).....we would get a minimum of 3 hits in cache for customer/account dimensions and very likely many, many more than that (how many calls do you make a day?).

So I am very interested to see you write that you have done this with 30M row hash tables and not seen this behavior.....It gives me some hope our information is somehow incorrect adn there is something we are missing... :D

ray.wurlod · Post by **ray.wurlod** » Sat Mar 13, 2004 4:58 pm

I think, if you read it a bit more carefully, that you will find that what Ken said was that the design called for 30M rows, but what he implemented was ten hashed files each with only 3M rows.

I suspect, without knowing the individual case, that data were partitioned based on the final digit of the customer number, one instance of the job processing each partition.

If this (or something like it) were the case, then the entire customer content could be viewed with a logical construct called a Distributed file, which conceptually is an editable view containing the UNION of all its parts. But for reference lookups each part is a separate hashed file, small enough to be cached.

Can you tell us where you found the reference to the 1GB limit on total cache for hashed files? We may be able to determine that that figure is configurable. (Not making any promises.)

DSXchange

Which is more efficient Hash Tables or OCI lookups

Which is more efficient Hash Tables or OCI lookups

Re: Which is more efficient Hash Tables or OCI lookups

Re: Which is more efficient Hash Tables or OCI lookups

Re: Which is more efficient Hash Tables or OCI lookups

Re: Which is more efficient Hash Tables or OCI lookups

Re: Which is more efficient Hash Tables or OCI lookups

Partitioning jobs

Re: Partitioning jobs

Re: Partitioning jobs