Which is more efficient Hash Tables or OCI lookups

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

Ray said it better than I have been able to say. If your requirement is to put 30 million rows into a hash file for repetitive lookup, you will probably exceed the read caching capabilities. My technique suggests partitioning the data so that you have a subset in N hash files, where the size of the hash file allows for caching. You also want as many transformation jobs as you have cpus, so that each cpu is fully used. If you have 24 cpus, then I suggest that you have 24 hash lookups of your large dimension, and 24 transformation jobs of the fact data. Each job handles a specific customer key set. Rather than use the "ends with 0..9" which limits you to only 10 partitions, MOD allows you to have as many partitions as you want. MOD returns the integer remainder of a division operation, so MOD(somenumber, 24) always returns 0..23. So, all customer keys when passed thru MOD(customer_key,24) will return 0..23. So, you have 24 hash files, each one contains its respective set of customer_keys. Your fact transformation jobs limit themselves to only processing those customers that pertain to it, again those ending in 0..23.

This partitioning of reference data, instantiating the transformation, and distributing processing across multiple cpus is EXACTLY what PX does for you automatically. Remember, divide the source data and conquer with multiple identical processes. You take your customer dimension table, and push it into a PX reference dataset. If you have 24 cpus, and set up a node pool of 24 nodes (1 node ~ 1 cpu if you configure it that way), it will disperse the appropriate subset to each node. Then, when processing the fact rows and you merge the fact data with the customer dimension data set, it will again send only the rows that pertain to those customers to that node. The partitioning keeps the right data together.

I'm trying to get someone to look at the manuals for 7.0 and confirm the limitations with the caching. I have never seen this limit, as long as the DS user has no limits on memory usage. I had one client with a 30 million hash lookup requirement, partitioned into 16 lookups. Each job achieved full caching, as watching glance or top/prstat showed that each process was using a lot of memory. I'll try to dig out some screen snapshots of glance to prove it.
Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

If you read the disk caching guide under your Start menu, you'll see that the only size limitation is 1/2 a terabyte, and that the guide actually has a tuning table where up to 2 gigabyte hash files are documented. In addition, it doesn't talk anywhere about a user limitation, in fact, it describes in detail the newer features (in 7) that deal with public and private link caching (I gotta re-read that a few more times to understand it).

So, I couldn't find any evidence to support a hash file limitation, other than the usual user limits and the fact that hash files are paged in and out of memory up to your individual setting (max 512 meg). So, if data is constantly paging, you could be hurting yourself more than helping, and therefore just doing direct reads are better. When you have more physical hash files, you actually will use more cache/memory because you have more processes each with its own limit. Think about it!
Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
peternolan9
Participant
Posts: 214
Joined: Mon Feb 23, 2004 2:10 am
Location: Dublin, Ireland
Contact:

Post by peternolan9 »

ray.wurlod wrote:
Can you tell us where you found the reference to the 1GB limit on total cache for hashed files? We may be able to determine that that figure is configurable. (Not making any promises.)
Hi Ray,
in administrator->projects->tunables there are some settings for cache size (read/write) that only go up to 999. Tom Nel put some appends here and got 'mixed' answers as to whether this was global/total read cache available for all cached hash files or just the max for a single hash file.

I would love to find out we were just unclear and that this means the max size for an individual hash file.

If no-one knows the definitive answer we could test to see if DS complains when we try to load 10 500MB hash files or something like that.

Even after our splitting we still have 2 hash files that are 1GB+ that we need to do something about....
Best Regards
Peter Nolan
www.peternolan.com
Post Reply