Ambiguous Error Message!

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

mczern
Premium Member
Premium Member
Posts: 80
Joined: Thu Jan 20, 2005 8:38 am
Location: Atlanta, GA
Contact:

Post by mczern »

kcbland wrote:Your reference lookups are against all 12 million rows? Are these dimensions? Why are you dumping the whole table to a hash file, why not just get the rows you need? If you took all of the distinct natural keys for each dimension in a typical run, I bet you would only require a vastly smaller number of rows be dumped into hash files.

There are methods of prescanning your source data for its natural keys, putting into a work table (we call it a bucket of keys), and inner-join extract the required rows and put that into the hash file. When you couple a smaller hash file, preload to memory option, multiple job instance capability, and performance of the hash file, the OCI lookups never scale. This technique is true for both Server and PX implementations. You cannot justify pulling 100% of the rows out of the database when you only need to reference 5%.

I've got white papers on these techniques if you're interested. Just get it off my website.
Wow! Great reactions!
Yes, this is a true data warehouse. Fact table jobs have to retrieve the surrogate key from the dimension based on the natural key(s).

Smaller hash tables are an option we are exploring... Prescanning the source data for the 'subset' of natural keys to then do a select from the dimension to generate a hash file that is just the combination of natural key(s) and surrogate key. The idea behind using direct lookups into the Oracle table was that using an Oracle index containing both the natural and surrogate keys would simulate using a hash file without actually having to build the hash file.

The problem we have in prescanning the data is logisitical. Some of the source queries are time consuming. Essentially what is being proposed is that the source cursor be run once to build the subset of records for each hash file and then again to process the data and actually build and load the fact records into the data warehouse. Say it only takes half the time to populate the hash files as it does to actually perform the incremental load of the data warehouse. The wall clock time to perform a data warehouse load has increased 50%. The problem we are having is that the current daily load time is slowly eating away at our time window. Increasing the time would not be an option.

While it might be necessary to only reference possibly sightly more than 5% of a particular dimension on an incremental load, we were hoping that we could go directly against the Oracle dimension with the index and by pass having to create the hash file at all.

mike
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

For each source dimension table extract out of Oracle and landed to two sequential files, one the full column set, the other just your natural keys. For your fact tables, create a sequential file of natural keys for each foreign key dimension lookup required. Execute a sort distinct on the concatenated set of files for each dimension (henceforth bucket) file to dedup your keys and combine. You now have the absolute maximum number of foreign keys to retrieve during processing.

For each bucket of keys file, DIRECT path load into a work table. Now, inner join to the natural key and spool to file each of the dimensions. You see, you still incur the extract overhead of pulling rows out of Oracle, gain a small amount of time to cat/dedup the key files and DIRECT path load (minimal) into a work table. But you GAIN all the time back and then some because the inner join can be done with a parallel query, and the spooling is a vastly smaller set of rows.

Now load up your hash files with the results. During transformation, the hash files are exceptional better in performance because they'll be smaller, and because you know every row will be referenced at least once, the likelyhood of hash caching being a benefit will go way up.

This method works for Server and PX. It also opens doors for tricks in faster SCD type 1 and 2 processing. For fact processsing, odds are greatly in your favor that the rows are mostly inserts. Therefore, the inner join and spool is very negligible.
Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
Post Reply