Wow! Great reactions!kcbland wrote:Your reference lookups are against all 12 million rows? Are these dimensions? Why are you dumping the whole table to a hash file, why not just get the rows you need? If you took all of the distinct natural keys for each dimension in a typical run, I bet you would only require a vastly smaller number of rows be dumped into hash files.
There are methods of prescanning your source data for its natural keys, putting into a work table (we call it a bucket of keys), and inner-join extract the required rows and put that into the hash file. When you couple a smaller hash file, preload to memory option, multiple job instance capability, and performance of the hash file, the OCI lookups never scale. This technique is true for both Server and PX implementations. You cannot justify pulling 100% of the rows out of the database when you only need to reference 5%.
I've got white papers on these techniques if you're interested. Just get it off my website.
Yes, this is a true data warehouse. Fact table jobs have to retrieve the surrogate key from the dimension based on the natural key(s).
Smaller hash tables are an option we are exploring... Prescanning the source data for the 'subset' of natural keys to then do a select from the dimension to generate a hash file that is just the combination of natural key(s) and surrogate key. The idea behind using direct lookups into the Oracle table was that using an Oracle index containing both the natural and surrogate keys would simulate using a hash file without actually having to build the hash file.
The problem we have in prescanning the data is logisitical. Some of the source queries are time consuming. Essentially what is being proposed is that the source cursor be run once to build the subset of records for each hash file and then again to process the data and actually build and load the fact records into the data warehouse. Say it only takes half the time to populate the hash files as it does to actually perform the incremental load of the data warehouse. The wall clock time to perform a data warehouse load has increased 50%. The problem we are having is that the current daily load time is slowly eating away at our time window. Increasing the time would not be an option.
While it might be necessary to only reference possibly sightly more than 5% of a particular dimension on an incremental load, we were hoping that we could go directly against the Oracle dimension with the index and by pass having to create the hash file at all.
mike