UV secondary indexes

kcbland · Post by **kcbland** » Sun Aug 24, 2003 8:28 pm

datastagedummy is really doing a cartesian product. This is a relational activity, so therefore it makes sense to do things with a relational database. What is happening is that each row is causing a SQL statement to be executed against the hash file, so if you have 100K rows, you get 100K SQL statements in a single-threaded job, therefore single-threaded statements.

Here's how I would approach this. I would take the source data file and bulk load w/truncate into a work table. Then, I'd do the cartesian product join and put the results back into a sequential text file. You now could have used N number of cpus based on a parallel query (if you're using something like Oracle or Informix.) You'll find that query happens multi-threaded and very fast.

It looks like you're doing something with product/inventory matrices, where you have to explode a top-level product into its parts. I showed a client this technique, where a matrix or portfolio item was exploded into its variant parts (Retek system). Blasting this file into a table, cartesian product, yank it back out into a file took less than 2 minutes for 300K rows. Then, the file was transformed as it was before, except that cartesian product lookup was removed. The job was then instantiated for a linear improvement.

Good luck!

Kenneth Bland

datastagedummy · Post by **datastagedummy** » Mon Aug 25, 2003 7:11 am

Thanks Ken and everybody for the help though I haven't tried it yet but Ken's approach looks more logical.

My source is a flat file and targets is Oracle9i but the problem is I dont have DDL privilege on the Oracle DB so for these kind of situations I generally use the UniVerse (haven't used UV before and am a dummy at it).

What is your suggestion should I be using Oracle instead of UV and load the Source and lookup data in 2 table and do the Join in the DB and write to a sequential file.

Thanks Ken and you are absolutely right I am trying to explode an Item into it's sub components.

kcbland · Post by **kcbland** » Mon Aug 25, 2003 7:47 am

You should have a user schema and some work tablespace in the warehouse. The volume of these work tables should be low, and they are only for facilitating the ETL. Since they are work tables, the data content is disposable and these tables can be truncated with drop storage after processing to return used space to the DB.

There's also techniques where you can scan your source data, collect a unique list of natural keys, write them to a work table in the warehouse, inner-join extract the rows in the warehouse, and put in a hash file. This allows you to have a hash file that only contains the rows in your warehouse, so that during transformation any rows not in your hash lookup can be treated as inserts, and where you find a row it is an update.

Even see the other posting I just recently did about multi-row lookups. You could use this just mentioned technique to build a slowly-changing dimension lookup of a very large dimension, without having to have all rows from that dimension in your hash files.

Good luck!

Kenneth Bland