I need to tap into the bounty of knowledge that is DSXchange :D yet again!
In my current project I am transferring the data from a transactional system to a copy of the same on a different server on an iterative basis during the course of the day. Normally, I would rather do this using replication or MQ Series, but neither of those are available to me.
Here are the ground rules and system stuff:
1. Source and target are DB2 UDB running on AIX Unix
2. No federated servers allowed
3. DataStage 7.5.1.A Server Edition on AIX Unix is my flavor. One of my main design paradigm's is to use as little database specific stuff as possible (avoiding stored procedures, external code and the like). The reason for this is that there is some question as to whether the target will stay on DB2 UDB as planned (wasn't my choice) or go back to SQL Server 2000 (2005 eventually). Hence, I also chose to sacrifice a little speed and used nothing but ODBC instead of DB2 CLI drivers/bulk loaders.
4. Using separate schema in the same database to capture history/changes
5. Using triggers to populate the history/changes, adding TRIGGER_DT, TRIGGER_USERID, TRIGGER_ACTION (U-update lists all fields, I-insert lists all fields, D-delete list only key fields), & TRIGGER_STATUS (default = 2 = in process - target table also has 1= current row, 0=historical row)
6. Target tables generally follow slowly changing dimension type 2 rules - therefore, in my case, everything is an INSERT.
7. There are 412 source tables to process, current with five steps (jobs) for each table, giving a total of 2060 jobs plus 1 job to get the batch info and 1 job to perform cleanup when done.
8. My goal is to optimize to be able to run every 15 minutes. This should mean no more than 2000-3000 rows per table per batch. Over time, I will also be able to eliminate tables that don't change that often to fewer scheduled runs.
My jobs are structured as follows, landing all data for processing:
Step 1: src_<tablename>
Creates a seq file of all fields, a hashed file of natural key values plus TRIGGER_DT (keyhash), and hashed file of distinct natural key values plus maximum TRIGGER_DT (hash_max).
Step 2: delsrc_<tablename>
Uses keyhash to remove extracted rows from source tables - this is necessary because rows may be put in while DS is processing the existing rows. This effectively gets me row level locking without preventing the triggers from putting more data in during processing (at least, that's my thought).
Step 3: tgt_<tablename>
This step inserts all records into the target table from the seq file. While inserting, the process uses hash_max to change the TRIGGER_STATUS from 2 to 5 when the natural keys and the TRIGGER_DT matches, identifying the most current row for later processing.
Step 4: updtgt_<tablename>
This step uses the distinct natural keys in the hash_max to change the TRIGGER_STATUS of any already inserted rows with 1 or 2 to 0 based on the distict natural keys in hash_max.
Step 5: upd_current_tgt_<tablename>
This step is meant to take any TRIGGER_STATUS's = 5 and update them to 1 as the most current row. Using the above methodology, there should be no way for it to be off (famous last words )
I will have follow up processes nightly and weekends to cleanup, verify and validate/balance depending on level of necessity.
Now, here are my problems and the sources of my questions to you all:
1. In steps 4 & 5, I cannot seem to use parameters in the tablename portion of the user defined query. For example, one of my tables goes as follows:
Code: Select all
UPDATE UCWS.ADM_ACCT_PERIOD SET TRIGGER_STATUS = '0' WHERE UCWS.ADM_ACCT_PERIOD.SEQ_NUM = ? AND UCWS.ADM_ACCT_PERIOD.STATUS_FLAG IN (1,2)
Code: Select all
UPDATE #SchemaName#.#TableName# AS X SET X.TRIGGER_STATUS = '0' WHERE X.SEQ_NUM = ? AND X.STATUS_FLAG IN (1,2)
2. Using hard coded fields that are not part of the stream coming in from the hash_max also seems to cause problems (see example above). Specifically, the TRIGGER_STATUS field is not a part of the hash_max file so it doesn't come down the link. Could I just use a stage variable to get around that by hard coding that value to (1,2)?
If I can figure out the answers to the two questions above, I hope to be able to pass the tablename into a single job to update all TRIGGER_STATUS fields that equal 5 to 1 to mark the most current rows. This would knock 411 jobs out of the cycle. I know I could do it with a stored procedure, but once again since the final target db is in flux and could be anywhere (or even multiple platforms), I'm trying to do as much in DataStage as possible, sometimes beyond what makes sense to do.
Am I just making this too hard?
As always, your input is greatly appreciated.
Bestest!