Changed Data Capture questions

danddmrs · Post by **danddmrs** » Thu Apr 24, 2008 2:37 pm

Currently our company data from the mainframe each night, FTP's the data to a server at a remote location, then transforms and loads the data.

Issues are the time it takes to run the cycle and the growing volume of data.

Changed Data Capture seemed a viable alternative so I suggested it. My thought was to determine the Inserts, Updates, and Deletes while the data is still on the mainframe and send only changed data to the LAN, then perform the changes via DStage.

Our DBA's recommended setting up a Staging database on site, dumping all of the mainframe data to the LAN, loading the Staging area with DataStage, and then use DStage for CDC and making the changes at the remote location.

Their recommendation is based on their experience with data warehouses, but can someone explain what advantage is gained or if this will even work? What about deletes?

Server A would hold todays data, Server B holds yesterdays data 500 miles down the road and needs to be updated with changes. 99% of the data is unchanged. Is DataStage the right tool for the job? If so please advise on how to set this up.

Additional information.
The company has just installed DS EE. To this point all of our DataStage projects have been done with the Server Edition. In DS SE I was dealing with some large hashed files for CDC and experiencing performance problems, that's how I ended up determining changed data on the Mainframe. Is EE better suited for the task?

This was previously posted in the wrong forum. Not sure how/why I did that...sorry.

Thanks in advance.

iDomz · Post by **iDomz** » Fri Apr 25, 2008 5:34 am

As I understand, CDC works by log scrapping - If you refresh the staging db daily, CDC will return the entire db as output ( theoretically - in practice, it will exceed your log capacity)

The latest document I have from IBM (Mar 2008) says CDC is available on following platforms
DB2 UDB, SQL Server,Sybase, Oracle, Teradata, Message queues after the datamirror integration. Not sure it will work with mainframes in case you have db2 on irons.

Were you planning to use change capture/apply?

danddmrs · Post by **danddmrs** » Fri Apr 25, 2008 7:53 am

On the Mainframe we currently extract data from various systems. The data is FTP'd to the host server, then the tables are truncated and reloaded.

One suggestion is to compare the current mainframe extract with the previous extract to determine changes from cycle to cycle. Using the CDC files, a DataStage process would transform the data and Insert, Update, or Delete data on the Warehouse.

Another suggestion is dumping the extracted data to a local server, truncating and reloading in a staging warehouse via DataStage, and then using DataStage to capture changes between the staging WH and the production WH and make the changes.

If we go the route of the staging warehouse is DataStage the right tool for the job? I don't think so but there is a strong push to head that direction.

If we were going to capture changes via log scrapping I would think we would need to go to the mainframe since that is where the changes occur. So no, I don't think the plan is to use change capture/apply.

Thanks

ririr · Post by **ririr** » Fri Apr 25, 2008 9:45 am

Couple things come to my mind..

Here is what we did when building ETL architecture to load data from our Mainframe system (ADABAS). Refreshed ADABAS files from Mainframe into Oracle database by utilizing "Treehouse" tool and database triggers.

http://www.treehouse.com/

The full refresh was performed only ONCE and for caputring data changes onwards, we utilized a tool from "treehouse" to scan the log files (plogs) and build DMLS (inserts/updates/deletes). Every 4 hours, these DMLS were applied on the oracle instance that had triggers firing on insert/update/delete conditions for each row. These triggers wrote the data into CDC tables with BU (before-update), AU (after-update), AI(after-insert) and BD(before-delete) images along with the date when the row is created and a sequence number column to load the db sequence value. The reason why we needed a sequence number was, for some files, the requirement for CDC was "not" to capture all the changes but only capture begining and end images for that day.

Another option is upload the Mainframe files on OS and run "diff" on current and previous files and create the "difference" file and feed it to your ETL. With this approach, you may not get all the changes that are being done to the files and capturing "deletes" will be a little "tricky".

Hope this helps..

danddmrs · Post by **danddmrs** » Fri Apr 25, 2008 6:13 pm

Thanks for replies.

Does anyone have thoughts on dumping all the mainframe data to a Staging Oracle DB and using DStage for capturing data changes vs the Production Oracle DB?

Pro or Con would be great.

sesvs43 · Post by **sesvs43** » Sun Apr 27, 2008 8:13 am

the best way to do what you want is to use IBM's data event publisher product for DB2 on the Z machine. It reads the DB2 log and passes updates, inserts and deletes to a transmit queue (MQ) on the Z machine that can then be directed to a receive queue on your datastage server. An "always on" datastage job then processes the receive queue contents to your target database. You'll want to use the new DTS stage that came out in the 8.0.1 fix pack for unit-of-work processing.

nsm · Post by **nsm** » Wed Apr 30, 2008 1:20 pm

Can you explain what a "receive Queue" is in DataStage?
do you mean an RTI job when you say "always on" DataSTage job?

Thanks
nsm.