CDC - Design approach

myukassign · Post by **myukassign** » Thu Mar 17, 2011 1:15 am

I am looking for a design help, to understand the best way to do a CDC.

Please help.

I have 4 tables in oracle that grows in millions of size. That has key columns but no timestamp / Date / Batch Id. The tables are not indexed as well. My table can have insert/update records.

What I need is, i need to do CDC to the final target everyday. I fear once the table grow to a big size, it will kill the CDC stage.

What all options I have? Can I still go ahed with CDC stage, or is there anyh better design approach something in database level.

My database is Oracle 11 G

asyed · Post by **asyed** » Thu Mar 17, 2011 2:13 am

You could stage the incoming data onto a database.

Write a query which joins this staging data to your DWH tables and unload only relevant data.

You could add partial CDC functionality as well in your SQL query.

How many millions of rows are you expecting. If your dimension tables are going to grow to such an extent that it is going to break down the CDC process [memory/disk space]. Then most likely your system configuration might not be suitable for overall DW requirements as well.

myukassign · Post by **myukassign** » Thu Mar 17, 2011 7:11 am

It's going to grow in big size...

myukassign · Post by **myukassign** » Mon Apr 04, 2011 10:46 pm

Any idea's anyone...Please

vmcburney · Post by **vmcburney** » Mon Apr 04, 2011 11:34 pm

There are two approaches - a DataStage one and an InfoSphere CDC one. In the DataStage approach you have no option but to pull off all data that you think may contain changes and then push that through a DataStage Change Data Capture stage. You may be able to speed this up a bit by using the CRC32 function to speed this up but that starts to get complex.

Under the InfoSphere CDC approach you would install a CDC agent on the source database, turn on full supplemental logging on those tables and that would detect all data changes for you and deliver them onto your target server where another agent either delivers the changes into DataStage or straight into a database table.

InfoSphere CDC would significantly reduce the amount of data you are taking from the source database and it would retrieve these changes from the database redo and archive logs and not from the database itself. You can run this in batch or real time and it is a safer way to move the data than DataStage since it guarantees data delivery by interrogating the table bookmarks and ensuring the source and target tables are synchronised.