CDC - Design approach

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
myukassign
Premium Member
Premium Member
Posts: 238
Joined: Fri Jul 25, 2008 8:55 am

CDC - Design approach

Post by myukassign »

I am looking for a design help, to understand the best way to do a CDC.

Please help.

I have 4 tables in oracle that grows in millions of size. That has key columns but no timestamp / Date / Batch Id. The tables are not indexed as well. My table can have insert/update records.

What I need is, i need to do CDC to the final target everyday. I fear once the table grow to a big size, it will kill the CDC stage.

What all options I have? Can I still go ahed with CDC stage, or is there anyh better design approach something in database level.

My database is Oracle 11 G
asyed
Participant
Posts: 16
Joined: Sun Dec 12, 2010 10:24 pm
Location: Hyderabad, India

Post by asyed »

You could stage the incoming data onto a database.

Write a query which joins this staging data to your DWH tables and unload only relevant data.

You could add partial CDC functionality as well in your SQL query.

How many millions of rows are you expecting. If your dimension tables are going to grow to such an extent that it is going to break down the CDC process [memory/disk space]. Then most likely your system configuration might not be suitable for overall DW requirements as well.
myukassign
Premium Member
Premium Member
Posts: 238
Joined: Fri Jul 25, 2008 8:55 am

Post by myukassign »

It's going to grow in big size...
myukassign
Premium Member
Premium Member
Posts: 238
Joined: Fri Jul 25, 2008 8:55 am

Post by myukassign »

Any idea's anyone...Please
vmcburney
Participant
Posts: 3593
Joined: Thu Jan 23, 2003 5:25 pm
Location: Australia, Melbourne
Contact:

Post by vmcburney »

There are two approaches - a DataStage one and an InfoSphere CDC one. In the DataStage approach you have no option but to pull off all data that you think may contain changes and then push that through a DataStage Change Data Capture stage. You may be able to speed this up a bit by using the CRC32 function to speed this up but that starts to get complex.

Under the InfoSphere CDC approach you would install a CDC agent on the source database, turn on full supplemental logging on those tables and that would detect all data changes for you and deliver them onto your target server where another agent either delivers the changes into DataStage or straight into a database table.

InfoSphere CDC would significantly reduce the amount of data you are taking from the source database and it would retrieve these changes from the database redo and archive logs and not from the database itself. You can run this in batch or real time and it is a safer way to move the data than DataStage since it guarantees data delivery by interrogating the table bookmarks and ensuring the source and target tables are synchronised.
Post Reply