InfoSphere CDC installation

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
neeraj
Participant
Posts: 107
Joined: Tue May 24, 2005 4:09 am

InfoSphere CDC installation

Post by neeraj »

Hello,

We are planning to use InfoSphere CDC for Oracle as source and target is also Oracle.

Have got an idea about what are configurations need to be done once the tool is installed but could not find any material explaining the tools that need to be installed on Source Database like Agents etc. and what are considerations that need to be taken care so that it doesn't have any perfomance impact on the existing source Database.

Incase anyone could share the details, it would be geat help.

Regards
Neeraj
qt_ky
Premium Member
Premium Member
Posts: 2895
Joined: Wed Aug 03, 2011 6:16 am
Location: USA

Post by qt_ky »

I believe there is more than one way to configure IIDR (CDC) with Oracle.

Did you search the Knowledge Center? Here is a link.

http://www-01.ibm.com/support/knowledge ... lcome.html
Choose a job you love, and you will never have to work a day in your life. - Confucius
cppwiz
Participant
Posts: 135
Joined: Tue Sep 04, 2007 11:27 am

Post by cppwiz »

Well the only agent that needs to be installed is the IBM CDC agent. However, there is a whole list of tasks that must be completed on the source database before installing CDC:

http://pic.dhe.ibm.com/infocenter/iidr/ ... stall.html

The biggest hurdle in this list is the size of the archive logs. We had to allocate 4 TB of archive log space so that we could stop CDC if necessary and still catch-up with the logs later when CDC was restarted. I think 4 TB gave us approximately 14 days of CDC downtime before we lost the log position and had to start from scratch again.

There will be a performance impact on the source database, but the exact impact depends on how many subscriptions and tables are configured for CDC. YMMV, but we saw an impact of between 10-40% of the CPU on the source server.

There is also a Redbook on CDC and Chapter 8 goes into good detail on the Performance Analysis for configuring your system:

http://www.redbooks.ibm.com/redbooks/pdfs/sg247941.pdf
vmcburney
Participant
Posts: 3593
Joined: Thu Jan 23, 2003 5:25 pm
Location: Australia, Melbourne
Contact:

Post by vmcburney »

Both those measurements seem excessive. I know of a site that had CDC/DataMirror up and running for eight years and the longest outage was 48 hours when we upgraded to a new version. 14 days seems way over the top. If you have an outage of more then a couple days you may as well just resync the table rather then replay multiple days worth of transactions.

10-40% CPU utilisation is also way over the top - I would be looking for 5% at worst or 0% if you put the logs on another machine. Is this the overhead of the CDC agent (which does very little) or the overhead of turning on full supplemental logging and having a much larger log file?
cppwiz
Participant
Posts: 135
Joined: Tue Sep 04, 2007 11:27 am

Post by cppwiz »

Yes, I agree 14 days is excessive, but we actually went past that window twice due to bugs in the CDC software that shutdown all replication. After IBM delivered a patch, a full refresh was then necessary for all tables, which took over eight days. 48 hours seems too little unless you're only replicating a handful of tables that can be quickly refreshed from scratch.

The 10-40% CPU was for the agent during the early morning hours. The rest of the day it was nearly zero because no data was being updated. For the database that we were replicating, most updates were done during a nightly batch cycle and then replicated immediately across six subscriptions for 600+ tables. If the database doesn't have a bulk insert/update/delete cycle but instead only has incremental updates, the CPU utilization will obviously be much less.

There are many ways to throttle the CPU utilization of CDC, but if you want to achieve nearly zero latency for a lot of large tables, you will have to pay the price in CPU utilization. It is a trade-off between CPU vs. latency, which was the query from the original poster. If you want nearly zero impact on the source during business hours, that can be achieved by turning off replication and then catching-up during non-business hours. If you want nearly zero latency, that can be achieved by increasing the number of subscriptions and using more CPU on the source server.
Post Reply