Page 1 of 2

How to implement Change Data Capture in PX job

Posted: Mon Jun 13, 2011 9:09 am
by Billyqing
Hi,

I have done Change Data Capture in Server DS job, ver. 7.5.

I am going to do same for Change Data Capture in Px ver. 8.5.

Does anyone have experience how to do so in PX? Please give some instructions on this.

Appreciate for your help in advanced.

Thanks,

Re: How to implement Change Data Capture in PX job

Posted: Mon Jun 13, 2011 9:24 am
by madhusds
You can use the CDC Transaction stage in an IBM InfoSphere DataStage job to read data that is captured by IBM InfoSphere Change Data Capture (InfoSphere CDC) and apply the change data to a target database.

The CDC Transaction stage integrates the replication capabilities that are provided by InfoSphere CDC with the ETL capabilities that are provided by InfoSphere DataStage. You can use these products together to perform real time, continuous replication with guaranteed delivery and transactional integrity in the event of failure.
For more information, see http://publib.boulder.ibm.com/infocente ... ainer.html.

Posted: Mon Jun 13, 2011 9:38 am
by Billyqing
Thank you for your reply, Madhusds.

InfoSphere CDC needs to purchased separately and is not a component of InfoSphere Datastage, is this right?

If we will not going to purchase InfoSphere CDC, is anyway to implement CDC in PX job?


Thanks,

Posted: Mon Jun 13, 2011 2:27 pm
by asorrell
That is correct - the CDC stage is useless unless you buy either the CDD or CDC package from IBM.

You specifically asked about the CDC stage in your earlier question. Were you perhaps actually interested in learning how to use the Change Capture stage in PX? (which is quite different from the CDC stage)

Posted: Mon Jun 13, 2011 4:48 pm
by vmcburney
There are a few options:
1) IBM have two Change Data Capture stages for parallel jobs, the premium one that needs a separate InfoSphere CDC license. The InfoSphere CDC plugin version uses database log scraping to identified changed data against the source database logs.
2) There are a pair of stages call Change Capture and Change Apply. The first compares recordsets and identifies inserts, updates and deletes. The second can apply those changes to a dataset. The stage is easy to use, just have a look at the help file or manual.
3) The Slowly Changing Dimension stage which does its own change data capture for slowly changing dimension data of type 1, 2 and 3. Since it automates the setting of the surrogate key, effective, expiry and current flag it can be useful.
4) There is a CRC32 function available in the parallel Transformer which can generate a code for a string - or for a packed set of fields. You can set up dynamic change data capture (CDC via a shared container) by packing and coding a record and using the CRC32 function for a primary key field and the comparison fields.
5) Lookup stage followed by a Transformer can do a simpler change data capture where there are not a lot of fields to compare. A Transformer can be handy for change data capture as it is able to store values from the previous record in stage variables and this can be used to detect changes in specific fields between records. Tends to be a bit more flexible than the Change Capture stage but a pain to add the comparison of many fields.
6) You can bulk load the new data into a database staging table and perform change data capture with a user-defined SQL statement that retrieves changed rows by comparing the new data to the old table with a SQL join. With a lot of fields to compare the SQL is a pain to write and maintain and might not be as scalable if there are a lot of unchanged rows flowing through.

Posted: Tue Jun 14, 2011 6:47 am
by Billyqing
Hi All,

Thank you very much for your reply.
I will try the options Vmcburney has mentioned here, which are exactly I am looking for.

I will post any progress regard to this topic.

Appreciate for all your inputs.

Posted: Tue Jun 14, 2011 7:04 am
by chulett
IMHO, technically everything in Vince's list after #1 is CDD rather than CDC. :wink:

Posted: Tue Jun 14, 2011 3:00 pm
by Billyqing
Hi,

Just to double check with you for Vince's #4 of the list.
CRC32 is not Parallel job function but a server job function.


I have used this function in server jobs. But I can not use this function in Parallel jobs.

Thanks,

Posted: Tue Jun 14, 2011 3:59 pm
by jwiles
There is a checksum stage available beginning with IS 8.1 that can be used similar to how Vince is suggesting CRC32.

Posted: Thu Jun 16, 2011 6:10 pm
by vmcburney
I used CRC32 years ago and recall now that it was in Server jobs. Quite a handy function. I agree this is all CDD - detecting changes by comparing incoming large or delta recordsets to a target recordset. This puts load onto the ETL server and can potentially require large extracts from source systems. The very best CDC approach is InfoSphere CDC as it is the only method that pulls true delta data off source systems without any raliance on triggers or source system date fields and it does not lose intra-day data changes to the same record.

Posted: Fri Jun 17, 2011 7:07 am
by Billyqing
It looks that I will use Infosphere CDC and CDD to handle all delta changes and put those into the target.

Anyone has experience using CDC and CDD components with Datastage?

Give me some suggestions on how to do so if you have done so.
I even do not have any manual for those.

Not much on the internet as well.

Thanks,

Posted: Fri Jun 17, 2011 7:56 am
by chulett
Billyqing wrote:It looks that I will use Infosphere CDC and CDD to handle all delta changes and put those into the target.
Infosphere CDC? So your company has purchased that?

Note also you typically do one or the other for a particular source, 'or' not 'and'.

Posted: Fri Jun 17, 2011 10:11 am
by chulett
Just wanted to make sure as it didn't seem like that was the case from your previous posts.

Posted: Fri Jun 17, 2011 7:37 pm
by vmcburney
Once you buy a CDD license you need to decide how you want to land data. You can use CDD for InfoSphere Warehouse and land data to staging or ODS or atomic level persistent data tables. You can use CDD for DataStage where the data goes through a DataStage job before landing to a database. I would recommend keeping the DataStage job simple - use the CDC input stage to pass data into the job, use a Transformer to add tagging columns such as DW_LOAD_DATE or SOURCE_SYSTEM_ID and write to a table with a database connector. You still get guaranteed data delivery under version 8.5 which means the CDC engine will bookmark the target database table and make sure rows are delivered, even if the DataStage job fails half way through - the CDC engine will restart the job and continue processing rows that were not committed.

This guaranteed data delivery is why you want to keep the initial DataStage load job simple - you do not want the Datastage job to filter or reject or lose rows as that could throw out the CDC table bookmarking between source and target which guarantees data delivery.

Posted: Mon Jun 20, 2011 8:23 am
by Billyqing
Thank you for your input, Vince.

The componnents of CDC installed are:
InfoSphere CDC Management Console and InfoSphere CDC Access Server
But I still do not know how CDC could connect to Datastage.

Could you tell me more details about DS<--->CDC?

Appreciate for your help.