How to implement Change Data Capture in PX job

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Billyqing
Participant
Posts: 44
Joined: Thu May 13, 2004 12:00 pm
Location: Canada

How to implement Change Data Capture in PX job

Post by Billyqing »

Hi,

I have done Change Data Capture in Server DS job, ver. 7.5.

I am going to do same for Change Data Capture in Px ver. 8.5.

Does anyone have experience how to do so in PX? Please give some instructions on this.

Appreciate for your help in advanced.

Thanks,
Bill
madhusds
Premium Member
Premium Member
Posts: 29
Joined: Tue Dec 28, 2010 8:19 pm
Location: NH
Contact:

Re: How to implement Change Data Capture in PX job

Post by madhusds »

You can use the CDC Transaction stage in an IBM InfoSphere DataStage job to read data that is captured by IBM InfoSphere Change Data Capture (InfoSphere CDC) and apply the change data to a target database.

The CDC Transaction stage integrates the replication capabilities that are provided by InfoSphere CDC with the ETL capabilities that are provided by InfoSphere DataStage. You can use these products together to perform real time, continuous replication with guaranteed delivery and transactional integrity in the event of failure.
For more information, see http://publib.boulder.ibm.com/infocente ... ainer.html.
Thanks
Billyqing
Participant
Posts: 44
Joined: Thu May 13, 2004 12:00 pm
Location: Canada

Post by Billyqing »

Thank you for your reply, Madhusds.

InfoSphere CDC needs to purchased separately and is not a component of InfoSphere Datastage, is this right?

If we will not going to purchase InfoSphere CDC, is anyway to implement CDC in PX job?


Thanks,
Bill
asorrell
Posts: 1707
Joined: Fri Apr 04, 2003 2:00 pm
Location: Colleyville, Texas

Post by asorrell »

That is correct - the CDC stage is useless unless you buy either the CDD or CDC package from IBM.

You specifically asked about the CDC stage in your earlier question. Were you perhaps actually interested in learning how to use the Change Capture stage in PX? (which is quite different from the CDC stage)
Andy Sorrell
Certified DataStage Consultant
IBM Analytics Champion 2009 - 2020
vmcburney
Participant
Posts: 3593
Joined: Thu Jan 23, 2003 5:25 pm
Location: Australia, Melbourne
Contact:

Post by vmcburney »

There are a few options:
1) IBM have two Change Data Capture stages for parallel jobs, the premium one that needs a separate InfoSphere CDC license. The InfoSphere CDC plugin version uses database log scraping to identified changed data against the source database logs.
2) There are a pair of stages call Change Capture and Change Apply. The first compares recordsets and identifies inserts, updates and deletes. The second can apply those changes to a dataset. The stage is easy to use, just have a look at the help file or manual.
3) The Slowly Changing Dimension stage which does its own change data capture for slowly changing dimension data of type 1, 2 and 3. Since it automates the setting of the surrogate key, effective, expiry and current flag it can be useful.
4) There is a CRC32 function available in the parallel Transformer which can generate a code for a string - or for a packed set of fields. You can set up dynamic change data capture (CDC via a shared container) by packing and coding a record and using the CRC32 function for a primary key field and the comparison fields.
5) Lookup stage followed by a Transformer can do a simpler change data capture where there are not a lot of fields to compare. A Transformer can be handy for change data capture as it is able to store values from the previous record in stage variables and this can be used to detect changes in specific fields between records. Tends to be a bit more flexible than the Change Capture stage but a pain to add the comparison of many fields.
6) You can bulk load the new data into a database staging table and perform change data capture with a user-defined SQL statement that retrieves changed rows by comparing the new data to the old table with a SQL join. With a lot of fields to compare the SQL is a pain to write and maintain and might not be as scalable if there are a lot of unchanged rows flowing through.
Billyqing
Participant
Posts: 44
Joined: Thu May 13, 2004 12:00 pm
Location: Canada

Post by Billyqing »

Hi All,

Thank you very much for your reply.
I will try the options Vmcburney has mentioned here, which are exactly I am looking for.

I will post any progress regard to this topic.

Appreciate for all your inputs.
Bill
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

IMHO, technically everything in Vince's list after #1 is CDD rather than CDC. :wink:
-craig

"You can never have too many knives" -- Logan Nine Fingers
Billyqing
Participant
Posts: 44
Joined: Thu May 13, 2004 12:00 pm
Location: Canada

Post by Billyqing »

Hi,

Just to double check with you for Vince's #4 of the list.
CRC32 is not Parallel job function but a server job function.


I have used this function in server jobs. But I can not use this function in Parallel jobs.

Thanks,
Bill
jwiles
Premium Member
Premium Member
Posts: 1274
Joined: Sun Nov 14, 2004 8:50 pm
Contact:

Post by jwiles »

There is a checksum stage available beginning with IS 8.1 that can be used similar to how Vince is suggesting CRC32.
- james wiles


All generalizations are false, including this one - Mark Twain.
vmcburney
Participant
Posts: 3593
Joined: Thu Jan 23, 2003 5:25 pm
Location: Australia, Melbourne
Contact:

Post by vmcburney »

I used CRC32 years ago and recall now that it was in Server jobs. Quite a handy function. I agree this is all CDD - detecting changes by comparing incoming large or delta recordsets to a target recordset. This puts load onto the ETL server and can potentially require large extracts from source systems. The very best CDC approach is InfoSphere CDC as it is the only method that pulls true delta data off source systems without any raliance on triggers or source system date fields and it does not lose intra-day data changes to the same record.
Billyqing
Participant
Posts: 44
Joined: Thu May 13, 2004 12:00 pm
Location: Canada

Post by Billyqing »

It looks that I will use Infosphere CDC and CDD to handle all delta changes and put those into the target.

Anyone has experience using CDC and CDD components with Datastage?

Give me some suggestions on how to do so if you have done so.
I even do not have any manual for those.

Not much on the internet as well.

Thanks,
Bill
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Billyqing wrote:It looks that I will use Infosphere CDC and CDD to handle all delta changes and put those into the target.
Infosphere CDC? So your company has purchased that?

Note also you typically do one or the other for a particular source, 'or' not 'and'.
-craig

"You can never have too many knives" -- Logan Nine Fingers
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Just wanted to make sure as it didn't seem like that was the case from your previous posts.
-craig

"You can never have too many knives" -- Logan Nine Fingers
vmcburney
Participant
Posts: 3593
Joined: Thu Jan 23, 2003 5:25 pm
Location: Australia, Melbourne
Contact:

Post by vmcburney »

Once you buy a CDD license you need to decide how you want to land data. You can use CDD for InfoSphere Warehouse and land data to staging or ODS or atomic level persistent data tables. You can use CDD for DataStage where the data goes through a DataStage job before landing to a database. I would recommend keeping the DataStage job simple - use the CDC input stage to pass data into the job, use a Transformer to add tagging columns such as DW_LOAD_DATE or SOURCE_SYSTEM_ID and write to a table with a database connector. You still get guaranteed data delivery under version 8.5 which means the CDC engine will bookmark the target database table and make sure rows are delivered, even if the DataStage job fails half way through - the CDC engine will restart the job and continue processing rows that were not committed.

This guaranteed data delivery is why you want to keep the initial DataStage load job simple - you do not want the Datastage job to filter or reject or lose rows as that could throw out the CDC table bookmarking between source and target which guarantees data delivery.
Billyqing
Participant
Posts: 44
Joined: Thu May 13, 2004 12:00 pm
Location: Canada

Post by Billyqing »

Thank you for your input, Vince.

The componnents of CDC installed are:
InfoSphere CDC Management Console and InfoSphere CDC Access Server
But I still do not know how CDC could connect to Datastage.

Could you tell me more details about DS<--->CDC?

Appreciate for your help.
Bill
Post Reply