Change Data Capture

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

nvalia
Premium Member
Premium Member
Posts: 180
Joined: Thu May 26, 2005 6:44 am

Change Data Capture

Post by nvalia »

Hi All,

We have a requirement where we have to capture changed data from a Sybase server. It is a Read Only server and everyday a SAN Snapshot (Backup) of the Production data is dumped in it (around 22 Million rows).
Not all tables have a As-of-Date or any such date field to be able to track the changes in data.

We have a load window constraint of around 6-7 hrs for capturing changed data and then loading it into the Sybase IQ Data Mart .
The Deltas should be in the range of 1,00,000 rows..(One hundred thousand)

So what is the best way to capture change in this scenario?
Can this be efficiently achieved via ETL , if yes, then how?

Thanks,
NV
DSguru2B
Charter Member
Charter Member
Posts: 6854
Joined: Wed Feb 09, 2005 3:44 pm
Location: Houston, TX

Post by DSguru2B »

You have two options. Do it in DataStage via CDC or Join Stage. Or you can do it at the database level itself. Do a select * from table where key NOT IN your yesterdays table, to get changes.
Creativity is allowing yourself to make mistakes. Art is knowing which ones to keep.
nvalia
Premium Member
Premium Member
Posts: 180
Joined: Thu May 26, 2005 6:44 am

Post by nvalia »

Thanks DSGuru2B for your prompy response.

Is CDC plug-in stage available for ASE server (Sybase) and how does it achieve it?
Join stage, how would you achieve CDC using it?

Most importantly what is the impact on performance as we talking high volumes here?

Doing it at the DB level is not the option we have..it has to be implemented via ETL

Regards,
NV
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

What is you version of datastage? ChangeCapture should be able to find the difference in the Input1 to the Input2. A flag used to indentify this change.
Full Outer Join coupled with Transformer with Null Detection can be used to construct the similar functionality.
If you have any timestamp field to denote the row updated timestamp, it should make your life more easy.
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
snt_ds
Premium Member
Premium Member
Posts: 280
Joined: Wed Oct 18, 2006 11:53 am
Location: Pune, India

Post by snt_ds »

Thanks Kumar..

We are using 7.5.1 (Enterprise edition..Parallel).
As I have mentioned in my first post, we do not have the luxury of having date time fields..so life is going to be tough!!

We have implemented change data capture in another project..the volumes was pretty low there though.

My main area of concern is performance and the time taken for it.
Would request you guys to touch on that.
I have provided all data row counts in my first post.

Regards,
Nirav
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

Hi Nirav,

What is the logic on which the Change is captured?
Is it compared with Production database (Target databse)?
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
snt_ds
Premium Member
Premium Member
Posts: 280
Joined: Wed Oct 18, 2006 11:53 am
Location: Pune, India

Post by snt_ds »

Kumar,

There are no predefined rules for capturing teh changes..that is the challenge.

Back up of the entire Production DB is taken on the Sybase server (source for our ETL process) every 12 hrs..so we have to capture only the changes in betwen these 2 back ups and the load in the target Sybase IQ server. We do not touch the Prod server at all.

Refer to first post..

Regards,
NV
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

Okey.... Key for the each table (Backup for two consecutive days) should be marked as Key in Chane Capture and Preferred field can be marked as data or optionally All fields can be marked as Data field in Chagne capture.
But capturing changes based on Database log should be more faster. Not sure, why you refrain to use Database for this.
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
DSguru2B
Charter Member
Charter Member
Posts: 6854
Joined: Wed Feb 09, 2005 3:44 pm
Location: Houston, TX

Post by DSguru2B »

If time is of the essence then why are you exactly refraining from it. Have "they" asked you not to?
Creativity is allowing yourself to make mistakes. Art is knowing which ones to keep.
snt_ds
Premium Member
Premium Member
Posts: 280
Joined: Wed Oct 18, 2006 11:53 am
Location: Pune, India

Post by snt_ds »

well we are exploring the solution through ETL..
it is preferred by them.."Client"!!
DSguru2B
Charter Member
Charter Member
Posts: 6854
Joined: Wed Feb 09, 2005 3:44 pm
Location: Houston, TX

Post by DSguru2B »

Well you have the data with you. Try with the CDC, with the Join stage and at the database level. Do a bench mark. Present the results to your team and see which one better.
Creativity is allowing yourself to make mistakes. Art is knowing which ones to keep.
snt_ds
Premium Member
Premium Member
Posts: 280
Joined: Wed Oct 18, 2006 11:53 am
Location: Pune, India

Post by snt_ds »

Kumar,

How could we trap the changes via the Database log?
As I mentioned earlier it is straight dump of the SAN Snapshot on the existing data..so basically all the data in the Sybase server gets
over written each time.
Would the database still maintain the logs?
snt_ds
Premium Member
Premium Member
Posts: 280
Joined: Wed Oct 18, 2006 11:53 am
Location: Pune, India

Post by snt_ds »

How much time would a PX job involving The Change Capture Stage take where the before and after datasets are both having around a million rows?
Should this process take more than 30 minutes on a 2 node machine?
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Two characters per row or two million characters per row? How long is a piece of string?
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
DSguru2B
Charter Member
Charter Member
Posts: 6854
Joined: Wed Feb 09, 2005 3:44 pm
Location: Houston, TX

Post by DSguru2B »

Maybe less than 30 minutes. As told, it totally depends upon the record size.
Creativity is allowing yourself to make mistakes. Art is knowing which ones to keep.
Post Reply