CDC Apply

malladi_555 · Post by **malladi_555** » Sun May 01, 2011 10:32 pm

Hi All,

What we are doing right now is kinda Data Integration Project..
We have Overall 3 Source tables....which of those have about 300 Columns on each..and the source is DB2...We follow 5 steps for Integration..
1) Pull the data from source and with some kinda functions we stored it as a CSV files..
2) Apply the date converstions in the second step and keep it in the Same CSV files.
3) We Designed a SIF Structure....Over here we create a job out of which
what ever the columns we need it out of those 600..we take those in to one Target CSV file.
4) Over here we apply the Business Calculations Kinda stuff..
5) This is the Final step we create different kinda jobs and send the Columns to required Target tables and Get ready to go for a Datamarts..

The overall 5 step process is kinda Moving the data to Core warehouse model to Enterprise Datawarehouse model...

Now the client want to apply the CDC(Change Data Capture) at the Corewarehouse model...

They will Use IBM's CDC tool to capture the changes...
what the tool does is for every record it captures the changes and give it as records one is update and another as insert..finally we get those in a sequential file...

I need some help out of this ..
For some columns they need to overwrite..and for some columns they want recent history too...

1) In which step we have to apply these CDC is the best Process..at SIF or before itself?
2) What is the Best process to apply..I mean what kind of stages and what kind of logic....?
Any Sugesstions Welcome...

ray.wurlod · Post by **ray.wurlod** » Sun May 01, 2011 11:25 pm

DataStage has a CDC stage that can receive the output from the Change Data Capture product. DataStage also has Slowly Changing Dimension stage in which you can treat updates as Type 1 or Type 2, or you can build your own logic using other stages.

vmcburney · Post by **vmcburney** » Mon May 02, 2011 4:54 am

This sounds like a bloated and unmaintainable Data Warehouse load process, as though someone though merging all data into a table of 600 columns would somehow magically make the data easier to manage. I've afraid there is a reason why data modelling is an important discipline, that there are benefits to relational modeling, enterprise modelling, data vault modelling and dimensional modeling. They all tend to be better than jamming all fields into one bloody big table.

This is a bloated beast of a Data Warehouse and I am guessing it cost a fortune to build and is hard to maintain.

Throw out all CSV files - what decade is this? The 70s? Saving data five times in useless data stores is wrong. Saving data into text files is wrong as all metadata is lost and it leads to I/O bottlenecks and CSV files cannot be queried to troubleshoot ETL errors. Using CSV as a staging layer between DataStage jobs is just wrong.

A single file of 600 columns? What a data quality nightmare. How many of those columns are filled with empty values due to referential integrity problems? How much redundancy? How do you manage changing reference or master data over time? How do you track change in master data? Why merge into 600 columns only to split into data marts? This isn't enterprise modelling, this is Bob the Builder data modelling.

In order to use CDC in a Data Warehouse context you need to take that bloated beast out into a pasture and shoot it. You need a Data Warehouse 2.0 data model - which is a data model that can handle real time data loads where data doesn't necessarily come in the same order but should be accepted 100% of the time. When you move from batch processing to CDC you now guarantee the delivery of every row of data so you cannot allow any referential integrity rejections. I would be looking for a raw data store based around the SIF master data entities that tracks change over time. It would either replace the existing SIF data store or be the main supplier. I would then go from this data store straight into the business layer which may be data marts or an enterprise layer before the data marts. Either way I would only store data in database layers that could be audited.

You may find your Bob the Builder Enterprise Data Model is woefully inadequate for accepting real time data with master data that changes over time. Have a look at the articles on Data Warehouse 2.0 and the Data Vault modelling technique from Dan Linstedt which accepts data in real time and is perfect for InfoSphere CDC load techniques.

Are you talking about DataStage CDC or InfoSphere real time CDC? Your new raw data model should handle all the change data over time including inserts, updates and deletes.