CDC Apply

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
malladi_555
Participant
Posts: 2
Joined: Mon Apr 25, 2011 6:23 am

CDC Apply

Post by malladi_555 »

Hi All,

What we are doing right now is kinda Data Integration Project..
We have Overall 3 Source tables....which of those have about 300 Columns on each..and the source is DB2...We follow 5 steps for Integration..
1) Pull the data from source and with some kinda functions we stored it as a CSV files..
2) Apply the date converstions in the second step and keep it in the Same CSV files.
3) We Designed a SIF Structure....Over here we create a job out of which
what ever the columns we need it out of those 600..we take those in to one Target CSV file.
4) Over here we apply the Business Calculations Kinda stuff..
5) This is the Final step we create different kinda jobs and send the Columns to required Target tables and Get ready to go for a Datamarts..

The overall 5 step process is kinda Moving the data to Core warehouse model to Enterprise Datawarehouse model...

Now the client want to apply the CDC(Change Data Capture) at the Corewarehouse model...

They will Use IBM's CDC tool to capture the changes...
what the tool does is for every record it captures the changes and give it as records one is update and another as insert..finally we get those in a sequential file...

I need some help out of this ..
For some columns they need to overwrite..and for some columns they want recent history too...

1) In which step we have to apply these CDC is the best Process..at SIF or before itself?
2) What is the Best process to apply..I mean what kind of stages and what kind of logic....?
Any Sugesstions Welcome...
KUMAR
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

DataStage has a CDC stage that can receive the output from the Change Data Capture product. DataStage also has Slowly Changing Dimension stage in which you can treat updates as Type 1 or Type 2, or you can build your own logic using other stages.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
vmcburney
Participant
Posts: 3593
Joined: Thu Jan 23, 2003 5:25 pm
Location: Australia, Melbourne
Contact:

Post by vmcburney »

This sounds like a bloated and unmaintainable Data Warehouse load process, as though someone though merging all data into a table of 600 columns would somehow magically make the data easier to manage. I've afraid there is a reason why data modelling is an important discipline, that there are benefits to relational modeling, enterprise modelling, data vault modelling and dimensional modeling. They all tend to be better than jamming all fields into one bloody big table.

This is a bloated beast of a Data Warehouse and I am guessing it cost a fortune to build and is hard to maintain.

Throw out all CSV files - what decade is this? The 70s? Saving data five times in useless data stores is wrong. Saving data into text files is wrong as all metadata is lost and it leads to I/O bottlenecks and CSV files cannot be queried to troubleshoot ETL errors. Using CSV as a staging layer between DataStage jobs is just wrong.

A single file of 600 columns? What a data quality nightmare. How many of those columns are filled with empty values due to referential integrity problems? How much redundancy? How do you manage changing reference or master data over time? How do you track change in master data? Why merge into 600 columns only to split into data marts? This isn't enterprise modelling, this is Bob the Builder data modelling.

In order to use CDC in a Data Warehouse context you need to take that bloated beast out into a pasture and shoot it. You need a Data Warehouse 2.0 data model - which is a data model that can handle real time data loads where data doesn't necessarily come in the same order but should be accepted 100% of the time. When you move from batch processing to CDC you now guarantee the delivery of every row of data so you cannot allow any referential integrity rejections. I would be looking for a raw data store based around the SIF master data entities that tracks change over time. It would either replace the existing SIF data store or be the main supplier. I would then go from this data store straight into the business layer which may be data marts or an enterprise layer before the data marts. Either way I would only store data in database layers that could be audited.

You may find your Bob the Builder Enterprise Data Model is woefully inadequate for accepting real time data with master data that changes over time. Have a look at the articles on Data Warehouse 2.0 and the Data Vault modelling technique from Dan Linstedt which accepts data in real time and is perfect for InfoSphere CDC load techniques.

Are you talking about DataStage CDC or InfoSphere real time CDC? Your new raw data model should handle all the change data over time including inserts, updates and deletes.
Post Reply