Best Practice

palmeal · Post by **palmeal** » Fri Oct 15, 2004 8:34 am

A bit of preamble before I get to the question.

My company are in the process of developing a d/w.
The sources for the data are a series of Sybase databases and one access database. DataStage will be used to transform the data and MS SQL 2k will host the D/W. These things are set in stone.

The question I have revolves around how to get the data into the ETL on DataStage:
- is it best to schedule a job on DataStage to poll each source database for new data ?
- is it best for the source databases to push data out to DataStage (lot more code involved) ?
- is it possible to make DataStage aware that there are data changes available from the source databases ?

It doesn't matter (at the moment) whether data updates on the source database are real time or time lagged.

What is the "industry standard" for getting the data into the ETL - push from source or pull from DataStage. There will probably be no definitive answer for this but any help or pointers would be appreciated.

ogmios · Post by **ogmios** » Fri Oct 15, 2004 10:00 am

Opionions ahead:

The question I have revolves around how to get the data into the ETL on DataStage:
1) is it best to schedule a job on DataStage to poll each source database for new data?
is it best for the source databases to push data out to DataStage (lot more code involved)?

Both require extra code. Personally I will always advocate pushing data from source to datastage. On the datastage server I would put a small framework of UNIX scripts to process files via DataStage.
Or if no physical files are involved I have the source system ftp a trigger file to DataStage to indicate the ETL can start.

3) is it possible to make DataStage aware that there are data changes available from the source databases?

Only if you write it yourself

... see above

4) It doesn't matter (at the moment) whether data updates on the source database are real time or time lagged.

Real-time gives some problems with locked rows, ... and a problem of having a stable view of the data. What if a process is still changing data in one table while you already are processing the other with ETL (missing detail rows, ...)

5) What is the "industry standard" for getting the data into the ETL - push from source or pull from DataStage. There will probably be no definitive answer for this but any help or pointers would be appreciated.

There's no industry standard.

Ogmios

KeithM · Post by **KeithM** » Fri Oct 15, 2004 10:10 am

Traditionally data warehouses are setup so that the warehouse controls pulling the data from the source systems. This is done so that the processes that build the warehouse are all controlled in a single place. In your example you have multiple source systems. You would not want to change each one of them to control when data is sent to the warehouse. This would require a more complex architecture and one of the goals of the warehouse is to limit the impact on the resources on the operational systems.

That being said, there has been more development recently in real time data warehousing. In this situation the data is feed to the warehouse as it is added to the operational system usually through some sort of queue processing. Ascential has a product called RTI which is used to execute Datastage jobs to do updates real time.

Since you are just starting with a warehouse I would suggest keeping your architecture as simple as possible and control the pulling of data within Datastage ETL jobs. If you make the mistake that many organizations do starting out and try to bite off too much you are going to have a lot of problems. I would also suggest reading the Data Warehouse Lifecycle Toolkit by Ralph Kimball if you have not already. Many of the questions you ask are answered in more detail in his books.

ray.wurlod · Post by **ray.wurlod** » Fri Oct 15, 2004 5:36 pm

Is there an easy way to detect changes in the source databases? If so, then DataStage can run periodically to detect these and process those rows very easily. For example, the SELECT statement generated by DataStage could include a constraint on the LAST_UPDATED_DATE column selecting only rows updated since the (stored value of) the most recent extraction.

Without such cues in the database, changed data detection is a little more complex; the usual mechanism being to have DataStage preserve an image of the tables in a staging area. Incoming rows can be compared against the staging area to detect change.

Other approaches include

reading from transaction logs, which is fraught with its own difficulties, not least the predilection of database vendors to change the internal structure of transaction logs (and not possible at all for MS Access)

Change Data Capture, which is a hugely expensive variant on the same theme

comparison with backups rather than an image maintained in a staging area; this requires close synchronization of backups and ETL processing

The Real-Time Integration services allow you to set up your DataStage jobs as Web services, which could - for example - be triggered by changes to the source data.

penguin · Post by **penguin** » Mon Oct 18, 2004 12:33 pm

I'm a relative DW novice (having only worked on one significant project) but can also vouch for Kimball DW Lifecyle Toolkit book....

In our particular setup we are afforded the luxury of having a scheduled daily window when activity on the source databases in negligible - if your environment allows it this would certainly provide you with the 'steadiest' data as you are not be trying to hit a moving target.

You don't mention the size of the project (i.e. the amount of data and frequency) but another thing you might want to contemplate is whether you need to perform incremental updates on or your tables or whether complete table refreshes can sometime be employed - our warehouses are (thankfully!) fairly small so with some of our more manageable dimensions we don't bother trying to sort the wheat from the chaff, we just refresh the whole thing from scratch!

Where incremental updates are necessary we use triggers on the core systems (Informix) to track whether records have changed - I believe Sybase also supports such triggers - our batch jobs are then set to see if any triggers or other key files exist as the incremental refresh jobs commence.... this nicely avoids the need to scan too many large tables for changes but obviously introduces overheads into your core systems.

Anyway the rebuild I was waiting for has completed - good luck - there are plenty of wise people on this forum (much wiser than me at least!) so if in doubt - ask!

DSXchange

Best Practice

Best Practice

Re: Best Practice