Delta Extraction

girishoak · Post by **girishoak** » Wed Oct 29, 2003 6:27 am

Hi,

I want to do delta extraction from the database. For this I am passing two dates as parameter. One as a start date while other as To Date. The extraction works fine. But now I want to schedule this job.
I want to extract data from the last successfull extraction date to till date. My problem is that how do I get the last successfull extraction date for the job

Is there any workaround for this.
Please help. Thanks in advance.

chulett · Post by **chulett** » Wed Oct 29, 2003 7:02 am

Some things to think about...

You could have your job write out the last successful extraction date somewhere... database table, hash file, etc. Code a Batch Control job (not a Sequence, but Basic code) to read in that file/table and dynamically set the parameter(s) before running the job.

Probably 'safer' to store stuff like that in a database and then rebuild the hash file each time, if you decide to use one. A pre-processor job could source from the database and (re)build the hash, then your batch could read it and set your parameters accordingly.

If you wanted to stay inside a Sequence job, have the preprocessor job read the database table, find the appropriate row and pass the date out using DSSetUserStatus. The next job in the sequence can pick that up and use it as an input parameter.

There are many ways to skin the DataStage cat, this certainly isn't an exhaustive list of Things You Can Do. Others will chime in I'm sure, but I hope this is enough to get you thinking about and perhaps even solving your problem!

kduke · Post by **kduke** » Wed Oct 29, 2003 8:18 am

Girish Oak

I think you need to consider or handle 2 situations.

1. What happens when you select rows already processed.
2. How do you know when you missed rows.

You need some kind of audit check for both.

Kim.

kcbland · Post by **kcbland** » Wed Oct 29, 2003 8:36 am

Your question points to a missing holistic (word for the day) approach to ETL. You will not be able to rely on DataStage to track and permanently store things like when a job last ran, what data it processed, etc. Don't believe me? If you build something that uses DS logs as the permanent datastore for last successful runs, then you obviously haven't either:
1. migrated jobs to a new project (logs don't move with jobs)
2. re-imported a job (log is wiped out and created new on import)
3. corrupted a project/job log (blink

)
4. filled up a job log and had to clear it

So, you're going to need to develop an audit repository in your warehouse and gather processing metadata. You're going to have to store job execution history there (how you get it out of DS has been covered ad nauseum).

Kim's points deal with induction processing. How do you deal with errant data that doesn't make it through. Anything selected today that doesn't make it in today may have to be retried tomorrow. This points to an architecture that has to handle rejected rows, suspended rows, etc. This is where your data warehouse architect has to be more than just a data modeler. These and many other considerations have to be put in place before ETL starts being written.

Personally, I build an audit repository to store all process metadata. I architect persistent staging databases to deal with induction processing. I build modular ETL that produces staging files throughout the processing. I also build pure insert and pure update files prior to loading. THERE'S A HUGE BENEFIT TO RELOADING OVER REPROCESSING in the event of a database failure and recovery.

Kim, Ray, Vincent, Craig, and myself are always available for architecture consulting. :D

kduke · Post by **kduke** » Wed Oct 29, 2003 9:40 am

Kenneth

Well said. This applies to parameter history as well. You should always have an audit trail of each job run, parameter values, start, end times and row counts per link. This should be stored in a database and not a hash file. This audit trail applies to complex data extractions like a change data capture or what you call a delta extraction.

I firmly believe every data warehouse should do integrity checks on all loaded or updated tables. You need to give the user confidence that the numbers generated from a warehouse report exactly match a legacy report. Whether that report is OLAP or paper then they should still match.

Kim.

kduke · Post by **kduke** » Wed Oct 29, 2003 10:36 am

I think my point is this, our job is not to just move the data around. Thats what DataStage was designed to do. Our job is to organize the data. If your data is not organized properly then OLAP will not work well.

I worked on a design recently where they turned a star schema into a "constellation". It was so badly snowflaked it was almost an OLTP normalized design. It was a text book wrong design. The client did not know the difference. The in-house architects did nothing to fix it. It was a poor design. I did my job and told them about it. I also created a good set of documentation to go with the ETL we wrote. I did more than what was expected.

Unless you organize the metadata then you are not done. Unless you document the jobs then you are not done. If the jobs are not done in a consistant manor then you warehouse is less than it should be. If you do not have integrity checks or audit trails then your work is incomplete. You want a job well done then go the extra mile. Anyone can move data around with DataStage. It is not difficult. Data needs to become information which helps companies make good decisions. One good decision can pay for the whole data warehouse or help a company survive. Good solutions take professionals.

I think my 2 cents has exploded.

Kim.

ray.wurlod · Post by **ray.wurlod** » Wed Oct 29, 2003 4:30 pm

Today's word is "exposition", and you've just had an excellent exposition of the issues. It is defintely possible to have DataStage do all these things, but it must be done in a way so that the records are kept in persistent storage (text files or database tables, it doesn't matter).