incremental loading

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
vijaykumar
Participant
Posts: 228
Joined: Tue Oct 03, 2006 7:08 pm

incremental loading

Post by vijaykumar »

Hi Gurus,

In the incremental loading, we consider loading records b/w the last successful load and current record.

Actually i have never done incremental loading.How do we know the last successful load.

Can Gurus plz help me in analyzing.

cheers;
Vijay
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

Create a table and manage "Process Metadata" using it. Store information about every run, use that metadata to feed subsequent processes. That metadata could be configuration information, such as a "From Date" column.
Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
DSguru2B
Charter Member
Charter Member
Posts: 6854
Joined: Wed Feb 09, 2005 3:44 pm
Location: Houston, TX

Post by DSguru2B »

Usually two timestamps are stored in a control table. These timestamps are passed as a job parameter to the sql select (select * from table where date between #date1# AND #date2#). Once your cycle finishes OR before your next run, the control table is updated/inserted with new date records and selected again to be passed as parameters. This will work if the records you are selecting have a date/timestamp column.
The other way is to do a lookup against table and new records go as incremental inserts.
Creativity is allowing yourself to make mistakes. Art is knowing which ones to keep.
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

For the second option DSguru2B, you can use Staging table to retain the last run data, rather than doing a lookup against the full target data. If you requirement specifies for last and current load.
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
rleishman
Premium Member
Premium Member
Posts: 252
Joined: Mon Sep 19, 2005 10:28 pm
Location: Melbourne, Australia
Contact:

Post by rleishman »

IMO this is better done at the extraction phase, not at load time.

For delta-extraction, you can use one of the methods described above, or if extracting from an Oracle database you can look into Change Data Capture - it's a non-invasive way to capture all the changes on a table without timestamp columns, full table scans, or - with 10g - even accessing the source table itself (deltas are scraped straight from Archive Logs).
Ross Leishman
vmcburney
Participant
Posts: 3593
Joined: Thu Jan 23, 2003 5:25 pm
Location: Australia, Melbourne
Contact:

Post by vmcburney »

I did a blog post on this a long time ago. I think the IBM Oracle Change Data Capture and Replication Server both add triggers to the Oracle tables. You might have to go for an Oracle product to get log scraping. Even then you still need to do change data determination where, at the end of transformation, you compare your delta data to your target table to see if there really are any changes and split those into inserts and updates for optimised database loads.

So some type of delta off the database and change data capture in DataStage works well.

You can build a change data capture capability in DataStage using the CRC32 function, search the archives for examples. It is less messy than trying to compare dozens of columns in a Transformer stage. In parallel jobs you have the change capture and change apply stages to do this.
rleishman
Premium Member
Premium Member
Posts: 252
Joined: Mon Sep 19, 2005 10:28 pm
Location: Melbourne, Australia
Contact:

Post by rleishman »

I think the CDC triggers/packages are a 9i thing, and 10g does it straight from the logs. Never used it, so can't swear on it.
Ross Leishman
Post Reply