How to capture deletes incrementally

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
casedwgroup
Premium Member
Premium Member
Posts: 24
Joined: Mon Aug 20, 2007 1:17 pm

How to capture deletes incrementally

Post by casedwgroup »

First of all, I apologize if this has been discussed here before. In looking through other topics regarding change data detection, I wasn't able to find an answer to what I'm looking for, but I may have missed it somewhere. If so, please point me in that direction. Most posts that I saw regarding deletes also mentioned Parallel version and/or a CDC stage that has to be purchased separately. Neither of those are currently an option.

I am wondering what the best way to go about capturing deletes from a source system is. Both source and target databases are Oracle 10g. Currently, we have a number of staging jobs that use CRC32 logic to load inserts and updates incrementally (Source stage -> Merge stage -> Transform to compare CRC rows -> Split stage -> Target stage). This takes care of capturing any new rows or any rows where a field has changed.

However, we currently have no process in place to handle deleted rows. There is no CDC on the source side. What are my options for taking care of this? I also need to take into account how to propagate the deleted rows into the fact and dimension tables.

Any information in regards to this would be helpful. Thanks!
mikegohl
Premium Member
Premium Member
Posts: 97
Joined: Fri Jun 13, 2003 12:50 pm
Location: Chicago
Contact:

Post by mikegohl »

Most data warehouses don't normally delete rows from a dimension table because fact will still exist. They normally deactivated by setting a flag or change the end date. Deleting fact rows seems like a strange requirement.
Michael Gohl
chucksmith
Premium Member
Premium Member
Posts: 385
Joined: Wed Jun 16, 2004 12:43 pm
Location: Virginia, USA
Contact:

Post by chucksmith »

You basically need to reverse your insert/update logic to detect deletes. You must answer the question, "is this row in the target, but not found in the source?" You could dump your current source to a hashed file, then read from the target and lookup the row in the hashed source. Your output link would use a NOTFOUND link variable for the constraint.

This is a lot of I/O, but it is local to the DataStage server.

Or, if you have a copy of the FULL source you loaded yesterday, you can compare it with today's source to identify the deletes.

I have also used the UNIX COMM command to help identify rows unique to today's and yesterday's source file, then used these smaller unique files to identify inserts, updates, and deletes.
JoshGeorge
Participant
Posts: 612
Joined: Thu May 03, 2007 4:59 am
Location: Melbourne

Post by JoshGeorge »

Table differencing is one way to capture delete delta on a non-audit (without change recording columns) table without Change Data Capture. Maintain a copy of the table (Staging area? Anywhere you can) and use SQL MINUS operator to find the delete delta:

Code: Select all

SELECT * FROM previous_version MINUS SELECT * FROM current_version;

Or just maintain the key columns of the table and do a LEFT OUTER JOIN or MINUS operation. If you are maintaining a Type 1 slowly changing dimension then you can consider dealing straight with the target.
Joshy George
<a href="http://www.linkedin.com/in/joshygeorge1" ><img src="http://www.linkedin.com/img/webpromo/bt ... _80x15.gif" width="80" height="15" border="0"></a>
casedwgroup
Premium Member
Premium Member
Posts: 24
Joined: Mon Aug 20, 2007 1:17 pm

Post by casedwgroup »

Thanks for the replies, everyone. I think we're probably going to end up flagging the deleted rows in fact tables. I just wanted to get some options from people that have actually dealt with this before, so I appreciate the suggestions.

If I have any issues with the implementation, I'll either open this topic again or start a new one. Thanks again.
Post Reply