Page 1 of 1

How to capture deletes incrementally

Posted: Wed Jun 11, 2008 1:44 pm
by casedwgroup
First of all, I apologize if this has been discussed here before. In looking through other topics regarding change data detection, I wasn't able to find an answer to what I'm looking for, but I may have missed it somewhere. If so, please point me in that direction. Most posts that I saw regarding deletes also mentioned Parallel version and/or a CDC stage that has to be purchased separately. Neither of those are currently an option.

I am wondering what the best way to go about capturing deletes from a source system is. Both source and target databases are Oracle 10g. Currently, we have a number of staging jobs that use CRC32 logic to load inserts and updates incrementally (Source stage -> Merge stage -> Transform to compare CRC rows -> Split stage -> Target stage). This takes care of capturing any new rows or any rows where a field has changed.

However, we currently have no process in place to handle deleted rows. There is no CDC on the source side. What are my options for taking care of this? I also need to take into account how to propagate the deleted rows into the fact and dimension tables.

Any information in regards to this would be helpful. Thanks!

Posted: Wed Jun 11, 2008 3:15 pm
by mikegohl
Most data warehouses don't normally delete rows from a dimension table because fact will still exist. They normally deactivated by setting a flag or change the end date. Deleting fact rows seems like a strange requirement.

Posted: Wed Jun 11, 2008 5:34 pm
by chucksmith
You basically need to reverse your insert/update logic to detect deletes. You must answer the question, "is this row in the target, but not found in the source?" You could dump your current source to a hashed file, then read from the target and lookup the row in the hashed source. Your output link would use a NOTFOUND link variable for the constraint.

This is a lot of I/O, but it is local to the DataStage server.

Or, if you have a copy of the FULL source you loaded yesterday, you can compare it with today's source to identify the deletes.

I have also used the UNIX COMM command to help identify rows unique to today's and yesterday's source file, then used these smaller unique files to identify inserts, updates, and deletes.

Posted: Wed Jun 11, 2008 7:03 pm
by JoshGeorge
Table differencing is one way to capture delete delta on a non-audit (without change recording columns) table without Change Data Capture. Maintain a copy of the table (Staging area? Anywhere you can) and use SQL MINUS operator to find the delete delta:

Code: Select all

SELECT * FROM previous_version MINUS SELECT * FROM current_version;

Or just maintain the key columns of the table and do a LEFT OUTER JOIN or MINUS operation. If you are maintaining a Type 1 slowly changing dimension then you can consider dealing straight with the target.

Posted: Thu Jun 12, 2008 7:15 am
by casedwgroup
Thanks for the replies, everyone. I think we're probably going to end up flagging the deleted rows in fact tables. I just wanted to get some options from people that have actually dealt with this before, so I appreciate the suggestions.

If I have any issues with the implementation, I'll either open this topic again or start a new one. Thanks again.