Page 1 of 1

Comparing a new db snapshot to the previous and get a delta

Posted: Wed Mar 30, 2011 10:10 pm
by dmotoc
Hello there

I'm trying to compare a new snapshot table (Oracle) with the previous, geta delta and then apply the delta to a destination table

The size of the snapshot(s) is quite big, 2-3 mil rows.

Which approach do you recomand? using datasets, Outer Joins?...

Thanks for your assistence
D

Posted: Wed Mar 30, 2011 10:39 pm
by ray.wurlod
Welcome aboard.

Many approaches are available. First question is how accessible the previous snapshot is. It's probable that Oracle has a comparison tool (I don't know).

If you want to do it within DataStage, probably the best approach is to use one of the change detection stage types, namely Difference, Compare or Change Capture. These are all similar, but subtly different. It depends what you want as the result which you choose.

Posted: Sat Apr 02, 2011 9:29 pm
by dmotoc
Thanks for your reply Ray

I want to do it in DataStage, as a proof of concept.

Both snapshots are in the same database. Their size is bigger than I thought at the beginning, they have arround 20 mil rows each.

By comparing the earlier snapshot to a latter snapshot we can get the static set and the changed set. The set of differences can be split into 'old rows no longer in the base' (delta1) and 'rows added to the base' (delta2)
Bases should be compared on the defined PK of the target table.

I tried the ChangeCapture stage...question is: can I have the two input links straight from the two transformers? Or should they go through an intermediary stage? Do they need to be sorted before?
I marked as keys the fields that are making the PK on the target table leaving ALL the other fields as values. Not sure if this is correct...

Thanks for your help

Posted: Sat Apr 02, 2011 11:27 pm
by dougcl
The inputs to the CDC stage are supposed to be sorted on the primary key. If you don't put one there, one will be added automatically, unless you have prevented sort insertion via environment variable.

It should be correct to compare all columns except the primary key(s).