Comparing a new db snapshot to the previous and get a delta

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
dmotoc
Participant
Posts: 6
Joined: Mon Oct 13, 2008 11:30 am
Location: Canada

Comparing a new db snapshot to the previous and get a delta

Post by dmotoc »

Hello there

I'm trying to compare a new snapshot table (Oracle) with the previous, geta delta and then apply the delta to a destination table

The size of the snapshot(s) is quite big, 2-3 mil rows.

Which approach do you recomand? using datasets, Outer Joins?...

Thanks for your assistence
D
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Welcome aboard.

Many approaches are available. First question is how accessible the previous snapshot is. It's probable that Oracle has a comparison tool (I don't know).

If you want to do it within DataStage, probably the best approach is to use one of the change detection stage types, namely Difference, Compare or Change Capture. These are all similar, but subtly different. It depends what you want as the result which you choose.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
dmotoc
Participant
Posts: 6
Joined: Mon Oct 13, 2008 11:30 am
Location: Canada

Post by dmotoc »

Thanks for your reply Ray

I want to do it in DataStage, as a proof of concept.

Both snapshots are in the same database. Their size is bigger than I thought at the beginning, they have arround 20 mil rows each.

By comparing the earlier snapshot to a latter snapshot we can get the static set and the changed set. The set of differences can be split into 'old rows no longer in the base' (delta1) and 'rows added to the base' (delta2)
Bases should be compared on the defined PK of the target table.

I tried the ChangeCapture stage...question is: can I have the two input links straight from the two transformers? Or should they go through an intermediary stage? Do they need to be sorted before?
I marked as keys the fields that are making the PK on the target table leaving ALL the other fields as values. Not sure if this is correct...

Thanks for your help
dougcl
Premium Member
Premium Member
Posts: 137
Joined: Thu Jun 24, 2010 4:28 pm

Post by dougcl »

The inputs to the CDC stage are supposed to be sorted on the primary key. If you don't put one there, one will be added automatically, unless you have prevented sort insertion via environment variable.

It should be correct to compare all columns except the primary key(s).
Post Reply