SCD Type 1 - Removing duplicate deliveries

bmsq · Post by **bmsq** » Wed Nov 29, 2006 6:31 pm

Hi all,

I've got SCD of type 1 working correctly within DataStage so I can now figure out which records I have previously processed. However, I'm now struggling to figure out which records I have previously delivered.

I don't know if that really makes sense, but here is the currently solution:

1. I use a lookup table to check if the current record has been previously processed.
2. If it hasn't been previously processed, generate key and insert. Else it has been previously processed; use a change capture stage to check if the record has changed.
3. If data has changed update the database.

This works fine as long as the target database and lookup table are in sync. However, Synchronisation is easily broken when I start to receive a new record multiple time. In this situation, it gets inserted into the database the first time but once it is inserted, the lookup table is not out of date. Any time I get that same record again, the lookup table will not recognise that it has already been inserted and therefore require an update instead.

Can I use the target database as the reference link for the lookup? When using spare lookup, will the lookup table recognise the newly inserted records? Does this cause performance issues (reading & writing to the same database table)?

Are there any alternative ways of looking at this? I would prefer to avoid staging followed by post processing of the incremental updates before doing a final flushing the changes to production. A sort and removed duplicates system is also out of the question due the nature of our data source.

Any input would be highly valued,

Barry

Raghavendra · Post by **Raghavendra** » Wed Nov 29, 2006 6:51 pm

yes you can use target database as reference lookup.
If you use sparse lookup then thelookup database definetly recognises the newly inserted record.
Reading and writing to the same database table won't be a problem.
But when you use a sparse lookup you get performance issues if your driving and reference data ratio is more than 1:100.

bmsq · Post by **bmsq** » Wed Nov 29, 2006 7:12 pm

Thanks for your reply, that helps ALOT! :D

But when you use a sparse lookup you get performance issues if your driving and reference data ratio is more than 1:100

What do you mean by this ratio? Are you saying that if the size of the reference data is large compared to the driving data then there will be performance issues? Or are you saying the opposite? In my case, the driving data is in the millions whereas the references data is less than 2000 and rarely needs updating.

Nageshsunkoji · Post by **Nageshsunkoji** » Thu Nov 30, 2006 3:41 am

Hi,

For scenarios where the number of input rows is significantly smaller
(1:100 or more) than the number of reference rows in a DB2 or Oracle
table, a Sparse Lookup may be appropriate.

I think your condition is opposite, in your case Driving rows are huge and refrence row are less. So, It will definetely hit the performance.

Raghavendra · Post by **Raghavendra** » Thu Nov 30, 2006 11:25 am

Hi,

I believe sparse lookup is not a right option for your scenario interms of performance.