SCD Type 1 - Removing duplicate deliveries

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
bmsq
Premium Member
Premium Member
Posts: 32
Joined: Mon Oct 30, 2006 9:19 pm

SCD Type 1 - Removing duplicate deliveries

Post by bmsq »

Hi all,

I've got SCD of type 1 working correctly within DataStage so I can now figure out which records I have previously processed. However, I'm now struggling to figure out which records I have previously delivered.

I don't know if that really makes sense, but here is the currently solution:

1. I use a lookup table to check if the current record has been previously processed.
2. If it hasn't been previously processed, generate key and insert. Else it has been previously processed; use a change capture stage to check if the record has changed.
3. If data has changed update the database.

This works fine as long as the target database and lookup table are in sync. However, Synchronisation is easily broken when I start to receive a new record multiple time. In this situation, it gets inserted into the database the first time but once it is inserted, the lookup table is not out of date. Any time I get that same record again, the lookup table will not recognise that it has already been inserted and therefore require an update instead.

Can I use the target database as the reference link for the lookup? When using spare lookup, will the lookup table recognise the newly inserted records? Does this cause performance issues (reading & writing to the same database table)?

Are there any alternative ways of looking at this? I would prefer to avoid staging followed by post processing of the incremental updates before doing a final flushing the changes to production. A sort and removed duplicates system is also out of the question due the nature of our data source.

Any input would be highly valued,

Barry
Raghavendra
Participant
Posts: 147
Joined: Sat Apr 30, 2005 1:23 am
Location: Bangalore,India

Post by Raghavendra »

yes you can use target database as reference lookup.
If you use sparse lookup then thelookup database definetly recognises the newly inserted record.
Reading and writing to the same database table won't be a problem.
But when you use a sparse lookup you get performance issues if your driving and reference data ratio is more than 1:100.
bmsq
Premium Member
Premium Member
Posts: 32
Joined: Mon Oct 30, 2006 9:19 pm

Post by bmsq »

Thanks for your reply, that helps ALOT! :D
But when you use a sparse lookup you get performance issues if your driving and reference data ratio is more than 1:100
What do you mean by this ratio? Are you saying that if the size of the reference data is large compared to the driving data then there will be performance issues? Or are you saying the opposite? In my case, the driving data is in the millions whereas the references data is less than 2000 and rarely needs updating.
Nageshsunkoji
Participant
Posts: 222
Joined: Tue Aug 30, 2005 2:07 am
Location: pune
Contact:

Post by Nageshsunkoji »

Hi,

For scenarios where the number of input rows is significantly smaller
(1:100 or more) than the number of reference rows in a DB2 or Oracle
table, a Sparse Lookup may be appropriate.

I think your condition is opposite, in your case Driving rows are huge and refrence row are less. So, It will definetely hit the performance.
NageshSunkoji

If you know anything SHARE it.............
If you Don't know anything LEARN it...............
Raghavendra
Participant
Posts: 147
Joined: Sat Apr 30, 2005 1:23 am
Location: Bangalore,India

Post by Raghavendra »

Hi,

I believe sparse lookup is not a right option for your scenario interms of performance.
Post Reply