Issues with SCD & CDC?

etldwh_techie2050 · Post by **etldwh_techie2050** » Mon Jul 20, 2009 6:01 pm

experts,
This thread is a continual discussion on earlier thread. Please refer to the following topic...
viewtopic.php?p=334305#334305

Can you please address the problem that Parag has mentioned in his reply which is as follows....

But there is a problem with both these stages.The input records must not have multiple records with same Key. For e.g. if your key is a Client_Id and there is a case when same Client does different multiple transactions then in your system there will be multiple records with same Client_Id. If you feed these records as an input to CDC or SCD then only first record is processed correctly and rest all are directly tagged as Inserts (Change_Code = 1) and are sent further even though these were exact copy. So you need to keep in mind the data scenario.

How do we have to deal in this case? if there are 3 records with same Client_Id and all of them change in some columns and as per parag the issue is only 1st record will be processed properly and rest 2 records will be inserts. So, will there be 5 records after SCD or CDC in this case? How to avoid this Kunal or Parag? What u experts recommend here?

chulett · Post by **chulett** » Mon Jul 20, 2009 9:16 pm

I would think that your business rules re: duplicate records would drive the solution. Are all processed? Only the first one encountered? Only the last / most recent record? The answer would affect the solution.

priyadarshikunal · Post by **priyadarshikunal** » Tue Jul 21, 2009 4:54 am

Probably you want to process all the records. But if you send all the transactions to compare with all transaction in target is not going to work. as it should have proper keys to uniquely identify each record.

even if you are comparing 3 transactions in source and 1 transaction in target then it should not compare each source record with the traget record in case of SCD (in general datawarehouse term and not related to stage). You need to compare each record with its previous record.

Its possible to overcome this problem in delta processing and is near to impossible in full comparison (mind the word "near").

But all depends on the answer you are giving to questions asked by Craig.

etldwh_techie2050 · Post by **etldwh_techie2050** » Tue Jul 21, 2009 12:47 pm

Chulett, all are processed to keep history information and displayed the same using web interfaces.

etldwh_techie2050 · Post by **etldwh_techie2050** » Tue Jul 21, 2009 4:10 pm

kunal, thanks for your reply.

jcthornton · Post by **jcthornton** » Tue Jul 21, 2009 5:40 pm

The way to deal with this case is very carefully.

As indicated, only the first record for a given key will be matched. All others will come through as Inserts. You will need to identify the duplicate keys outside of the CDC/SCD stage.

One approach is to do this in a Transform stage.

1. Sort and partition the inputs by the keys. Have additional sort (but not partition) criteria on fields that determine the order records were created. These additional sorting fields are usually in descending order (newest first)

2. On change of key (use stage variables to determine this), determine your fields as for a new insert/most recent Type 2 record.

3. On a repeated key, apply your derivations for superseded records.

Rinse and repeat until done.

I suggest sorting in descending order by date because usually the derivations are best determined that way. If your logic goes the other way, reverse the sort as needed.

If it should happen that both the prior record (if exists) and the following record (if exists) are needed for calculations, it will take two transformers in order with reversed sorts.

As always, there are plenty of ways to complicate a design, especially when building off of partial explanations and hypothetical cases.