Implementing SCD in server job

allavivek · Post by **allavivek** » Tue Jul 13, 2010 1:22 am

Hi All,

Iam trying to implement SCD2 in server job...

ODBC(Dim table)------->Transformer-------->SEQ(capturingupdates)
                             |
                             |
                        hash file
                             |
                             |
          ODBC(source / reference)

If any record got updated from reference , a new record of same key is created with updated values which is basically SCD2..

My doubt is in usage of hash file my duplicate key records are getting overwritten...so that during lookup iam getting wrong results...

Q.Is there any option in hash file stage to disable overwriting of records or is there any other way....

Q.Is there any other way to implement SCD2 in server jobs...

Thanks All...

ray.wurlod · Post by **ray.wurlod** » Tue Jul 13, 2010 1:27 am

Wrap your design in Code tags so we can see what connects to what.

Without that there's not much sense in trying to essay an answer.

The usual solution for SCD2 in server jobs involves two Hashed File stages referring to the same hashed file.

allavivek · Post by **allavivek** » Tue Jul 13, 2010 1:33 am

Hi Ray,

Thanks for reply,

I edited my job ...give me suggestion in solving...

thank you

ray.wurlod · Post by **ray.wurlod** » Tue Jul 13, 2010 1:39 am

Make sure that caching is disabled in both stages, and that the writing Hashed File stage is set to "lock for update".

allavivek · Post by **allavivek** » Tue Jul 13, 2010 1:57 am

Hi Ray,

I tried using what you've suggested me..but not getting the result..

What i understood is "preload file to memory" , if it is set the lookup is done from memory and if record id not found lookup is done in file...

By doing this if iam getting a record which is updated twice before i will not get expected result...i will get old values instead of new...

correct me if iam wrong...

thank you

ray.wurlod · Post by **ray.wurlod** » Tue Jul 13, 2010 4:35 am

Do not pre-load to memory. You want to be able to write to the hashed file as the job runs. Likewise to not enable write cache. You want to be able to ensure that a record written is immediately available when the next row is processed.

vivekgadwal · Post by **vivekgadwal** » Tue Jul 13, 2010 1:11 pm

ray.wurlod wrote:Make sure that caching is disabled in both stages, and that the writing Hashed File stage is set to "lock for update".

That is a good suggestion. I also think that the problem is with the data or with the fetching of the data. He seems to be having multiple rows with the same Hashing key, which will get overwritten in the Hashed file.

OP - Why cannot you do a direct lookup to the database?

chulett · Post by **chulett** » Tue Jul 13, 2010 1:34 pm

They could as long as they are willing to commit every record as it is written to the target database and take the performance hit on the lookup as well. Much better to stick with the hashed file lookup and set it up properly, it really is simple - especially after you've done it once.

vivekgadwal · Post by **vivekgadwal** » Tue Jul 13, 2010 1:43 pm

chulett wrote:They could as long as they are willing to commit every record as it is written to the target database and take the performance hit on the lookup as well. Much better to stick with the hashed file lookup and set it up properly, it really is simple - especially after you've done it once.

True. Hashed file lookups are the way to go.

However, the OP's job design shows that he is capturing the updates in a Sequential file. That means, he is not too worried about the updates happening immediately. Hence my suggestion to go directly to the database source.

Also, another contributing factor for my suggestion is w.r.t the problem he is having. He is overwriting rows because of the same key as he is placing them in a Hashed file. As far as I know, there is no way to override this in a Hashed file.

chulett · Post by **chulett** » Tue Jul 13, 2010 1:59 pm

It's hard to tell exactly what is going on with the OP's job as what is posted doesn't make much sense, at least to me. However, in general, it is quite straight-forward to do Type2 SCDs in a Server job and the proper way to do that is what Ray posted - the details first get populated into a reference hashed file, not just keys but all appropriate elements. You either don't cache the lookup or if you do, you make sure you have it 'Locked for updates'. If you get a hashed file hit and determine (in the Transformer via stage variables) that nothing has changed, you do nothing. Hashed file miss? Insert. Hashed file hit and change? Same Insert plus an Update. In either case you write the entire set of new data back to the reference hashed file so it is available for the next input row.

Yes, there is no way to stop a hashed file write from doing a destructive overwrite for a given set of key fields, that's just how they work. However, in this case we want it to do that so we always have a record of the most current active record in the target.

vivekgadwal · Post by **vivekgadwal** » Tue Jul 13, 2010 2:19 pm

chulett wrote: Yes, there is no way to stop a hashed file write from doing a destructive overwrite for a given set of key fields, that's just how they work. However, in this case we want it to do that so we always have a record of the most current active record in the target.

@OP - Craig has given you a very detailed explanation of how to do this within the Server job. It doesn't get any better than this. You could accomplish the whole thing within a job. It is not very clear why you are capturing only updates within a Sequential file stage and mentioned that you are implementing SCD2 within this job. If it is something else you are trying to implement, have a more detailed post on it so that we can help accordingly.