Multiple Jobs updating the same Hash File

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
jkumar5444
Premium Member
Premium Member
Posts: 7
Joined: Wed Feb 14, 2007 12:57 pm

Multiple Jobs updating the same Hash File

Post by jkumar5444 »

I am trying to clean up some existing jobs. There are 2 feeds coming in from different systems. Each of them has CUSTID as the primary key. There is a hash file (CUST_ID.hsh) that contains the CUSTID along with a generated Surrogate Key (DWH_CUST_ID). The jobs are using SDKSequences to maintain the current surrogate key value.

Each job stream updates the CUST_ID.hsh hash file with new CUSTID and DWH_CUST_ID values. There is a situation where each feed may have a new CUSTID and may try to insert the same DWH_CUST_ID value for two different CUSTIDs. Obviously, the CUST_ID.hsh file will only take one of the rows and reject the other. I want to avoid that situation.

What is the most robust solution to this issue besides scheduling the jobs at different intervals? Thanks!

Jay.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Welcome aboard.

The rule for Hashed File stage is "destructive overwrite" - last one in wins.

Therefore your "obviously" statement is incorrect.

But your job logic is dangerous - why not use two separate hashed files?
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
jkumar5444
Premium Member
Premium Member
Posts: 7
Joined: Wed Feb 14, 2007 12:57 pm

Post by jkumar5444 »

Hi Ray. Thanks for your response and information regarding how the Hashed File Stage works. I agree with you regarding the fact that the logic is dangerous. That is why I was looking for an alternative that would be cleaner. We can't use two separate Hashed Files because the Hashed File is used as a reference in other job Streams.

I thought about using tables instead of Hashed Files for lookups. I think this would maintain better integrity, but I'm not conviced that its 100% effective. As I mentioned before, the only alternative that I can see is to stagger these job streams so that they don't run at the same time. Please let me know what you think. Thanks.

Jay.
mikegohl
Premium Member
Premium Member
Posts: 97
Joined: Fri Jun 13, 2003 12:50 pm
Location: Chicago
Contact:

Post by mikegohl »

What about letting both jobs use a different range of keys?
Michael Gohl
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Re: Multiple Jobs updating the same Hash File

Post by chulett »

jkumar5444 wrote:Each job stream updates the CUST_ID.hsh hash file with new CUSTID and DWH_CUST_ID values. There is a situation where each feed may have a new CUSTID and may try to insert the same DWH_CUST_ID value for two different CUSTIDs.
How is this situation where 'the same DWH_CUSTID could be assigned two different CUSTIDs' happening? Will staggering the job stream actually prevent this, or is it just that the destructive overwrite into the hashed file happens in the 'right order' when you do that? :?

Is there not a way to correct the surrogate handling so that this 'situation' does not occur?
-craig

"You can never have too many knives" -- Logan Nine Fingers
jkumar5444
Premium Member
Premium Member
Posts: 7
Joined: Wed Feb 14, 2007 12:57 pm

Post by jkumar5444 »

If the jobs are staggered, this situation will not arise because the logic looks up existing keys before assigning new keys. So if the streams are staggered, one stream will insert new surrogate keys. The next stream will find existing keys and new keys if not found in the first stream. This works and I have verified the data.

Using a different range of keys will not help if both streams are running at the same time. This will actually increase the likelyhood of multiple DWH_CUST_IDs values for each CUST_ID.

I appreciate the comments and questions. Please keep them coming. I'm sure I'll find a decent solution from all the great feedback. Thanks.

Jay.
thurmy34
Premium Member
Premium Member
Posts: 198
Joined: Fri Mar 31, 2006 8:27 am
Location: Paris

Post by thurmy34 »

Hi
What do you with your hashed file after the two jobs are executed ?
What do you think about using a parameter for the name of the hashed file ?
In that case you'll have to execute the job(s) who read it twice.
Hope This Helps
Regards
jkumar5444
Premium Member
Premium Member
Posts: 7
Joined: Wed Feb 14, 2007 12:57 pm

Post by jkumar5444 »

Thanks to everyone for their generous input. I have decided to to stagger the job run times. I am marking this topic as resolved. Thank you once again!

Jay.
jkumar5444
Premium Member
Premium Member
Posts: 7
Joined: Wed Feb 14, 2007 12:57 pm

Post by jkumar5444 »

Thanks to everyone for their generous input. I have decided to to stagger the job run times. I am marking this topic as resolved. Thank you once again!

Jay.
Post Reply