Multiple Jobs updating the same Hash File
Moderators: chulett, rschirm, roy
-
- Premium Member
- Posts: 7
- Joined: Wed Feb 14, 2007 12:57 pm
Multiple Jobs updating the same Hash File
I am trying to clean up some existing jobs. There are 2 feeds coming in from different systems. Each of them has CUSTID as the primary key. There is a hash file (CUST_ID.hsh) that contains the CUSTID along with a generated Surrogate Key (DWH_CUST_ID). The jobs are using SDKSequences to maintain the current surrogate key value.
Each job stream updates the CUST_ID.hsh hash file with new CUSTID and DWH_CUST_ID values. There is a situation where each feed may have a new CUSTID and may try to insert the same DWH_CUST_ID value for two different CUSTIDs. Obviously, the CUST_ID.hsh file will only take one of the rows and reject the other. I want to avoid that situation.
What is the most robust solution to this issue besides scheduling the jobs at different intervals? Thanks!
Jay.
Each job stream updates the CUST_ID.hsh hash file with new CUSTID and DWH_CUST_ID values. There is a situation where each feed may have a new CUSTID and may try to insert the same DWH_CUST_ID value for two different CUSTIDs. Obviously, the CUST_ID.hsh file will only take one of the rows and reject the other. I want to avoid that situation.
What is the most robust solution to this issue besides scheduling the jobs at different intervals? Thanks!
Jay.
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
Welcome aboard.
The rule for Hashed File stage is "destructive overwrite" - last one in wins.
Therefore your "obviously" statement is incorrect.
But your job logic is dangerous - why not use two separate hashed files?
The rule for Hashed File stage is "destructive overwrite" - last one in wins.
Therefore your "obviously" statement is incorrect.
But your job logic is dangerous - why not use two separate hashed files?
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
-
- Premium Member
- Posts: 7
- Joined: Wed Feb 14, 2007 12:57 pm
Hi Ray. Thanks for your response and information regarding how the Hashed File Stage works. I agree with you regarding the fact that the logic is dangerous. That is why I was looking for an alternative that would be cleaner. We can't use two separate Hashed Files because the Hashed File is used as a reference in other job Streams.
I thought about using tables instead of Hashed Files for lookups. I think this would maintain better integrity, but I'm not conviced that its 100% effective. As I mentioned before, the only alternative that I can see is to stagger these job streams so that they don't run at the same time. Please let me know what you think. Thanks.
Jay.
I thought about using tables instead of Hashed Files for lookups. I think this would maintain better integrity, but I'm not conviced that its 100% effective. As I mentioned before, the only alternative that I can see is to stagger these job streams so that they don't run at the same time. Please let me know what you think. Thanks.
Jay.
Re: Multiple Jobs updating the same Hash File
How is this situation where 'the same DWH_CUSTID could be assigned two different CUSTIDs' happening? Will staggering the job stream actually prevent this, or is it just that the destructive overwrite into the hashed file happens in the 'right order' when you do that?jkumar5444 wrote:Each job stream updates the CUST_ID.hsh hash file with new CUSTID and DWH_CUST_ID values. There is a situation where each feed may have a new CUSTID and may try to insert the same DWH_CUST_ID value for two different CUSTIDs.
Is there not a way to correct the surrogate handling so that this 'situation' does not occur?
-craig
"You can never have too many knives" -- Logan Nine Fingers
"You can never have too many knives" -- Logan Nine Fingers
-
- Premium Member
- Posts: 7
- Joined: Wed Feb 14, 2007 12:57 pm
If the jobs are staggered, this situation will not arise because the logic looks up existing keys before assigning new keys. So if the streams are staggered, one stream will insert new surrogate keys. The next stream will find existing keys and new keys if not found in the first stream. This works and I have verified the data.
Using a different range of keys will not help if both streams are running at the same time. This will actually increase the likelyhood of multiple DWH_CUST_IDs values for each CUST_ID.
I appreciate the comments and questions. Please keep them coming. I'm sure I'll find a decent solution from all the great feedback. Thanks.
Jay.
Using a different range of keys will not help if both streams are running at the same time. This will actually increase the likelyhood of multiple DWH_CUST_IDs values for each CUST_ID.
I appreciate the comments and questions. Please keep them coming. I'm sure I'll find a decent solution from all the great feedback. Thanks.
Jay.
-
- Premium Member
- Posts: 7
- Joined: Wed Feb 14, 2007 12:57 pm
-
- Premium Member
- Posts: 7
- Joined: Wed Feb 14, 2007 12:57 pm