Lookup problem with a multi-instance job - Strange !
Moderators: chulett, rschirm, roy
Lookup problem with a multi-instance job - Strange !
I have a large file that will have multiple records for the same security. I am using unix 'split' command to split the file into 8 equal files and use them as an input to a multi-instance job.
In my job, I am doing a lookup to a hash file to determine whether the given security is already existing, if yes, I am processing it for update else I am processing it for insert.
I am having a strange problem with one particular security. There are 8 records for this security in my main input file. After split, all 8 records are falling into the same input file. The lookup is not finding the record in the hash file and sending it as an insert instead of an update. However, when I trim my main input file and keep only this security, all the 8 records are scattered across 8 different files, the lookup is finding the record in the hash file and processing it as an update.
This is a strange problem that I am facing. The lookup is cached. Is there any locking or something happening that is causing this behaviour?
Any clues would be appreciated...
-Thanks
Anjan
In my job, I am doing a lookup to a hash file to determine whether the given security is already existing, if yes, I am processing it for update else I am processing it for insert.
I am having a strange problem with one particular security. There are 8 records for this security in my main input file. After split, all 8 records are falling into the same input file. The lookup is not finding the record in the hash file and sending it as an insert instead of an update. However, when I trim my main input file and keep only this security, all the 8 records are scattered across 8 different files, the lookup is finding the record in the hash file and processing it as an update.
This is a strange problem that I am facing. The lookup is cached. Is there any locking or something happening that is causing this behaviour?
Any clues would be appreciated...
-Thanks
Anjan
If you split your data into 8 parts, and run 8 instances to process the data, you need to make sure that rows that need to be handled together go to the same instance and you DON'T use any caching.
Think of this as a single job, whereby the first row has to be seen by the second row. You can only do this if you disable read and write caching to the hash file.
Now, if you involve multiple instances, and the first row is handled by one instance and the second handled by another instance, you will experience errant data.
Instead of using the unix split, use a single source file and put a constraint in a new first transformer. Use a MOD statement on the column that can keep like data together. If your security column is a number, then MOD(inlink.security,8) = 0 will have all rows where the security number / 8 has a remainder of 0. For each instance, the remainder will be 0 thru 7. You can setup your jobs this way, so that you don't incur the initial cost of the "split" command, and you'll keep like security numbers together. The repeating security number will always be handled by the same job instance. The lack of read/write caching will force prior rows to be visible to following rows.
Think of this as a single job, whereby the first row has to be seen by the second row. You can only do this if you disable read and write caching to the hash file.
Now, if you involve multiple instances, and the first row is handled by one instance and the second handled by another instance, you will experience errant data.
Instead of using the unix split, use a single source file and put a constraint in a new first transformer. Use a MOD statement on the column that can keep like data together. If your security column is a number, then MOD(inlink.security,8) = 0 will have all rows where the security number / 8 has a remainder of 0. For each instance, the remainder will be 0 thru 7. You can setup your jobs this way, so that you don't incur the initial cost of the "split" command, and you'll keep like security numbers together. The repeating security number will always be handled by the same job instance. The lack of read/write caching will force prior rows to be visible to following rows.
Kenneth Bland
Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
Ken -
Thanks for your response. Let me clarify a few things -
Here is the sample input record -
Any ideas?
-Thanks
Anjan
Thanks for your response. Let me clarify a few things -
Here is the sample input record -
My problem is that when all the records belonging to security 101 are in one file, the lookup does not find the record, but when they are across different files, the lookup correctly finds the record. Is it because of caching? I disabled caching and still see the same behavior.SECURITY START DATE TYPE RATE
101 1/1/2001 00:00:00 MATURITY 100
101 1/1/2001 00:00:00 FACTOR 5
101 2/1/2001 00:00:00 FACTOR 5.5
101 3/1/2001 00:00:00 FACTOR 6
101 1/1/2001 00:00:00 CALL 0
101 1/1/2001 00:00:00 PUT 0
Now here is my lookup hash file -
SECURITY START DATE TYPE SURROGATEKEY
101 1/1/2001 00:00:00 MATURITY 111
101 1/1/2001 00:00:00 FACTOR 112
101 2/1/2001 00:00:00 FACTOR 113
101 1/1/2001 00:00:00 CALL 115
101 1/1/2001 00:00:00 PUT 116
SECURITY + START DATE + TYPE are the key columns in the hash file.
Any ideas?
-Thanks
Anjan
-
- Participant
- Posts: 3337
- Joined: Mon Jan 17, 2005 4:49 am
- Location: United Kingdom
As you said in your last post, there are 3 keys for the hash-file. But they have different dates. Can you check what properties you provided as key in the first time you created this hash-file. It maybe wise to delete the hash-file and recreate it fresh in order to confirm that the dictionary is correct.
Yes I already did that. But still having the same problem..Sainath.Srinivasan wrote:As you said in your last post, there are 3 keys for the hash-file. But they have different dates. Can you check what properties you provided as key in the first time you created this hash-file. It maybe wise to delete the hash-file and recreate it fresh in order to confirm that the dictionary is correct.
Code: Select all
Input data:
SECURITY START DATE TYPE RATE
101 1/1/2001 00:00:00 MATURITY 100
Now here is my lookup hash file -
SECURITY START DATE TYPE SURROGATEKEY
101 1/1/2001 00:00:00 MATURITY 111
If the lookup row is written to the hash file, then subsequently referenced by a repeating row, then you CANNOT be using ANY READ OR WRITE CACHING.
I really can't get a grasp of what your job looks like. PLEASE give us a word diagram, such as SEQ---> xfm/HASH lookup --> HASH (same as lookup).
Also, if I was you, I would always use a date format of "YYYY-MM-DD HH:MI:SS".
Kenneth Bland
Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
Well, Kim and I verified your example data should work.
Now how about actual data? From your stated example, "security" is only 1 column of 3 on which you have to match a lookup. When you say "one security in the input file", are you saying one row, or one set of rows with the same "security" value, or one set of rows with the same 3 columns values (called the natural key so that we can use the correct terms).
So, are you saying the natural key repeats on multiple rows? Is your situation that the natural key repeats, and that under that situation, the first row should be written as an insert and the subsequent treated as updates? Is the problem that the subsequent rows are treated as inserts also?
![Mad :x](./images/smilies/icon_mad.gif)
Now how about actual data? From your stated example, "security" is only 1 column of 3 on which you have to match a lookup. When you say "one security in the input file", are you saying one row, or one set of rows with the same "security" value, or one set of rows with the same 3 columns values (called the natural key so that we can use the correct terms).
So, are you saying the natural key repeats on multiple rows? Is your situation that the natural key repeats, and that under that situation, the first row should be written as an insert and the subsequent treated as updates? Is the problem that the subsequent rows are treated as inserts also?
Kenneth Bland
Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
kcbland wrote:Well, Kim and I verified your example data should work.![]()
Now how about actual data? From your stated example, "security" is only 1 column of 3 on which you have to match a lookup. When you say "one security in the input file", are you saying one row, or one set of rows with the same "security" value, or one set of rows with the same 3 columns values (called the natural key so that we can use the correct terms).
So, are you saying the natural key repeats on multiple rows? Is your situation that the natural key repeats, and that under that situation, the first row should be written as an insert and the subsequent treated as updates? Is the problem that the subsequent rows are treated as inserts also?
Sorry if my earlier explanations were not clear... When I say one security I mean a set of rows with the same value under the column security.
Hope that explains...
I am sending you the word document as attachment.
-Thanks
Anjan
Then those rows mean nothing to each other, because they are different natural key combinations. Therefore, your problem is most likely NOT related to this column, unless it just happens to be pointing to an issue with your hash file metadata on the reference.Anjan Roy wrote:When I say one security I mean a set of rows with the same value under the column security.
Please tell me that your hash reference lookup has the primary key boxes checked for all three columns that make up the natural key, and that when you write to the hash file you have those same three columns checked. If there's anything wrong at this point, it's probably your hash file metadata.
Kenneth Bland
Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
yes. The metadata is exactly the same. All three primary key columns are checked.kcbland wrote:Please tell me that your hash reference lookup has the primary key boxes checked for all three columns that make up the natural key, and that when you write to the hash file you have those same three columns checked. If there's anything wrong at this point, it's probably your hash file metadata.
My confusion is that if the metadata is wrong.. then it should not work at all. Why does it work when I have only records related to one security?
The security column is part of 3 columns that make up a key. If you were to describe to me a situation where a row is found/notfound when one of those 3 columns has other rows like itself, my guess would be 99% right that you have a metadata issue.
What you write to a hash file has to be read from the hash file the same way. Since metadata is the way, it has to be consistent.
As for intermittent finding of a row, that usually is either a caching issue, or an in the case of a multi-instance job the hash target is getting cleared and since all instances are clones, they all clear the hash file. In that case, one instance may already be writing data to the hash file when another comes thru and wipes out the contents. So verify no write delay or read caching, as well as clearing of the hash file.
What you write to a hash file has to be read from the hash file the same way. Since metadata is the way, it has to be consistent.
As for intermittent finding of a row, that usually is either a caching issue, or an in the case of a multi-instance job the hash target is getting cleared and since all instances are clones, they all clear the hash file. In that case, one instance may already be writing data to the hash file when another comes thru and wipes out the contents. So verify no write delay or read caching, as well as clearing of the hash file.
Kenneth Bland
Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
I already verified the metadata it matches between the job that creates the hash file and the jobs that read it.kcbland wrote:As for intermittent finding of a row, that usually is either a caching issue, or an in the case of a multi-instance job the hash target is getting cleared and since all instances are clones, they all clear the hash file. In that case, one instance may already be writing data to the hash file when another comes thru and wipes out the contents. So verify no write delay or read caching, as well as clearing of the hash file.
There is no write caching turned on. Also the multi instance job does not clear the hash file.