Hi, I know this issue has been talked about a lot. I haven't had to do this since a long time. The issue is range lookups and let me share what I have done so far.
MyRefJob
-----------
This hashed file job creates a hashed file. Before populating the hashed file, I created the indexes SEPARATELY on the 3 key columns using -
CREATE.INDEX MyHashedFile RefID Eff_Date_Start Eff_Date_End
The hashed file was created in Account so there was NO need for a VOC entry. Then, the hashed file was populated.
Next, I had to do a range lookup such that for if the Source.InputDate lay between the Ref.Eff_Date_Start and Ref.Eff_Date_End (and if the IDs matched), then pass all (may be multiple) rows to output.
MatchRowsJob
----------------
The hashed file, MyHashedFile, was accessed using a UV stage, with the same 3 columns as key. The SQL was user-defined -
SELECT RefID, Eff_Date_Start, Eff_Date_End FROM MyHashedFile
WHERE (RefID = ? AND Eff_Date_Start <= ? AND Eff_Date_End >= ?);
The transformer has the Reference Link with mutli row result set enabled.
My test data set only has about 4,000 rows and the reference link has about 150,000 rows. The MatchRowsJob is taking forever to finish (< 1 row per sec). What else do I need to do to get this job moving?
gateleys wrote:Hi, I know this issue has been talked about a lot. I haven't had to do this since a long time. The issue is range lookups and let me share what I have done so far.
MyRefJob
-----------
This hashed file job creates a hashed file. Before populating the hashed file, I created the indexes SEPARATELY on the 3 key columns using -
CREATE.INDEX MyHashedFile RefID Eff_Date_Start Eff_Date_End
The hashed file was created in Account so there was NO need for a VOC entry. Then, the hashed file was populated.
Next, I had to do a range lookup such that for if the Source.InputDate lay between the Ref.Eff_Date_Start and Ref.Eff_Date_End (and if the IDs matched), then pass all (may be multiple) rows to output.
MatchRowsJob
----------------
The hashed file, MyHashedFile, was accessed using a UV stage, with the same 3 columns as key. The SQL was user-defined -
SELECT RefID, Eff_Date_Start, Eff_Date_End FROM MyHashedFile
WHERE (RefID = ? AND Eff_Date_Start <= ? AND Eff_Date_End >= ?);
The transformer has the Reference Link with mutli row result set enabled.
My test data set only has about 4,000 rows and the reference link has about 150,000 rows. The MatchRowsJob is taking forever to finish (< 1 row per sec). What else do I need to do to get this job moving?
Thanks.
Hi,
Reference Link with mutli row result set -how can you set this in server jobs?
You'll need to break the job down into component pieces to isolate where any bottleneck occurs. First thought would be to, if you can, add an @FALSE constraint so that no output is written. That way just the input / lookup speed can be gauged.
-craig
"You can never have too many knives" -- Logan Nine Fingers
chulett wrote:add an @FALSE constraint so that no output is written. That way just the input / lookup speed can be gauged.
I am writing to a sequential file so the output link should not be a bottleneck. And it isn't... I have checked the reading and writing of the data without any problems. So, its gotta be the multi-row lookup that is causing the delay.
In past situations like this I've used left outer join to collect ALL the (sorted) dates related to the RefID and then used a transformer to constrain the data I dont require. Sweet.
3NF: Every non-key attribute must provide a fact about the key, the whole key, and nothing but the key. So help me Codd.
battaliou wrote:
In past situations like this I've used left outer join to collect ALL the (sorted) dates related to the RefID and then used a transformer to constrain the data I dont require. Sweet.
I forgot to mention that in the hashed file, the column RefID alone does NOT provide uniqueness to the row. RefID has to be accompanied by the Eff_Start_Date and Eff_End_Date for unique reference rows.
In short, how do you improve the performance of a sluggish process that does a range lookup (using UV stage) on two dates that make up a key along with a RefID column.
I have indexed all three key columns individually (and built them), since the match is on RefID and the dates appear in the where clause (Date1 >= ? AND Date2 <= ?).
Perhaps you could verify that, in fact, the indexes you believe are in place actually are. I don't recall the syntax you'd need off the top of my head, however.
I say that because the only time I see rows/sec run at '<1' in this situation is when there are no indexes in place. And yes Ray, we all know this is a useless metric but just go with the flow, k?
I'd also be curious what format your dates are in.
-craig
"You can never have too many knives" -- Logan Nine Fingers
gateleys, if at the end the performance is still not at an acceptable level, go for staging tables. Load your lookup and source into work tables and pass the same sql. Do that anyways while you wait for advice. It will give you a feel of how quick the process can be at the database level.
Creativity is allowing yourself to make mistakes. Art is knowing which ones to keep.