Where did I go wrong?

gateleys · Post by **gateleys** » Mon Apr 23, 2007 4:30 pm

Hi, I know this issue has been talked about a lot. I haven't had to do this since a long time. The issue is range lookups and let me share what I have done so far.

Code: Select all

MyRefJob
-----------
This hashed file job creates a hashed file. Before populating the hashed file, I created the indexes SEPARATELY on the 3 key columns using -
CREATE.INDEX MyHashedFile RefID Eff_Date_Start Eff_Date_End
The hashed file was created in Account so there was NO need for a VOC entry. Then, the hashed file was populated.

Next, I had to do a range lookup such that for if the Source.InputDate lay between the Ref.Eff_Date_Start and Ref.Eff_Date_End (and if the IDs matched), then pass all (may be multiple) rows to output.

Code: Select all

MatchRowsJob
----------------
The hashed file, MyHashedFile, was accessed using a UV stage, with the same 3 columns as key. The SQL was user-defined - 

SELECT RefID, Eff_Date_Start, Eff_Date_End FROM  MyHashedFile 
WHERE (RefID = ? AND Eff_Date_Start <= ? AND Eff_Date_End >= ?);

The transformer has the Reference Link with mutli row result set enabled.

My test data set only has about 4,000 rows and the reference link has about 150,000 rows. The MatchRowsJob is taking forever to finish (< 1 row per sec). What else do I need to do to get this job moving?

Thanks.

kduke · Post by **kduke** » Mon Apr 23, 2007 4:35 pm

If you do a LIST.INDEX MyHashedFile ALL then it will tell you whether or not you need a BUILD.INDEX. You need exclusive access to build the index.

gateleys · Post by **gateleys** » Mon Apr 23, 2007 4:43 pm

Ah!!!! Darn that BUILD.

Thanks Kim.

gateleys

kduke · Post by **kduke** » Mon Apr 23, 2007 9:33 pm

No problem.

georgesebastian · Post by **georgesebastian** » Mon Apr 23, 2007 10:28 pm

gateleys wrote:Hi, I know this issue has been talked about a lot. I haven't had to do this since a long time. The issue is range lookups and let me share what I have done so far.
Code: Select all
MyRefJob
-----------
This hashed file job creates a hashed file. Before populating the hashed file, I created the indexes SEPARATELY on the 3 key columns using -
CREATE.INDEX MyHashedFile RefID Eff_Date_Start Eff_Date_End
The hashed file was created in Account so there was NO need for a VOC entry. Then, the hashed file was populated.
Next, I had to do a range lookup such that for if the Source.InputDate lay between the Ref.Eff_Date_Start and Ref.Eff_Date_End (and if the IDs matched), then pass all (may be multiple) rows to output.
Code: Select all
MatchRowsJob
----------------
The hashed file, MyHashedFile, was accessed using a UV stage, with the same 3 columns as key. The SQL was user-defined - 

SELECT RefID, Eff_Date_Start, Eff_Date_End FROM  MyHashedFile 
WHERE (RefID = ? AND Eff_Date_Start <= ? AND Eff_Date_End >= ?);
The transformer has the Reference Link with mutli row result set enabled.

My test data set only has about 4,000 rows and the reference link has about 150,000 rows. The MatchRowsJob is taking forever to finish (< 1 row per sec). What else do I need to do to get this job moving?

Thanks.

Hi,

Reference Link with mutli row result set -how can you set this in server jobs?

Thanks
George

chulett · Post by **chulett** » Mon Apr 23, 2007 10:39 pm

As noted, start a new thread and provide a link back to this one with the URL tags.

ray.wurlod · Post by **ray.wurlod** » Tue Apr 24, 2007 12:26 am

Etiquette Note
Do not hijack threads. In particular do not hijack threads marked as Resolved. Start a new thread.

gateleys · Post by **gateleys** » Tue Apr 24, 2007 8:23 am

Hi,
Sorry I had to re-open an 'I thought it was resolved' case.

The job did finish and produce the right output. However, even with indexes built on each of the 3 indices, it took an hour and half to finish.

My test data set only has about 4,000 rows and the reference link has about 150,000 rows. The source input got 900 hits on the reference.

Isn't an hour and half a loooonngg time?

gateleys

chulett · Post by **chulett** » Tue Apr 24, 2007 8:44 am

No problem, that's allowed.

You'll need to break the job down into component pieces to isolate where any bottleneck occurs. First thought would be to, if you can, add an @FALSE constraint so that no output is written. That way just the input / lookup speed can be gauged.

gateleys · Post by **gateleys** » Tue Apr 24, 2007 9:50 am

chulett wrote:add an @FALSE constraint so that no output is written. That way just the input / lookup speed can be gauged.

I am writing to a sequential file so the output link should not be a bottleneck. And it isn't... I have checked the reading and writing of the data without any problems. So, its gotta be the multi-row lookup that is causing the delay.

gateleys

battaliou · Post by **battaliou** » Tue Apr 24, 2007 11:32 am

Yo Gateleys

In past situations like this I've used left outer join to collect ALL the (sorted) dates related to the RefID and then used a transformer to constrain the data I dont require. Sweet.

gateleys · Post by **gateleys** » Tue Apr 24, 2007 1:48 pm

battaliou wrote:Yo Gateleys

Yo battaliou ....wassup!!

battaliou wrote: In past situations like this I've used left outer join to collect ALL the (sorted) dates related to the RefID and then used a transformer to constrain the data I dont require. Sweet.

I forgot to mention that in the hashed file, the column RefID alone does NOT provide uniqueness to the row. RefID has to be accompanied by the Eff_Start_Date and Eff_End_Date for unique reference rows.

gateleys

gateleys · Post by **gateleys** » Wed Apr 25, 2007 6:41 am

Anyone?

In short, how do you improve the performance of a sluggish process that does a range lookup (using UV stage) on two dates that make up a key along with a RefID column.

I have indexed all three key columns individually (and built them), since the match is on RefID and the dates appear in the where clause (Date1 >= ? AND Date2 <= ?).

Any help will be appreciated.

chulett · Post by **chulett** » Wed Apr 25, 2007 6:51 am

Perhaps you could verify that, in fact, the indexes you believe are in place actually are. I don't recall the syntax you'd need off the top of my head, however.

I say that because the only time I see rows/sec run at '<1' in this situation is when there are no indexes in place. And yes Ray, we all know this is a useless metric but just go with the flow, k?

I'd also be curious what format your dates are in.

DSguru2B · Post by **DSguru2B** » Wed Apr 25, 2007 7:37 am

gateleys, if at the end the performance is still not at an acceptable level, go for staging tables. Load your lookup and source into work tables and pass the same sql. Do that anyways while you wait for advice. It will give you a feel of how quick the process can be at the database level.

DSXchange

Where did I go wrong?

Where did I go wrong?

Re: Where did I go wrong?