Where did I go wrong?

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

gateleys
Premium Member
Premium Member
Posts: 992
Joined: Mon Aug 08, 2005 5:08 pm
Location: USA

Where did I go wrong?

Post by gateleys »

Hi, I know this issue has been talked about a lot. I haven't had to do this since a long time. The issue is range lookups and let me share what I have done so far.

Code: Select all

MyRefJob
-----------
This hashed file job creates a hashed file. Before populating the hashed file, I created the indexes SEPARATELY on the 3 key columns using -
CREATE.INDEX MyHashedFile RefID Eff_Date_Start Eff_Date_End
The hashed file was created in Account so there was NO need for a VOC entry. Then, the hashed file was populated.
Next, I had to do a range lookup such that for if the Source.InputDate lay between the Ref.Eff_Date_Start and Ref.Eff_Date_End (and if the IDs matched), then pass all (may be multiple) rows to output.

Code: Select all

MatchRowsJob
----------------
The hashed file, MyHashedFile, was accessed using a UV stage, with the same 3 columns as key. The SQL was user-defined - 

SELECT RefID, Eff_Date_Start, Eff_Date_End FROM  MyHashedFile 
WHERE (RefID = ? AND Eff_Date_Start <= ? AND Eff_Date_End >= ?);
The transformer has the Reference Link with mutli row result set enabled.

My test data set only has about 4,000 rows and the reference link has about 150,000 rows. The MatchRowsJob is taking forever to finish (< 1 row per sec). What else do I need to do to get this job moving?

Thanks.
kduke
Charter Member
Charter Member
Posts: 5227
Joined: Thu May 29, 2003 9:47 am
Location: Dallas, TX
Contact:

Post by kduke »

If you do a LIST.INDEX MyHashedFile ALL then it will tell you whether or not you need a BUILD.INDEX. You need exclusive access to build the index.
Mamu Kim
gateleys
Premium Member
Premium Member
Posts: 992
Joined: Mon Aug 08, 2005 5:08 pm
Location: USA

Post by gateleys »

Ah!!!! Darn that BUILD.

Thanks Kim.

gateleys
kduke
Charter Member
Charter Member
Posts: 5227
Joined: Thu May 29, 2003 9:47 am
Location: Dallas, TX
Contact:

Post by kduke »

No problem.
Mamu Kim
georgesebastian
Participant
Posts: 83
Joined: Tue Dec 19, 2006 8:38 am

Re: Where did I go wrong?

Post by georgesebastian »

gateleys wrote:Hi, I know this issue has been talked about a lot. I haven't had to do this since a long time. The issue is range lookups and let me share what I have done so far.

Code: Select all

MyRefJob
-----------
This hashed file job creates a hashed file. Before populating the hashed file, I created the indexes SEPARATELY on the 3 key columns using -
CREATE.INDEX MyHashedFile RefID Eff_Date_Start Eff_Date_End
The hashed file was created in Account so there was NO need for a VOC entry. Then, the hashed file was populated.
Next, I had to do a range lookup such that for if the Source.InputDate lay between the Ref.Eff_Date_Start and Ref.Eff_Date_End (and if the IDs matched), then pass all (may be multiple) rows to output.

Code: Select all

MatchRowsJob
----------------
The hashed file, MyHashedFile, was accessed using a UV stage, with the same 3 columns as key. The SQL was user-defined - 

SELECT RefID, Eff_Date_Start, Eff_Date_End FROM  MyHashedFile 
WHERE (RefID = ? AND Eff_Date_Start <= ? AND Eff_Date_End >= ?);
The transformer has the Reference Link with mutli row result set enabled.

My test data set only has about 4,000 rows and the reference link has about 150,000 rows. The MatchRowsJob is taking forever to finish (< 1 row per sec). What else do I need to do to get this job moving?

Thanks.
Hi,

Reference Link with mutli row result set -how can you set this in server jobs?

Thanks
George
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

As noted, start a new thread and provide a link back to this one with the URL tags.
-craig

"You can never have too many knives" -- Logan Nine Fingers
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Etiquette Note
Do not hijack threads. In particular do not hijack threads marked as Resolved. Start a new thread.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
gateleys
Premium Member
Premium Member
Posts: 992
Joined: Mon Aug 08, 2005 5:08 pm
Location: USA

Post by gateleys »

Hi,
Sorry I had to re-open an 'I thought it was resolved' case.

The job did finish and produce the right output. However, even with indexes built on each of the 3 indices, it took an hour and half to finish.

My test data set only has about 4,000 rows and the reference link has about 150,000 rows. The source input got 900 hits on the reference.

Isn't an hour and half a loooonngg time?

gateleys
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

No problem, that's allowed. :wink:

You'll need to break the job down into component pieces to isolate where any bottleneck occurs. First thought would be to, if you can, add an @FALSE constraint so that no output is written. That way just the input / lookup speed can be gauged.
-craig

"You can never have too many knives" -- Logan Nine Fingers
gateleys
Premium Member
Premium Member
Posts: 992
Joined: Mon Aug 08, 2005 5:08 pm
Location: USA

Post by gateleys »

chulett wrote:add an @FALSE constraint so that no output is written. That way just the input / lookup speed can be gauged.
I am writing to a sequential file so the output link should not be a bottleneck. And it isn't... I have checked the reading and writing of the data without any problems. So, its gotta be the multi-row lookup that is causing the delay.

gateleys
battaliou
Participant
Posts: 155
Joined: Mon Feb 24, 2003 7:28 am
Location: London
Contact:

Post by battaliou »

Yo Gateleys

In past situations like this I've used left outer join to collect ALL the (sorted) dates related to the RefID and then used a transformer to constrain the data I dont require. Sweet.
3NF: Every non-key attribute must provide a fact about the key, the whole key, and nothing but the key. So help me Codd.
gateleys
Premium Member
Premium Member
Posts: 992
Joined: Mon Aug 08, 2005 5:08 pm
Location: USA

Post by gateleys »

battaliou wrote:Yo Gateleys
Yo battaliou ....wassup!!
battaliou wrote: In past situations like this I've used left outer join to collect ALL the (sorted) dates related to the RefID and then used a transformer to constrain the data I dont require. Sweet.

I forgot to mention that in the hashed file, the column RefID alone does NOT provide uniqueness to the row. RefID has to be accompanied by the Eff_Start_Date and Eff_End_Date for unique reference rows.

gateleys
gateleys
Premium Member
Premium Member
Posts: 992
Joined: Mon Aug 08, 2005 5:08 pm
Location: USA

Post by gateleys »

Anyone?

In short, how do you improve the performance of a sluggish process that does a range lookup (using UV stage) on two dates that make up a key along with a RefID column.

I have indexed all three key columns individually (and built them), since the match is on RefID and the dates appear in the where clause (Date1 >= ? AND Date2 <= ?).

Any help will be appreciated.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Perhaps you could verify that, in fact, the indexes you believe are in place actually are. I don't recall the syntax you'd need off the top of my head, however.

I say that because the only time I see rows/sec run at '<1' in this situation is when there are no indexes in place. And yes Ray, we all know this is a useless metric but just go with the flow, k? :wink:

I'd also be curious what format your dates are in.
-craig

"You can never have too many knives" -- Logan Nine Fingers
DSguru2B
Charter Member
Charter Member
Posts: 6854
Joined: Wed Feb 09, 2005 3:44 pm
Location: Houston, TX

Post by DSguru2B »

gateleys, if at the end the performance is still not at an acceptable level, go for staging tables. Load your lookup and source into work tables and pass the same sql. Do that anyways while you wait for advice. It will give you a feel of how quick the process can be at the database level.
Creativity is allowing yourself to make mistakes. Art is knowing which ones to keep.
Post Reply