Hashed / Universe Performance

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
tracy
Participant
Posts: 47
Joined: Mon Aug 07, 2006 9:19 am

Hashed / Universe Performance

Post by tracy »

My data in a Sequential file looks something like this:

A 05/18/2009 100
A 06/03/2009 50
B 08/01/2009 40
C 07/01/2009 5

I've got a lookup table that looks something like this:

A 05/01/2009 06/30/2009
B 07/01/2009 12/31/2009
C 01/01/2010 12/31/2012

I've loaded this into a Hashed File where all three fields are keys.

In my job, I've got a Transformer Stage that has the Sequential File data coming in, as well as a Universe Stage that tries looking up the Code (A, B, or C) to make sure that the code was valid on the date in question. For instance, the record with "C" would be invalid since according to the lookup table it's only valid after 2010.

In my the Universe stage under "SQL for referenc Inputs", I have something that looks like this:

SELECT CODE, STARTE_DATE, END_DATE
FROM HASHED_FILE
WHERE CODE = ?
AND START_DATE <= ?
AND END_DATE >= ?

This is working. But it's really low. The records passing through go at 2 rows per second.

I took out the Universe stage and it jumps up to 2000 rps. And when I changed the "<=" and ">=" to simple "=" it also ran fast. I also noticed that if my Hashed file has 30,000 records in it it's slow, but if it only has 100, it's much faster rps.

I've had other situations with non-hashed files where we've changed one setting and it really sped things up. Is there any such thing here?
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

For the Universe stage to be 'functional' over a hashed file like that, you need to create a secondary index over those fields. A search for CREATE.INDEX and BUILD.INDEX should get you the syntax here.
-craig

"You can never have too many knives" -- Logan Nine Fingers
tracy
Participant
Posts: 47
Joined: Mon Aug 07, 2006 9:19 am

Post by tracy »

Thanks! There's lots of stuff out there for me to review. So far, I've done this in Administrator:

CREATE.INDEX HASHED_FILE CODE START_DATE END_DATE
BUILD.INDEX HASHED_FILE ALL

It seems to have done it successfully, but now my job is running even slower. Before I saw 2 rows per second. Now it's just sitting at 0 rows per second.

Still reading through everything...
tracy
Participant
Posts: 47
Joined: Mon Aug 07, 2006 9:19 am

Post by tracy »

I haven't had any luck with these Indexes yet.

One of the things I saw while looking through other posts was using the Universe Stage to join by some but not all of the keys. Could I use this technique as a workaround to my issue? For instance, if I just joined on the CODE and not the dates, would I possibly see 1 record going into a Transformer Stage and multiples coming out? Then I could use constraints inside the Transformer stage to check the dates and only let the one valid record through.

But even when I tried just joining on the code only, it only allowed 1 record through, not multiples like I'd hoped.

Does anybody have any tips or suggestions?
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

You could do that but it really shouldn't be necessary. There's a link property on the lookup itself you'd need to enable to allow a 'multi-row result set' - did you?
-craig

"You can never have too many knives" -- Logan Nine Fingers
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

The other issue I should have mentioned but which would have been in the search results is the fact that mentioning multiple columns in the 'create' statement creates multiple individual indexes, not a composite index. So I would suggest starting over and creating only the index over the CODE field and seeing how much (if any) that helps.
-craig

"You can never have too many knives" -- Logan Nine Fingers
tracy
Participant
Posts: 47
Joined: Mon Aug 07, 2006 9:19 am

Post by tracy »

Thanks again, chulett!

I didn't know about the "multi-row result set" link property. I toggled that and it appears to be working to give me multiple records on the output. But the more I start working with it, the less I'm comfortable with the output. So I'm thinking this isn't a great option for this particular problem (i can see it coming in handy like in a cartesian product sort of a way someday though). So I'm back to the indexes.

My example is a little simpler than my real life example. In real life, I have two codes along with the Start and End Dates.

I did read in the other posts about how the create statement creates multiple individual indexes. So I tried several things last week. I tried creating one on the codes as well as on the dates (so 4 indexes total). I tried creating one on just the codes but none of the dates (2 indexes total). And it was still just as slow.

When you posted this morning, I thought I'd try one more time before I posted my reply. I dropped my test job, made a fresh copy of my production job, created an index on just one of the codes (the one that has the most unique values in it). Then it appeared to be running anywhere from 200 to 1000 rows per second which at this time I'm happy with right now.

My last step will be to make this automatically create and build the index as part of my job, as when I populate a Hashed File, I have the following boxes checked and the indexes appear to get wiped out each time:

Create File
Delete file before create
Clear file before writing

I'm assuming I have to rebuild the index regularly anyways. I'll post my solution here to hopefully help others in the future.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Delete file and Clear file are mutually exclusive, meaning you only need one or the other. Stick with 'clear file' so you don't have to rebuild the index every run.
-craig

"You can never have too many knives" -- Logan Nine Fingers
tracy
Participant
Posts: 47
Joined: Mon Aug 07, 2006 9:19 am

Post by tracy »

Thanks again for the help, chulett.

I didn't have much luck with it if I didn't at least rebuild the index each time. So to play it safe, I deleted, created, and rebuild the indexes each time.

I think I'm all set now and am setting the status to resolved. I've applied the same technique to my other slow places and now my job that wasn't nearing completion after 8 hours only took a few minutes.

The routine I created to do the Indexes is below (in case it helps somebody else in the future), so I just call this after I create the Hashed Files. I'm sure it can be better coded or that perhaps I'm missing something or have too much. But it's working.


$INCLUDE DSINCLUDE JOBCONTROL.H
Deffun DSRMessage(A1, A2, A3) Calling "*DataStage*DSR_MESSAGE"
Equate RoutineName To "HashedIndexes"

If p_DeleteIndex = "Y" Then
v_Command = "DELETE.INDEX " : p_HashedName : " " : p_IndexName
Call DSLogInfo(v_Command, "Delete Index Command")
Call DSExecute("UV",v_Command , Output, SystemReturnCode)
If Output = '' Then
Output = "{no output}"
End
Call DSLogInfo(Output, "Delete Index Output")

End

If p_CreateIndex = "Y" Then
v_Command = "CREATE.INDEX " : p_HashedName : " " : p_IndexName
Call DSLogInfo(v_Command, "Create Index Command")
Call DSExecute("UV",v_Command , Output, SystemReturnCode)
If Output = '' Then
Output = "{no output}"
End
Call DSLogInfo(Output, "Create Index Output")

End

If p_BuildIndex = "Y" Then
v_Command = "BUILD.INDEX " : p_HashedName : " " : p_IndexName
Call DSLogInfo(v_Command, "Build Index Command")
Call DSExecute("UV",v_Command , Output, SystemReturnCode)
If Output = '' Then
Output = "{no output}"
End
Call DSLogInfo(Output, "Build Index Output")
End
Post Reply