Hashed / Universe Performance

tracy · Post by **tracy** » Mon Nov 30, 2009 6:18 pm

My data in a Sequential file looks something like this:

A 05/18/2009 100
A 06/03/2009 50
B 08/01/2009 40
C 07/01/2009 5

I've got a lookup table that looks something like this:

A 05/01/2009 06/30/2009
B 07/01/2009 12/31/2009
C 01/01/2010 12/31/2012

I've loaded this into a Hashed File where all three fields are keys.

In my job, I've got a Transformer Stage that has the Sequential File data coming in, as well as a Universe Stage that tries looking up the Code (A, B, or C) to make sure that the code was valid on the date in question. For instance, the record with "C" would be invalid since according to the lookup table it's only valid after 2010.

In my the Universe stage under "SQL for referenc Inputs", I have something that looks like this:

SELECT CODE, STARTE_DATE, END_DATE
FROM HASHED_FILE
WHERE CODE = ?
AND START_DATE <= ?
AND END_DATE >= ?

This is working. But it's really low. The records passing through go at 2 rows per second.

I took out the Universe stage and it jumps up to 2000 rps. And when I changed the "<=" and ">=" to simple "=" it also ran fast. I also noticed that if my Hashed file has 30,000 records in it it's slow, but if it only has 100, it's much faster rps.

I've had other situations with non-hashed files where we've changed one setting and it really sped things up. Is there any such thing here?

chulett · Post by **chulett** » Mon Nov 30, 2009 6:57 pm

For the Universe stage to be 'functional' over a hashed file like that, you need to create a secondary index over those fields. A search for CREATE.INDEX and BUILD.INDEX should get you the syntax here.

tracy · Post by **tracy** » Tue Dec 01, 2009 12:23 pm

Thanks! There's lots of stuff out there for me to review. So far, I've done this in Administrator:

CREATE.INDEX HASHED_FILE CODE START_DATE END_DATE
BUILD.INDEX HASHED_FILE ALL

It seems to have done it successfully, but now my job is running even slower. Before I saw 2 rows per second. Now it's just sitting at 0 rows per second.

Still reading through everything...

tracy · Post by **tracy** » Mon Dec 07, 2009 1:32 am

I haven't had any luck with these Indexes yet.

One of the things I saw while looking through other posts was using the Universe Stage to join by some but not all of the keys. Could I use this technique as a workaround to my issue? For instance, if I just joined on the CODE and not the dates, would I possibly see 1 record going into a Transformer Stage and multiples coming out? Then I could use constraints inside the Transformer stage to check the dates and only let the one valid record through.

But even when I tried just joining on the code only, it only allowed 1 record through, not multiples like I'd hoped.

Does anybody have any tips or suggestions?

chulett · Post by **chulett** » Mon Dec 07, 2009 7:33 am

You could do that but it really shouldn't be necessary. There's a link property on the lookup itself you'd need to enable to allow a 'multi-row result set' - did you?

chulett · Post by **chulett** » Mon Dec 07, 2009 7:44 am

The other issue I should have mentioned but which would have been in the search results is the fact that mentioning multiple columns in the 'create' statement creates multiple individual indexes, not a composite index. So I would suggest starting over and creating only the index over the CODE field and seeing how much (if any) that helps.

tracy · Post by **tracy** » Mon Dec 07, 2009 1:02 pm

Thanks again, chulett!

I didn't know about the "multi-row result set" link property. I toggled that and it appears to be working to give me multiple records on the output. But the more I start working with it, the less I'm comfortable with the output. So I'm thinking this isn't a great option for this particular problem (i can see it coming in handy like in a cartesian product sort of a way someday though). So I'm back to the indexes.

My example is a little simpler than my real life example. In real life, I have two codes along with the Start and End Dates.

I did read in the other posts about how the create statement creates multiple individual indexes. So I tried several things last week. I tried creating one on the codes as well as on the dates (so 4 indexes total). I tried creating one on just the codes but none of the dates (2 indexes total). And it was still just as slow.

When you posted this morning, I thought I'd try one more time before I posted my reply. I dropped my test job, made a fresh copy of my production job, created an index on just one of the codes (the one that has the most unique values in it). Then it appeared to be running anywhere from 200 to 1000 rows per second which at this time I'm happy with right now.

My last step will be to make this automatically create and build the index as part of my job, as when I populate a Hashed File, I have the following boxes checked and the indexes appear to get wiped out each time:

Create File
Delete file before create
Clear file before writing

I'm assuming I have to rebuild the index regularly anyways. I'll post my solution here to hopefully help others in the future.

chulett · Post by **chulett** » Mon Dec 07, 2009 1:25 pm

Delete file and Clear file are mutually exclusive, meaning you only need one or the other. Stick with 'clear file' so you don't have to rebuild the index every run.

tracy · Post by **tracy** » Wed Dec 09, 2009 5:37 pm

Thanks again for the help, chulett.

I didn't have much luck with it if I didn't at least rebuild the index each time. So to play it safe, I deleted, created, and rebuild the indexes each time.

I think I'm all set now and am setting the status to resolved. I've applied the same technique to my other slow places and now my job that wasn't nearing completion after 8 hours only took a few minutes.

The routine I created to do the Indexes is below (in case it helps somebody else in the future), so I just call this after I create the Hashed Files. I'm sure it can be better coded or that perhaps I'm missing something or have too much. But it's working.

$INCLUDE DSINCLUDE JOBCONTROL.H
Deffun DSRMessage(A1, A2, A3) Calling "*DataStage*DSR_MESSAGE"
Equate RoutineName To "HashedIndexes"

If p_DeleteIndex = "Y" Then
v_Command = "DELETE.INDEX " : p_HashedName : " " : p_IndexName
Call DSLogInfo(v_Command, "Delete Index Command")
Call DSExecute("UV",v_Command , Output, SystemReturnCode)
If Output = '' Then
Output = "{no output}"
End
Call DSLogInfo(Output, "Delete Index Output")

End

If p_CreateIndex = "Y" Then
v_Command = "CREATE.INDEX " : p_HashedName : " " : p_IndexName
Call DSLogInfo(v_Command, "Create Index Command")
Call DSExecute("UV",v_Command , Output, SystemReturnCode)
If Output = '' Then
Output = "{no output}"
End
Call DSLogInfo(Output, "Create Index Output")

End

If p_BuildIndex = "Y" Then
v_Command = "BUILD.INDEX " : p_HashedName : " " : p_IndexName
Call DSLogInfo(v_Command, "Build Index Command")
Call DSExecute("UV",v_Command , Output, SystemReturnCode)
If Output = '' Then
Output = "{no output}"
End
Call DSLogInfo(Output, "Build Index Output")
End