How to speed up multiple lookups in a Databasic routine

CharlesNagy · Post by **CharlesNagy** » Fri Jun 13, 2008 3:38 am

I am trying to speed up a routine called from a transformer, which, based on the arguments fed to it, reads from various lookup hashed files to return the values required for each row.

Each time the routine is called it may read from several hashed files in order to gather the information.

Commons are used to ensure that a hashed file which has been opened previously is not opened again, and if a required value exists, the lookup is not done.

The routine is still painfully slow, and I suspect that it is due to the sheer amount of disk access. Is there a way to allocate increased disk cache within a databasic routine? I have tried playing around with the job performance tab, but it hasn't helped much.

Not much point in posting the routine as it calls up to 20 different functions, which would be unwieldy to post. Each function opens one to several hashed files (if they are not already open) and reads from them.

I thought about amalgamating them all into one routine, but I'm not sure it would help, since I suspect that it is the sheer amount of reads that is slowing it down.

Any advice gratefully accepted

ArndW · Post by **ArndW** » Fri Jun 13, 2008 5:43 am

It probably is the reads. I've sometimes included my own cache in programs of this type by creating a Key and a Data dynamic variable for each file. The keys & data are appended as new values to these strings, and I check up on the Key (using LOCATE) before trying to do an actual read. In many cases this can have marked improvement in throughput, but it does depend on your data - if the same record is never read again then you can get a marked slowdown using this method.

CharlesNagy · Post by **CharlesNagy** » Fri Jun 13, 2008 7:42 am

ArndW wrote:It probably is the reads. I've sometimes included my own cache in programs of this type by creating a Key and a Data dynamic variable for each file. The keys & data are appended as new values to these strings, and I check up on the Key (using LOCATE) before trying to do an actual read. In many cases this can have marked improvement in throughput, but it does depend on your data - if the same record is never read again then you can get a marked slowdown using this method.

Thanks for the idea. I created a dynamic array for a file, for values that had already been read in, but as the lookup file has 50,000 records, the dynamic array has started to get quite large, so it starts off quite fast but slows down as it progresses.

I don't think there is an easy way around it without analysing, and redisigning the process.....

ArndW · Post by **ArndW** » Fri Jun 13, 2008 8:45 am

Of course the DataStage hashed file caching works quite well - but I'm sure you know that and have reasons for not being able to use this.

Depending upon how important this is to you, you can enable memory file caching for DataStage at the system level and use that; this would mean that your Hashed reads would never really go to disk. I have used this successfully before and the performance improvements are very high.

There is a document regarding memory hashed files, I can't recall the name right now but you can find it in the documentation directory.

chulett · Post by **chulett** » Fri Jun 13, 2008 9:05 am

dsdskche.pdf

CharlesNagy · Post by **CharlesNagy** » Fri Jun 13, 2008 9:10 am

ArndW wrote:Of course the DataStage hashed file caching works quite well - but I'm sure you know that and have reasons for not being able to use this.

Depending upon how important this is to you, you can enable memory file caching for DataStage at the system level and use that; this would mean that your Hashed reads would never really go to disk. I have used this successfully before and the performance improvements are very high.

There is a document regarding memory hashed files, I can't recall the name right now but you can find it in the documentation directory.

Thanks, I am looking at the document 'Hash Stage Disk Caching' now, if that is the one you meant. I will check & see if this is a possibility....

ArndW · Post by **ArndW** » Fri Jun 13, 2008 9:21 am

That's the document. I can't recall the exact numbers, but the maximum amount of memory useable for the cache is initially small but can be expanded. There are a couple of "gotchas" with caching and I've managed to get the DS instance so confused that not a single r/w operation worked until we rebooted, but that was due to me playing around on files I shouldn't have (I think I was trying to get the director to respond faster and shoved all of the DS_... files into the cache while 30 developers were trying to do their job in the background. I made no friends that day...

CharlesNagy · Post by **CharlesNagy** » Fri Jun 13, 2008 9:26 am

ArndW wrote:That's the document. I can't recall the exact numbers, but the maximum amount of memory useable for the cache is initially small but can be expanded. There are a couple of "gotchas" with caching and I've managed to get the DS instance so confused that not a single r/w operation worked until we rebooted, but that was due to me playing around on files I shouldn't have (I think I was trying to get the director to respond faster and shoved all of the DS_... files into the cache while 30 developers were trying to do their job in the background. I made no friends that day...

Thanks guys, much appreciated....

ray.wurlod · Post by **ray.wurlod** » Fri Jun 13, 2008 3:36 pm

Please note that the cache referred to in this document is not accessible from DataStage routines.

chulett · Post by **chulett** » Fri Jun 13, 2008 4:34 pm

D'oh!

ArndW · Post by **ArndW** » Sat Jun 14, 2008 1:34 am

The public disk caching is a DataStage engine level setting and can be enabled per hashed file and is independant of which methods are used to read/write to the files. So it will work from BASIC code just as it does in jobs.

I just took a quick check and it seems that the "dsdskche.pdf" document is no longer delivered at V8, but I checked and the UV commands are still present so it can still be used.

Oh, I just searched the docs and found that the disk caching information has been added at Chapter 25 in the Server Job Developer Guide.

CharlesNagy · Post by **CharlesNagy** » Mon Jun 16, 2008 1:32 am

ArndW wrote:The public disk caching is a DataStage engine level setting and can be enabled per hashed file and is independant of which methods are used to read/write to the files. So it will work from BASIC code just as it does in jobs.

I just took a quick check and it seems that the "dsdskche.pdf" document is no longer delivered at V8, but I checked and the UV commands are still present so it can still be used.

Oh, I just searched the docs and found that the disk caching information has been added at Chapter 25 in the Server Job Developer Guide.

Thanks guys, appreciate all the input, I am now marking this resolved...