Page 1 of 1

How to speed up multiple lookups in a Databasic routine

Posted: Fri Jun 13, 2008 3:38 am
by CharlesNagy
I am trying to speed up a routine called from a transformer, which, based on the arguments fed to it, reads from various lookup hashed files to return the values required for each row.

Each time the routine is called it may read from several hashed files in order to gather the information.

Commons are used to ensure that a hashed file which has been opened previously is not opened again, and if a required value exists, the lookup is not done.

The routine is still painfully slow, and I suspect that it is due to the sheer amount of disk access. Is there a way to allocate increased disk cache within a databasic routine? I have tried playing around with the job performance tab, but it hasn't helped much.

Not much point in posting the routine as it calls up to 20 different functions, which would be unwieldy to post. Each function opens one to several hashed files (if they are not already open) and reads from them.

I thought about amalgamating them all into one routine, but I'm not sure it would help, since I suspect that it is the sheer amount of reads that is slowing it down.

Any advice gratefully accepted

Posted: Fri Jun 13, 2008 5:43 am
by ArndW
It probably is the reads. I've sometimes included my own cache in programs of this type by creating a Key and a Data dynamic variable for each file. The keys & data are appended as new values to these strings, and I check up on the Key (using LOCATE) before trying to do an actual read. In many cases this can have marked improvement in throughput, but it does depend on your data - if the same record is never read again then you can get a marked slowdown using this method.

Posted: Fri Jun 13, 2008 7:42 am
by CharlesNagy
ArndW wrote:It probably is the reads. I've sometimes included my own cache in programs of this type by creating a Key and a Data dynamic variable for each file. The keys & data are appended as new values to these strings, and I check up on the Key (using LOCATE) before trying to do an actual read. In many cases this can have marked improvement in throughput, but it does depend on your data - if the same record is never read again then you can get a marked slowdown using this method.
Thanks for the idea. I created a dynamic array for a file, for values that had already been read in, but as the lookup file has 50,000 records, the dynamic array has started to get quite large, so it starts off quite fast but slows down as it progresses.

I don't think there is an easy way around it without analysing, and redisigning the process.....

Posted: Fri Jun 13, 2008 8:45 am
by ArndW
Of course the DataStage hashed file caching works quite well - but I'm sure you know that and have reasons for not being able to use this.

Depending upon how important this is to you, you can enable memory file caching for DataStage at the system level and use that; this would mean that your Hashed reads would never really go to disk. I have used this successfully before and the performance improvements are very high.

There is a document regarding memory hashed files, I can't recall the name right now but you can find it in the documentation directory.

Posted: Fri Jun 13, 2008 9:05 am
by chulett
dsdskche.pdf

Posted: Fri Jun 13, 2008 9:10 am
by CharlesNagy
ArndW wrote:Of course the DataStage hashed file caching works quite well - but I'm sure you know that and have reasons for not being able to use this.

Depending upon how important this is to you, you can enable memory file caching for DataStage at the system level and use that; this would mean that your Hashed reads would never really go to disk. I have used this successfully before and the performance improvements are very high.

There is a document regarding memory hashed files, I can't recall the name right now but you can find it in the documentation directory.
Thanks, I am looking at the document 'Hash Stage Disk Caching' now, if that is the one you meant. I will check & see if this is a possibility....

Posted: Fri Jun 13, 2008 9:21 am
by ArndW
That's the document. I can't recall the exact numbers, but the maximum amount of memory useable for the cache is initially small but can be expanded. There are a couple of "gotchas" with caching and I've managed to get the DS instance so confused that not a single r/w operation worked until we rebooted, but that was due to me playing around on files I shouldn't have (I think I was trying to get the director to respond faster and shoved all of the DS_... files into the cache while 30 developers were trying to do their job in the background. I made no friends that day...

Posted: Fri Jun 13, 2008 9:26 am
by CharlesNagy
ArndW wrote:That's the document. I can't recall the exact numbers, but the maximum amount of memory useable for the cache is initially small but can be expanded. There are a couple of "gotchas" with caching and I've managed to get the DS instance so confused that not a single r/w operation worked until we rebooted, but that was due to me playing around on files I shouldn't have (I think I was trying to get the director to respond faster and shoved all of the DS_... files into the cache while 30 developers were trying to do their job in the background. I made no friends that day...
Thanks guys, much appreciated....

Posted: Fri Jun 13, 2008 3:36 pm
by ray.wurlod
Please note that the cache referred to in this document is not accessible from DataStage routines.

Posted: Fri Jun 13, 2008 4:34 pm
by chulett
D'oh! :cry:

Posted: Sat Jun 14, 2008 1:34 am
by ArndW
The public disk caching is a DataStage engine level setting and can be enabled per hashed file and is independant of which methods are used to read/write to the files. So it will work from BASIC code just as it does in jobs.

I just took a quick check and it seems that the "dsdskche.pdf" document is no longer delivered at V8, but I checked and the UV commands are still present so it can still be used.

Oh, I just searched the docs and found that the disk caching information has been added at Chapter 25 in the Server Job Developer Guide.

Posted: Mon Jun 16, 2008 1:32 am
by CharlesNagy
ArndW wrote:The public disk caching is a DataStage engine level setting and can be enabled per hashed file and is independant of which methods are used to read/write to the files. So it will work from BASIC code just as it does in jobs.

I just took a quick check and it seems that the "dsdskche.pdf" document is no longer delivered at V8, but I checked and the UV commands are still present so it can still be used.

Oh, I just searched the docs and found that the disk caching information has been added at Chapter 25 in the Server Job Developer Guide.
Thanks guys, appreciate all the input, I am now marking this resolved...