Problem in getting data from Oracle to DS : Too Slow

Viswanath · Post by **Viswanath** » Wed May 05, 2004 6:53 pm

We are currently using various jobs which use the Oracle 8 stage. We pick up data from staging tables and then insert them into the datawarehouse.

The oracle server is on a UNIX box and the NT box serves as a DataSTage server. When the queries which selects the data from the Oracle tables runs in the Unix box, these queries takes around 1 to 2 seconds to run. But when they run from the DataStage jobs these queries take 1 to 1.5 hours to run. We have tried to run the same query from the NT box connecting to the Oracle server and we have found that it takes around 3 minutes to run. Now this negates the network congestion that might be there.

Could anybody help me in trying to figure out why there is so much difference? The array size in these queries have been given as 1000. Any leads would be a great help.

Cheers

vmcburney · Post by **vmcburney** » Wed May 05, 2004 8:11 pm

Just to be clear, what you are saying is when you run your DataStage extract job about one to one and a half hours passes before the first row is processed, is this right? And if you run the same query from a SQL Plus session on that box it takes three minutes.

This is bad. Try reducing the array size down to something like 100 just to see if this has an impact. Get your Oracle DBA to run a database trace when the job is started to capture the exact SQL that is being executed and to analyse what it happening.

Check to make sure there are no before job routines being run in that job that may be using up that hour before the SQL gets executed.

Viswanath · Post by **Viswanath** » Wed May 05, 2004 9:01 pm

Two hours is a long time and we just have had another lead. These jobs actually write the result set into a hash file as they need to be used as lookups. The hash file index creation is probably the bottle neck here. Cause we tried the same job (snippet) writing to seqeuntial files and the job finished in under 2 minutes. The overhead is thus in the hash file creation.

Could anybody please tell me what are the ideal parameters that needs to be set when creating or using a hash file? Obviously this should depend on the number of records and other factors. but any baseline that could be extended to the situation? It would be a great help.

To ANswer the questions asked, these are just SQL's no before or after sub routines used. The trace is on and it still gives us inconclusive results and points to the DS processing.

chulett · Post by **chulett** » Wed May 05, 2004 9:05 pm

Dig out your Client cdrom and look for a directory called something like "Unsupported Utilities".

In there, you'll find the HFC - the Hashed File Calculator tool. Use it to find the ideal creation statement for your hash file based on # records, key types, etc etc.

Viswanath · Post by **Viswanath** » Wed May 05, 2004 9:17 pm

Will do that. Thanks a lot.

Cheers

kduke · Post by **kduke** » Thu May 06, 2004 8:08 am

If a hash file is sized wrong then your performance can be terrible. Use DYNAMIC files with minimum modulos unless you want to become a Universe DBA. ANALYZE.FILE will tell you the minimum modulo.

Viswanath · Post by **Viswanath** » Sun May 09, 2004 2:46 am

Hi All,

Had an extremely tiresome Sunday trying to solve this one. Well here are the stats. I have gone through the forum completely and got a lot of ideas. Kenneth if you see this i read the whole of the reply of yours on hash files et all.

Well the total number of rows that i am processing is 1000000. (round about). Now the record length varied but lets get to one of them where the record lenth of 124. I tried using the 64 bit option and a group size of both 2 and 1. The HFC gave me some values for the modulus and the timing did not get better. It was the same as before. Now as per the post the OVER.30 file still remains at 2 KB. Kenneth i did the whole excersice as you had mentioned. Thus i guess i have attained the right modulus. but it doesnt work

Is there any other factor which I am missing out on? Is there no way of improving the performance of the hash files? I do agree that OCI lookup is a better option but i was assured by the DS technical guys that the number of rows processed should be at more than 5000 rows / sec come what may. The max I could acheive is 300 rows/second.

Any other stats you guys would need?

Cheers

kcbland · Post by **kcbland** » Sun May 09, 2004 1:13 pm

Break your job into two separate jobs. Job #1 spools from the source to a sequential file. Measure your rows/second. Job #2 reads the sequential file and loads into the hash. Measure your rows/second. Job #2 should astound and amaze you. You should see the speed at which DS can take a simple sequential file and slam it into a hash. I've kept mum regarding your post, as I have been letting others help you.

My suspicion is that you haven't read my performance analysis document, in which I lay out the argument that it is better to have two jobs in your situation. You will find that a source database's attention wanders if it can't spool the data in a continuous feed, meaning that the target is busy doing something (like transforming or hashing data or loading another DBMS). If you follow my suggestion, you will probably see that your problem is that your query is spooling at its maximum rate, and that DS could be hashing the data at 10000's of rows a second.

You have not indicated the average bytes/row. The network traffic involved in spooling data is a factor of bytes/second, NEVER rows/second. In fact, all measurements should be judged in bytes/second. You're probably max'ing the single ODBC connection. I'd recommend, once you confirm that your ODBC spool connection is max'ed, that you partition your query with either a MOD or ranging query and run multiple instances of the spooling job #1 to open more connections to the source table, then collate the result files and run job #2 to load the set.

Viswanath · Post by **Viswanath** » Sun May 09, 2004 3:00 pm

Hi Kenneth,

I agree with your comment. The way we tested was replacing the hash files with Seq files and the result was awesome. As far as i understand, please correct me if I am wrong, a DS job executes from a passive stage to a passive stage. Instead of having two jobs, can I write from the Oracle stage to a Seq file and then link it to the hash file. That way i dont have to change a lot of stuff around. Would the performance improve by this. If not what is the difference of having two jobs and doing the same thing in a single job? Looks I might have to get back to a redesiging of my jobs...

Cheers,

kcbland · Post by **kcbland** » Sun May 09, 2004 3:47 pm

You could do: OCI --> seq --> hash, but, if you wanted to instantiate the OCI -->seq portion to achieve parallel sets of queries spooling results, then having this as an all-in-one job doesn't let you.

Let's say you have a column in the query that you could do a MOD(columnname, 5) on. If you have 5 jobs, each numbered 1 to 5, they could do a MOD(columnname,5) = #myjobnumber#-1 statement. Each job would pull 1/5th of the dataset. Then, concatenate the results into a single sequential file. Then, load that file into a hash.

You never said how fast the seq --> hash runs. If you have a fully spooled file and run it in this job, is your performance AWESOME? If it is, then you have seen the fact that the hash file was never your problem. Don't be fooled by write-delay cache either, ultimately, the data does have to make it to disk, so your rows/second will stall at the end of the data feed as the cache flushes to disk.

Viswanath · Post by **Viswanath** » Sun May 09, 2004 3:57 pm

Hi Kenneth,

The Seq-->hash still did not give me good performance. But that was when we had the idea that there was some wrong setting in the hash file. However now I am going to redesign the whole stuff. See if this improves the performance. So I stil guess the hash files were my problem.

I will try to change the design and see the result. Thanks kenneth for all the help.

Cheers,

kcbland · Post by **kcbland** » Sun May 09, 2004 7:01 pm

How wide is your average row? Did you begin with a minimum modulus? If so, you know it takes some time to initially create the hash file? Also, if the hash file is really wide? Let's get some numbers here to help us. How many rows, how many characters per row, how many columns. What is the file system like, are you spooling on a single physical disk, is it RAID 0/1 or 5, etc.

Viswanath · Post by **Viswanath** » Sun May 09, 2004 7:09 pm

Hi Kenneth,

As you had asked.

1.) The average row is as around 125 bytes long.
2.) I did begin with a modulus of 1. Then started increasing slowly. With every run. Finally i got to such a point where the OVER.30 file was 2048 bytes.
3.) ok the number of rows touches around 1000000 rows. Number of columns varies from a min of 3 to a maximum of 8.
5.) I am indeed spooling to a single disk. No fragmentation present. I confirmed this with my System Admin.

Now I checked the "Allow write to cache" option and found that my job finishes off really fast within 2 minutes. Thats teh first change i did today morning and it worked. However every post here is against such an option specially if the file is read from only once. Also the same file is being written to and read from. Could anyone tell me if i am doing something that i shouldnt be doing?

kcbland · Post by **kcbland** » Sun May 09, 2004 8:53 pm

Now wait a second. If your job is seq --> xfm --> hash, then you should use a write-delay cache when loading this file. A write-delay cache means that while the data is loading into the hash file, it actually uses a memory mapped file and only once the file reaches the maximum cache size or the job closes does the file write to disk. Read caching is only used when referencing the hash file, and that is in another job. How you write the file is irrelevant to how you will use it.

Now, if your job design is writing to and referencing from the hash file, something like seq -->xfm <--> hash then you have not explained correctly what you're doing in the job. If you are "rolling" up data, then you can't use write-delay caching as a delayed row is not immediately referenceable unless you're using very non-performant write-delay+read caching.

eoyylo · Post by **eoyylo** » Thu May 13, 2004 3:26 am

Hi Cheers,
in the past I resolved performance problem with hash file and i can help you but it is necessary that you answer at 2 questions about the hash file:
how many field have the key of the hash file?
And about the variance of data in each field in the key? Is high or low?
I'd like to know count(distinct_value_of_field)/count(records) for each field in the key.

Mario