Speed up the job

raju_chvr · Post by **raju_chvr** » Mon Oct 13, 2003 10:23 am

Hi folks,

I want to speed up my job. I couldn't figure out what is slowing it down. I have a OCI stage which is bringing in 98,000 rows and there are 6 lookup OCI stages and all of them are very few say in 100's(this is the reason for not going to Hashed file stage).

When I have the original OCI stage and another look-up OCI stage i have speed of 500 rows/sec. When I start adding another look-up OCI stage the count drastically falls down to 62 rows/sec. I have indexes on the proper columns which I am using in the stages, I have proper settings for Array sizes & Transactions.

What is that I am missing here?

I know there are many topics in this forum on this issues and I did query, but I couldn't find one. Excuse me for this topic if it is duplicate one.

thanks,
Raju

kcbland · Post by **kcbland** » Mon Oct 13, 2003 10:34 am

You have traded simplicity for performance.

You have a design that does not easily support instantiation. As you run more parallel instances, you will saturate your reference lookup databases with queries.

Since each source row generates a SQL statement for each lookup, you are suffering under a lack of scalability. In addition, were you to try to partition source data, you'll find that it's more difficult if your source data is selected with a complicated query.

You have no caching capability with database lookups compared to hash lookups. Hash lookups have all kinds of in-memory options that accelerate performance. Hash lookups completely lend themselves to many parallel instantiated jobs, as they incur no network overhead and benefit from local disk access, as well as memory.

You also have a job design that does not lend itself for a database to do parallel queries, as each reference lookup is looking for one and only one row. Since each lookup is basically: Go find me this row where col a = this and col b = that. If you were able to collect keys from your source data for all the rows you need and then put that into a hash file, your lookup build query would be: Go find me all of these rows where cola = this and col b = that from this list of key columns. Your database then can process this with a parallel query.

So, in short, we use hash files to take advantage of a structured approach that attempts to isolate and maximize lookup processing. In addition, smaller and modular jobs are more readily able to take advantage of partitioning source data for instantiated jobs.

kduke · Post by **kduke** » Mon Oct 13, 2003 10:36 am

Raju

This is a great topic. If you have just a few hundred rows in each lookup then this is a perfect choice for a hash lookup. You can build each hash file in a few seconds. If one of these lookups in OCI does not have a index then you will be doing a full table scan. All these hash files can be loaded in memory which will also speed things up. Trust me it is worth the effort to go to hash file lookups.

Kim.

raju_chvr · Post by **raju_chvr** » Mon Oct 13, 2003 10:47 am

I understand what you are saying. But one of the look ups has just 11 rows and another one has just 45 rows. Do u think I still shld go for the lookups for these stages? i am not being lazy here

, but just wondering to what magnitude that will reduce the performance.

the two main stages in this job are: the original OCI stage(98,000 rows) and another look-up OCI stage of another 98,000 rows and the rest look-ups are small as mentioned above.

and Kim I do have proper indexes on the columns being used.

- Raju

kcbland · Post by **kcbland** » Mon Oct 13, 2003 10:54 am

98,000 source rows, with 6 lookups =

98000 * 6 = 588000

You are doing 588000 database queries in your job run. For each lookup you do a serial step thru, so your lookups are done one at a time, not simultanously. So, each lookup is a cumulative overhead to how long it takes to process a single row.

You didn't mention if the lookups are to an instance on the same hardward as DataStage Server. If you have a local instance, then you are fighting yourself. If you had hash files, your performance is independent of the database. You've wisely segregated and insolated transformation.

kduke · Post by **kduke** » Mon Oct 13, 2003 11:32 am

Raju

Kenneth is correct. The smaller the hash file the bigger the perfomance gain. Hash files take milliseconds to lookup even hundred thousand rows. You have to weigh the performance gain against the time it takes to build the hash file. Small hash files take seconds to load. The lookup time is always faster because there is no connection time to a foreign database. There is nothing faster than a hash file to find a key. It is lots faster than any index. It is easy to prove if you want to go into the mathmatics of hash versus index. A hash file in memory cannot be beat. Try it.

Kim.

tonystark622 · Post by **tonystark622** » Tue Oct 14, 2003 7:07 am

Raju,

One more vote for hash file approach from me. I have a job here that went from 10-15 rows/sec to hundreds of rows per second by converting all the DB lookups to hash file lookups. If your database instance is not on the same machine as your DataStage server, you have the latency of the network connection to contend with. Multiply the small delay with each lookup times the half million lookups that you're doing during your job and it adds up to significant delay.

Good Luck,
Tony

louise · Post by **louise** » Mon Oct 27, 2003 3:25 am

Hi,

I have a similar job which has OCI input with more than 8 lookups which I have turned into hashed files lookups but the performance is still bad. Less than 10 rows per sec !

What else can I do to improve the performance ?

kduke · Post by **kduke** » Mon Oct 27, 2003 7:28 am

louise

Try taking the original source feed and dumping it straight to a sequential file. Make this sequential file the start of the next job with all the lookups. This will show you where the botleneck is.

Kim.

vmcburney · Post by **vmcburney** » Mon Oct 27, 2003 4:45 pm

Louise, it will help if you isolate which part of your job is the bottleneck. One technique is to create 8 temporary copies of your job with a single different lookup in each and run them. Also create a copy with no lookups and make this your baseline speed. You can now build a list of the performance of each of your lookups to identify which are the slow ones. If your baseline speed is still slow then the problem is not with the lookups.

Once you have identified which lookup is slowest you can try to speed it up. If the lookup came from the same database as the source data then consider joining the source and the lookup in the database select statement.

If your slow lookup is large, either a large number of rows or a large number of columns, then use the hash file config tool on your DataStage client install CD in the unsupported utilities folder. This will give you some configuration settings for your hash file that should speed up performance.