Slow reading from ORACLE

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
rwolancz
Participant
Posts: 14
Joined: Fri May 31, 2002 1:30 am
Location: Japan

Slow reading from ORACLE

Post by rwolancz »

Hi,

We've got a table with ~200 columns that is read to a hash file and we have not seen it working better than ~300 rows per second.

What performance should I expect reading ~200 columns?

We have tried different array sizes. I cannot think anything else, thers is not much else I can play with, is it?

Thanks,
Robert
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

I seriously doubt that you need 200 columns in a hashed file. The function of a hashed file is to service reference lookups fast. Typically a hashed file has one or two key columns, and a very few non-key columns that are returned for the particular row for which the lookup was successful.
Large rows (particularly rows exceeding 1628 bytes including storage overheads) are unwieldy for hashed files, because of the way data are stored. These will necessary run slowly, because a new disk buffer (at least one) must be allocated for each such oversized record - the system spends a lot of its time waiting for the request to extend the OVER.30 file to return.
Scutinize your design with an intense scruit. How many of the columns in the hashed file are actually used by your design? Omit all the others.

Ray Wurlod
Education and Consulting Services
ABN 57 092 448 518
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

"Scutinize your design with an intense scruit". [:D] Me likie.
rwolancz
Participant
Posts: 14
Joined: Fri May 31, 2002 1:30 am
Location: Japan

Post by rwolancz »

Ray, before seing your note I have done a few tests.

1. I have tried running it with ~100 rows, performance went up to 1500 rps.

2. Running original job with 200 columns from ORA to dos NUL file - performance was still slow - ~300rps!

3. Moving the same 200 columns from HASH to HASH - ~300 rps.

It seems to me that DS simply slows down as the number of columns grows, regardless of where they go or come from. Can it be tuned (without changing data structures)?

Re your point.
Yes it is a lot of columns, but in our case we rather need all of them in a hash. That's how big our source is. Not all of them are used now, but it is a part of rather generic module for a large system. The system grows and evolves with every next implementation and we simply do not know what will be used in the future. It seemed rather big hassle to keep adding columns as we go and go back and modify data structures whereever it was used before. But nothing's in concrete, suggestions are welcome ...

Thanks,
Robert
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

Your next and probably most productive solution is to use job instantiation. Find a way to partition your source data. My favorite trick is to do a WHERE MOD(some_numeric_column, #NumberOfJobInstances#) = ThisJobInstance - 1 in the where clause of the query and send the output to a sequential text file named something like FRED_TABLE_#ThisJobInstance#.txt.

Your problem is that there is a finite number of bytes per second that Oracle can spew out of the single-threaded slave that is passing data to the DataStage job. The easiest way to improve performance by multiples is to divide the result set and have many DataStage jobs capturing the results. If you setup 10 instances of the same job, each with an assigned instance number 1 thru 10, the above WHERE clause should give you an even distribution of rows to each job. You theoretically should scale to 10 X faster, or in your case 3000 rows/s. You should achieve linear scaling until you reach the cpu maximum on the Oracle server, I doubt your transformation server will bottle neck because all the DataStage jobs are doing is spooling the data to sequential files.

Since your original design has the query spooling to a hash file, you incur hashing overhead. You will find that there is a reduction in throughput once your DataStage jobs exceed available cpu power. In this case, I would never do what you are doing. I always spool to a file in one job, then in another job read that file into the hash file in a second job. This allows me to set the number of job instances in the querying job to the value that the source database can handle. (On Oracle, working through a dblink, there is a finite number of bytes per second that a dblink can handle, so if you're doing this you will find that you will top out at N number of job instances even though the source Oracle server has excess cpu power.) Then, picking up the file and loading it into the hash file, you can adjust your number of job instances to achieve the performance desired.

This all applies towards a general concept of small modular jobs. The more smaller and focused each job is, limiting to yourself to database to sequential file, sequential file to either sequential or hash, and sequential to database, the more you will be able to do job instantiation. Job instantiation is the ONLY technique from my experience, that will give performance improvements in MULTIPLES. Tweaking the BASIC syntax or noodling array sizes does not get you where you need to be from a performance level. If you're loading data into a target, all database platforms supply high performance loaders. It's unfortunate there's few high performance query unloaders.

Good luck!


Kenneth Bland
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

I also missed that you're on NT. You're going to have to be watching Performance Monitor on your DataStage NT server. You have to be cognizant of the cpu utilization of your job. If the job is max'ing a cpu then it's running as fast as it can. If the database is only able to spew at 300 rows/sec, then instantiation is your trick to multiply your net throughput. I hope you're not hitting an Oracle instance on your DataStage server, you're fighting yourself for resources.

Good luck!

Kenneth Bland
Post Reply