Performance Issue

rodre · Post by **rodre** » Wed Apr 02, 2008 9:02 am

Performance Issue:

I am running a Multi-Instance job (3 of them); each instance creates a 500,000 record sequential file from a Sybase OC Stage Database. The data coming across has 12 columns of basic data like name, zip, phone, email, etc. No transformations are going on. The job looks like this:

Code: Select all

Sybase OC----->Transf----->SequentialFile

The multi-instance job is running in a sequence loop and is consuming 75% of CPU Usage when all 3 multi-instance are going. Is basically 25% CPU per instance.

Our DEV Server has 2 physical and 4 logical processors.

My question is: how can the performance be improved?

Thank you in advance for your help!!

DSguru2B · Post by **DSguru2B** » Wed Apr 02, 2008 2:50 pm

What "Issue" are you having with the performance?

ray.wurlod · Post by **ray.wurlod** » Wed Apr 02, 2008 2:57 pm

A preparatory question - what is "performance" in an ETL context?

kcbland · Post by **kcbland** » Wed Apr 02, 2008 3:41 pm

You've maximized the output of one job - it consumes an entire cpu. You've got one leftover cpu not doing much. How about running 4 instances instead of 3? If 3 gets you to 75%, I'd bet 4 gets you to 100%. Then, add more cpus and get to 8 cores. Then you can run 8 instances as long as your source database doesn't run out of capacity.

kcbland · Post by **kcbland** » Thu Apr 03, 2008 8:31 am

First utilize all of your cpu power available. There is no saving cpu power for later. Once your cpus are fully utilized, your next task will be to make the work being doing more efficient. You will need to optimize any hashed lookups so that the cpus spend less time figuring out where the data is and more time transforming. You want to make sure any functions called for transformation are efficient and not wasteful. You will also need to insure that the work being performed has few redundant operations. If the same value is derived multiple times for different output columns, derive once into a stage variable and use the stage variable result in the column derivations.

kcbland · Post by **kcbland** » Thu Apr 03, 2008 2:32 pm

There's nothing to elaborate. You are trying to kill the server - it's your goal as a data integration programmer. You don't care about reserving power for others. You are the center of the universe - program accordingly. Building inefficient processes to "play nice" is not the idea. You're supposed to use all available resources to get the job done.

If you need to scale back your processes, then build throttling mechanisms to limit processing. Otherwise, when you want your process to move data as quickly as possible your process MUST be able to use all resources.

Partitioned parallelism is the only viable method for massively processing large volumes of data quickly. PX, Ab Initio, even Server w/ multi-instances use this concept.

ray.wurlod · Post by **ray.wurlod** » Thu Apr 03, 2008 5:05 pm

You get one CPU second per second per CPU. It can't be saved, so may as well be used. %Idle should be as close to 0 as possible most of the time in an optimally tuned system; I usually strive for somewhere between 0.1% and 0.5%, unless planning for future (known) increased demand. System processes will pre-empt user processes if they need to, because they run with higher priority. DataStage jobs, being background processes, run with low priority.

John Smith · Post by **John Smith** » Mon Apr 07, 2008 1:17 am

[quote] My question is: how can the performance be improved?

hire a consultant to help you tune your jobs?

ray.wurlod · Post by **ray.wurlod** » Mon Apr 07, 2008 2:17 am

Even that may not help.

Relying on CPU as the metric of performance in an I/O intensive job is probably missing the point.

The process spends time waiting for I/O operations to complete - while it is waiting it can not consume CPU so the CPU goes off and services other processes or, if there is none ready execute, chalks up some cycles against the "Idle" process - all those cycles have to be accounted for somewhere, if only so that they add properly to 100%.