Performance Issue

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
rodre
Premium Member
Premium Member
Posts: 218
Joined: Wed Mar 01, 2006 1:28 pm
Location: Tennessee

Performance Issue

Post by rodre »

Performance Issue:

I am running a Multi-Instance job (3 of them); each instance creates a 500,000 record sequential file from a Sybase OC Stage Database. The data coming across has 12 columns of basic data like name, zip, phone, email, etc. No transformations are going on. The job looks like this:

Code: Select all

Sybase OC----->Transf----->SequentialFile


The multi-instance job is running in a sequence loop and is consuming 75% of CPU Usage when all 3 multi-instance are going. Is basically 25% CPU per instance.

Our DEV Server has 2 physical and 4 logical processors.

My question is: how can the performance be improved?

Thank you in advance for your help!! :)
DSguru2B
Charter Member
Charter Member
Posts: 6854
Joined: Wed Feb 09, 2005 3:44 pm
Location: Houston, TX

Post by DSguru2B »

What "Issue" are you having with the performance?
Creativity is allowing yourself to make mistakes. Art is knowing which ones to keep.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

A preparatory question - what is "performance" in an ETL context?
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

You've maximized the output of one job - it consumes an entire cpu. You've got one leftover cpu not doing much. How about running 4 instances instead of 3? If 3 gets you to 75%, I'd bet 4 gets you to 100%. Then, add more cpus and get to 8 cores. Then you can run 8 instances as long as your source database doesn't run out of capacity.
Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

First utilize all of your cpu power available. There is no saving cpu power for later. Once your cpus are fully utilized, your next task will be to make the work being doing more efficient. You will need to optimize any hashed lookups so that the cpus spend less time figuring out where the data is and more time transforming. You want to make sure any functions called for transformation are efficient and not wasteful. You will also need to insure that the work being performed has few redundant operations. If the same value is derived multiple times for different output columns, derive once into a stage variable and use the stage variable result in the column derivations.
Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

There's nothing to elaborate. You are trying to kill the server - it's your goal as a data integration programmer. You don't care about reserving power for others. You are the center of the universe - program accordingly. Building inefficient processes to "play nice" is not the idea. You're supposed to use all available resources to get the job done.

If you need to scale back your processes, then build throttling mechanisms to limit processing. Otherwise, when you want your process to move data as quickly as possible your process MUST be able to use all resources.

Partitioned parallelism is the only viable method for massively processing large volumes of data quickly. PX, Ab Initio, even Server w/ multi-instances use this concept.
Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

You get one CPU second per second per CPU. It can't be saved, so may as well be used. %Idle should be as close to 0 as possible most of the time in an optimally tuned system; I usually strive for somewhere between 0.1% and 0.5%, unless planning for future (known) increased demand. System processes will pre-empt user processes if they need to, because they run with higher priority. DataStage jobs, being background processes, run with low priority.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
John Smith
Charter Member
Charter Member
Posts: 193
Joined: Tue Sep 05, 2006 8:01 pm
Location: Australia

Post by John Smith »

[quote] My question is: how can the performance be improved?

hire a consultant to help you tune your jobs? 8)
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Even that may not help.

Relying on CPU as the metric of performance in an I/O intensive job is probably missing the point.

The process spends time waiting for I/O operations to complete - while it is waiting it can not consume CPU so the CPU goes off and services other processes or, if there is none ready execute, chalks up some cycles against the "Idle" process - all those cycles have to be accounted for somewhere, if only so that they add properly to 100%.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Post Reply