IPC versus multi instance jobs

pajj · Post by **pajj** » Thu May 03, 2007 8:31 am

Is there a benefit to running multi instance jobs processing dynamically partitioned data without IPC enabled versus running a single job with IPC enabled and using link partioning?

kris · Post by **kris** » Thu May 03, 2007 12:25 pm

pajj wrote:Is there a benefit to running multi instance jobs processing dynamically partitioned data without IPC enabled versus running a single job with IPC enabled and using link partioning?

There is a significant difference in two approaches.

Having IPC (Row buffering enabled or with IPC stage in the job) enabled for a job will enable the job to run using a separate process for each active stage, which will give you some level of performance boost (depends on in process or inter process).

By running multiple instances of a job with partitioned input is more like a divide and conquer approach. Running a number of instances depending on the configuration of your server will run like multi threads of one process in parallel using more system resources does the job a lot faster compared running a single job as one threaded process.

De pending on the type of requirement, you choose one of these two approaches.

Kris~

kris · Post by **kris** » Thu May 03, 2007 12:26 pm

pajj wrote:Is there a benefit to running multi instance jobs processing dynamically partitioned data without IPC enabled versus running a single job with IPC enabled and using link partioning?

There is a significant difference in two approaches.

1. Having IPC (Row buffering enabled or with IPC stage in the job) enabled for a job will enable the job to run using a separate process for each active stage, which will give you some level of performance boost (depends on in process or inter process).

2. By running multiple instances of a job with partitioned input is more like a divide and conquer approach. Running a number of instances depending on the configuration of your server will run like multi threads of one process in parallel using more system resources does the job a lot faster compared running a single job as one threaded process.

De pending on the type of requirement, you choose one of these two approaches.

Kris~

JoshGeorge · Post by **JoshGeorge** » Sat May 05, 2007 12:18 am

When you do it in single job with IPC - Advantage is on the connectivity you make (Especially database, if you are doing bulk loading). Also from maintenance point easy to investigate.

eostic · Post by **eostic** » Tue May 08, 2007 9:29 pm

More on how they are vastly different.... multi-instancing means launching a "whole new job"...and you control the degree of "parallel activity" (words chosen carefully there) by your source definition. For instance, you could launch two instances with MQSeries as the source, each getting a different QueueName as a job parameter, or have Oracle as the source, with each instance getting a different value or range in a WHERE clause. Each "job instance" that you launch may itself be running in one or more processes depending on its topology. Certainly each instance could then have (for Server) it's own settings for IPC.

The features for IPC (intra and inter process, or using the IPC stage itself), result in separate processes WITHIN THE SAME JOB for each of the stages. There are some rules as to where the boundaries are placed, but basically IPC is giving you a certain level of "pipelining" (moving chunks of data thru the job, each stage working concurrently). That is fine, but don't ever try to just "turn it on" without thinking about it. It also alters the "row by row" behavior that you may be depending on in your job. Imagine a job that does a lookup near the source, and if that lookup fails, a flag is set, and then 10 stages later, towards the end of the job, a row is inserted into the original lookup table. If you want to ensure that the VERY NEXT row from the source FINDS the newly inserted row, then you CANNOT use IPC --- with IPC turned on, the second row will likely never find the lookup, because it will have been thru the lookup before the first row gets inserted (one way to think of it is that the buffers are following each other more closely with IPC). So...it could be a great performance boost --- or could kill your job logic. Use it wisely and carefully, and in even in conjunction with multi-instancing, once you understand each of the concepts.

Ernie

pajj · Post by **pajj** » Wed May 09, 2007 5:32 am

eostic wrote: Imagine a job that does a lookup near the source, and if that lookup fails, a flag is set, and then 10 stages later, towards the end of the job, a row is inserted into the original lookup table.
Ernie

Using the job design you describe , would that not occur in a PX job stream due to pipelining?

eostic · Post by **eostic** » Wed May 09, 2007 2:54 pm

yes... a very interesting topic. A job that requires that exact type of functionality and flow requires Server. Of course, there may be better ways in EE to "skin the cat" (ie...an alternative technique overall).

Ernie

DSXchange

IPC versus multi instance jobs

IPC versus multi instance jobs

Re: IPC versus multi instance jobs

Re: IPC versus multi instance jobs

Re: IPC versus multi instance jobs