Possibility of running many Mutiple Instances concurrently

vnspn · Post by **vnspn** » Thu Apr 05, 2007 8:08 am

Hi,

I had posted here before on processing a huge source file containing 40 million records. This takes a very long time for processing.

Now we would likely be getting the source file broken down into 20 different files of approximate equal size. I would like to know if we can use the "Multiple Instance" job property to process all these 20 different files concurrently.

As each Job instance is going to process its own chunk of data, could DataStage's memory support such a kind of multiple processing? Could it support 20 Jobs processing its own bunch of records simultaneously?

Would there be degradation in performance? Would DataStage have enough memory for supporting 20 Jobs parallely?

Our version is 7.1 and the server is on Windows.

Thanks.

DSguru2B · Post by **DSguru2B** » Thu Apr 05, 2007 8:24 am

The best way to find out is to try it out. Run , say 10 instances at the same time. If you are satisfied, start incrementing the number of simultaneous instances. Stop at a point where you find that the process is struggling. Experiment it out. No one can tell you as only you know your system best.

ray.wurlod · Post by **ray.wurlod** » Thu Apr 05, 2007 1:00 pm

That is a very dangerous generalization to make. If you include complex transformation rules in the data flow, CPU consumption will increase. If you include sorting, both CPU and memory consumption will increase. If you use many cached hashed files, CPU and memory consumption will increase. If you use inter-process row buffering (or even in-process row buffering), memory consumption will increase. Too much memory demanded results in paging, and consequent (relative) slowness.

kduke · Post by **kduke** » Thu Apr 05, 2007 9:10 pm

I have seen more than 10 but I put sleeps in between because you can overwhelm the CPU with all 20 starting one after another. 10 to 30 seconds on the sleeps. I would bet 20 jobs is easily possible.

vnspn · Post by **vnspn** » Fri Apr 06, 2007 9:36 am

Hmm.. This seems to be a good idea. To start each of the instance in a time gap of 10 to 30 seconds.

How can I make it to sleep for every 30 seconds before the next instance's start. Is there a routine to make the Job Sequenc to sleep?

DSguru2B · Post by **DSguru2B** » Fri Apr 06, 2007 9:44 am

You can write a short before job subroutine that checks what instance it is and pause accordingly. SLEEP command can be used.

byk · Post by **byk** » Tue Apr 10, 2007 2:32 am

vnspn wrote:Hi,

I had posted here before on processing a huge source file containing 40 million records. This takes a very long time for processing.

Now we would likely be getting the source file broken down into 20 different files of approximate equal size. I would like to know if we can use the "Multiple Instance" job property to process all these 20 different files concurrently.

As each Job instance is going to process its own chunk of data, could DataStage's memory support such a kind of multiple processing? Could it support 20 Jobs processing its own bunch of records simultaneously?

Would there be degradation in performance? Would DataStage have enough memory for supporting 20 Jobs parallely?

Our version is 7.1 and the server is on Windows.

Thanks.

The individual instances may run a bit slower but overall there will be a performance benefit. The optimal number of parallel instance has to be found by hit and trial and will primarily depend on
1. Number of CPU
2. Amount of memory (RAM + swap space) for Datastage

I am assuming that you have ensured the independence of the instances that we plan to run in parallel.

I have successfully tried this strategy running 6 parallel threads (different independent jobs) and got a time gain of 60% !!!

kcbland · Post by **kcbland** » Tue Apr 10, 2007 8:03 am

A job of SEQ --> XFM --> SEQ should use 1 cpu completely. If you have 8 cpus, then you should not exceed the number of cpus too much. You ALWAYS need to measure the requirements/impact of a single instance, then use that to determine the maximum number of instances your server will sustain.

vnspn · Post by **vnspn** » Tue Apr 10, 2007 8:41 am

byk,

I have a couple of questions based on your reply.

What do you mean by saying as "swap space". Would we need to specify any amount of memory that can be consumed by DataStage, during the installation process of the Server.

What do you mean by saying "ensure independence of the instances". A single Job is going to run in multiple instances for different data. Are you talking about the way it would be called in the Job Sequence.

Thanks.

byk · Post by **byk** » Mon Apr 16, 2007 8:10 am

vnspn,

I am not sure whether you can mention/restrict the memory usage during installation but its good to provision for good amount of memory (3-4GB) and page file
Ensuring Independence (may be implicit in your case) is basically ensuring that the jobs that are to run in parallel are no way dependent on each other.

eostic · Post by **eostic** » Tue Apr 17, 2007 9:25 pm

...(on ensuring independence)...like being certain that two instances don't try to write to the same seqential file, or do conflicting things against the same target rdbms table, etc. You may already be familiar with built-in Job Parm #DSJobInvocationId# , but others reading this thread may find it useful as it's not that well documented. fyi --- For those of you with RTI, these multi-instance job issues are critical, as RTI uses this feature internally to spawn multiple "always on" instances of a job that are waiting for input.

Ernie

pajj · Post by **pajj** » Mon Apr 30, 2007 11:49 am

This scenario is similar to a current project I am workign on.

Will using link partioning obtain the same, worse or better results than a separate job instance?

What is the benefit of one method over the other?

tks

ray.wurlod · Post by **ray.wurlod** » Mon Apr 30, 2007 3:28 pm

Worse. Link partitioner uses inter process communication (IPC) which introduces an additional possible point of failure.

katz · Post by **katz** » Tue May 01, 2007 12:19 pm

An alternative to running multi-instance jobs in parallel for separate input files could be to start multiple readers of the same source file in the same job (the source is a sequential file?) and use awk as a filter to to read a subset of rows from the single input file. Awk is available for windows.

For example the expression awk 'NR%2 == 0' would read the even numbered rows from the file and awk 'NR%2 == 1' would read the odd numbered rows. Both file reads can be perfomed in parallel, in the same job. By extending this approach you can set-up any number of concurrent readers with each reading a unique set of input rows from the same file.

After this point you still need to determine if all the downstream processes must/can be handled in parallel (the simple case) or if it is possible that some of the parallel streams can be merged for processing to preserve server resources.

The optimal number of readers and number of subsequent parallel processes to create will depend on the capacity of your server and the type/complexity of the activites that you need to perform (you have not given much information in this regard). When in doubt test it.

The method described here does not obviate any of the previously posted comments, but is offered as an alternative for creating a parallel read of the data without the complexities of having 20 separate input files.

Good luck with whatever approach you take.
katz

DSXchange

Possibility of running many Mutiple Instances concurrently

Possibility of running many Mutiple Instances concurrently

Re: Possibility of running many Mutiple Instances concurrent