Possibility of running many Mutiple Instances concurrently

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
vnspn
Participant
Posts: 165
Joined: Mon Feb 12, 2007 11:42 am

Possibility of running many Mutiple Instances concurrently

Post by vnspn »

Hi,

I had posted here before on processing a huge source file containing 40 million records. This takes a very long time for processing.

Now we would likely be getting the source file broken down into 20 different files of approximate equal size. I would like to know if we can use the "Multiple Instance" job property to process all these 20 different files concurrently.

As each Job instance is going to process its own chunk of data, could DataStage's memory support such a kind of multiple processing? Could it support 20 Jobs processing its own bunch of records simultaneously?

Would there be degradation in performance? Would DataStage have enough memory for supporting 20 Jobs parallely?

Our version is 7.1 and the server is on Windows.

Thanks.
DSguru2B
Charter Member
Charter Member
Posts: 6854
Joined: Wed Feb 09, 2005 3:44 pm
Location: Houston, TX

Post by DSguru2B »

The best way to find out is to try it out. Run , say 10 instances at the same time. If you are satisfied, start incrementing the number of simultaneous instances. Stop at a point where you find that the process is struggling. Experiment it out. No one can tell you as only you know your system best.
Creativity is allowing yourself to make mistakes. Art is knowing which ones to keep.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

That is a very dangerous generalization to make. If you include complex transformation rules in the data flow, CPU consumption will increase. If you include sorting, both CPU and memory consumption will increase. If you use many cached hashed files, CPU and memory consumption will increase. If you use inter-process row buffering (or even in-process row buffering), memory consumption will increase. Too much memory demanded results in paging, and consequent (relative) slowness.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
kduke
Charter Member
Charter Member
Posts: 5227
Joined: Thu May 29, 2003 9:47 am
Location: Dallas, TX
Contact:

Post by kduke »

I have seen more than 10 but I put sleeps in between because you can overwhelm the CPU with all 20 starting one after another. 10 to 30 seconds on the sleeps. I would bet 20 jobs is easily possible.
Mamu Kim
vnspn
Participant
Posts: 165
Joined: Mon Feb 12, 2007 11:42 am

Post by vnspn »

Hmm.. This seems to be a good idea. To start each of the instance in a time gap of 10 to 30 seconds.

How can I make it to sleep for every 30 seconds before the next instance's start. Is there a routine to make the Job Sequenc to sleep?
DSguru2B
Charter Member
Charter Member
Posts: 6854
Joined: Wed Feb 09, 2005 3:44 pm
Location: Houston, TX

Post by DSguru2B »

You can write a short before job subroutine that checks what instance it is and pause accordingly. SLEEP command can be used.
Creativity is allowing yourself to make mistakes. Art is knowing which ones to keep.
byk
Participant
Posts: 8
Joined: Wed Mar 07, 2007 8:43 am

Re: Possibility of running many Mutiple Instances concurrent

Post by byk »

vnspn wrote:Hi,

I had posted here before on processing a huge source file containing 40 million records. This takes a very long time for processing.

Now we would likely be getting the source file broken down into 20 different files of approximate equal size. I would like to know if we can use the "Multiple Instance" job property to process all these 20 different files concurrently.

As each Job instance is going to process its own chunk of data, could DataStage's memory support such a kind of multiple processing? Could it support 20 Jobs processing its own bunch of records simultaneously?

Would there be degradation in performance? Would DataStage have enough memory for supporting 20 Jobs parallely?

Our version is 7.1 and the server is on Windows.

Thanks.
The individual instances may run a bit slower but overall there will be a performance benefit. The optimal number of parallel instance has to be found by hit and trial and will primarily depend on
1. Number of CPU
2. Amount of memory (RAM + swap space) for Datastage

I am assuming that you have ensured the independence of the instances that we plan to run in parallel.

I have successfully tried this strategy running 6 parallel threads (different independent jobs) and got a time gain of 60% !!!
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

A job of SEQ --> XFM --> SEQ should use 1 cpu completely. If you have 8 cpus, then you should not exceed the number of cpus too much. You ALWAYS need to measure the requirements/impact of a single instance, then use that to determine the maximum number of instances your server will sustain.
Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
vnspn
Participant
Posts: 165
Joined: Mon Feb 12, 2007 11:42 am

Post by vnspn »

byk,

I have a couple of questions based on your reply.

What do you mean by saying as "swap space". Would we need to specify any amount of memory that can be consumed by DataStage, during the installation process of the Server.

What do you mean by saying "ensure independence of the instances". A single Job is going to run in multiple instances for different data. Are you talking about the way it would be called in the Job Sequence.

Thanks.
byk
Participant
Posts: 8
Joined: Wed Mar 07, 2007 8:43 am

Post by byk »

vnspn,

I am not sure whether you can mention/restrict the memory usage during installation but its good to provision for good amount of memory (3-4GB) and page file
Ensuring Independence (may be implicit in your case) is basically ensuring that the jobs that are to run in parallel are no way dependent on each other.
eostic
Premium Member
Premium Member
Posts: 3838
Joined: Mon Oct 17, 2005 9:34 am

Post by eostic »

...(on ensuring independence)...like being certain that two instances don't try to write to the same seqential file, or do conflicting things against the same target rdbms table, etc. You may already be familiar with built-in Job Parm #DSJobInvocationId# , but others reading this thread may find it useful as it's not that well documented. fyi --- For those of you with RTI, these multi-instance job issues are critical, as RTI uses this feature internally to spawn multiple "always on" instances of a job that are waiting for input.

Ernie
pajj
Participant
Posts: 11
Joined: Fri Jun 16, 2006 12:27 pm

Post by pajj »

This scenario is similar to a current project I am workign on.

Will using link partioning obtain the same, worse or better results than a separate job instance?

What is the benefit of one method over the other?

tks
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Worse. Link partitioner uses inter process communication (IPC) which introduces an additional possible point of failure.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
katz
Charter Member
Charter Member
Posts: 52
Joined: Thu Jan 20, 2005 8:13 am

Post by katz »

An alternative to running multi-instance jobs in parallel for separate input files could be to start multiple readers of the same source file in the same job (the source is a sequential file?) and use awk as a filter to to read a subset of rows from the single input file. Awk is available for windows.

For example the expression awk 'NR%2 == 0' would read the even numbered rows from the file and awk 'NR%2 == 1' would read the odd numbered rows. Both file reads can be perfomed in parallel, in the same job. By extending this approach you can set-up any number of concurrent readers with each reading a unique set of input rows from the same file.

After this point you still need to determine if all the downstream processes must/can be handled in parallel (the simple case) or if it is possible that some of the parallel streams can be merged for processing to preserve server resources.

The optimal number of readers and number of subsequent parallel processes to create will depend on the capacity of your server and the type/complexity of the activites that you need to perform (you have not given much information in this regard). When in doubt test it.

The method described here does not obviate any of the previously posted comments, but is offered as an alternative for creating a parallel read of the data without the complexities of having 20 separate input files.

Good luck with whatever approach you take.
katz
Post Reply