DataStage V11.3 Grid - Restart

oacvb · Post by **oacvb** » Thu Jun 25, 2015 11:46 am

We have created a grid environment ( 1 primary and 6 conductor) and installed DataStage V 11.3.1.2. Migrated our jobs from 8.5 into new 11.3.1.2 . We have submitted a sequence that runs 40 jobs in parallel. We have installed Platform LSF 9.3. All the jobs are submitted to the same host, can you please suggest how to distribute the load into multiple host.

oacvb · Post by **oacvb** » Thu Jun 25, 2015 11:47 am

And job fails due to unavailability of resource or memory.

PaulVL · Post by **PaulVL** » Thu Jun 25, 2015 2:09 pm

Incert a 1 second delay into your job submission.

in lsb.queues, add this

JOB_ACCEPT_INTERVAL = 1
(read about it)

add this to lsb.params

MBD_SLEEP_TIME = 1 #Amount of time in seconds used for calculating parameter values

That should simulate a round robin yet still maintain load balancing.

you might also want to limit submitting jobs to a compute node if it is over a certain CPU threshhold

ut = 0.85
(in lsb.queues)

oacvb · Post by **oacvb** » Thu Jun 25, 2015 2:41 pm

Thanks PaulVL. I will make necessary changes and let you know the details. Will there be any delay in job submission? If yes, Will it only to the node (host) it submitted earlier or overall. I am trying to understand the impact if any.

oacvb · Post by **oacvb** » Thu Jun 25, 2015 2:52 pm

Will all the jobs in a sequence submitted to the same or different compute nodes. I believe it depends on the resource engine. Please correct my understanding.

PaulVL · Post by **PaulVL** » Fri Jun 26, 2015 8:17 am

Test it.

What you are running into is the fact that the host you are submitting to is a valid candidate for a job because you told it that it had a maximum job slot (maybe 64), and you are submitting 40 jobs. At the time of submittion, your grid does not understand how much CPU each job will use. So 40 goes into 60 just fine. Boom, first server offered up to you is fare game.

Even if you put in the "ut=0.85" that is not enough to stop the flooding of one server. It will just stop concidering that host to be a candidate if the CPU is above 85%. At the start of your sequencer, that host would not be 85+. So all 40 jobs get sent to it because it's still fare game.

By introducing a 1 second delay before the host can accept another job, you will basically be able to submit X number of jobs per second where X is the quantity of compute nodes in your pool, 1 per compute node, the other jobs are held in the queue. On second #2 you would submit another X amount of jobs. This would happen until the backlog is all done.

So yes, there would be a delay in your job submissions because of the wait time in the queue.

That's load balancing for ya. Test the load on the box... then deploy.

Of course, the above technique is totally thrown out the window if you are using the sequencer.sh method of pre-generating your APT file in the sequencer, then passing it to the jobs. The jobs at that point are not grid jobs which are individually load balanced.

oacvb · Post by **oacvb** » Fri Jun 26, 2015 10:56 am

Thanks PaulVL. It really helped.