DataStage V11.3 Grid - Restart

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
oacvb
Participant
Posts: 128
Joined: Wed Feb 18, 2004 5:33 am

DataStage V11.3 Grid - Restart

Post by oacvb »

We have created a grid environment ( 1 primary and 6 conductor) and installed DataStage V 11.3.1.2. Migrated our jobs from 8.5 into new 11.3.1.2 . We have submitted a sequence that runs 40 jobs in parallel. We have installed Platform LSF 9.3. All the jobs are submitted to the same host, can you please suggest how to distribute the load into multiple host.
oacvb
Participant
Posts: 128
Joined: Wed Feb 18, 2004 5:33 am

Post by oacvb »

And job fails due to unavailability of resource or memory.
PaulVL
Premium Member
Premium Member
Posts: 1315
Joined: Fri Dec 17, 2010 4:36 pm

Post by PaulVL »

Incert a 1 second delay into your job submission.

in lsb.queues, add this

JOB_ACCEPT_INTERVAL = 1
(read about it)

add this to lsb.params

MBD_SLEEP_TIME = 1 #Amount of time in seconds used for calculating parameter values


That should simulate a round robin yet still maintain load balancing.

you might also want to limit submitting jobs to a compute node if it is over a certain CPU threshhold


ut = 0.85
(in lsb.queues)
oacvb
Participant
Posts: 128
Joined: Wed Feb 18, 2004 5:33 am

Post by oacvb »

Thanks PaulVL. I will make necessary changes and let you know the details. Will there be any delay in job submission? If yes, Will it only to the node (host) it submitted earlier or overall. I am trying to understand the impact if any.
oacvb
Participant
Posts: 128
Joined: Wed Feb 18, 2004 5:33 am

Post by oacvb »

Will all the jobs in a sequence submitted to the same or different compute nodes. I believe it depends on the resource engine. Please correct my understanding.
PaulVL
Premium Member
Premium Member
Posts: 1315
Joined: Fri Dec 17, 2010 4:36 pm

Post by PaulVL »

Test it.

What you are running into is the fact that the host you are submitting to is a valid candidate for a job because you told it that it had a maximum job slot (maybe 64), and you are submitting 40 jobs. At the time of submittion, your grid does not understand how much CPU each job will use. So 40 goes into 60 just fine. Boom, first server offered up to you is fare game.

Even if you put in the "ut=0.85" that is not enough to stop the flooding of one server. It will just stop concidering that host to be a candidate if the CPU is above 85%. At the start of your sequencer, that host would not be 85+. So all 40 jobs get sent to it because it's still fare game.

By introducing a 1 second delay before the host can accept another job, you will basically be able to submit X number of jobs per second where X is the quantity of compute nodes in your pool, 1 per compute node, the other jobs are held in the queue. On second #2 you would submit another X amount of jobs. This would happen until the backlog is all done.


So yes, there would be a delay in your job submissions because of the wait time in the queue.

That's load balancing for ya. Test the load on the box... then deploy.

Of course, the above technique is totally thrown out the window if you are using the sequencer.sh method of pre-generating your APT file in the sequencer, then passing it to the jobs. The jobs at that point are not grid jobs which are individually load balanced.
oacvb
Participant
Posts: 128
Joined: Wed Feb 18, 2004 5:33 am

Post by oacvb »

Thanks PaulVL. It really helped.
Post Reply