Sequencer Job is aborted in DS7.5.1

dhwankim · Post by **dhwankim** » Mon Jul 04, 2005 1:20 am

Hi All,

I am mading Sequencer jobs for initial loading for DW.
But I have a problem with Sequencer job.

Actually, I have entry point sequencer job.
the sequencer job has child-sequencer jobs.
each child sequencer job has server jobs or parallel jobs.

when I force to run entry-point sequencer job.
but this job has aborted after first-stop child-sequencer job.
the job can not let next-step sequencer jobs start.

that time, each child-sequencer job wrote below message.
BatchIDi20..JobControl (@SDIEWFA02301): Controller problem: Error calling DSRunJob(SDIEWFA02301), code=-14
[Timed out while waiting for an event]

but each child-sequencer job wirte uppoer message just after few second from being started by parent sequencer job.

I already modified mfile and t30file paramter (increasing) in uvconfig.
and applied this by restarting ds daemon.

so I have no idea for resolving this symptom.

I need your hand.

Thank in advance.

Sainath.Srinivasan · Post by **Sainath.Srinivasan** » Mon Jul 04, 2005 3:19 am

Do you run in multiple instance?

Sainath.Srinivasan · Post by **Sainath.Srinivasan** » Mon Jul 04, 2005 3:20 am

Do you run in multiple instance?

dhwankim · Post by **dhwankim** » Mon Jul 04, 2005 3:31 am

Sainath.Srinivasan wrote:Do you run in multiple instance?

No, I did not use multi instance.
but this entry-point sequencer job runs 30 more server jobs concurrentyly.

So I wonder datastage engine could not start some child jobs.

I used Unix (with 16CPUs, 60G Memory) for datastage process.

SO I wonder why datastage did not fork child jobs.

just DS gave a meesage Error calling DSRunJob(SDIEWFA02301), code=-14
[Timed out while waiting for an event]

What 's mean upper message.

Thank in advance.

ArndW · Post by **ArndW** » Mon Jul 04, 2005 3:46 am

Does the same server job timeout each time? I suspect that it does not; could you stagger your 30 concurrent calls by making some of them depend upon others finishing? Also, monitor your cpu usage while these are running, vmstat should be detailed enough.

dhwankim · Post by **dhwankim** » Mon Jul 04, 2005 4:14 am

ArndW wrote:Does the same server job timeout each time? I suspect that it does not; could you stagger your 30 concurrent calls by making some of them depend upon others finishing? Also, monitor your cpu usage while these are running, vmstat should be detailed enough.

Each job is different one.
and this machine has 30 cpus so, I think this machine has enough H/W Resource. anyway, I wonder Why datastage job aborted after just starting.

Which DataStage Parameter is related to this symptom.
or How to provent this error.

I has about 2000 initial job.
I already made sequencer jobs for handle dependency between jobs.

ArndW · Post by **ArndW** » Mon Jul 04, 2005 4:20 am

The error message is that DataStage has waited longer than it thinks to start a job; this is most likely due to the system's resources being bottlenecked during this initial startup phase.

Please monitor your CPU usage when the job starts, if it is over 95% for periods of 10-15 seconds then this is your most likely cause. The request to change your sequence is not a final solution, but just a way to narrow down the cause - if the error goes away then you can see the relationship and work from there.

The number of CPUs might not be limiting you. It could be virtual memory space, disk I/O (on the partition with DataStage) or even your DataStage configuration (the T30FILES is not the culprite here, but did you change any other configuration parameters?).

dhwankim · Post by **dhwankim** » Mon Jul 04, 2005 6:29 am

You are right.
but I just want to control how to retain this process until getting H/W Resource.
the current issues is that ds jobs have aborted when resource lack.

Now Current System available resource is 2 ~ 0 when Sequencer jobs running.

Thank your hands in advance.

ArndW · Post by **ArndW** » Mon Jul 04, 2005 6:36 am

dhwankim,

could you explain the

JdDSSJOBUpdate_T1_JC_JOB_PARAMETERS_Hf

part - I'm afraid I don't know what you mean.

Sainath.Srinivasan · Post by **Sainath.Srinivasan** » Mon Jul 04, 2005 6:37 am

You can wait for some jobs to finish before starting the rest. By this way you can avoid the contention.

You can break the jobs into multiple sequencer for now so to run them in a sequential mode.

ray.wurlod · Post by **ray.wurlod** » Mon Jul 04, 2005 3:25 pm

Hello DaeHwan,

Your monitoring will show that each job uses more than 50% of a CPU if run separately. This tells you, by simple arithmetic, that 30 jobs on 16 CPUs is overloading the machine. This is why you must run fewer jobs at a time to overcome this problem.

dhwankim · Post by **dhwankim** » Mon Jul 04, 2005 6:36 pm

JdDSSJOBUpdate_T1_JC_JOB_PARAMETERS_Hf is just one of server job.
the job processes reading sequence file and transfomrating and looking up some hashed file , and writing to sequence.
It's plain job.

I recognize now my problem with utilizing H/W Resource Usage.
But I wonder how to protect job from aborting when H/W Resource Lack.

It means that , When Serve does not have enough free resource, How can I prevent jobs going to be aborted.

Does universe (DS Engine) have any parameter relate to this sympton.
and
I want to which parameter in Unix Kernel or whatever could control this symptom.

Thank in advance

ray.wurlod · Post by **ray.wurlod** » Tue Jul 05, 2005 12:41 am

Unfortunately the only detection in DataStage is the timeout when a job fails to start within a hard-coded interval. That is we can't tune the timeout. And, as you noted, the job that can not start aborts.

You probably could do something with UNIX, but there's nothing supplied "out of the box" as far as I am aware. I am thinking of a shell script that takes one or two measures of %Idle, and only proceeds if these are non-zero, indicating that the machine has spare capacity.
Of course it may also be some other resource, such as memory (set a threshold on PF/sec) or I/O capacity. These would have to be done on a per-machine basis, since every machine is different.

dhwankim · Post by **dhwankim** » Tue Jul 05, 2005 1:54 am

ray.wurlod wrote:Unfortunately the only detection in DataStage is the timeout when a job fails to start within a hard-coded interval. That is we can't tune the timeout. And, as you noted, the job that can not start aborts.

You probably could do something with UNIX, but there's nothing supplied "out of the box" as far as I am aware. I am thinking of a shell script that takes one or two measures of %Idle, and only proceeds if these are non-zero, indicating that the machine has spare capacity.
Of course it may also be some other resource, such as memory (set a threshold on PF/sec) or I/O capacity. These would have to be done on a per-machine basis, since every machine is different.

Thank Ray for your advice & Tips.