Sequencer Job is aborted in DS7.5.1

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
dhwankim
Premium Member
Premium Member
Posts: 45
Joined: Mon Apr 07, 2003 2:18 am
Location: Korea
Contact:

Sequencer Job is aborted in DS7.5.1

Post by dhwankim »

Hi All,

I am mading Sequencer jobs for initial loading for DW.
But I have a problem with Sequencer job.

Actually, I have entry point sequencer job.
the sequencer job has child-sequencer jobs.
each child sequencer job has server jobs or parallel jobs.

when I force to run entry-point sequencer job.
but this job has aborted after first-stop child-sequencer job.
the job can not let next-step sequencer jobs start.

that time, each child-sequencer job wrote below message.
BatchIDi20..JobControl (@SDIEWFA02301): Controller problem: Error calling DSRunJob(SDIEWFA02301), code=-14
[Timed out while waiting for an event]

but each child-sequencer job wirte uppoer message just after few second from being started by parent sequencer job.

I already modified mfile and t30file paramter (increasing) in uvconfig.
and applied this by restarting ds daemon.

so I have no idea for resolving this symptom.

I need your hand.

Thank in advance.
Sainath.Srinivasan
Participant
Posts: 3337
Joined: Mon Jan 17, 2005 4:49 am
Location: United Kingdom

Post by Sainath.Srinivasan »

Do you run in multiple instance?
Sainath.Srinivasan
Participant
Posts: 3337
Joined: Mon Jan 17, 2005 4:49 am
Location: United Kingdom

Post by Sainath.Srinivasan »

Do you run in multiple instance?
dhwankim
Premium Member
Premium Member
Posts: 45
Joined: Mon Apr 07, 2003 2:18 am
Location: Korea
Contact:

Post by dhwankim »

Sainath.Srinivasan wrote:Do you run in multiple instance?
No, I did not use multi instance.
but this entry-point sequencer job runs 30 more server jobs concurrentyly.

So I wonder datastage engine could not start some child jobs.

I used Unix (with 16CPUs, 60G Memory) for datastage process.

SO I wonder why datastage did not fork child jobs.

just DS gave a meesage Error calling DSRunJob(SDIEWFA02301), code=-14
[Timed out while waiting for an event]

What 's mean upper message.

Thank in advance.
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

Does the same server job timeout each time? I suspect that it does not; could you stagger your 30 concurrent calls by making some of them depend upon others finishing? Also, monitor your cpu usage while these are running, vmstat should be detailed enough.
dhwankim
Premium Member
Premium Member
Posts: 45
Joined: Mon Apr 07, 2003 2:18 am
Location: Korea
Contact:

Post by dhwankim »

ArndW wrote:Does the same server job timeout each time? I suspect that it does not; could you stagger your 30 concurrent calls by making some of them depend upon others finishing? Also, monitor your cpu usage while these are running, vmstat should be detailed enough.
Each job is different one.
and this machine has 30 cpus so, I think this machine has enough H/W Resource. anyway, I wonder Why datastage job aborted after just starting.

Which DataStage Parameter is related to this symptom.
or How to provent this error.

I has about 2000 initial job.
I already made sequencer jobs for handle dependency between jobs.
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

The error message is that DataStage has waited longer than it thinks to start a job; this is most likely due to the system's resources being bottlenecked during this initial startup phase.

Please monitor your CPU usage when the job starts, if it is over 95% for periods of 10-15 seconds then this is your most likely cause. The request to change your sequence is not a final solution, but just a way to narrow down the cause - if the error goes away then you can see the relationship and work from there.

The number of CPUs might not be limiting you. It could be virtual memory space, disk I/O (on the partition with DataStage) or even your DataStage configuration (the T30FILES is not the culprite here, but did you change any other configuration parameters?).
dhwankim
Premium Member
Premium Member
Posts: 45
Joined: Mon Apr 07, 2003 2:18 am
Location: Korea
Contact:

Post by dhwankim »

You are right.
but I just want to control how to retain this process until getting H/W Resource.
the current issues is that ds jobs have aborted when resource lack.

Now Current System available resource is 2 ~ 0 when Sequencer jobs running.


Thank your hands in advance.
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

dhwankim,

could you explain the
JdDSSJOBUpdate_T1_JC_JOB_PARAMETERS_Hf
part - I'm afraid I don't know what you mean.
Sainath.Srinivasan
Participant
Posts: 3337
Joined: Mon Jan 17, 2005 4:49 am
Location: United Kingdom

Post by Sainath.Srinivasan »

You can wait for some jobs to finish before starting the rest. By this way you can avoid the contention.

You can break the jobs into multiple sequencer for now so to run them in a sequential mode.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Hello DaeHwan,

Your monitoring will show that each job uses more than 50% of a CPU if run separately. This tells you, by simple arithmetic, that 30 jobs on 16 CPUs is overloading the machine. This is why you must run fewer jobs at a time to overcome this problem.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
dhwankim
Premium Member
Premium Member
Posts: 45
Joined: Mon Apr 07, 2003 2:18 am
Location: Korea
Contact:

Post by dhwankim »

JdDSSJOBUpdate_T1_JC_JOB_PARAMETERS_Hf is just one of server job.
the job processes reading sequence file and transfomrating and looking up some hashed file , and writing to sequence.
It's plain job.

I recognize now my problem with utilizing H/W Resource Usage.
But I wonder how to protect job from aborting when H/W Resource Lack.

It means that , When Serve does not have enough free resource, How can I prevent jobs going to be aborted.

Does universe (DS Engine) have any parameter relate to this sympton.
and
I want to which parameter in Unix Kernel or whatever could control this symptom.


Thank in advance
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Unfortunately the only detection in DataStage is the timeout when a job fails to start within a hard-coded interval. That is we can't tune the timeout. And, as you noted, the job that can not start aborts.

You probably could do something with UNIX, but there's nothing supplied "out of the box" as far as I am aware. I am thinking of a shell script that takes one or two measures of %Idle, and only proceeds if these are non-zero, indicating that the machine has spare capacity.
Of course it may also be some other resource, such as memory (set a threshold on PF/sec) or I/O capacity. These would have to be done on a per-machine basis, since every machine is different.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
dhwankim
Premium Member
Premium Member
Posts: 45
Joined: Mon Apr 07, 2003 2:18 am
Location: Korea
Contact:

Post by dhwankim »

ray.wurlod wrote:Unfortunately the only detection in DataStage is the timeout when a job fails to start within a hard-coded interval. That is we can't tune the timeout. And, as you noted, the job that can not start aborts.

You probably could do something with UNIX, but there's nothing supplied "out of the box" as far as I am aware. I am thinking of a shell script that takes one or two measures of %Idle, and only proceeds if these are non-zero, indicating that the machine has spare capacity.
Of course it may also be some other resource, such as memory (set a threshold on PF/sec) or I/O capacity. These would have to be done on a per-machine basis, since every machine is different.
Thank Ray for your advice & Tips.
Post Reply