Page 1 of 1

Sequence job aborted after 'Waiting for job to start'

Posted: Sun Jul 02, 2017 3:45 pm
by Palermo
Hi all,

I faced a serious challenge. Sequence jobs have been aborting for the whole week and this happens with different sequence jobs chaotically. Before the jobs worked fine for the last 6 months.

Here is an example of log. As you can see, TRIGGERS_JobSeq (1) ran PRCSSD_TRGGR_IND_SET_Y_PJob (2) but the (1) aborted and (2) finished. Starting time=120 seconds.

Image

Image

After rerunning (1) they both finished successfully:

Image

What was done?
1) DS server was restarted
2) DSWaitStartup and DSWaitStartup were changed to 120 (although the log doesn't show us any errors related to timeout. Why not? if this is the problem.)

Please advise how to fix it? Many thanks, in advance, for your help.

Posted: Mon Jul 03, 2017 6:46 am
by UCDI
did anything else change? running more jobs, server OS update, anything like that? How many jobs are running, and how many are allowed (operations console)?

Posted: Mon Jul 03, 2017 7:00 am
by chulett
Ah yes, the proverbial question - what changed? Obviously something did.

Typically, when I see someone post about seemingly random issues with jobs not starting within the timeout limit but then run fine later, it is almost always an indication of a resource issue on the server. So I have the same questions as UCDI posted...

Posted: Mon Jul 03, 2017 10:06 am
by Palermo
UCDI,

At this time 44 jobs are running. OS was not updated. The Workload Management was disabled 1,5 year ago and I am not sure that the following parameters limit a number of running jobs: T30FILE=4096, RLTABSZ=480 (Maximum running jobs=900)

CHulett - I agree with you. Support team opened PMR ticket to monitor and estimate Server resources.

Thanks.

Posted: Wed Jul 05, 2017 7:10 am
by chulett
Those settings tend to limit the number of running jobs by causing any over the limit to blow up... and throw very specific errors pointing to them as the culprit, from what I recall. They're not the issue.

And realize that the "resource issue" isn't confined to just what DataStage things are running on the server...

Posted: Thu Jul 06, 2017 2:00 am
by Palermo
The support team reported that was GSKit. Now the problem was solved.

Posted: Thu Jul 06, 2017 6:49 am
by chulett
GSKit? Not something that's ever been posted here before, can you (or anyone else) elaborate a bit? Thanks.

Posted: Thu Jul 06, 2017 6:05 pm
by JRodriguez
Latest versions of IIS use a Global Security kit (GSkit) for both encryption and SSL communication...by default.

It will be nice to find more on the root cause and the solution. I could forsee that if two version of the GSkit got installed in the server could cause issues ( two DS version side by side with itag??, or other IBM products using a different version of the GSkit) ...

Posted: Mon Jul 10, 2017 12:57 pm
by Palermo
I don't know the details because I am a developer and I was not involved in solving the problem.