Controller problem: Error calling DSRunJob

rachit82 · Post by **rachit82** » Mon Sep 28, 2009 10:04 am

This has been happening more often that we like. This was the 2nd time in 2 weeks we got this issue. The reason why my superiors have a problem understanding the solutions you have provided is that Universe has been removed from the 8.1 framework and DB2 has been provided for managing the metadata.

We have a lot of Server jobs that were migrated from 7.5.2 to 8.1 and all of them seem to stop working at the same time. The Parallel jobs are still running in Production at the given time on the Same Server.

We checked and according to our Unix Admin, one of the Universe databases was down and the jobs were being held up due to this. I dont believe it as the jobs with java processes are also affected even though they had no dependency on that particular database.

Please advise.

chulett · Post by **chulett** » Mon Sep 28, 2009 10:17 am

Since you've started a new topic and have not posted an actual error we have no idea what "solutions we have provided" nor what precise issue you are fighting other than things apparently "stalling". And the assertion that "Universe has been removed from the 8.1 framework" is simply untrue.

rachit82 · Post by **rachit82** » Mon Sep 28, 2009 10:21 am

The following is the error message:

seq_ODS_UV_DISTRO..JobControl (@RPO_DBMAXDT): Controller problem: Error calling DSRunJob(ODS_DB_AUDITTIME.Retail_PO), code=-14
[Timed out while waiting for an event]

seq_ODS_UV_DISTRO..JobControl (@Coordinator): Summary of sequence run
09:39:35: Sequence started (checkpointing on)
09:39:35: RPO_DBMAXDT (JOB ODS_DB_AUDITTIME.Retail_PO) started
09:40:36: Exception raised: @RPO_DBMAXDT, Error calling DSRunJob(ODS_DB_AUDITTIME.Retail_PO), code=-14 [Timed out while waiting for an event]
09:40:36: Sequence failed (restartable)

seq_ODS_UV_DISTRO..JobControl (fatal error from @Coordinator): Sequence job (restartable) will abort due to previous unrecoverable errors

This is the common error across all the Sequences.

The Error across all the jobs are:

"Abnormal termination of stage ODS_DB_AUDITTIME.Retail_PO.Xfm_Maxdate detected"

Please advise.

chulett · Post by **chulett** » Mon Sep 28, 2009 10:28 am

There are a ton of conversations here on the topic of that -14 or "Timed out while waiting for an event" error. Have you been through them? What solution was offered that there's a problem understanding? I'm assuming it was something along the lines of "run fewer jobs at the same time".

rachit82 · Post by **rachit82** » Mon Sep 28, 2009 10:35 am

chulett wrote:There are a ton of conversations here on the topic of that -14 or "Timed out while waiting for an event" error. Have you been through them? What solution was offered that there's a problem understanding? I'm assuming it was something along the lines of "run fewer jobs at the same time".

I have been through them all and i have presented all the solutions but i have been rebuffed cause all of them date back to 2007 and this 2009 with DS 8.1 with DB2 not universe.

Our scheduling tool limits all DS jobs count to 10 at any given time of the day and our java applications have a limit of 10. When these jobs started aborting, the Server CPU idle was at 98% and Memory usage was at 25%.

Now you tell me how less should i be running on the server? These same jobs were running great less than 3 weeks back on 7.5.2 server for more than a year with the same level of usage. We were running at more than 25 jobs on it at a given time.

We got bigger and more powerful boxes for 8.1 on IBM's recommendation and our WAS and DB2 are on a different box and DS is on another one.

chulett · Post by **chulett** » Mon Sep 28, 2009 12:56 pm

No, I understand just fine.

Again, 2007 or 2009 doesn't really matter, that same "Universe" engine is still there and in addition to that there's now the overhead of the DB2 repository and the IIS and WAS servers on top of that. It's good that you got "bigger and more powerful boxes" for 8.1 as it's important to realize the amount of resources all of those extra bits add but it does mean that you really don't have an "apples to apples" comparison with your old server running 7.5 and the new ones running 8.1.

Here about all we can say is the official line for that -14 error: the engine is still hard-coded to throw that error if it tries to start a job and it takes more than 60 seconds for the response to come back that it was able to start it. Nothing more and nothing less to the error than that. And the typical reason for that would be a 'resource issue' hence the first suggestion would always be to do less at once. Now, that doesn't necessarily mean running fewer jobs at the same time, it could also mean launching fewer jobs at the same time - sometimes all it takes it a slight delay between job starts to alleviate the problem.

Now, there's probably more to it than that nowadays with the new architecture. We here don't necessarily know what all needs to be monitored or what other things could affect this, resource or configuration wise. So any other "solution" to this would need to come from IBM or whomever your official support provider is. Have you approached them? What advice did they offer?

rachit82 · Post by **rachit82** » Mon Sep 28, 2009 1:35 pm

Apparently its not only resource issue, but also related to NLS.

Due to the way NLS settings are done during installation especially with Universe connectivity, if the Universe database is down, all the jobs will abort with this error.

Once the Universe database instance was brought back up, the jobs were working as usual.

Thanks for the insight though.

chulett · Post by **chulett** » Mon Sep 28, 2009 1:42 pm

Wow... glad it was as simple as that.

Thanks for posting that. We'll all have to keep in mind the fact that the -14 error could also mean an issue with the Universe repository / engine as well nowadays, I would have expected a more... catastrophic... error than that.

chulett · Post by **chulett** » Mon Sep 28, 2009 1:58 pm

In thinking about this, it seems to me in hindsight that we could have move beyond all of the 'standard' answers that we got bogged down in if it had been more more clear that you had this issue with all jobs in all situations. Meaning you couldn't even run a single job manually with no other jobs running without it throwing this error. If that was the part you meant when you said you're "not sure you still understand" then you were right. I didn't.

Looking back at what you posted I see where you did say that but I missed it in all of the other text. "All jobs in All Projects Stalled and would not run". "all of them seem to stop working at the same time". But then I thought you'd resolved that part when you said "We checked and according to our Unix Admin, one of the Universe databases was down and the jobs were being held up due to this" and you were still having the issue after that, which is why you were here.

Ah well, good to know it's all sorted out.

DSXchange

Controller problem: Error calling DSRunJob

Controller problem: Error calling DSRunJob

Here are the errors at Sequence and Job levels

Am not sure you still understand

I found the reason for it