Page 1 of 2

multiple invocations of a multi instance job failing

Posted: Thu Apr 16, 2009 3:12 pm
by panchusrao2656
We have a multi instance AUDIT job that runs for each and every job and collect the job stats and load them to AUDIT tables. Sometimes the job is failing when couple instances of the job is running with multiple invocations. We got the following error.

Error calling DSRunJob(SEQX_ROUTINE_SAVE_JOB_INFO.J040_CIMS_MPI_SDS_PHYSICIANS_ODS_O_CUST_ID_test.CCC_sAUDIT_SK), code=-2 [Job is not in the right state (compiled and not running)]


To test the parallel invocation of a multi-instance job,I have created a test Job Sequencer which calls a multi instance job with five different invocations. Some times job is completing successfully where as sometimes the job is failing with one of the INVOCATION CALL is failing with a job status 99.

Do we need to set something at the PROJECT level to support multi instance jobs.Please share if you have faced this issue eariler.

SEQX_PARALLEL_AUDIT_test321..JobControl (@Coordinator): Summary of sequence run
12:06:43: Sequence started (checkpointing on)
12:06:43: EEE (JOB J100_ODS_CONFORMANCE_SMS_CUST_ADDR_TLPHN_4654_SP_test123.EEE) started
12:06:45: DDD (JOB J100_ODS_CONFORMANCE_SMS_CUST_ADDR_TLPHN_4654_SP_test123.DDD) started
12:06:48: CCC (JOB J100_ODS_CONFORMANCE_SMS_CUST_ADDR_TLPHN_4654_SP_test123.CCC) started
12:06:50: BBB (JOB J100_ODS_CONFORMANCE_SMS_CUST_ADDR_TLPHN_4654_SP_test123.BBB) started
12:06:52: AAA (JOB J100_ODS_CONFORMANCE_SMS_CUST_ADDR_TLPHN_4654_SP_test123.AAA) started
12:10:17: CCC (JOB J100_ODS_CONFORMANCE_SMS_CUST_ADDR_TLPHN_4654_SP_test123.CCC) finished, status=2 [Finished with warnings]
12:10:18: AAA (JOB J100_ODS_CONFORMANCE_SMS_CUST_ADDR_TLPHN_4654_SP_test123.AAA) finished, status=99 [Not running]
12:10:19: Exception raised: @AAA, Unhandled abort encountered in job J100_ODS_CONFORMANCE_SMS_CUST_ADDR_TLPHN_4654_SP_test123.AAA
12:10:22: EEE (JOB J100_ODS_CONFORMANCE_SMS_CUST_ADDR_TLPHN_4654_SP_test123.EEE) finished, status=2 [Finished with warnings]
12:10:23: DDD (JOB J100_ODS_CONFORMANCE_SMS_CUST_ADDR_TLPHN_4654_SP_test123.DDD) finished, status=2 [Finished with warnings]
12:10:24: BBB (JOB J100_ODS_CONFORMANCE_SMS_CUST_ADDR_TLPHN_4654_SP_test123.BBB) finished, status=2 [Finished with warnings]
12:10:24: Sequence failed (restartable)

Thank You

Posted: Thu Apr 16, 2009 4:52 pm
by ray.wurlod
What does the job log of job J100_ODS_CONFORMANCE_SMS_CUST_ADDR_TLPHN_4654_SP_test123.AAA reveal?

Posted: Thu Apr 16, 2009 4:54 pm
by panchusrao2656
Job invocation itself is failing and i cannot see the log in the director.

Posted: Thu Apr 16, 2009 5:03 pm
by ray.wurlod
That's odd, because there's a "started" message in the sequence log and there are more than three minutes between that and the warning event. Could anyone perhaps have tried to recompile the job in this time? Can you please post detail of the "job run requested" event for this invocation?

Posted: Thu Apr 16, 2009 5:18 pm
by panchusrao2656
As i am calling the same job with different invocations, there is no change to thee job.

We have set the Auto Purge Job log to 6 runs at the Project level in Administrator, i tried changing this option to 21 and still having the same issue.

The other strange thing that i have noticed is that, when the jobs were invoked from the Job Sequencer,i see 5 instances running and at the end one invocation is disappearing sometimes and for that invocation of the job, sequencer was reported with the status 99. This is not happening all the time.

I tried running the job with five invocations with out using the job sequencer, all are fininsing properly(each instance takes roughly 3 minutes to complete).

I have no idea what is happening.

Posted: Thu Apr 16, 2009 5:38 pm
by chulett
Isn't there some sort of odd MI bug that enabling auto-purge of the logs creates that people have reported? What happens if you disable auto-purge for this job? :?

Posted: Thu Apr 16, 2009 6:03 pm
by panchusrao2656
I tried the option of purging logs until yesterday for this job, but i am getting the samething. I see the weared scenario that i mentioned earlier and i have captured the screenshots of the director showing that all 5 instances running initially, then the parent job showing that one of the instance returned the status 99 and the last screenshot that show the log only for four instances of the job.

I donot know how to attach them to this topic.

Posted: Thu Apr 16, 2009 10:31 pm
by chulett
Images cannot be "attached" here. Rather you need to upload them somewhere else and then link them to a post here using the [img] or "image tags". Lots of sites available to do free file sharing / hosting, if you feel the need to show us your screenshots.

Can't tell from what you posted, did you try turning off auto-purge for this job to see if it makes any difference?

Posted: Mon Apr 20, 2009 7:00 am
by Mike
There are some problems with multi-instance jobs in 8.0.1. I currently have a PMR open with IBM.

It is not entirely related to auto purging.

Since mult-instance jobs all share the same RT_LOG and RT_STATUS files, it seems to be some kind of timing issue when multiple instances are hitting these tables concurrently.

Sometimes a job sequence will abort with a status=99 error. Sometimes everything will finish with status OK and no active stages actually executing.

I applied one patch from IBM that was supposed to fix the timing issues related to the status=99 problem, but it has been ineffective.

Mike

Posted: Mon Apr 20, 2009 7:27 am
by chulett
Since this seems to be a known issue, best to contact your official support provider and see about getting the patch(es) Mike is talking about.

Posted: Mon Apr 20, 2009 7:31 am
by priyadarshikunal
If you look in to the 8.0.1 fixpack 2 release notes there are a lot of fixes (more than 300) developed by IBM to resolve the problems in earlier release.

In that release the same issue is mentioned.

Problem is not only purging the log entries but its an issue with auto purge itself. When the jobs are running concurrently and auto purge is active it returns the value 99 intermittently.

@Mike

you should look at the release notes to verify that your patch was the same through eCase number and description.

Posted: Mon Apr 20, 2009 7:46 am
by chulett
From what we've seen here, best to get off of 8.0.1, fix packs or no, and on to 8.1 at your earliest convenience.

Posted: Mon Apr 20, 2009 4:27 pm
by panchusrao2656
Thank you all for sharing your ideas & info. I will request our admin to raise a ticket to IBM to get the patch.

Posted: Wed Apr 22, 2009 12:25 am
by telenet_bi
I think there's 2 part to this issue. we had both these issues seperately:

- we had issues with jobs that run fine, but still return status 99 to the flow or to the dsjob command that started this. There is a patch for this, but this didn't solve our problem completely (it helped though)

- multi instance jobs: change the autopurging to work on number of days in stead of number of runs. this worked for us

Re: multiple invocations of a multi instance job failing

Posted: Wed Oct 28, 2009 12:42 pm
by Rahul.r.s
Could you please let us know which patch did you apply?
Were they any amongst the ones mentioned below:
patch_JR30015v4_server_aix_8011.tar
patch_JR30015v3_client_windows_8011.zip

Got an issue similar to yours !!!