Error calling subroutine: DSR_EXECJOB (Action=5)

hsahay · Post by **hsahay** » Tue Jul 23, 2013 11:25 am

Hi

I am facing a strange problem while running a job.

Let me explain the process a little bit.

We have data in sql server that needs to be loaded into oracle. We load the following 3 tables -

Header (primary key H1, also has a column X1)
Detail_1 and (primary key D1, foreign key H1, also has column X1 and X2)
Detail_2 (primary key D2, foreign key D1, also has column X2)

In terms of cardinality, there is a one to one correspondence between each table. Total number of rows are roughly about 7 million. So it would be 7 million headers, 7 million detail_1 and 7 million detail_2.

There are about 8 jobs in the process. The first 4 jobs check if the batch is ready in SQL SERVER to be posted to oracle, dumps the sql server data out to datasets and runs some validations.

Once the validations are okay the 5th job loads the data into the header.

The second job loads the data into DETAIL_1 table.

This job has the following design -

dataset >>transformer (removes leading trailing space from key columns)
oracle stage (to read rows from header table)

The dataset retrieves 7 million detail rows and the oracle stage retrieves 7 million headers.

The output of the above two stages is fed to a join stage where they are inner joined on column X1 to get the corresponding value of H1. The output of the join goes to a transformer where some date related values are transformed for oracle and finally the output of the transformer is fed to an oracle stage which uses upsert to load the rows.

It is this job that after running for a while fails in a rather strange manner. It returns an error 30107 to the calling script. When it first starts executing i can see it as "Running" in the director. But once it errors out and i go into director It only shows "Compiled" ...the "Running" instance can no longer be seen. When i click on the job (which is showing as compiled) it throws an error in director saying -

Error calling subroutine: DSR_EXECJOB (Action=5); check DataStage is set up correctly in project PrepBatch
(Subroutine failed to complete successfully (30107))

FOLLOWED BY

Unable to return an attached job

But in the meanwhile - when i check the DETAIL_1 table in oracle, it has loaded all the 7 million records successfully.

So far i have tried the following -
1. Rebooted the machine
2. Deleted the entire folder that contains these jobs, reimported and recompiled.
3. reindexed the project - making sure that nobody else was logged in.

We have 8GB RAM and 16GB virtual memory. Before i start the process, vmstat command in AIX shows -

avm = 636105 and fre = 1291657

During the header load and detail_1 load job the vmstat output looks like -

avm = 752916 and fre = 6122

topas is showing paging space to be 100% available. We use a 4 node configuration and the scratch space for all 4 nodes is fully available (60GB each). The script mountpoint where the datasets etc are created has 40GB available.

I don't know why this is happening and what i can do to stop it. We handle 10 times more volume in rest of our jobs and they run fine.

IBM Analytics Champion 2009 - 2020 · Post by **asorrell** » Tue Jul 23, 2013 12:08 pm

Usually that problem is caused by a bad index, but since you've already done a DS.REINDEX ALL, the next step would be to check to see if the project is corrupt.

Run ASBNode/bin/SyncProject.sh on the project with the -Report option to see if it reports any corruption. If it does, then backup the offending jobs to DSX so you have a copy, and run it with the -Fix option. The backup is because sometimes SyncProject.sh fixes bad jobs by deleting them if they can't be recovered cleanly.

Use -help option for more details on the tool. It is also documented in the 8.7 InfoCenter.

Note: If you are on version 8.1 or 8.5 you'll have to call customer service to get assistance with SyncProject, they only started shipping it by default in 8.7.

hsahay · Post by **hsahay** » Tue Jul 23, 2013 12:50 pm

Hi
Thanks...i will try it ..

Just a question. This job is generating enormous amounts of warnings ...is it possible that so many warnings somehow corrupt the files that are associated with this job causing this weird error ?

hsahay · Post by **hsahay** » Tue Jul 23, 2013 1:01 pm

Just ran it again after deleteall/re-import/recompile cycle ...
and this time getting different errors -

Error calling subroutine: DSR_EXECJOB (Action=5); check DataStage is set up correctly in project PrepBatch
(The connection is broken (81002))

and

Error calling subroutine: *DataStage*DSR_PROJECT (Action=7); check DataStage is set up correctly in project PrepBatch
(The connection is broken (81002))

and

Error calling subroutine: DSR_LOG (Action=3); check DataStage is set up correctly in project PrepBatch
(The connection is broken (81002))

While i am getting all this pop up error messages in director ...back in oracle the number of rows are changing constantly - which means the job is still running and it is loading data in oracle ....sigh ...

IBM Analytics Champion 2009 - 2020 · Post by **asorrell** » Tue Jul 23, 2013 1:32 pm

Yes - your client timed out (that's the 81002 error), but the job is still running in the background.

Get completely out of your clients and log back in.

hsahay · Post by **hsahay** » Tue Jul 23, 2013 1:38 pm

Thanks asorrell

So does that mean that these error are only the DS director client issue and the job is executing fine ?

Let me see what happens when all the rows are loaded ...about 4 million so far ...

IBM Analytics Champion 2009 - 2020 · Post by **asorrell** » Tue Jul 23, 2013 1:46 pm

hsahay wrote: So does that mean that these error are only the DS director client issue and the job is executing fine ?
...

Well, no and yes. The first error may indicate other problems, which is why you might still want to run SyncProject when the jobs are all stopped. The 81002 is just a client timeout. If you start a job, it then runs independently in the background. You can logout of the client (or have it disconnected) without affecting the job. You can check the job status (and row counts) in the Director client during and after a run.

hsahay · Post by **hsahay** » Wed Jul 24, 2013 3:34 pm

Hi

Just wanted to let you know that i used DS.Checker and syncproject and nothing bad was found.

But working on my suspicion that it had something to do with the large number of warnings being generated, i made changes to the code to remove the warnings and that fixed it.

The job now runs without a problem.

I am using RTLogging=1 - which puts all logs in the RT_LOG files ...When millions of warnings are generated it seems to somehow corrupt the file to such an extent that datastage loses all track of a running job - even though the job continues to do its job in the background. The running instance disappears from director view and when you click on the "compiled" instance of the job you start getting the above mentioned popup error messages. The calling script does not get a notification when the job ends and stays in running state.

Is this expected documented behavior or do i need a patch or something ?
For now i am marking this as resolved but if anybody has more info on it please do post it here....

ray.wurlod · Post by **ray.wurlod** » Wed Jul 24, 2013 4:33 pm

There is a limit of 2GiB on the size of the RT_LOGnnn hashed files. Any attempt to exceed that will corrupt the hashed file (generating a backward link (BLINK) error). If you believe that you need more than 2GiB of log events, you can use RESIZE to change that particular RT_LOGnnn to 64-bit addressing, which increases the upper limit to something larger than your disk can hold. But your prevention approach (eliminate the cause of the warnings) is always to be preferred.

hsahay · Post by **hsahay** » Thu Jul 25, 2013 10:34 am

Thanks for that bit of info Ray. Much appreciated.