Job Aborted, but still showing as running in Director.

SURA · Post by **SURA** » Tue Oct 11, 2011 6:40 pm

Hi there

I had a problem in a job which was running good yesterday. It aborts with a factal error.

LKUP006,2: Fatal Error: Unable to initialize communication channel on RISSDB01. This is typically caused by a configuration problem. Examples of typical problems include:

1) The temporary directory, identified by $TMPDIR and/or the scratch disks in your ORCHESTRATE configuration, is located on a non-local file system (e. g. mounted over NFS).

2) The temporary directory is located on a file system with insufficient space.

Then i tried to reset / compile etc, not allowing as it is saying this job is being accessed by myself.

Then i tried to release the lock using web console, there i couldn't find any job which is in running status, whereas in Director --> Monitor i can see the status of the job is running.

Then i used the ps -ef to find the process and found the below process which is relates to this job.

Code: Select all

info_adm  12328   1980  0 10:59:55 con  0:01 D:\IBM\InformationServer\Server\DSEngine/bin/uvsh DSD.RUN STG_TX_XXXXX  0/0/1/0/0/0/0
info_adm   6636  12328  0 10:59:57 con  0:00 D:\IBM\InformationServer\Server\DSEngine/bin/uvsh SH -c 'D:/IBM/Information
Server/Server/DSEngine/bin/NT_OshWrapper.exe //./pipe/ETL_DEV-RT_SC255-STG_TX_XXXXX RT_SC255/OshExecut
er.sh R DUMMY  -f RT_SC255/OshScript.osh -monitorport 13400 -pf RT_SC255/jpfile -impexp_charset ASCL_MS1252 -string_char
set ASCL_MS1252 -input_charset UTF-8 -output_charset UTF-8 -collation_sequence OFF'
info_adm  10128   6636  0 10:59:57 con  0:00 sh
info_adm   9516  10128  0 10:59:57 con  0:00 D:\IBM\InformationServer\Server\DSEngine\bin\NT_OshWrapper.exe //./pipe/ETL_DEV-RT_SC255-STG_TX_XXXXX RT_SC255/OshExecuter.sh R DUMMY -f RT_SC255/OshScript.osh -monitorport 1340
0 -pf RT_SC255/jpfile -impexp_charset ASCL_MS1252 -string_charset ASCL_MS1252 -input_charset UTF-8 -output_charset UTF-8
 -collation_sequence OFF

Now the question is,

1) if any of the job fails, DS should abort the job and kill whatever the process relates to that job. But why it fails to do its task.

2) Is it the sign for something which is not good (in configuration or any other issue>)

If any one know about this , please share the info with me.

Thanks
DS User

lstsaur · Post by **lstsaur** » Tue Oct 11, 2011 9:33 pm

Just delete the entry of that locked job in the XMETALOCKINFO table. Then you will be able to compile the job.

ulab · Post by **ulab** » Wed Oct 12, 2011 5:53 am

the suggestion is the perfect solution for this but if you are not a administrator then for instant you can solve this problem
the alternate solution, just rename the job, You'll be out of this problem

SURA · Post by **SURA** » Wed Oct 12, 2011 4:11 pm

Thanks for the reply. The way how i expressed might made you to give commets about how to release the job. I know how to take back the control of that job and i did it. But the question is why it is happening?

Anyhow thanks for the comments.

Thanks
DS User

SURA · Post by **SURA** » Wed Oct 12, 2011 11:53 pm

Hello All

Just an update. Just now one of the job had the same issue. But this time it is more strange.

In the DIRECTOR and DESIGNER it shows as completed successfully. But when the user try to run the job again, got a message "This job is already running!!"

No entry in the Xmetalock / Web Console etc. Where as i can find a process relates to this job which is running in the OS level.

Then i tried with Director --> Job cleanup resources and found the same PID and it is referring D:\IBM\InformationServer\Server\Projects\ETL_DEV\RT_CONFIG68.

When i tried to run the same job from other system, it allowed to run.

I totally lost and hence it is the client server technology, irrespective of the client machine it should either lock the job / release the job. But it is not the case here and not sure what is happening!

Any suggestions!!

DS User

ray.wurlod · Post by **ray.wurlod** » Thu Oct 13, 2011 12:02 am

What do you mean by "other system" in this context - another client?

It can be the case, when a player process on one node fails but all player processes on other nodes finish successfully before the error reporting from the failed node arrives and is processed by the conductor, that the parallel job can finish with a status of "success" even though there are Fatal errors in its log. It's usually a timing issue, as noted.

Following along the same vein, the "resource" entries in the RT_STATUSnnn table may not be updated by the failed process, which can leave that part of the job with an apparent status of "running". Clear Status File should remove this symptom. (With sufficient knowledge you could also review the contents of all the entries in RT_STATUSnnn; however this information is not documented anywhere.)

SURA · Post by **SURA** » Thu Oct 13, 2011 4:25 pm

Yes Ray

You are right. I mean client system.

Is there is any specific reason why it is happening?

Is it an issue or bug or any other specific reason causing this?

Thanks
DS User

ray.wurlod · Post by **ray.wurlod** » Thu Oct 13, 2011 4:43 pm

No, it's just "how it works", and it's different from how server jobs work (there the job executes in a single process).