Job Aborted, but still showing as running in Director.

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
SURA
Premium Member
Premium Member
Posts: 1229
Joined: Sat Jul 14, 2007 5:16 am
Location: Sydney

Job Aborted, but still showing as running in Director.

Post by SURA »

Hi there

I had a problem in a job which was running good yesterday. It aborts with a factal error.

Code: Select all

LKUP006,2: Fatal Error: Unable to initialize communication channel on RISSDB01. This is typically caused by a configuration problem. Examples of typical problems include:

1) The temporary directory, identified by $TMPDIR and/or the scratch disks in your ORCHESTRATE configuration, is located on a non-local file system (e. g. mounted over NFS).

2) The temporary directory is located on a file system with insufficient space.
Then i tried to reset / compile etc, not allowing as it is saying this job is being accessed by myself.

Then i tried to release the lock using web console, there i couldn't find any job which is in running status, whereas in Director --> Monitor i can see the status of the job is running.

Then i used the ps -ef to find the process and found the below process which is relates to this job.

Code: Select all

info_adm  12328   1980  0 10:59:55 con  0:01 D:\IBM\InformationServer\Server\DSEngine/bin/uvsh DSD.RUN STG_TX_XXXXX  0/0/1/0/0/0/0
info_adm   6636  12328  0 10:59:57 con  0:00 D:\IBM\InformationServer\Server\DSEngine/bin/uvsh SH -c 'D:/IBM/Information
Server/Server/DSEngine/bin/NT_OshWrapper.exe //./pipe/ETL_DEV-RT_SC255-STG_TX_XXXXX RT_SC255/OshExecut
er.sh R DUMMY  -f RT_SC255/OshScript.osh -monitorport 13400 -pf RT_SC255/jpfile -impexp_charset ASCL_MS1252 -string_char
set ASCL_MS1252 -input_charset UTF-8 -output_charset UTF-8 -collation_sequence OFF'
info_adm  10128   6636  0 10:59:57 con  0:00 sh
info_adm   9516  10128  0 10:59:57 con  0:00 D:\IBM\InformationServer\Server\DSEngine\bin\NT_OshWrapper.exe //./pipe/ETL_DEV-RT_SC255-STG_TX_XXXXX RT_SC255/OshExecuter.sh R DUMMY -f RT_SC255/OshScript.osh -monitorport 1340
0 -pf RT_SC255/jpfile -impexp_charset ASCL_MS1252 -string_charset ASCL_MS1252 -input_charset UTF-8 -output_charset UTF-8
 -collation_sequence OFF
Now the question is,

1) if any of the job fails, DS should abort the job and kill whatever the process relates to that job. But why it fails to do its task.

2) Is it the sign for something which is not good (in configuration or any other issue>)


If any one know about this , please share the info with me.

Thanks
DS User
lstsaur
Participant
Posts: 1139
Joined: Thu Oct 21, 2004 9:59 pm

Post by lstsaur »

Just delete the entry of that locked job in the XMETALOCKINFO table. Then you will be able to compile the job.
ulab
Participant
Posts: 56
Joined: Mon Mar 16, 2009 4:58 am
Location: bangalore
Contact:

alternate solution, just rename the job, You'll be out of th

Post by ulab »

the suggestion is the perfect solution for this but if you are not a administrator then for instant you can solve this problem
the alternate solution, just rename the job, You'll be out of this problem
Ulab----------------------------------------------------
help, it helps you today or Tomorrow
SURA
Premium Member
Premium Member
Posts: 1229
Joined: Sat Jul 14, 2007 5:16 am
Location: Sydney

Post by SURA »

Thanks for the reply. The way how i expressed might made you to give commets about how to release the job. I know how to take back the control of that job and i did it. But the question is why it is happening?

Anyhow thanks for the comments.

Thanks
DS User
SURA
Premium Member
Premium Member
Posts: 1229
Joined: Sat Jul 14, 2007 5:16 am
Location: Sydney

Post by SURA »

Hello All

Just an update. Just now one of the job had the same issue. But this time it is more strange.

In the DIRECTOR and DESIGNER it shows as completed successfully. But when the user try to run the job again, got a message "This job is already running!!"

No entry in the Xmetalock / Web Console etc. Where as i can find a process relates to this job which is running in the OS level.

Then i tried with Director --> Job cleanup resources and found the same PID and it is referring D:\IBM\InformationServer\Server\Projects\ETL_DEV\RT_CONFIG68.

When i tried to run the same job from other system, it allowed to run.

I totally lost and hence it is the client server technology, irrespective of the client machine it should either lock the job / release the job. But it is not the case here and not sure what is happening!

Any suggestions!!

DS User
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

What do you mean by "other system" in this context - another client?

It can be the case, when a player process on one node fails but all player processes on other nodes finish successfully before the error reporting from the failed node arrives and is processed by the conductor, that the parallel job can finish with a status of "success" even though there are Fatal errors in its log. It's usually a timing issue, as noted.

Following along the same vein, the "resource" entries in the RT_STATUSnnn table may not be updated by the failed process, which can leave that part of the job with an apparent status of "running". Clear Status File should remove this symptom. (With sufficient knowledge you could also review the contents of all the entries in RT_STATUSnnn; however this information is not documented anywhere.)
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
SURA
Premium Member
Premium Member
Posts: 1229
Joined: Sat Jul 14, 2007 5:16 am
Location: Sydney

Post by SURA »

Yes Ray

You are right. I mean client system.

Is there is any specific reason why it is happening?

Is it an issue or bug or any other specific reason causing this?

Thanks
DS User
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

No, it's just "how it works", and it's different from how server jobs work (there the job executes in a single process).
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Post Reply