ExeCmd always execute SERIALLY within a single Sequence Job
Moderators: chulett, rschirm, roy
-
- Premium Member
- Posts: 39
- Joined: Tue Apr 15, 2014 9:14 am
ExeCmd always execute SERIALLY within a single Sequence Job
Hi
I have a single Sequence job. Within it are five independent Execute Command stages, each calling a shell script, which then calls a Hive (Hadoop) command. At the beginning of this Shell Script I immediately log the system-time; at the end I log the time also. This way I would know when the shell script starts/stops, which also means when Execute Command starts/stops.
The Execute Command stages are independent, not serially placed (no lines connecting them) After the job runs I look at the log. I noticed that these Execute Commands always execute SERIALLY, not parallel as I expected.
I find a work around in that If I create 5 Sequence Jobs, and place each Execute Command in each Sequence job, then the Sequence Jobs execute in PARALLEL; hence the Execute Commands execute in PARALLEL.
Is this normal behavior? Even with the above work-around works (embedding each Execute Command in each Sequence job), this is hardly satisfactory. I have 50+ Execute Commands to do. That means I need 50+ Sequence Jobs? Some jobs have 200+ tables, each require a hive connection, ...
I really just want 1 (or just a few) sequence jobs to do all these Execute Commands.
Is this possible? Are there any options somewhere that I miss?
(DataStage Designer V 9.1.2.0)
Thanks
I have a single Sequence job. Within it are five independent Execute Command stages, each calling a shell script, which then calls a Hive (Hadoop) command. At the beginning of this Shell Script I immediately log the system-time; at the end I log the time also. This way I would know when the shell script starts/stops, which also means when Execute Command starts/stops.
The Execute Command stages are independent, not serially placed (no lines connecting them) After the job runs I look at the log. I noticed that these Execute Commands always execute SERIALLY, not parallel as I expected.
I find a work around in that If I create 5 Sequence Jobs, and place each Execute Command in each Sequence job, then the Sequence Jobs execute in PARALLEL; hence the Execute Commands execute in PARALLEL.
Is this normal behavior? Even with the above work-around works (embedding each Execute Command in each Sequence job), this is hardly satisfactory. I have 50+ Execute Commands to do. That means I need 50+ Sequence Jobs? Some jobs have 200+ tables, each require a hive connection, ...
I really just want 1 (or just a few) sequence jobs to do all these Execute Commands.
Is this possible? Are there any options somewhere that I miss?
(DataStage Designer V 9.1.2.0)
Thanks
Re: ExeCmd always execute SERIALLY within a single Sequence
Yes.eli.nawas_AUS wrote:Is this normal behavior?
Perhaps a scripted approach is in order for these 50+ commands?
-craig
"You can never have too many knives" -- Logan Nine Fingers
"You can never have too many knives" -- Logan Nine Fingers
Activities in a job sequence are executed sequentially.
You will see that if you investigate the generated BASIC code.
A job activity starts a job without waiting for it to finish. That is why you obtain the parallel execution of your commands by placing them inside a job sequence and executing them via a job activity. They run concurrently because your command execution time is longer than the small amount of time that is required to start each job sequence that contains a command.
In contrast, an execute command activity waits for the command to finish before it returns control to its controlling job sequence. That's why they do not start at the same time even though you have no trigger links between them.
Run your command in the background and the Execute Command activity will return control to the sequence immediately.
Mike
You will see that if you investigate the generated BASIC code.
A job activity starts a job without waiting for it to finish. That is why you obtain the parallel execution of your commands by placing them inside a job sequence and executing them via a job activity. They run concurrently because your command execution time is longer than the small amount of time that is required to start each job sequence that contains a command.
In contrast, an execute command activity waits for the command to finish before it returns control to its controlling job sequence. That's why they do not start at the same time even though you have no trigger links between them.
Run your command in the background and the Execute Command activity will return control to the sequence immediately.
Mike
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
To run a command in the background, add an ampersand after the query. To submit the command asynchronously (that is, to make sure that it doesn't die because it can't contact its parent), use the nohup command. For example:
Code: Select all
nohup echo "Dummy Heading" > #jpFilePath# &
Last edited by ray.wurlod on Mon Mar 28, 2016 11:12 pm, edited 1 time in total.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
-
- Premium Member
- Posts: 39
- Joined: Tue Apr 15, 2014 9:14 am
-
- Premium Member
- Posts: 39
- Joined: Tue Apr 15, 2014 9:14 am
Hi
I added the "nohup ... &" and it returns immediately.
I have a new issue though: I needed it to wait for all these process to completed before the next step.
Solution 1: I added the sequencer2 after these ExeCmds, like this (ignore the dots. They are there to align these words):
......................................... --> ExeCmd1 --+
..................................................................|
uservariable -> sequencer1 --> ExeCmd2 --+--> sequencer2(all)
..................................................................|
......................................... --> ExeCmd3 --+
But this does not work as all ExeCmd return TRUE, hence sequencer2 continues on. It does not have any idea to wait for those bkg processes.
Solution 2: I added another ExeCmd with the wait command inside, like this:
......................................... --> ExeCmd1 --+
.................................................................|
uservariable -> sequencer1 --> ExeCmd2 --+--> sequencer2 --> ExeCmd3
..................................................................|
......................................... --> ExeCmd3 --+
ExeCmd3 just have 1 "wait" command inside.
I was thinking that the wait command should work, because without any pid "wait" will wait for all child processes
But this does not work either, because the wait is a brand new process. Hence it does not have any child processes. So it finishes immediately, not waiting for those bkg processes from ExeCmd1..3.
What can I use to force sequence2 or ExeCmd3 to wait for all subprocesses to complete?
Thanks
I added the "nohup ... &" and it returns immediately.
I have a new issue though: I needed it to wait for all these process to completed before the next step.
Solution 1: I added the sequencer2 after these ExeCmds, like this (ignore the dots. They are there to align these words):
......................................... --> ExeCmd1 --+
..................................................................|
uservariable -> sequencer1 --> ExeCmd2 --+--> sequencer2(all)
..................................................................|
......................................... --> ExeCmd3 --+
But this does not work as all ExeCmd return TRUE, hence sequencer2 continues on. It does not have any idea to wait for those bkg processes.
Solution 2: I added another ExeCmd with the wait command inside, like this:
......................................... --> ExeCmd1 --+
.................................................................|
uservariable -> sequencer1 --> ExeCmd2 --+--> sequencer2 --> ExeCmd3
..................................................................|
......................................... --> ExeCmd3 --+
ExeCmd3 just have 1 "wait" command inside.
I was thinking that the wait command should work, because without any pid "wait" will wait for all child processes
But this does not work either, because the wait is a brand new process. Hence it does not have any child processes. So it finishes immediately, not waiting for those bkg processes from ExeCmd1..3.
What can I use to force sequence2 or ExeCmd3 to wait for all subprocesses to complete?
Thanks
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
-
- Premium Member
- Posts: 39
- Joined: Tue Apr 15, 2014 9:14 am
Hi
This is my understanding. Pls correct as needed:
- This jobs command must be run within an ExeCmd (because it is a Unix command).
- Since all bkg processes are child processes, that means we must find the pid of the parent process (1st sequence job)
- Then, how do you get the pid of the parent job. Does DS provide such function?
Thanks
This is my understanding. Pls correct as needed:
- This jobs command must be run within an ExeCmd (because it is a Unix command).
- Since all bkg processes are child processes, that means we must find the pid of the parent process (1st sequence job)
- Then, how do you get the pid of the parent job. Does DS provide such function?
Thanks
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
You can get the PID in a number of ways, for example enabling APT_PM_SHOW_PIDS. But this gives the PIDs of the player processes; their parents are the section leader processes, and their parent is either the conductor process or its rsh agent. And only the parent process of that will be the PID of the controlling sequence.
It's probably easier to use the UNIX command ps -ef with an appropriate grep filter piped into a cut command to retrive the PPID.
Sorry to be so generic, but I don't really have the time to devote to solving your particular problem right now.
I would probably prefer to use a DataStage routine here, in which information about the controlling sequence is readily obtained using DataStage API functions.
It's probably easier to use the UNIX command ps -ef with an appropriate grep filter piped into a cut command to retrive the PPID.
Sorry to be so generic, but I don't really have the time to devote to solving your particular problem right now.
I would probably prefer to use a DataStage routine here, in which information about the controlling sequence is readily obtained using DataStage API functions.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
Another thought. When you issue a nohup command, the PID of that process is reported to stdout, and therefore could be captured via the $CommandOutput activity variable of the Execute Command activity.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.