ExeCmd always execute SERIALLY within a single Sequence Job

eli.nawas_AUS · Post by **eli.nawas_AUS** » Mon Mar 28, 2016 2:22 pm

Hi
I have a single Sequence job. Within it are five independent Execute Command stages, each calling a shell script, which then calls a Hive (Hadoop) command. At the beginning of this Shell Script I immediately log the system-time; at the end I log the time also. This way I would know when the shell script starts/stops, which also means when Execute Command starts/stops.

The Execute Command stages are independent, not serially placed (no lines connecting them) After the job runs I look at the log. I noticed that these Execute Commands always execute SERIALLY, not parallel as I expected.

I find a work around in that If I create 5 Sequence Jobs, and place each Execute Command in each Sequence job, then the Sequence Jobs execute in PARALLEL; hence the Execute Commands execute in PARALLEL.

Is this normal behavior? Even with the above work-around works (embedding each Execute Command in each Sequence job), this is hardly satisfactory. I have 50+ Execute Commands to do. That means I need 50+ Sequence Jobs? Some jobs have 200+ tables, each require a hive connection, ...

I really just want 1 (or just a few) sequence jobs to do all these Execute Commands.

Is this possible? Are there any options somewhere that I miss?
(DataStage Designer V 9.1.2.0)

Thanks

chulett · Post by **chulett** » Mon Mar 28, 2016 3:20 pm

eli.nawas_AUS wrote:Is this normal behavior?

Yes.

Perhaps a scripted approach is in order for these 50+ commands?

Mike · Post by **Mike** » Mon Mar 28, 2016 5:07 pm

Activities in a job sequence are executed sequentially.

You will see that if you investigate the generated BASIC code.

A job activity starts a job without waiting for it to finish. That is why you obtain the parallel execution of your commands by placing them inside a job sequence and executing them via a job activity. They run concurrently because your command execution time is longer than the small amount of time that is required to start each job sequence that contains a command.

In contrast, an execute command activity waits for the command to finish before it returns control to its controlling job sequence. That's why they do not start at the same time even though you have no trigger links between them.

Run your command in the background and the Execute Command activity will return control to the sequence immediately.

Mike

ray.wurlod · Post by **ray.wurlod** » Mon Mar 28, 2016 7:15 pm

To run a command in the background, add an ampersand after the query. To submit the command asynchronously (that is, to make sure that it doesn't die because it can't contact its parent), use the nohup command. For example:

Code: Select all

nohup echo "Dummy Heading" > #jpFilePath# &

eli.nawas_AUS · Post by **eli.nawas_AUS** » Mon Mar 28, 2016 10:52 pm

I didn't think about the background option for the unix command.

Thanks for all your helps

eli.nawas_AUS · Post by **eli.nawas_AUS** » Tue Mar 29, 2016 10:10 am

Hi

I added the "nohup ... &" and it returns immediately.

I have a new issue though: I needed it to wait for all these process to completed before the next step.

Solution 1: I added the sequencer2 after these ExeCmds, like this (ignore the dots. They are there to align these words):

......................................... --> ExeCmd1 --+
..................................................................|
uservariable -> sequencer1 --> ExeCmd2 --+--> sequencer2(all)
..................................................................|
......................................... --> ExeCmd3 --+

But this does not work as all ExeCmd return TRUE, hence sequencer2 continues on. It does not have any idea to wait for those bkg processes.

Solution 2: I added another ExeCmd with the wait command inside, like this:

......................................... --> ExeCmd1 --+
.................................................................|
uservariable -> sequencer1 --> ExeCmd2 --+--> sequencer2 --> ExeCmd3
..................................................................|
......................................... --> ExeCmd3 --+

ExeCmd3 just have 1 "wait" command inside.
I was thinking that the wait command should work, because without any pid "wait" will wait for all child processes

But this does not work either, because the wait is a brand new process. Hence it does not have any child processes. So it finishes immediately, not waiting for those bkg processes from ExeCmd1..3.

What can I use to force sequence2 or ExeCmd3 to wait for all subprocesses to complete?

Thanks

ray.wurlod · Post by **ray.wurlod** » Tue Mar 29, 2016 5:18 pm

Investigate the jobs command to determine what background process(es) you have running. Keep executing this till your list is exhausted.

eli.nawas_AUS · Post by **eli.nawas_AUS** » Tue Mar 29, 2016 8:53 pm

Hi

This is my understanding. Pls correct as needed:

- This jobs command must be run within an ExeCmd (because it is a Unix command).

- Since all bkg processes are child processes, that means we must find the pid of the parent process (1st sequence job)

- Then, how do you get the pid of the parent job. Does DS provide such function?

Thanks

ray.wurlod · Post by **ray.wurlod** » Tue Mar 29, 2016 11:33 pm

You can get the PID in a number of ways, for example enabling APT_PM_SHOW_PIDS. But this gives the PIDs of the player processes; their parents are the section leader processes, and their parent is either the conductor process or its rsh agent. And only the parent process of that will be the PID of the controlling sequence.

It's probably easier to use the UNIX command ps -ef with an appropriate grep filter piped into a cut command to retrive the PPID.

Sorry to be so generic, but I don't really have the time to devote to solving your particular problem right now.

I would probably prefer to use a DataStage routine here, in which information about the controlling sequence is readily obtained using DataStage API functions.

ray.wurlod · Post by **ray.wurlod** » Tue Mar 29, 2016 11:51 pm

Another thought. When you issue a nohup command, the PID of that process is reported to stdout, and therefore could be captured via the $CommandOutput activity variable of the Execute Command activity.

DSXchange

ExeCmd always execute SERIALLY within a single Sequence Job

ExeCmd always execute SERIALLY within a single Sequence Job

Re: ExeCmd always execute SERIALLY within a single Sequence