TWO DIFFERENT DATA FLOWS IN A SINGLE JOB

kumarjit · Post by **kumarjit** » Tue Nov 27, 2012 1:09 am

I have been longing for quite sometime to ask this quetion here , but was too much tied up in other issues lately.

In out Datastage Repository ,there are many jobs, any one of which when I open the job in the Designer Canvas , I see TWO OR MORE INDEPENDENT Data flows within the same job.

Someting like this below :

Code: Select all

Oci----> Transformer----> Dataset1

Oci-----> Tranformer ---- > Dataset2

My question is , how does Datastage handle two different data flow within the same job ?
1. Taking into account that the job is parallel , how are the stage executed over the available
nodes of the system ?
2. Is one data flow logic executed at a time, followed by the other ?

Thanks
Kumarjit.

jerome_rajan · Post by **jerome_rajan** » Tue Nov 27, 2012 1:38 am

This might help you.

chulett · Post by **chulett** » Tue Nov 27, 2012 8:36 am

That design is no different from creating two separate jobs that you want to run at the same time.

kumarjit · Post by **kumarjit** » Thu Dec 06, 2012 3:55 am

@craig: As you say , THE JOB DESIGN IS SAME AS TWO SEPERATE JOBS RUNNING AT THE SAME TIME, lets say the the job is running over a 4 node MPP system , how will the nodes execute each stage if the job ? Will it be like EACH FLOW WILL BE CATERED BY 2 nodes ? I'm confused ......

Thanks
Kumarjit.

jerome_rajan · Post by **jerome_rajan** » Thu Dec 06, 2012 4:01 am

If you're using the 4 node config file for the job, both flows will run on all 4 nodes unless of course you are applying node pool constraints.
Just think of 2 different jobs with the same configuration running at the same time!

chulett · Post by **chulett** » Thu Dec 06, 2012 8:12 am

Exactly... nothing really to be confused about as they run just like any other PX job would. And no need for all that YELLING.

FranklinE · Post by **FranklinE** » Thu Dec 06, 2012 11:13 am

But!

Maybe I don't understand the infrastructure well enough, but it seems to me that there is a significant difference: two separate jobs have two separate sets of parent and child processes. Two "flows" in one job has one parent with children.

My jobs run in a very "crowded" environment. This is something about which I need to be aware.

Mike · Post by **Mike** » Thu Dec 06, 2012 11:48 am

I tend not to put multiple independent data flows in a single job simply because of unit of work (UOW) and restart/recovery considerations. If one fails, they all fail; and you can't run one flow without running all flows.

Resource usage isn't much of a concern. As independent jobs, I would have them running concurrently from a job sequence.

Mike

ray.wurlod · Post by **ray.wurlod** » Thu Dec 06, 2012 2:26 pm

Each operator in a parallel job runs in a separate process anyway. So it could be argued that two flows in one job would use fewer processes than two jobs, because only one set of (conductor and section leader processes) would be required.

chulett · Post by **chulett** » Thu Dec 06, 2012 2:29 pm

Which was Frankline's point I do believe. However, much like Mike I never considered lumping multiple independent flows in a single job any kind of a good idea for the reasons he stated.

kumarjit · Post by **kumarjit** » Fri Dec 07, 2012 5:30 am

Let consider the following job scenario :

Code: Select all

Oci_Src--> Tfm1--> Dataset1

Dataset2--> Tfm2--> Oci_Tgt

In the above situation
> The first flow uses Dataset1 to write to a file f1.ds
> Dataset2 is used to read data written in the file f1.ds an perform any
subsequent functions as part of the second flow
> Oci_Src reads data in volumes of Millions
> The host system has a 4 NODE configuration file

The question is :
If the first node reads say 20,000 data and writes it to f1.ds via Dataset1 , the second flow automaticaaly starts to execute , but by the time the secnd flow completes , the file f1.ds will be updated by the next node buffer read from Oci_Src . So , the second flow will always executed against a source whose data volume keeps changing throughout the tenor of the process . In such a case , will the job not abort because the seconf flow will be trying to reading from a file having dynamic data volume ?

Thanks
Kumarjit.

eostic · Post by **eostic** » Fri Dec 07, 2012 5:57 am

Does it matter? The key point throughout this thread is that "while it can be done", the management complexities about having two flows in the same Job vs two separate Jobs typically outweighs any benefit or reasons for multiples in a single Job. DataStage has allowed this since release 1, but best practices prevail, and you will find that this is rarely done. They will both abort if one aborts, as noted above, and I am certain that there are creative solutions out there that take advantage of that. However, in most cases, individual control is more valuable, easier to test individual bits of logic, and perhaps most important, easier to re-use among different applications.

Ernie

jwiles · Post by **jwiles** » Fri Dec 07, 2012 10:13 am

From a technical point of view: The scenario you proposed is invalid anyway...DataStage will not allow you to write to and read from the same dataset within the same job. The job would abort with an error such as:

Code: Select all

Data_Set_2: Operator initialization: The data "/home/dsadm/MyTestDS.ds" may not have more than one file/ds override

It would also not work reliably with two separate jobs either, with either job failure or incorrect results.

Ernie's points pretty much sum it up. There is merit for multiple streams in some situations (I'll occasionally use it for comparative logic testing with peeks/rowgens/etc), but it's not a recommended best practice for any production work.

Regards,

FranklinE · Post by **FranklinE** » Sat Dec 08, 2012 12:50 pm

I'm in the office today for exactly that: a job (another project's choice, not mine) has multiple threads at the job sequence level. My "day" started at Oh-dark-thirty.