TWO DIFFERENT DATA FLOWS IN A SINGLE JOB

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
kumarjit
Participant
Posts: 99
Joined: Fri Oct 12, 2012 7:47 am
Location: Kolkata

TWO DIFFERENT DATA FLOWS IN A SINGLE JOB

Post by kumarjit »

I have been longing for quite sometime to ask this quetion here , but was too much tied up in other issues lately.

In out Datastage Repository ,there are many jobs, any one of which when I open the job in the Designer Canvas , I see TWO OR MORE INDEPENDENT Data flows within the same job.

Someting like this below :

Code: Select all

Oci----> Transformer----> Dataset1

Oci-----> Tranformer ---- > Dataset2
My question is , how does Datastage handle two different data flow within the same job ?
1. Taking into account that the job is parallel , how are the stage executed over the available
nodes of the system ?
2. Is one data flow logic executed at a time, followed by the other ?


Thanks
Kumarjit.
Pain is the best teacher, but very few attend his class..
jerome_rajan
Premium Member
Premium Member
Posts: 376
Joined: Sat Jan 07, 2012 12:25 pm
Location: Piscataway

Post by jerome_rajan »

This might help you.
Jerome
Data Integration Consultant at AWS
Connect With Me On LinkedIn

Life is really simple, but we insist on making it complicated.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

That design is no different from creating two separate jobs that you want to run at the same time.
-craig

"You can never have too many knives" -- Logan Nine Fingers
kumarjit
Participant
Posts: 99
Joined: Fri Oct 12, 2012 7:47 am
Location: Kolkata

Post by kumarjit »

@craig: As you say , THE JOB DESIGN IS SAME AS TWO SEPERATE JOBS RUNNING AT THE SAME TIME, lets say the the job is running over a 4 node MPP system , how will the nodes execute each stage if the job ? Will it be like EACH FLOW WILL BE CATERED BY 2 nodes ? I'm confused ...... :roll:

Thanks
Kumarjit.
Pain is the best teacher, but very few attend his class..
jerome_rajan
Premium Member
Premium Member
Posts: 376
Joined: Sat Jan 07, 2012 12:25 pm
Location: Piscataway

Post by jerome_rajan »

If you're using the 4 node config file for the job, both flows will run on all 4 nodes unless of course you are applying node pool constraints.
Just think of 2 different jobs with the same configuration running at the same time!
Jerome
Data Integration Consultant at AWS
Connect With Me On LinkedIn

Life is really simple, but we insist on making it complicated.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Exactly... nothing really to be confused about as they run just like any other PX job would. And no need for all that YELLING. :wink:
-craig

"You can never have too many knives" -- Logan Nine Fingers
FranklinE
Premium Member
Premium Member
Posts: 739
Joined: Tue Nov 25, 2008 2:19 pm
Location: Malvern, PA

Post by FranklinE »

But! :wink:

Maybe I don't understand the infrastructure well enough, but it seems to me that there is a significant difference: two separate jobs have two separate sets of parent and child processes. Two "flows" in one job has one parent with children.

My jobs run in a very "crowded" environment. This is something about which I need to be aware.
Franklin Evans
"Shared pain is lessened, shared joy increased. Thus do we refute entropy." -- Spider Robinson

Using mainframe data FAQ: viewtopic.php?t=143596 Using CFF FAQ: viewtopic.php?t=157872
Mike
Premium Member
Premium Member
Posts: 1021
Joined: Sun Mar 03, 2002 6:01 pm
Location: Tampa, FL

Post by Mike »

I tend not to put multiple independent data flows in a single job simply because of unit of work (UOW) and restart/recovery considerations. If one fails, they all fail; and you can't run one flow without running all flows.

Resource usage isn't much of a concern. As independent jobs, I would have them running concurrently from a job sequence.

Mike
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Each operator in a parallel job runs in a separate process anyway. So it could be argued that two flows in one job would use fewer processes than two jobs, because only one set of (conductor and section leader processes) would be required.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Which was Frankline's point I do believe. However, much like Mike I never considered lumping multiple independent flows in a single job any kind of a good idea for the reasons he stated.
-craig

"You can never have too many knives" -- Logan Nine Fingers
kumarjit
Participant
Posts: 99
Joined: Fri Oct 12, 2012 7:47 am
Location: Kolkata

Post by kumarjit »

Let consider the following job scenario :

Code: Select all

Oci_Src--> Tfm1--> Dataset1

Dataset2--> Tfm2--> Oci_Tgt

In the above situation
> The first flow uses Dataset1 to write to a file f1.ds
> Dataset2 is used to read data written in the file f1.ds an perform any
subsequent functions as part of the second flow
> Oci_Src reads data in volumes of Millions
> The host system has a 4 NODE configuration file


The question is :
If the first node reads say 20,000 data and writes it to f1.ds via Dataset1 , the second flow automaticaaly starts to execute , but by the time the secnd flow completes , the file f1.ds will be updated by the next node buffer read from Oci_Src . So , the second flow will always executed against a source whose data volume keeps changing throughout the tenor of the process . In such a case , will the job not abort because the seconf flow will be trying to reading from a file having dynamic data volume ?

Thanks
Kumarjit.
Pain is the best teacher, but very few attend his class..
eostic
Premium Member
Premium Member
Posts: 3838
Joined: Mon Oct 17, 2005 9:34 am

Post by eostic »

Does it matter? The key point throughout this thread is that "while it can be done", the management complexities about having two flows in the same Job vs two separate Jobs typically outweighs any benefit or reasons for multiples in a single Job. DataStage has allowed this since release 1, but best practices prevail, and you will find that this is rarely done. They will both abort if one aborts, as noted above, and I am certain that there are creative solutions out there that take advantage of that. However, in most cases, individual control is more valuable, easier to test individual bits of logic, and perhaps most important, easier to re-use among different applications.

Ernie
Ernie Ostic

blogit!
<a href="https://dsrealtime.wordpress.com/2015/0 ... ere/">Open IGC is Here!</a>
jwiles
Premium Member
Premium Member
Posts: 1274
Joined: Sun Nov 14, 2004 8:50 pm
Contact:

Post by jwiles »

From a technical point of view: The scenario you proposed is invalid anyway...DataStage will not allow you to write to and read from the same dataset within the same job. The job would abort with an error such as:

Code: Select all

Data_Set_2: Operator initialization: The data "/home/dsadm/MyTestDS.ds" may not have more than one file/ds override
It would also not work reliably with two separate jobs either, with either job failure or incorrect results.

Ernie's points pretty much sum it up. There is merit for multiple streams in some situations (I'll occasionally use it for comparative logic testing with peeks/rowgens/etc), but it's not a recommended best practice for any production work.

Regards,
- james wiles


All generalizations are false, including this one - Mark Twain.
FranklinE
Premium Member
Premium Member
Posts: 739
Joined: Tue Nov 25, 2008 2:19 pm
Location: Malvern, PA

Post by FranklinE »

I'm in the office today for exactly that: a job (another project's choice, not mine) has multiple threads at the job sequence level. My "day" started at Oh-dark-thirty. :x
Franklin Evans
"Shared pain is lessened, shared joy increased. Thus do we refute entropy." -- Spider Robinson

Using mainframe data FAQ: viewtopic.php?t=143596 Using CFF FAQ: viewtopic.php?t=157872
Post Reply