Page 1 of 2

Maximum stages in a job

Posted: Fri Dec 09, 2005 3:39 am
by rkdatastage
Hi

can any one can clear my doubt that is there any limitation in using the stages in a job. As i want to design a job which is planned to use more number of stages available in a job.
Is it a Correct Process to design the most complex job as single job or have to divide it into multiple jobs.
Is there any limitation for using the stages in a job in datastage or
Maximum stages that i can be used in a job....?

earliest response will be appriciated.
thanks in advance

RK

Posted: Fri Dec 09, 2005 4:14 am
by ArndW
There is always going to be some limit on the number of stages a job can support - but this is far, far higher than any acceptable job you will write.

Put yourself in the shoes of someone opening up your job and trying to understand it in order to make some changes. If you have 150 stages and lines going all over the place it is going to be quite difficult to understand. If the job has only 20 stages it is much easier to understand, plus it fits on one or two pages on the display canvas in the designer.

Many developers have a personal maximum number of stages; I prefer to have a limit on job complexity. If I can't grasp an overview from the designer canvas the job is too complex. Some jobs cannot be easily split across several jobs without paying a performance price, so those can (and should) remain as they are, but most jobs can be split - especially if the output of one is a named pipe that is used as the input to another.

Posted: Fri Dec 09, 2005 4:23 am
by loveojha2
More than that you can use Local Containers, which will make it more understandable(visually).

Posted: Fri Dec 09, 2005 4:44 am
by RayNother
I always keep the jobs as small as possible and then put them all into a bigger sequence(s).
I find it easier to fault find when supporting jobs in testing/production this way...

IMHO you should always keep things simple.

Ray

Posted: Fri Dec 09, 2005 6:26 am
by WoMaWil
perhaps we can out-praise a bottle of Champagner for the one who privides a job running in his production with maximum of stages.

Wolfgang

Posted: Fri Dec 09, 2005 6:30 am
by ArndW
Wolfgang,

I know I've seen jobs with over 100 stages. Don't know if they ever ran, though...

Posted: Fri Dec 09, 2005 7:05 am
by WoMaWil
Arnd,

I had a Job at e-plus in Dusseldorf with 115 Stages to populate a dimension branch, which worked very fine in production and took 5 minutes to finish.

Who does top that number?

Wolfgang

PS: For sure, that was in last century, now I am a bit more expirienced and my aim now is to have a minimum of stages for each task.

Re: Maximum stages in a job

Posted: Fri Dec 09, 2005 7:12 am
by ravij
Hi

There is no limitation of using the no of stages in a single job, but if u use more stages in a single job it will become complex n confusing . If u split the whole job into small jobs it will be easy for u to handle the exceptions n errors.

bye
JRK
rkdatastage wrote:Hi

can any one can clear my doubt that is there any limitation in using the stages in a job. As i want to design a job which is planned to use more number of stages available in a job.
Is it a Correct Process to design the most complex job as single job or have to divide it into multiple jobs.
Is there any limitation for using the stages in a job in datastage or
Maximum stages that i can be used in a job....?

earliest response will be appriciated.
thanks in advance

RK

Posted: Fri Dec 09, 2005 7:23 am
by koolnitz
Guys,

Recently, i discussed this topic with one of the DS consultants. He also advised the same thing which all of you are commending.

In nutshell, whenever it's possible to break a complex job, go for it. At the same time I fully agree with Arnd that if "divide and rule" is hampering the performance then worth to have all the fruits in one tree.

Well, I personally prefer to have atmost 20-22 stages in a job.

Cheers!!

Posted: Fri Dec 09, 2005 8:40 am
by kcbland
Forget about using DataStage for a moment.

In the world of writing computer programs, is it "best" to have a single 5000 line top-down program, or a collection of small modular routines, methods, and procedures that may reach 8000 lines of code?

Best - define it. Best design to maintain? Best design for performance? Best design for the next guy? Best design for time-to-develop? Best design for trouble-shooting?

My opinion, it's not a competition to who can architect an ETL application in the fewest jobs. It's a competition for whose architecture last for years without having every job constantly re-written on every enhancement or change.

Posted: Fri Dec 09, 2005 6:23 pm
by ray.wurlod
There is an excellent book out there called The Elements of Programming Style - although it concentrates on language-based coding there is a lot of good advice, most of which can be generalized to graphical programming. Modularity is one of the main principles espoused, primarily for ease of understanding, re-use and maintenance.

Posted: Sat Dec 10, 2005 2:37 pm
by clshore
I was on an EE assignment where the number of stages exceeded 1,000 in several jobs.
My firm was called in midway through the project, after the jobs were written, to help resolve 'some issues'.
After some modifications, and much tweaking of kernel, memory, and disk resources, the jobs did actually run and satisfy requirements.
Working with the jobs in Designer was challenging. They took a long time to load. Viewing the whole job, most of the stage icons were so small that they could not be discerned on the palette as anything but blobs. When panning or zooming, the refresh took so long that it was painful.
It's not the way I would do it, but it's what the client created, and wanted, and it meets their needs.

Carter

Posted: Sat Dec 10, 2005 5:35 pm
by chulett
O. M. G. :shock:

Posted: Sat Dec 10, 2005 9:40 pm
by aartlett
On my current assignment I'm here to oversee on behalf of the client the vendor that is/was doing the actual design/build work on a 7.1 SE system. I came up with same nameing standards and some sort of design standards before they started work.

I then told them that IMHO maintainability is more important than performance, unless performance causes a major blockage and them we'll take that in a case by case basis.

The result: fairly efficient jobs that are easy to read (left to right, inputs on top and right, outputs left and down), stage names that have a meaning, and thanks to the DS DOCO maker we have automatic documentation.

90% of the jobs passed QA first time, 100% the send time and most of the problems were in a hash not being cached (part of the standards).

My answer to the OP: Maintainability first, if you can't read it, the next guy in 6 months can't fix it. This foes for all languages, not just DS.

Posted: Sat Dec 10, 2005 10:22 pm
by kcbland
Some things to consider when opting for the "all-in-one" jobs for both PX and Server:

1. Large design size means importing/exporting the job takes longer - produces a larger dsx file for a single job.
2. Large designs means that more logic is within a single job - only one developer has write access to a job (even Hawk limits one user making changes even though others have read-only access).
3. All-in-one designs are non-recoverable from waypoints in the logic. There are no points for resumption of processing in the event of failure.
4. All-in-one designs usually are often nearly incomprehensible given the graphical metaphor is supposed to mean "at-a-glance" someone knows what the job is doing.
5. All-in-one designs usually mean to troubleshoot, the job has to be "exploded" into smaller constituent jobs just to figure out where the data is going "bad" during processing.
6. Sometimes, the job has to be "exploded" just so that a surgical enhancement can be made, and then reconstituted into the all-in-one form.
7. The all-in-one design sometimes limits another job from running because a lookup (either hash or dataset) being built needs to be reused by the other job so that a dependency is imposed. The alternative is that the same logic exists in two all-in-one jobs doubling resource consumption for that portion, or even worse two non-related jobs are coupled because of that one common lookup.

The alternative architecture has its issues as well:

1. Small and modular jobs means that processing activities are separate jobs, requiring more sophisticated usage of Sequencers or custom job control to manage executing jobs in dependent and hopefully concurrent fashion.
2. More jobs means that a method to communicate the data between jobs has to be established: files or pipes.
3. Smaller jobs means that the design library has a significant increase in objects and the naming conventions and foldering become more important.
4. More jobs means that careful documentation is required to piece together the now broken apart flow.
5. Data lineage becomes more difficult, as tracing the target column resultant value back to its origination point requires traversing stages and jobs, not just stages.
6. More jobs means managing versions are more complicated, as the correct version of every job in a transformation jobstream (batch?) has to be correct.