General design principle for DataStage

wfkurtz1 · Post by **wfkurtz1** » Fri Mar 04, 2011 1:24 pm

In general (don't you hate it when someone begins like that?) ... is it better to include as much logic as possible into one big job ... or ... to break up the logic into as many simple jobs as possible?

We have an ETL application with over 20 simple jobs, each not much more than one extract stage, one or two data manipulation stages and one load stage each. These are sequenced together and the sequence is run by dsjob which is run by AutoSys. a very modular design borrowed from other SW development paradigms.

The reason I ask is that a veteran DSer took a look at this and asked "why so many jobs?" He said "make as few connections to databases as possible and do as much as you can on the DS server."

OK ... what do you think? And "It depends" is not the right answer ... just joking

Everybody says IT people give that answer all the time!

ray.wurlod · Post by **ray.wurlod** » Fri Mar 04, 2011 2:54 pm

It's always a compromise. The advise to minimize connections to databases is sound, but some of the work may involve no database connections and so could be a candidate for a modular design. Environmental factors, such as limited time windows for accessing source and target systems, may also contribute to the design decisions. You can still do the "T" part of the ETL even though you have to do the "E" and "L" parts at different times - here again a modular design is indicated. As noted, it depends.

greggknight · Post by **greggknight** » Fri Mar 04, 2011 8:39 pm

Well my 2 cents would go like this.
I have been building datastage jobs since version 3 I am currently using 8.5 32bit and 8.5 64 bit.
That said Simpler is better, why when it comes time to determine why a job takes so long and you have 50 stages in the job you will understand why.
second
not only are jobs parrellel but pipe line is also used. This means that when a job starts all stages start so that a continuous flow is available. no waiting on stages.
depending on the configuration your using , 1 node or n nodes osh processes are established ( alot of them) with a job which contains a lot of stages. Memory and cpu come into play.
I guess what I am saying is you need to look at the big picture. Today 8 jobs and a sequence tommorrow 8000 jobs.

Its too late then.
I just started migrating a data warehouse which was coded on an as400 to datastage. I have 50 dim and 6 facts so far . I have 2000 jobs .60 of those jobs run 17 instances a piece at the same time. And we have just started. we will have a lot more facts before its over. Thats not including all the other jobs and projects that I will be migrating to 8.5 Like I said you need to look at the big picture. And decide from there.

ray.wurlod · Post by **ray.wurlod** » Sat Mar 05, 2011 1:04 am

In my current engagement, I am using fewer than 12 jobs, but they are generic, dynamic and multi-instance. (This was the customer's requirement, as well as an interesting technical challenge.) For example SQL statements are generated "on the fly" from information in the system tables. Many things are parameterised.

kduke · Post by **kduke** » Sat Mar 05, 2011 1:53 pm

I think there is a trade off between performance and making it simple. The fewer stages in a job then easier it is to follow and therefore modify. The more times you land the data the longer it takes to run from start to finish. There are always exceptions to the rule and busniess requirements that force you to do things you might not prefer doing. A lot of source systems are stretched to the limit. So you have to extract the data as quickly as possible and land it. Then off load as much processing as possible. This changes your design.

I have seen having a lot of jobs is just as hard to follow as one big job. So 8,000 jobs with 3 stages maybe worse than 800 jobs with 30 stages average. I find that consistency is more important than either of these. I have seen 8,000 jobs work just as smooth as 800 jobs especially if the jobs are very similar and the naming conventions are good.

Quality comes in many shapes and sizes. Try to be open to new ideas. Sometimes you might get surprised.

swapnilverma · Post by **swapnilverma** » Sun Mar 06, 2011 8:44 pm

You can break your job on below factors :-

Complexity - At present you have simpler jobs which are good.

Too simple ???? not a good choice

Restart ability- Incase of any issue / error from which point you can restore processing ? ( can you restart the job with out manual changes ??? )

Processing time -- If after combining few jobs you are not getting enough performance benefit Is there a point to combine them ??

If you have few complex job this will be tricky thing to achieve .

so decide based on your jobs nature ...

hope it helps

FranklinE · Post by **FranklinE** » Mon Mar 07, 2011 10:52 am

The one concept not mentioned yet is reusability. I have a small app (that will quadruple in size by the time we're done) that has the identical initial input stage for every job. It's paramaterized according to the source data configuration (z/OS mainframe datasets generated by Cobol modules). After working out some design issues with the first few jobs, I've been reusing that basic stage and will continue to do so for the rest of the project.

That's a small-scale example. I would assume -- with some confidence -- that a large application expected to have hundreds of jobs will have several opportunites along that line.

kduke · Post by **kduke** » Tue Mar 08, 2011 4:55 pm

Excellent point.

evee1 · Post by **evee1** » Tue May 03, 2011 7:08 pm

ray.wurlod wrote:In my current engagement, I am using fewer than 12 jobs, but they are generic, dynamic and multi-instance.

I have just started a project (early design yet) that has the same goal in mind. It is a re-write of the existing DWH using Datastage instead of a bunch of varying technolgies used currently for ETL-ing.

Ray,
I was wondering whether you would be willing to share more thoughts on the approach you have implemented. I'm particulary interested in the issues that have posed the major challenges for you (if there were any

).

I was also wondering what is the nature and size of this project, although I understand if this is too sensitive information to share.
In addition, was there any estimation done on how much development effort is saved using such an approach oppose to the conventional (thousands jobs) one?

chulett · Post by **chulett** » Tue May 03, 2011 7:35 pm

Here is Ray's post on his technical challenge:

viewtopic.php?t=138403