General design principle for DataStage
Moderators: chulett, rschirm, roy
General design principle for DataStage
In general (don't you hate it when someone begins like that?) ... is it better to include as much logic as possible into one big job ... or ... to break up the logic into as many simple jobs as possible?
We have an ETL application with over 20 simple jobs, each not much more than one extract stage, one or two data manipulation stages and one load stage each. These are sequenced together and the sequence is run by dsjob which is run by AutoSys. a very modular design borrowed from other SW development paradigms.
The reason I ask is that a veteran DSer took a look at this and asked "why so many jobs?" He said "make as few connections to databases as possible and do as much as you can on the DS server."
OK ... what do you think? And "It depends" is not the right answer ... just joking
Everybody says IT people give that answer all the time!
We have an ETL application with over 20 simple jobs, each not much more than one extract stage, one or two data manipulation stages and one load stage each. These are sequenced together and the sequence is run by dsjob which is run by AutoSys. a very modular design borrowed from other SW development paradigms.
The reason I ask is that a veteran DSer took a look at this and asked "why so many jobs?" He said "make as few connections to databases as possible and do as much as you can on the DS server."
OK ... what do you think? And "It depends" is not the right answer ... just joking
Everybody says IT people give that answer all the time!
"The price of freedom is eternal vigilance."
-- Thomas Jefferson
-- Thomas Jefferson
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
It's always a compromise. The advise to minimize connections to databases is sound, but some of the work may involve no database connections and so could be a candidate for a modular design. Environmental factors, such as limited time windows for accessing source and target systems, may also contribute to the design decisions. You can still do the "T" part of the ETL even though you have to do the "E" and "L" parts at different times - here again a modular design is indicated. As noted, it depends.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
-
- Premium Member
- Posts: 120
- Joined: Thu Oct 28, 2004 4:24 pm
Well my 2 cents would go like this.
I have been building datastage jobs since version 3 I am currently using 8.5 32bit and 8.5 64 bit.
That said Simpler is better, why when it comes time to determine why a job takes so long and you have 50 stages in the job you will understand why.
second
not only are jobs parrellel but pipe line is also used. This means that when a job starts all stages start so that a continuous flow is available. no waiting on stages.
depending on the configuration your using , 1 node or n nodes osh processes are established ( alot of them) with a job which contains a lot of stages. Memory and cpu come into play.
I guess what I am saying is you need to look at the big picture. Today 8 jobs and a sequence tommorrow 8000 jobs.
Its too late then.
I just started migrating a data warehouse which was coded on an as400 to datastage. I have 50 dim and 6 facts so far . I have 2000 jobs .60 of those jobs run 17 instances a piece at the same time. And we have just started. we will have a lot more facts before its over. Thats not including all the other jobs and projects that I will be migrating to 8.5 Like I said you need to look at the big picture. And decide from there.
I have been building datastage jobs since version 3 I am currently using 8.5 32bit and 8.5 64 bit.
That said Simpler is better, why when it comes time to determine why a job takes so long and you have 50 stages in the job you will understand why.
second
not only are jobs parrellel but pipe line is also used. This means that when a job starts all stages start so that a continuous flow is available. no waiting on stages.
depending on the configuration your using , 1 node or n nodes osh processes are established ( alot of them) with a job which contains a lot of stages. Memory and cpu come into play.
I guess what I am saying is you need to look at the big picture. Today 8 jobs and a sequence tommorrow 8000 jobs.
Its too late then.
I just started migrating a data warehouse which was coded on an as400 to datastage. I have 50 dim and 6 facts so far . I have 2000 jobs .60 of those jobs run 17 instances a piece at the same time. And we have just started. we will have a lot more facts before its over. Thats not including all the other jobs and projects that I will be migrating to 8.5 Like I said you need to look at the big picture. And decide from there.
"Don't let the bull between you and the fence"
Thanks
Gregg J Knight
"Never Never Never Quit"
Winston Churchill
Thanks
Gregg J Knight
"Never Never Never Quit"
Winston Churchill
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
In my current engagement, I am using fewer than 12 jobs, but they are generic, dynamic and multi-instance. (This was the customer's requirement, as well as an interesting technical challenge.) For example SQL statements are generated "on the fly" from information in the system tables. Many things are parameterised.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
I think there is a trade off between performance and making it simple. The fewer stages in a job then easier it is to follow and therefore modify. The more times you land the data the longer it takes to run from start to finish. There are always exceptions to the rule and busniess requirements that force you to do things you might not prefer doing. A lot of source systems are stretched to the limit. So you have to extract the data as quickly as possible and land it. Then off load as much processing as possible. This changes your design.
I have seen having a lot of jobs is just as hard to follow as one big job. So 8,000 jobs with 3 stages maybe worse than 800 jobs with 30 stages average. I find that consistency is more important than either of these. I have seen 8,000 jobs work just as smooth as 800 jobs especially if the jobs are very similar and the naming conventions are good.
Quality comes in many shapes and sizes. Try to be open to new ideas. Sometimes you might get surprised.
I have seen having a lot of jobs is just as hard to follow as one big job. So 8,000 jobs with 3 stages maybe worse than 800 jobs with 30 stages average. I find that consistency is more important than either of these. I have seen 8,000 jobs work just as smooth as 800 jobs especially if the jobs are very similar and the naming conventions are good.
Quality comes in many shapes and sizes. Try to be open to new ideas. Sometimes you might get surprised.
Mamu Kim
-
- Participant
- Posts: 135
- Joined: Tue Aug 14, 2007 4:27 am
- Location: Mumbai
You can break your job on below factors :-
Complexity - At present you have simpler jobs which are good.
Too simple ???? not a good choice
Restart ability- Incase of any issue / error from which point you can restore processing ? ( can you restart the job with out manual changes ??? )
Processing time -- If after combining few jobs you are not getting enough performance benefit Is there a point to combine them ??
If you have few complex job this will be tricky thing to achieve .
so decide based on your jobs nature ...
hope it helps
Complexity - At present you have simpler jobs which are good.
Too simple ???? not a good choice
Restart ability- Incase of any issue / error from which point you can restore processing ? ( can you restart the job with out manual changes ??? )
Processing time -- If after combining few jobs you are not getting enough performance benefit Is there a point to combine them ??
If you have few complex job this will be tricky thing to achieve .
so decide based on your jobs nature ...
hope it helps
Thanks
Swapnil
"Whenever you find whole world against you just turn around and Lead the world"
Swapnil
"Whenever you find whole world against you just turn around and Lead the world"
The one concept not mentioned yet is reusability. I have a small app (that will quadruple in size by the time we're done) that has the identical initial input stage for every job. It's paramaterized according to the source data configuration (z/OS mainframe datasets generated by Cobol modules). After working out some design issues with the first few jobs, I've been reusing that basic stage and will continue to do so for the rest of the project.
That's a small-scale example. I would assume -- with some confidence -- that a large application expected to have hundreds of jobs will have several opportunites along that line.
That's a small-scale example. I would assume -- with some confidence -- that a large application expected to have hundreds of jobs will have several opportunites along that line.
Franklin Evans
"Shared pain is lessened, shared joy increased. Thus do we refute entropy." -- Spider Robinson
Using mainframe data FAQ: viewtopic.php?t=143596 Using CFF FAQ: viewtopic.php?t=157872
"Shared pain is lessened, shared joy increased. Thus do we refute entropy." -- Spider Robinson
Using mainframe data FAQ: viewtopic.php?t=143596 Using CFF FAQ: viewtopic.php?t=157872
I have just started a project (early design yet) that has the same goal in mind. It is a re-write of the existing DWH using Datastage instead of a bunch of varying technolgies used currently for ETL-ing.ray.wurlod wrote:In my current engagement, I am using fewer than 12 jobs, but they are generic, dynamic and multi-instance.
Ray,
I was wondering whether you would be willing to share more thoughts on the approach you have implemented. I'm particulary interested in the issues that have posed the major challenges for you (if there were any ).
I was also wondering what is the nature and size of this project, although I understand if this is too sensitive information to share.
In addition, was there any estimation done on how much development effort is saved using such an approach oppose to the conventional (thousands jobs) one?