optimum stages per job?

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
evans036
Premium Member
Premium Member
Posts: 72
Joined: Tue Jan 31, 2006 11:13 pm

optimum stages per job?

Post by evans036 »

being new to DS we had a bunch of very expensive IBM consultants recommend how to structure our jobs. In general we found them very helpful.

one area in which they were quite adament was in us NOT landing the data (notwithstanding restart considerations) for performance reasons.

This dictated very large jobs (north of 100 stages) which they (the consultants) felt was appropriate.

i have since decided large jobs are unwieldy, take forever to compile (3 minutes) and are impossible to debug. i have changed our best practices (going forward) to limit the number of stages per job (ie generally a job should have no more then 20 stages).

do you folks generally code up such large jobs?

any pros/cons that i have not mentioned above?

thanks in advance,
steve
Kirtikumar
Participant
Posts: 437
Joined: Fri Oct 15, 2004 6:13 am
Location: Pune, India

Post by Kirtikumar »

Few of the problem of the large jobs -
1. Diff to debug
2. When a job with say 50 stages is run, these resource contraint may occur. E.g. As in PX, each link is virtual dataset and takes memory in some MBs, if it is not available then job may fail.
3. If you use say a 2 node file and have 10 CPUs, then a 50 stage job may approx create more number of processes. After some threshold, these many number of processes might spent time in doing inter process communication (IPC). So max time will be spent in doing IPC and managing the processes.

This are some of the reasons to avoid the job with large number of stages.
Regards,
S. Kirtikumar.
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

If the whole point of not spliting the job is to not land the data in between, then you very well use the highlight of PX, Persistent dataset. Which will alow you avoid IO to a very great extend when compared to Sequential file.
You cannot have hard and fast rules for the number of stages, but it is always good to well designed many jobs, rather than having single bulkier job.
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Please go back to your very expensive consultants and ask about restart recoverability.

Landing data can be part of a good design, for example if you have different time windows available for extraction and for loading. Sure you sacrifice some throughput time when compared to not landing data, but you gain the ability to meet your time windows.

And, if the target server barfs part-way through, you can restart the load from a known point. Without a staging area you have to restart from scratch.

Ask them.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
evans036
Premium Member
Premium Member
Posts: 72
Joined: Tue Jan 31, 2006 11:13 pm

Post by evans036 »

we always make sure we land the data before loading and do the loading in separate jobs.

btw those consultant types deserted ship a month or two ago.

thanks,

steve
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

There's a difference between theory and practice. When attempting to set a land-speed record, do you see them driving a car in any other manner than a straight line? Do you see that the engine is tuned for a high-performance full power run with no clutch or automatic transmission? Do you see that the car is designed without comfort in mind, airbags, collision resistance devices like bumpers, paint, leather, or even a stereo? Do you notice the engine isn't designed to change oil, replace parts occassionally, no color-coded fuse box? But it sure does go fast in a straight line.

You can't take that design, those driving practices, and engineering standards and apply it to driving on the highway. In the stale environment of a lab, straight traversal between two database systems across a network will only achieve maximum speed if you don't land the data (which includes any temp files, so you have to have obscene amounts of memory to avoid any swapping or temp files).

But in the real world, we deal with restartability, audit trail, debug-ability, modularity, practicality, version control, etc. So we build small modular jobs that fit into general categories (source data acquisition, lookup building, workfile building, transformation, extraction from staging, and ultimately loading). Those drive us towards templates, repeatable designs, and smaller jobs. Landing between milestone activities allow restart and audit points.

I've never seen an environment where there was so much memory that swapping never occurs. Ask yourself a question, why does the config file require scratch pools and disk nodes if there's no landing to disk? Could it be that it really is occuring behind the scenes out of your control? Wouldn't you rather control the "scratching" so that there's some re-usability to that data anyway?
Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
vijayrc
Participant
Posts: 197
Joined: Sun Apr 02, 2006 10:31 am
Location: NJ

Re: optimum stages per job?

Post by vijayrc »

evans036 wrote:being new to DS we had a bunch of very expensive IBM consultants recommend how to structure our jobs. In general we found them very helpful.

one area in which they were quite adament was in us NOT landing the data (notwithstanding restart considerations) for performance reasons.

This dictated very large jobs (north of 100 stages) which they (the consultants) felt was appropriate.

i have since decided large jobs are unwieldy, take forever to compile (3 minutes) and are impossible to debug. i have changed our best practices (going forward) to limit the number of stages per job (ie generally a job should have no more then 20 stages).

do you folks generally code up such large jobs?

any pros/cons that i have not mentioned above?

thanks in advance,
steve
Same out here...DS new in our company, and had 'very expensive' IBM consultants for a week/two and they recommended unwanted landing of data and at the same time to consider 50+ stages as the guideline and not to go more. Smaller jobs are best suited for debugging/functionality etc, but at the cost of start-up times for each job. Larger jobs are best suited for performance, since data isn't landed that often, but at the cost of debugging/functionality. So I would suggest a moderate job that encapsulates a particular functionality. Just my 2cents. =Vijay
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Incremental design.
Thorough testing.

Buy LOTS of hardware from your friendly IBM rep.

One of the above is not as serious a comment as the other two.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Post Reply