optimum stages per job?

evans036 · Post by **evans036** » Thu Oct 05, 2006 5:33 am

being new to DS we had a bunch of very expensive IBM consultants recommend how to structure our jobs. In general we found them very helpful.

one area in which they were quite adament was in us NOT landing the data (notwithstanding restart considerations) for performance reasons.

This dictated very large jobs (north of 100 stages) which they (the consultants) felt was appropriate.

i have since decided large jobs are unwieldy, take forever to compile (3 minutes) and are impossible to debug. i have changed our best practices (going forward) to limit the number of stages per job (ie generally a job should have no more then 20 stages).

do you folks generally code up such large jobs?

any pros/cons that i have not mentioned above?

thanks in advance,
steve

Kirtikumar · Post by **Kirtikumar** » Thu Oct 05, 2006 5:42 am

Few of the problem of the large jobs -
1. Diff to debug
2. When a job with say 50 stages is run, these resource contraint may occur. E.g. As in PX, each link is virtual dataset and takes memory in some MBs, if it is not available then job may fail.
3. If you use say a 2 node file and have 10 CPUs, then a 50 stage job may approx create more number of processes. After some threshold, these many number of processes might spent time in doing inter process communication (IPC). So max time will be spent in doing IPC and managing the processes.

This are some of the reasons to avoid the job with large number of stages.

kumar_s · Post by **kumar_s** » Thu Oct 05, 2006 6:19 am

If the whole point of not spliting the job is to not land the data in between, then you very well use the highlight of PX, Persistent dataset. Which will alow you avoid IO to a very great extend when compared to Sequential file.
You cannot have hard and fast rules for the number of stages, but it is always good to well designed many jobs, rather than having single bulkier job.

ray.wurlod · Post by **ray.wurlod** » Thu Oct 05, 2006 2:29 pm

Please go back to your very expensive consultants and ask about restart recoverability.

Landing data can be part of a good design, for example if you have different time windows available for extraction and for loading. Sure you sacrifice some throughput time when compared to not landing data, but you gain the ability to meet your time windows.

And, if the target server barfs part-way through, you can restart the load from a known point. Without a staging area you have to restart from scratch.

Ask them.

evans036 · Post by **evans036** » Thu Oct 05, 2006 6:39 pm

we always make sure we land the data before loading and do the loading in separate jobs.

btw those consultant types deserted ship a month or two ago.

thanks,

steve

kcbland · Post by **kcbland** » Thu Oct 05, 2006 7:28 pm

There's a difference between theory and practice. When attempting to set a land-speed record, do you see them driving a car in any other manner than a straight line? Do you see that the engine is tuned for a high-performance full power run with no clutch or automatic transmission? Do you see that the car is designed without comfort in mind, airbags, collision resistance devices like bumpers, paint, leather, or even a stereo? Do you notice the engine isn't designed to change oil, replace parts occassionally, no color-coded fuse box? But it sure does go fast in a straight line.

You can't take that design, those driving practices, and engineering standards and apply it to driving on the highway. In the stale environment of a lab, straight traversal between two database systems across a network will only achieve maximum speed if you don't land the data (which includes any temp files, so you have to have obscene amounts of memory to avoid any swapping or temp files).

But in the real world, we deal with restartability, audit trail, debug-ability, modularity, practicality, version control, etc. So we build small modular jobs that fit into general categories (source data acquisition, lookup building, workfile building, transformation, extraction from staging, and ultimately loading). Those drive us towards templates, repeatable designs, and smaller jobs. Landing between milestone activities allow restart and audit points.

I've never seen an environment where there was so much memory that swapping never occurs. Ask yourself a question, why does the config file require scratch pools and disk nodes if there's no landing to disk? Could it be that it really is occuring behind the scenes out of your control? Wouldn't you rather control the "scratching" so that there's some re-usability to that data anyway?

vijayrc · Post by **vijayrc** » Wed Jan 31, 2007 12:00 pm

evans036 wrote:being new to DS we had a bunch of very expensive IBM consultants recommend how to structure our jobs. In general we found them very helpful.

one area in which they were quite adament was in us NOT landing the data (notwithstanding restart considerations) for performance reasons.

This dictated very large jobs (north of 100 stages) which they (the consultants) felt was appropriate.

i have since decided large jobs are unwieldy, take forever to compile (3 minutes) and are impossible to debug. i have changed our best practices (going forward) to limit the number of stages per job (ie generally a job should have no more then 20 stages).

do you folks generally code up such large jobs?

any pros/cons that i have not mentioned above?

thanks in advance,
steve

Same out here...DS new in our company, and had 'very expensive' IBM consultants for a week/two and they recommended unwanted landing of data and at the same time to consider 50+ stages as the guideline and not to go more. Smaller jobs are best suited for debugging/functionality etc, but at the cost of start-up times for each job. Larger jobs are best suited for performance, since data isn't landed that often, but at the cost of debugging/functionality. So I would suggest a moderate job that encapsulates a particular functionality. Just my 2cents. =Vijay

ray.wurlod · Post by **ray.wurlod** » Wed Jan 31, 2007 4:00 pm

Incremental design.
Thorough testing.

Buy LOTS of hardware from your friendly IBM rep.

One of the above is not as serious a comment as the other two.

DSXchange

optimum stages per job?

optimum stages per job?

Re: optimum stages per job?