Processor Sizing, Version Control & MS Explorer

dickfong · Post by **dickfong** » Wed Mar 12, 2003 11:08 am

We are in the preparation stage of a new project and got 3 questions about DataStage.
1. How can the number of processors be sized? Are there any formula for the calculations?
2. Anyone can share good/bad experience with the version control that comes with DataStage, versus using other VC software such as clearcase?
3. One observation of us is that MetaStage explorer consumed all the memory we have yet not able to run smoothly even for the import. We have increased memory from 256MB to 512MB but the improvement isnt significant. Are there any formula for the estimation on memory for MetaStage explorer?

Any sharing of experiences and advices would be appreciated.

Thanks in advance

Dick

ray.wurlod · Post by **ray.wurlod** » Wed Mar 12, 2003 12:14 pm

MetaStage is, by virtue of what it does, extremely memory hungry.
Even when the company was Ardent, field technicians were advocating not less than 512MB for DataStage alone, and not less than 1GB for DataStage and MetaStage. It's still the case with MetaStage that more is better.
The new version, MetaStage 6.0, seems to be slightly less memory hungry (but maybe I maxed out on CPU first).
Ascential is putting a lot of work - so they tell me - into reducing the footprint of MetaStage, at the same time they tell me they're enhancing it, particularly its reporting capabilities.
Watch this space.

Ray Wurlod
Education and Consulting Services
ABN 57 092 448 518

ariear · Post by **ariear** » Wed Mar 12, 2003 2:19 pm

Hi,

If your DataStage server is W2K a good idea is to install the MS client there also (Usually it'll be a stronger machine) and do all your imports overthere by using some remotedesktop software like terminal services etc - if you do imports manually. If automatically just schedule them overthere.

Ariear

chulett · Post by **chulett** » Wed Mar 12, 2003 3:45 pm

On the issue of Version Control, this has been discussed several times so you should be able to search the forum and get more information.

In a nutshell, I think it's a great addition to the toolset and makes managing versions of job/routines/scripts/etc very easy. Plus, since it is an integrated product, it knows how to do things like compile jobs and set read-only attributes when promoting objects, for example. And it beats the heck out of the Packaging Wizard!

-craig

vmcburney · Post by **vmcburney** » Wed Mar 12, 2003 6:48 pm

I agree that version control is very good for moving components from development into other environments such as testing and production. They have improved the handling of scripts and text files so you can deliver these along with the jobs and routines.

It is not a tool for managing multiple developers within a development environment. It lacks the check in / check out and version comparison features you would expect in a source control tool. This isn't a big deal if your team understands which parts of the project they are working on.

Vincent McBurney
Data Integration Services
www.intramatix.com

srinivasb · Post by **srinivasb** » Wed Mar 12, 2003 9:50 pm

Hi,

On version control utility of DataStage, the concept is simple.Only the design time jobs are editable while testing and production jobs are read only.

Thus there will be three projects - design, test and production.
Once you pass the jobs thru version control, they become non editable. the version control also assigns a specific version number ( self generated)

As vincent puts it, while managing a 14 member development team , we have experienced that a tool like Visual Source Safe (VSS) is much suited.

Regards
Srinivas.B

Srinivas.B
India
Phone:0091-44-28585690
Xtn 5067

dickfong · Post by **dickfong** » Thu Mar 13, 2003 8:43 am

Thanks for all of you, your input are very useful to me. Thank you [:)]

For Version Control, I would better check the archive of this forum. Just one more question. Are there any integration / suggestions to make DataStage/Version Control to work with clearcase or VSS?

For MetaStage, is it possible to 'calculate' the memory/processor that can make the import and analysis to run smoothly? Or 1GB is just a 'minimum requirement'?

For DataStage sizing, we are still annoying how many CPU shall we put in a box for the optimized performance for our ETL process. Any suggestion to it? BTW, is it a normal practice to put DataStage in a seperated machine for ETL purpose? Is it recommended to share DataStage a box with other software like the dbms or the olap server?

Best Regards,
Dick FOng

kjanes · Post by **kjanes** » Thu Mar 13, 2003 10:30 am

There are a lot of factors that may shape your final decisions.

For MetaStage, a primary concern is the size of the DataStage Project and the complexity of the jobs that dictate the import and analysis time. Yes, you can throw 1GB of memory at it and a 2GB processor on the client but then you have database conciderations as well. How busy is your database server? There is not a simple answer to this because there is a point on diminishing returns from a hardware perspective due to the design of the product. For a MetaStage Administrator I would recommend atleast 1GB of memory and for users, atleast 512MB RAM. CPU should probably be atleast 1GB which is already behind the times.

DataStage sizing has similar criteria. First and foremost would be what window do you have to complete your processes and how much hardware is it going to take to accomplish this window. We are running an instance of Sybase on the same server as DataStage (AIX) that provides the bulk of the data for our ETL. We are also going to run some OLAP operations during off-hours on the same box. It is an 8-CPU AIX box that has 6GB RAM. We run a weekly batch process of about 400 jobs in less than 24 hours. Our weekly process keeps growing and is fairly complex. DataStage comtinues to handle the growth to keep us in our window. DataStage can consume any and all resources (all 8 cpu's and memory) unless you load balance. We kick off probably 40 jobs at once and DataStage consumes all resources to get it's work done.

If we had other applications in conflict on this box, we would not be able to finish within our window. We have limited the number of applications that can interfere with our ETL. Sybase works as part of the ETL so it is not an issue.

Kevin Janes

vmcburney · Post by **vmcburney** » Thu Mar 13, 2003 5:10 pm

I've often seen a staging database running on the same server as DataStage. This has the advantage of reducing network overheads when accessing this database. I would advise against putting any production databases on your DataStage server as this would impact on user response times. If you put OLAP services on your database server will this impact on OLAP users? If you only intend to run DataStage at night what happens when it falls over one night and you need to run it all day?

The number of CPUs depends on how much data you are moving and how small the processing window is. Since you are running so much on that server you really should probably push for 8 processors to make sure you don't run into any problems. With the use of parallel jobs you should find a way to use all of those processors during a load.

For a development environment you can probably throw everything onto the same box. DataStage, databases and OLAP. Ideally you want test and development environments that match the specs of your production environment but not everyone has that much money.

Vincent McBurney
Data Integration Services
www.intramatix.com

dickfong · Post by **dickfong** » Thu Mar 13, 2003 5:22 pm

Thanks for the valuable information.

Currently we are processing 8 GB of source data with about 1000 jobs, the load window is around 5 - 6 hours. Our DataStage sits on an AIX box.

With the figures stated. Or with the figures that you've mentioned, how do you calculate that 8 CPUs or so is needed? Or do you judge it by experience?

Regards,
Dick Fong

ray.wurlod · Post by **ray.wurlod** » Thu Mar 13, 2003 10:33 pm

Another possibility, being implemented on my current project, is to have Parallel Extender control the CPUs that DataStage can use, and to have a separate set of CPUs allocated to Oracle tasks. (Oracle is a staging area from which other DataStage jobs load SAP BW. However, as is the nature of such things, some users want to query the Oracle tables.)

Ray Wurlod
Education and Consulting Services
ABN 57 092 448 518

kjanes · Post by **kjanes** » Fri Mar 14, 2003 7:59 am

We are in fact using Sybase as a staging area on the same box as DataStage. It is definately not advised to run a Production database on the same box as DataStage.

Another note: DataStage will always have the priority on this box so the OLAP processes will only be run during off-hours whenever that may be.

The question remains: how does one determine 8 CPU's is enough? If your process is completing close to the end of your window and you cannot afford for a job to go down and make you miss your window then you may need to look at ways to either tune performance, improve the process (use PE?) or throw more hardware at it. As time goes on, you also should consider how much the data source is going to grow and whether or not new jobs/process will impact your window.

Since hardware is not cheap, we have looked to DataStage tuning and job re-design for long-running jobs. Job complexity, network bandwith, database load methods all need to be condsidered as well. Some real world benchmarking using a "live" data will hopefully give you an idea of the timing. I would keep in consideration that "Production" seems to run a certain degree slower because of network traffic, database contention, etc...

Is the load window (5-6 hours) just the time to load the DB or does that include the extract/transform as well?

Kevin Janes

dickfong · Post by **dickfong** » Fri Mar 14, 2003 9:13 am

In our case, the 5-6 hours includes anything from processing to loading transformed data into target database.

Then, how can one design how many processors shall a project start with? Shall it be pure experience or can it be calculated somehow? like using benchmarking from the vendors?

Regards,
Dick Fong

kjanes · Post by **kjanes** » Fri Mar 14, 2003 10:02 am

A few words of experience... vendor benchmarks are usually done in ideal environments. Usually not real world. So, their benchmarks are usually not entirely indicative of what you might experience. I am sure Ascential can offer you some guidance on hardware based on their experience and calculations they have created. The one this I have encountered is it is better to have more hardware (due to future growth) than just enough (which may top out quickly). Oversizing at a reasonable level allows for future growth without having to go back and justify more expenditures.

I do not have such formulas at my disposal. Throughput and ETL performance is probably best defined over time by experience due to the many considerations that may can impact run time. I think we go through 100GB or so during our weekly process.

Kevin Janes