Datastage Server Sizing

Archive of postings to DataStageUsers@Oliver.com. This forum intended only as a reference and cannot be posted to.

Moderators: chulett, rschirm

Locked
admin
Posts: 8720
Joined: Sun Jan 12, 2003 11:26 pm

Datastage Server Sizing

Post by admin »

Ill try again...
Please take 30 seconds to return this information;
Itll be very helpful to many people, Im sure.

Im trying to compile some information to
help in datastage server sizing...

Information Im looking for is what people ar doing
with what kind of hardware. Im looking for omething
like:

4 400Mhz CPU NT, 512MB memory, 300 Rows/Second
or
2 750 Mhz CPU Sun, 1G memory, 100 Gigabytes in 14
minutes

and, optionally, what the source & target are
(which database, flat file, bulk load, etc.)
and how you would characterize the job complexity
(low, medium, high)

and any other comments you think are appropriate.

if you forward me this information about how
you are using Datastage, Id greatly appreciate
it, and Ill post a summation in a few days for
those who are interested.

thanks....




__________________________________________________
Do You Yahoo!?
Listen to your Yahoo! Mail messages from any phone. http://phone.yahoo.com
admin
Posts: 8720
Joined: Sun Jan 12, 2003 11:26 pm

Post by admin »

Okay bozofoot, the reason nobody responded is because
your question is like asking "In 30 seconds or less,
type the meaning of life."

Also, are you an Informatica plant trying to get a
cheap survey that INFA can run up the flagpole saying
"See, DataStage requires million dollar servers to get
the same throughput INFA gets off a PDA running
Oracle!"

So, to stretch my 30 seconds a little longer...

Sizing is a factor of:
1. Data sources (sequential extracts, transaction
logs, OLTP databases)
2. Data source sizing
3. Transformation complexity
4. Data targets (sequential, OLAP databases, OLTP
databases, ODSs)
5. Funding
6. Technology standards

So, if a client is running AS/400 sources and
SQL-Server 7.0 targets, then the preferred solution is
mostly NT based, because that tends to be the smaller environments. More is better. High-speed quad processor NT servers with a gigabyte of memory is necessary for effective parallel throughput. A RAID array is beneficial for keeping away from disk bottlenecks. NT servers have issues with disks, network cards, parallelism, blue-screens-of-death, etc.

UNIX platforms allow larger data volumes and
throughput. For example, my current pharmaceutical
client is pushing 250 million rows of data through a medium-level transformation ETL application every five days. The sources are sequential file feeds with an Oriekill 8i target. The hardware is a Sun E10k with a production domain with 16 cpus, 20G of RAM, 32G of swap, ~5 terabytes of physical disk. Our throughput could be higher, but Im an advocate of modular and maintainable applications.

There is no simple model for sizing a dedicated ETL
server. Anyone can write brain-dead ETL jobs that do
not maximize the resources at hand. The single
largest determining factor is PARALLELISM. How many
concurrent processes are required to achieve the kind
of throughput required. If you are not using
bulk-loaders than you are wasting a lot of time on
loading. The secret to throughput is bulkloading and PARALLELISM. Sizing is only affected by PARALLELISM.
More cpus = more parallel jobs = more throughput.

Best recommendation:
Get as many cpus as you can afford. If sizing a UNIX
box, trade off as much hardware as you can for cpus.
DataStage is NOT a memory hog, but enough memory and
fast disks make life easier. Memory is needed to
boost read and write caching. But a properly tuned
job will consume an entire cpu!

We look forward to your posting of results on the INFA
website.

-Ken



--- ". ." wrote:
> Ill try again...
> Please take 30 seconds to return this information;
> Itll be very helpful to many people, Im sure.
>
> Im trying to compile some information to
> help in datastage server sizing...
>
> Information Im looking for is what people ar doing
> with what kind of hardware. Im looking for
> omething
> like:
>
> 4 400Mhz CPU NT, 512MB memory, 300 Rows/Second
> or
> 2 750 Mhz CPU Sun, 1G memory, 100 Gigabytes in 14
> minutes
>
> and, optionally, what the source & target are
> (which database, flat file, bulk load, etc.)
> and how you would characterize the job complexity
> (low, medium, high)
>
> and any other comments you think are appropriate.
>
> if you forward me this information about how
> you are using Datastage, Id greatly appreciate
> it, and Ill post a summation in a few days for
> those who are interested.
>
> thanks....
>
>
>
>
> __________________________________________________
> Do You Yahoo!?
> Listen to your Yahoo! Mail messages from any phone.
> http://phone.yahoo.com


__________________________________________________
Do You Yahoo!?
Listen to your Yahoo! Mail messages from any phone. http://phone.yahoo.com
admin
Posts: 8720
Joined: Sun Jan 12, 2003 11:26 pm

Post by admin »

Id just like to add to Kens answer that the network can be a (the?) bottleneck. Dedicated network segments with large bandwidth and isolated subnet addresses can make a world of difference (compared, for example, to participating on the office LAN).
Locked