Infra Sizing for a new Datastage initiative

blitz76 · Post by **blitz76** » Mon Aug 24, 2015 3:00 am

Hi Experts

Would appreciate your thoughts on this.

Background:
We are doing infrastructure sizing.This happens before the requirements phase.
The mandate is to find out if the current instance of Datastage can handle the data and if any changes are required in the hardware.
We use DS 8.5
There are other Datastage projects up and running.

Details known:
The Daily load needs to load 200 Million rows per day.The time window available is 6 hours
Input format .txt/.csv files.
The input files would come from 3 source systems.Each source system would split the data into multiple files.
Output format is .txt files.
Transformations: Simple to Medium consisting of generating new columns and a couple of lookups
It is estimated that on an average the source columns will have around 120 odd columns.

In effect,
There are no extracts,loads from Databases.
Simple -Medium transformations.

Have never had the need to do Infra sizing before.

Would appreciate if you can give me any pointers about how to work out the sizing for this project.

rkashyap · Post by **rkashyap** » Mon Aug 24, 2015 7:35 am

Contact IBM and request IBM Techline Information Server Sizing Estimate.

PaulVL · Post by **PaulVL** » Mon Aug 24, 2015 8:29 am

The mandate is to find out if the current instance of Datastage can handle the data and if any changes are required in the hardware.

Look at the existing work on the box. Ask application team how much data they are pumping into it. Look at your CPU consumption over a month. Ask application team if they are seeing any delays in their job submission (if they are using an external job scheduler). Determine why if any there is a delay.
Count how many cores and memory you currently have on your setup.
Ask HW purchasing team a realistic minimum HW purchase in terms of core count for a non virtualized environment. (I hate VMs for DataStage setups)
Ask for a Linux number and an AIX number. AIX isn't so bad since you can slice it up into a smaller LPAR (ya ya an LPAR is a type of virtual, blah blah)

Look how existing jobs are created. Application teams tend to stick with a trend "small jobs or big jobs". Expect the same behavior from the same folks.

Ask if new workload will be done at the same time as existing workload or if it can be shifted to a non-peek time of day and still meet SLA. Application teams will say to process the data as soon as it is ready in order to accommodate for a re-run if needed. And for the most part that is true, but there is always wiggle room in that answer and a hour here or there could have a huge difference on your resource consumption (scratch disk).