Whole Machine Grinds to a Halt - On Disk Contention.

Jluchjen · Post by **Jluchjen** » Thu Mar 03, 2005 10:31 pm

We are Running:
Datastage Server Version 7.1r1
AIX version 5.2.
Disk is an Hitachi 9500 disk array
connected over Dual Channel a fibre links.

Hello All,

When I run a couple of my Jobs which generate a large amount of Disk IO. This AIX box performs like a dog, ie telnet and issue an ls you may wait up to 2 minutes for a response or try to exit a shell the same sort of wait.

My system administrator and I feel we are getting nowhere and recovering the same ground.

When we use nmon to watch what is happening, the following things are observed:

Once the hard disk reaches 100% (Stating anything from 3000kb to 4000kb per Second Through put) One by one the CPUs show 100% W (Wait). The IO appears to continue to be written, and when it clears the CPUs seem to free up again for a while. Only to repeat the Cycle.

Has anybody seen this before (and hopefully fixed it)????

Thanks Jack

kcbland · Post by **kcbland** » Thu Mar 03, 2005 10:48 pm

Yep, happens all the time. One simple job SEQ --> XFM --> HASH can hog a poor disk/controller design. Try running two jobs like that at the same time, forget about it. Put a database on top of that layout, go brew a pot of coffee and send out for donuts.

We see this on linux and Windows boxes a lot, the disk framework is not organized to handle the high disk i/o requirements. For example, one of my customers development server uses Clarions but the production server uses Symmetrix. The jobs on the development server can't execute a full volume data run because of disk contention, but the production server handles it without breaking a sweat.

Your disk array, can you describe it's nature? Is it a heck of a lot of disks and controllers like the high-end 9500, or are we talking 5 disks striped on a single controller with 1 GB cache? What kind of volume are you throwing at it?

Jluchjen · Post by **Jluchjen** » Fri Mar 04, 2005 12:34 am

Thanks Ken,

That was quick we are truly gratefull.

My Sysetm admin says:
Our 9500 has 4GB of cache.
4 raid controllers.
There is 13 raid5 groups of 4data + 1 parity.

We are defining 150GB disks within these raid groups.
AIX is connected over 2Gb fibre link.

Ta Jack

gpatton · Post by **gpatton** » Fri Mar 04, 2005 6:01 am

Had a very similar problem before.

What you need to do is see how the disk drives / LUN's are actually mapped to the RAID groups.

You may be trying to separate IO and in reality be exasperating it.

If Disk 1, Disk 2, Disk 3 and Disk4 are used in LUN1 and LUN2 and LUN3, then you should make all of these LUN's part of one file system.

Make sure that the database log files are on separate disk from the database files.

Also check out the RAID stripping on the SAN and the database to make sure that you are not cross stripping the IO.

Jluchjen · Post by **Jluchjen** » Fri Mar 04, 2005 5:19 pm

Hi :D ,

Thank you very much, I have passed on your comments to our System Administrator and look forward to improved performance.

I intend to post again once we get the changes Implemented and measure the performance gain.

Ta Jack