Linux 8 way maxs all procs - box to become unresponsive

Bill_G · Post by **Bill_G** » Fri Dec 15, 2006 2:44 pm

We failed our 4 way dev box (linux RH 3) over to an 8 way yesterday. Without modifing the config file (4 node) several PX jobs ran to completion, albeit without vast improvement in performance. I figured I needed to add some scratch space and modify the config file. Still just testing the waters.

However, testing today - again without modification to the config file, i had a PX job max out all 8 procs and render the Server useless. We had to execute a hard reboot to recover. 4 gig of ram (out of 32) was used and swap space used was 0. The job had previously run successfully on the 4 way.

The job reads in 4 inputs from ORACLE, splits the data into peices, executes approximately 15 or so simultaneous lookups(both sparse and normal), then merges the data back together prior to writing to the target - a Data Set.

I have opened a ticket with IBM, but was hoping somone else had run into a similar problem. Is it a config file issue or some OS level setting that needs to be adjusted?

TIA

the config file used.

Code: Select all

{
	node "node0"
	{
		fastname "etl00.xxxxx.org"
		pools ""
		resource disk "/usr/ETL/Flat-Files" {pools ""}
		resource scratchdisk "/usr/ETL/Temp1" {pools ""}
	}
 
  node "node1"
        {
                fastname "etl00.xxxxx.org"
                pools ""
                resource disk "/usr/ETL/Flat-Files" {pools ""}
                resource scratchdisk "/usr/ETL/Temp2" {pools ""}
                resource ORACLE "node1" {pools ""}
        }
        
  node "node2"
        {
                fastname "etl00.xxxxx.org"
                pools ""
                resource disk "/usr/ETL/Flat-Files" {pools ""}
                resource scratchdisk "/usr/ETL/Temp1" {pools ""}
                resource ORACLE "node2" {pools ""}
        }
        
        
  node "node3"
        {
                fastname "etl00.xxxxx.org"
                pools ""
                resource disk "/usr/ETL/Flat-Files" {pools ""}
                resource scratchdisk "/usr/ETL/Temp2" {pools ""}
           }            
        
}

ray.wurlod · Post by **ray.wurlod** » Fri Dec 15, 2006 4:01 pm

Larger data volumes? That it "ran OK on the four CPU box" suggests that, if nothing else has changed, then either a larger data volume is the cause, or that other things were occurring at the same time on the eight CPU box. You need to check both. Unfortunately, you have destroyed all the evidence (unless you were monitoring system performance) by rebooting.

Bill_G · Post by **Bill_G** » Mon Dec 18, 2006 8:52 am

We are using the same exact source data set that we used on the 4-way configuration. The only thing that has changed is the hardware - we failed the 4-way over.

ray.wurlod · Post by **ray.wurlod** » Mon Dec 18, 2006 3:21 pm

How closely have you verified that "the only thing changed is the hardware"? Please monitor both systems, running the same jobs, to determine what DataStage processes are running and what resources these are consuming. Also report all other processes, particularly any associated with database servers and with data transfers.

jgreve · Post by **jgreve** » Mon Dec 18, 2006 6:12 pm

Linux RedHat 3, eh?
By "failed over" do you mean... converted? ported? migrated?
Is any of the hardware the same? Or did you plug in a
new 8-way box to your network, leaving the old 4-way one hooked
up as well?

As for your final question, "Is it a config file issue or some OS level
setting that needs to be adjusted?" I would say, "It depends."
You've got a fair amount of trouble shooting ahead of you (some might
call it detective work).

I haven't gone all that deep into parallel stuff (yet), so I'll throw
out some naive questions (not because I'm experienced or anything,
just because I think it is an interesting topic area).
At any rate, I'd welcome criticism, constructive or otherwise,
to see if these are useful things to try.

What happens if you change your fastnames to "localhost" ?
(now, now - I did say they were naive questions.)

What happens if you force everything to run on a single node, e.g. use a config
file with only "node0" - does it hang, or just run really slow?
What happens if you run an easy do-nothing job - perhaps a record generator
to peek stages. Do the log entries show activity on node0-node3 as you have
set up in your config file?

At the point when you notice your job has hung, what is going
on with the rest of the machine?
What is your cpu utilization?
Disk i/o ?
Network traffic?
It just clicked in my mind that your box freezes up so it is hard
to get stats. Perhaps launch a 1-second logger in another window
and flush your performance metrics to disk so you can get a clue
about which aspect is running away on you?
(I'd be tempted to hack this up with a perl-loop and just cat stuff
from the /proc folder, but there must be more sophisticated ways, yes?)
After things seem to hang, wait an extra minute before rebooting.
You should have some interesting numbers up to the point it went
over the cliff...

Is Oracle running on your 8-way box as well, or a remote system?
If remote, what are your ping times like to the oracle box - is your internal
network fairly clear or are the admins deathmatching again?

What happens when you check your Oracle inputs - do they actually have data?
Fire up toad or smth and see if you can get result sets on your input and ensure
that your datastage players aren't waiting on some kind of database lock.

Good luck with this.
I'm sure you'll figure something out.
John G.

Bill_G wrote:We failed our 4 way dev box (linux RH 3) over to
an 8 way yesterday. Without modifing the config file (4 node)
several PX jobs ran to completion, albeit without vast improvement in performance.

I figured I needed to add some scratch space and modify the config file.
Still just testing the waters. However, testing today - again without modification
to the config file, i had a PX job max out all 8 procs and render the Server useless.
We had to execute a hard reboot to recover. 4 gig of ram (out of 32) was used and
swap space used was 0. The job had previously run successfully on the 4 way.

The job reads in 4 inputs from ORACLE, splits the data into peices,
executes approximately 15 or so simultaneous lookups(both sparse and normal),
then merges the data back together prior to writing to the target - a Data Set.

I have opened a ticket with IBM, but was hoping somone else had run into a similar problem.
Is it a config file issue or some OS level setting that needs to be adjusted?

TIA

DSXchange

Linux 8 way maxs all procs - box to become unresponsive

Linux 8 way maxs all procs - box to become unresponsive

detective time Re: Linux 8 way maxs all procs -