Failover Handling in MPP setups

riptydeva · Post by **riptydeva** » Fri Jun 03, 2005 7:22 am

My team is looking to move a couple of our mission critical jobs from a single server SMP environment to a multi-server MPP environment because we need failover support in case a machine goes down. However, my question is this:

How does Enterprise Edition handle failures of one of it's servers? Will it simply shunt the work to the remaining servers with nodes defined in the configuration file, or will the system stop working?

In addition, if we have chosen a parallelism of "hash" when leading into a lookup (to split the lookup between the two servers instead of sending both machines the entire data set), will this complicate the issue or will EE still handle it properly (assuming the answer to the above question is that it will).

lshort · Post by **lshort** » Fri Jun 03, 2005 7:24 am

Thats a very good question. Is there anyway you could try it out and let us know?

riptydeva · Post by **riptydeva** » Fri Jun 03, 2005 7:33 am

Hehe I wish. Unfortunately we need to know the answer before we buy those pricey licenses. :D

alanwms · Post by **alanwms** » Fri Jun 03, 2005 9:20 am

Are you planning on setting up failover for each of the servers in the MPP cluster or just the main server where DS jobs are initiated? Does the proposed failover seamlessly migrate IP address/host names?

The configuration file identifies the nodes, hosts, disk pools, etc. I'd suggest you look at your clustered environment after a failover and then match up all the system resources against your current configuration files to determine what the configuration files need to look like in case of a failover. Typically, you'll have root disks local to a server and all the other disk on a SAN/NAS. Would DataStage (the engine part, not the projects) be located on the root disk? If so, you'll need to configure/license DS on the failover server separately from the primary server.

I currently work with DS 7.1 Server on a SMP system with failover, but I don't know exactly what the DS licensing arrangements are.

Alan

riptydeva · Post by **riptydeva** » Fri Jun 03, 2005 11:44 am

I just got in touch with an Ascential guru, and here is what he suggested:

EE, in an MPP environment, will shut down if any node fails. It's all or nothing. The simple trick to support failover is to have multiple config files. Say you have 2 servers. Have them run in MPP with a config file that points to both. If server A fails, replace the config file with one that just defines nodes on server B, and then restart the job. You can even have your server monitoring software do this for you.

So I think that is what we are going to try doing. I suggested just having a 2nd SMP box that can be brought online when the primary box fails, but the team really wants to go with MPP to reduce any potential downtime.

lshort · Post by **lshort** » Fri Jun 03, 2005 11:48 am

This is a good topic to stay on top of. Please keep us informed as to how it works out.

ray.wurlod · Post by **ray.wurlod** » Fri Jun 03, 2005 4:46 pm

It's not that simple. (It never is.)

Your job designs need to be constructed so that they preserve enough knowledge to be able to restart, maybe even to restart part way through. This necessarily involves landing that information to disk, and it needs to be available everywhere (perhaps a File Set with Entire partitioning).

The designs also have to incorporate the logic for restart, unless you'd prefer to do it all again. This gets messy if, for example, rows per commit is not 0 or 1. You have to know which rows were committed, so that you don't process them again. And, if you're aggregating, this is not a pretty task.

These are just a couple of reasons why DataStage will never have true restart capability "out of the box". There are too many "it depends" factors.

DSXchange

Failover Handling in MPP setups

Failover Handling in MPP setups

Re: Failover Handling in MPP setups