Failover Handling in MPP setups

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
riptydeva
Participant
Posts: 5
Joined: Fri Jun 03, 2005 7:11 am

Failover Handling in MPP setups

Post by riptydeva »

My team is looking to move a couple of our mission critical jobs from a single server SMP environment to a multi-server MPP environment because we need failover support in case a machine goes down. However, my question is this:

How does Enterprise Edition handle failures of one of it's servers? Will it simply shunt the work to the remaining servers with nodes defined in the configuration file, or will the system stop working?

In addition, if we have chosen a parallelism of "hash" when leading into a lookup (to split the lookup between the two servers instead of sending both machines the entire data set), will this complicate the issue or will EE still handle it properly (assuming the answer to the above question is that it will).
lshort
Premium Member
Premium Member
Posts: 139
Joined: Tue Oct 29, 2002 11:40 am
Location: Toronto

Post by lshort »

Thats a very good question. Is there anyway you could try it out and let us know?
8)
Lance Short
"infinite diversity in infinite combinations"
***
"The absence of evidence is not evidence of absence."
riptydeva
Participant
Posts: 5
Joined: Fri Jun 03, 2005 7:11 am

Post by riptydeva »

Hehe I wish. Unfortunately we need to know the answer before we buy those pricey licenses. :D
alanwms
Charter Member
Charter Member
Posts: 28
Joined: Wed Feb 26, 2003 2:51 pm
Location: Atlanta/UK

Post by alanwms »

Are you planning on setting up failover for each of the servers in the MPP cluster or just the main server where DS jobs are initiated? Does the proposed failover seamlessly migrate IP address/host names?

The configuration file identifies the nodes, hosts, disk pools, etc. I'd suggest you look at your clustered environment after a failover and then match up all the system resources against your current configuration files to determine what the configuration files need to look like in case of a failover. Typically, you'll have root disks local to a server and all the other disk on a SAN/NAS. Would DataStage (the engine part, not the projects) be located on the root disk? If so, you'll need to configure/license DS on the failover server separately from the primary server.

I currently work with DS 7.1 Server on a SMP system with failover, but I don't know exactly what the DS licensing arrangements are.

Alan
riptydeva
Participant
Posts: 5
Joined: Fri Jun 03, 2005 7:11 am

Re: Failover Handling in MPP setups

Post by riptydeva »

I just got in touch with an Ascential guru, and here is what he suggested:

EE, in an MPP environment, will shut down if any node fails. It's all or nothing. The simple trick to support failover is to have multiple config files. Say you have 2 servers. Have them run in MPP with a config file that points to both. If server A fails, replace the config file with one that just defines nodes on server B, and then restart the job. You can even have your server monitoring software do this for you.

So I think that is what we are going to try doing. I suggested just having a 2nd SMP box that can be brought online when the primary box fails, but the team really wants to go with MPP to reduce any potential downtime.
lshort
Premium Member
Premium Member
Posts: 139
Joined: Tue Oct 29, 2002 11:40 am
Location: Toronto

Post by lshort »

This is a good topic to stay on top of. Please keep us informed as to how it works out.
Lance Short
"infinite diversity in infinite combinations"
***
"The absence of evidence is not evidence of absence."
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

It's not that simple. (It never is.)

Your job designs need to be constructed so that they preserve enough knowledge to be able to restart, maybe even to restart part way through. This necessarily involves landing that information to disk, and it needs to be available everywhere (perhaps a File Set with Entire partitioning).

The designs also have to incorporate the logic for restart, unless you'd prefer to do it all again. This gets messy if, for example, rows per commit is not 0 or 1. You have to know which rows were committed, so that you don't process them again. And, if you're aggregating, this is not a pretty task.

These are just a couple of reasons why DataStage will never have true restart capability "out of the box". There are too many "it depends" factors.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Post Reply