High Availability and DataStage 8.1

robrob · Post by **robrob** » Mon May 04, 2009 9:35 am

Hallo,
I would install InfoSphere DS 8.1 on a dual node system running AIX 5.3, but I've not found any guideline or documentation. On IBM's "Planning, Installation, and Configuration Guide" the only reference about clustering says: "Share the installation file systems across all servers at the same mount point."
what about DS_ASB instance name? It is linked with the host name so how should I configure the hosts file to allow the switch between the nodes?
The metadata repository will be on a clustere oracle.
Any idea please?
Thanks a lot,
robrob.

asaf_arbely · Post by **asaf_arbely** » Sun May 17, 2009 7:31 am

Hi RobRob

We had implemented a DRP architecture with the server edition (8.0.1)

It is NOT an high availability solution, but an AIX cluster that will activate the secondary node when the primary one will crash.

Under this solution, recovering from a HW crash is the same as recovering from a sudden boot - jobs should be designed to be restartable.

In a HW crash scenario, the operator should be able to "Reset" the job that was running, and re-activate it within couple of minutes since crash.

I'll be happy to provide more technical details if it suites your needs.

Asaf.

chulett · Post by **chulett** » Sun May 17, 2009 8:28 am

I helped build a clustered installation of the Server product on Compaq Tru64 Alpha boxes many moons ago. However, as noted, I wouldn't consider it true "high availability" as there was no notion of what was called TAF at the time: Transparent Application Fail-over. Running jobs would all die but the system would fail everything (including, for example, cron entries) over to another node on the cluster and we'd be back in business in a moment or two. Of course, you still needed to clean up the mess it made and (possibly) restart things but at least you weren't dead in the water.

The question about HA gets asked once in a great while here and as far as I know there still is no true High Availability version or ability in DataStage. A search of the forums should turn up the handful of previous conversations on the topic. And you could always check with IBM or your official support provider, see if anything new is here or around the corner.

asaf_arbely · Post by **asaf_arbely** » Sun May 17, 2009 9:33 am

Hi chulett,

I wish to add some idea I heard from an IBM rep.:

when implementing the Enterprise Edition over multi-node architecture, the HA is sort of "built-in" as one node may die during the process and yet, the entire process will finish while migrating the "crashed" portion to other nodes.

It sounds feasible and worth a try - if I only had the time and money...

chulett · Post by **chulett** » Sun May 17, 2009 9:56 am

It's the "sort of" part that concerns me. Curious what, if anything, others have heard or done recently - especially in the 8.x world.

ArndW · Post by **ArndW** » Sun May 17, 2009 11:38 am

Recently with 8.x we built a HA set of systems for DS in a banking environment. All of the concepts worked on the assumption that jobs would abort and, after failover, would get restarted. The jobs in this case used MQ for messages and thus could guarantee that each message would be processed once and once only. The HA itself was based on the both the hardware and database level software/hardware making a system crash an almost transparent issue. Tests of everything from "kill -9" through literally pulling network cables, SAN data cables and various and sundry power cables showed that it does work. Recovery time after failure detection was anywhere from 2 minutes to 10 minutes.

vmcburney · Post by **vmcburney** » Sun May 17, 2009 7:21 pm

The message based approach would work well since you can turn all the outputs into a single database transaction so if the job aborts everything rolls back. When you get up to higher data volumes you cannot use a single rollback transaction and need to think about rollback and recovery for the jobs that are supposed to restart after an abort.

asaf_arbely · Post by **asaf_arbely** » Mon May 18, 2009 4:55 am

I agree.

If the single logical job can recover and restart while cleaning its remainings from the last un-successful run, the above HA solution will work.

We implemented this solution in a bank with DS 8.0.1, on top of a cluster of 2 AIX machines that point to the same storage (EMC) and same repository (Oracle). obviously the DS installation took place only once. Storage has its own failover mechanisem, as well as the Oracle.

Indeed, recovery from a HW failure takes about 5 min.

Regarding the cleanup required - there is no difference between the disaster scenario and a "simple" abort of a job. In both cases one may need to clean something in order to re-start or continue from last successful checkpoint (supported by DS!).

If the job was designed to clean after itself - then we are talking about an almost automatic solution - only a restart is required.

Asaf.

chulett · Post by **chulett** » Mon May 18, 2009 6:38 am

So then... what is it called when a node failure does not kill running jobs but rather transparently moves the engine to another node? Really High Availability?

Serious question, btw, is that not a consideration? It was back in the day when I was goofing with it. I guess that's the ultimate goal but what's being discussed here is close enough. Ah... I remember working at the Golden Nugget on their Tandem "NonStop" systems fondly... everything comp'd... [sigh]

IBM Analytics Champion 2009 - 2020 · Post by **asorrell** » Wed May 20, 2009 10:25 pm

I'm at a site where they had IBM Global Services implement HA for their Linux-based HP systems. There are two production boxes which we run as a single MPP environment with 8 nodes. If the system detects a hardware failure OR a software failure (ie: WAS / Node Agent / DataStage is down) the system automatically switches to the "working" box, restarts the processes, and notifies us so we can restart any jobs that failed.

We've tried it several times (once accidentally) and it works very well. Jobs are designed so they are restartable, so downtime is just a few minutes.

ray.wurlod · Post by **ray.wurlod** » Wed May 20, 2009 11:44 pm

If you had two failover boxes would that be a HAHA configuration?

robrob · Post by **robrob** » Thu May 21, 2009 5:43 am

asaf_arbely wrote:Hi RobRob

We had implemented a DRP architecture with the server edition (8.0.1)

It is NOT an high availability solution, but an AIX cluster that will activate the secondary node when the primary one will crash.

Under this solution, recovering from a HW crash is the same as recovering from a sudden boot - jobs should be designed to be restartable.

In a HW crash scenario, the operator should be able to "Reset" the job that was running, and re-activate it within couple of minutes since crash.

I'll be happy to provide more technical details if it suites your needs.

Asaf.

Hi Asaf,
that's what I'm looking for, but I don't know what is the instance name to give during datastage installation.
If, from node1, I fill the virtual instance name it doesn't work because under /etc/hosts file it is not binded to localhost. But if I fill in the node1 hostname I can't activate the instance from node2.
I've found this on an IBM forum, but it doesn't seem too clear, and I think contains a typo.
We did the following to install 8.0.1:

login to master node which should be pointing to node 1, change the hostname to nodeMaster using smit
do a standard install of datastage as per documented instructions and then change hostname back to node1
Do a global replace in serverindex.xml file from host="node1" to host="nodeMaster" (about 5 entries)
On node2 include the rpc services entry in /etc/services and create subdirectory /tmp/rt, setup UNIX dsadm user
Configure startup scripts to change default.apt or any required apt files to use correct node i.e. fastname="node1" or fastname="node2"

Swapping nodes involves sourcing the dsenv environment, stopping, then starting datastage services.

Thanks a lot,
robrob.

chulett · Post by **chulett** » Thu May 21, 2009 2:32 pm

ray.wurlod wrote:If you had two failover boxes would that be a HAHA configuration?

Ah, there you are Mr Wurlod. You know you're not supposed to be out wandering the halls on your own, now don't you? Let's just get you back to your room, shall we...that's a good boy... this way...