High Availability and DataStage 8.1

A forum for discussing DataStage<sup>®</sup> basics. If you're not sure where your question goes, start here.

Moderators: chulett, rschirm, roy

Post Reply
robrob
Participant
Posts: 2
Joined: Fri Dec 05, 2008 9:36 am

High Availability and DataStage 8.1

Post by robrob »

Hallo,
I would install InfoSphere DS 8.1 on a dual node system running AIX 5.3, but I've not found any guideline or documentation. On IBM's "Planning, Installation, and Configuration Guide" the only reference about clustering says: "Share the installation file systems across all servers at the same mount point."
what about DS_ASB instance name? It is linked with the host name so how should I configure the hosts file to allow the switch between the nodes?
The metadata repository will be on a clustere oracle.
Any idea please?
Thanks a lot,
robrob.
asaf_arbely
Premium Member
Premium Member
Posts: 87
Joined: Sat Jul 14, 2007 2:24 pm

Post by asaf_arbely »

Hi RobRob

We had implemented a DRP architecture with the server edition (8.0.1)

It is NOT an high availability solution, but an AIX cluster that will activate the secondary node when the primary one will crash.

Under this solution, recovering from a HW crash is the same as recovering from a sudden boot - jobs should be designed to be restartable.

In a HW crash scenario, the operator should be able to "Reset" the job that was running, and re-activate it within couple of minutes since crash.

I'll be happy to provide more technical details if it suites your needs.

Asaf.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

I helped build a clustered installation of the Server product on Compaq Tru64 Alpha boxes many moons ago. However, as noted, I wouldn't consider it true "high availability" as there was no notion of what was called TAF at the time: Transparent Application Fail-over. Running jobs would all die but the system would fail everything (including, for example, cron entries) over to another node on the cluster and we'd be back in business in a moment or two. Of course, you still needed to clean up the mess it made and (possibly) restart things but at least you weren't dead in the water.

The question about HA gets asked once in a great while here and as far as I know there still is no true High Availability version or ability in DataStage. A search of the forums should turn up the handful of previous conversations on the topic. And you could always check with IBM or your official support provider, see if anything new is here or around the corner.
-craig

"You can never have too many knives" -- Logan Nine Fingers
asaf_arbely
Premium Member
Premium Member
Posts: 87
Joined: Sat Jul 14, 2007 2:24 pm

Post by asaf_arbely »

Hi chulett,

I wish to add some idea I heard from an IBM rep.:

when implementing the Enterprise Edition over multi-node architecture, the HA is sort of "built-in" as one node may die during the process and yet, the entire process will finish while migrating the "crashed" portion to other nodes.

It sounds feasible and worth a try - if I only had the time and money... :wink:
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

It's the "sort of" part that concerns me. Curious what, if anything, others have heard or done recently - especially in the 8.x world.
-craig

"You can never have too many knives" -- Logan Nine Fingers
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

Recently with 8.x we built a HA set of systems for DS in a banking environment. All of the concepts worked on the assumption that jobs would abort and, after failover, would get restarted. The jobs in this case used MQ for messages and thus could guarantee that each message would be processed once and once only. The HA itself was based on the both the hardware and database level software/hardware making a system crash an almost transparent issue. Tests of everything from "kill -9" through literally pulling network cables, SAN data cables and various and sundry power cables showed that it does work. Recovery time after failure detection was anywhere from 2 minutes to 10 minutes.
vmcburney
Participant
Posts: 3593
Joined: Thu Jan 23, 2003 5:25 pm
Location: Australia, Melbourne
Contact:

Post by vmcburney »

The message based approach would work well since you can turn all the outputs into a single database transaction so if the job aborts everything rolls back. When you get up to higher data volumes you cannot use a single rollback transaction and need to think about rollback and recovery for the jobs that are supposed to restart after an abort.
asaf_arbely
Premium Member
Premium Member
Posts: 87
Joined: Sat Jul 14, 2007 2:24 pm

Post by asaf_arbely »

I agree.

If the single logical job can recover and restart while cleaning its remainings from the last un-successful run, the above HA solution will work.

We implemented this solution in a bank with DS 8.0.1, on top of a cluster of 2 AIX machines that point to the same storage (EMC) and same repository (Oracle). obviously the DS installation took place only once. Storage has its own failover mechanisem, as well as the Oracle.

Indeed, recovery from a HW failure takes about 5 min.

Regarding the cleanup required - there is no difference between the disaster scenario and a "simple" abort of a job. In both cases one may need to clean something in order to re-start or continue from last successful checkpoint (supported by DS!).

If the job was designed to clean after itself - then we are talking about an almost automatic solution - only a restart is required.

Asaf.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

So then... what is it called when a node failure does not kill running jobs but rather transparently moves the engine to another node? Really High Availability? :?

Serious question, btw, is that not a consideration? It was back in the day when I was goofing with it. I guess that's the ultimate goal but what's being discussed here is close enough. Ah... I remember working at the Golden Nugget on their Tandem "NonStop" systems fondly... everything comp'd... [sigh]
-craig

"You can never have too many knives" -- Logan Nine Fingers
asorrell
Posts: 1707
Joined: Fri Apr 04, 2003 2:00 pm
Location: Colleyville, Texas

Post by asorrell »

I'm at a site where they had IBM Global Services implement HA for their Linux-based HP systems. There are two production boxes which we run as a single MPP environment with 8 nodes. If the system detects a hardware failure OR a software failure (ie: WAS / Node Agent / DataStage is down) the system automatically switches to the "working" box, restarts the processes, and notifies us so we can restart any jobs that failed.

We've tried it several times (once accidentally) and it works very well. Jobs are designed so they are restartable, so downtime is just a few minutes.
Andy Sorrell
Certified DataStage Consultant
IBM Analytics Champion 2009 - 2020
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

If you had two failover boxes would that be a HAHA configuration?
:lol:
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
robrob
Participant
Posts: 2
Joined: Fri Dec 05, 2008 9:36 am

Post by robrob »

asaf_arbely wrote:Hi RobRob

We had implemented a DRP architecture with the server edition (8.0.1)

It is NOT an high availability solution, but an AIX cluster that will activate the secondary node when the primary one will crash.

Under this solution, recovering from a HW crash is the same as recovering from a sudden boot - jobs should be designed to be restartable.

In a HW crash scenario, the operator should be able to "Reset" the job that was running, and re-activate it within couple of minutes since crash.

I'll be happy to provide more technical details if it suites your needs.

Asaf.
Hi Asaf,
that's what I'm looking for, but I don't know what is the instance name to give during datastage installation.
If, from node1, I fill the virtual instance name it doesn't work because under /etc/hosts file it is not binded to localhost. But if I fill in the node1 hostname I can't activate the instance from node2.
I've found this on an IBM forum, but it doesn't seem too clear, and I think contains a typo.
We did the following to install 8.0.1:

login to master node which should be pointing to node 1, change the hostname to nodeMaster using smit
do a standard install of datastage as per documented instructions and then change hostname back to node1
Do a global replace in serverindex.xml file from host="node1" to host="nodeMaster" (about 5 entries)
On node2 include the rpc services entry in /etc/services and create subdirectory /tmp/rt, setup UNIX dsadm user
Configure startup scripts to change default.apt or any required apt files to use correct node i.e. fastname="node1" or fastname="node2"

Swapping nodes involves sourcing the dsenv environment, stopping, then starting datastage services.


Thanks a lot,
robrob.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

ray.wurlod wrote:If you had two failover boxes would that be a HAHA configuration?
Ah, there you are Mr Wurlod. You know you're not supposed to be out wandering the halls on your own, now don't you? Let's just get you back to your room, shall we...that's a good boy... this way...
-craig

"You can never have too many knives" -- Logan Nine Fingers
Post Reply