Experiences with DataStage High Availability?

ogmios · Post by **ogmios** » Mon Feb 02, 2004 12:35 pm

Does anyone have experiences with DataStage High Availability on Solaris UNIX Machines?

Has anyone already implemented the failover scripts that are shown in the "DataStage Operator Tips Tricks Final Version Customers.ppt" in the files section of http://developernet.ascential.com ? And do they work with the Standard version of DataStage or only with the PE edition?

Ogmios

roy · Post by **roy** » Mon Feb 02, 2004 1:04 pm

Hi,
do you have a specific need?

ogmios · Post by **ogmios** » Mon Feb 02, 2004 1:12 pm

As high available as possible. We already have "high availability" on hardware level, every component is duplicated: when a disk crashes its mirror is automatically used, when a backplane gets broken the machine reboots and automatically switches to a backup backplane.

Now customers want to implement application high availability, when e.g. a backplane is broken they want DataStage to be restarted automatically upon reboot and somehow pick up the ETL cycle where it left off. Yeah right.

Ogmios

chulett · Post by **chulett** » Mon Feb 02, 2004 1:34 pm

What hardware are you running on? The same basic tasks as listed in the Tips & Tricks presentation can work for Server as well as PX I would think, they just differ in some of the gory details. In a former life we ran DataStage on a multi-node Compaq Tru64 cluster with failover scripted in. Wasn't that hard to setup from what I remember, biggest issue was failing over the crontab with the server for any affected users.

As to restarting the jobs where they left off, in my mind that wouldn't be any different in that particular environment than in a crashed stand-alone server implementation. I'd be curious what their script looks like and how it decides to restart job streams 'appropriately'.

ogmios · Post by **ogmios** » Mon Feb 02, 2004 1:45 pm

Solaris 8/Sunfire (with a big number

). The pictures in the presentation don't look like a representation of the standard DataStage edition: e.g. haven't seen a "player" mentioned elsewhere in the documentation I have (will check with Ascential).

Right now I was thinking of rolling our own and we're evaluating to use a third party "smart" scheduler instead of crontab that starts the jobs and can be programmed to restart the jobs which didn't finish yet after a crash.

Ogmios

chulett · Post by **chulett** » Mon Feb 02, 2004 1:58 pm

I think you'll find that 'Player' is specific to the Orchestrate underpinnings that power PX.

ray.wurlod · Post by **ray.wurlod** » Mon Feb 02, 2004 3:07 pm

chulett wrote:I think you'll find that 'Player' is specific to the Orchestrate underpinnings that power PX.

True. Orchestrate gives you a Conductor and a number of Players. Wonder where they got their terminology?

kcbland · Post by **kcbland** » Mon Feb 02, 2004 9:56 pm

ogmios wrote:Now customers want to implement application high availability, when e.g. a backplane is broken they want DataStage to be restarted automatically upon reboot and somehow pick up the ETL cycle where it left off.

This is possible only if you write an ETL application built to restart. There are frameworks by which your ETL application has staging milestones (see all of my history on this forum supporting this architecture) where data is landed to a physical form. These staging milestones set logical and convenient milestones within the processing stream so as to provide a fallback point in the event of catastrophic failure. Since a particular segment of processing may be manipulating staged datasets, and failure during that segment of processing will leave the dataset in a unrecoverable state, you have to be able to fall back to a recent milestone whereby you can pick up from there with a minimal amount of reprocessing.

I can espouse this framework further, but I want to say that this works. The critical elements are that your staging databases and work dataset file systems have to be mirrored for failover. Furthermore, you have to have an enterprise scheduler (or at least intelligent job control) that is smart enough to either resurrect a jobstream exactly where it last stopped, or restart the jobstream from the last completed milestone if in an unrecoverable segment of processing. The basic milestones are sourcing from source systems, lookup preparations, dimensional transformation, fact transformation, aggregate transformation, target database load preparation, bulk inserting, bulk updating. This is part of a Kimball'ian datawarehouse/bus architecture, which is highly successful.