Bit of a philosophical topic today, nothing to do with the technology but more to do with decision making.
During one of my VMworld sessions I made a point that a few people came back to me on with feedback that it was very important and was something they had not considered, so I thought I might share it here as well.
When people start negotiating SLAs around RTOs, they often get very tied to the technical components and ask questions about things like boot speed of VMs, resource pool sizing at the recovery site, and so forth that have impact on the recovery time for an application or service.
But something that people often overlook is the question of how long it takes to make a decision to fail over an environment and run through the recovery process. This is a very important factor as many times this is the longest period of outage!
Having a process for deciding when to hit the 'panic button' is handy. It helps you make up your mind on whether or not a recovery is necessary or whether the outage is temporary. I've seen it myself where an environment *could* have been failed over with SRM to a remote site, but they thought they 'almost had it' and sat with non-functional email for 8 hours rather than run the failover. Needless to say their RTO was irrelevant because people could not make the decision.
So it's just something to keep in mind: If your RTO is 3 hours, but you spend 2.5 hours of human time trying to decide if a failover is necessary, it is going to be very difficult to meet your RTO.
In sum – RTO is not purely a technical calculation: You must also include in your RTO the time it takes for humans to decide that a failover is needed!