More on Fault Tolerance from the Blogs
[Updated with new entries below! -jmt]
As this round of the contest goes forward, here are some of the posts we've seen on FT. Drop us a line at vmtn@vmware.com and we'll post a link to your blog. There's some great info in here!
Eric Sloof - Fault Tolerance at your home lab
After publishing an article about the CPU compatibility with VMware Fault Tolerance, my search for a white CPU began. The vLockstep technology used by FT requires the physical processor extensions added to the latest processors from Intel and AMD. In order to run FT, a host must have an FT-capable processor, and both hosts running an FT VM pair must be in the same processor family.
Richard Garsthagen’s “CPU-Host-Info” shows all the available options on both the Intel Q9400 and Q9550 marked true. I’ve used the Intel Q8200 in another white box and it didn’t work, so in order to use FT, you need FT and both the VT options. The next step is run through the Fault Tolerance Checklist.
Jason Boche - After enabling FT on a VM – subtleties to expect
In this particular instance, the underlying cause for this condition is VMware Fault Tolerance (FT) has been enabled on the FT “primary” VM. The fact that the memory resource settings cannot be modified is by design and is used as a means to help guarantee the FT “secondary” VM stays in vLockstep with the primary. What has actually happened is that when FT was enabled on the VM, a memory reservation was set equal to the amount of memory configured for the VM. This eliminates VMkernel swap file for the VM managed by the host in all cases, not just for FT enabled VMs.
What other subtle changes can you expect when you enable VMware Fault Tolerance (FT) on a VM?
Roger Lund - VMware Fault Tolerance: What is it? What does it do? (Roger created a clear video of FT in action for this post.)
With the advent of vSphere, VMware has released a host of new features. Today I am going to talk about VMware Fault Tolerance. I’ll give you a overview, and talk to you about the Requirements. Next I’ll walk you through the setup and configuration , and finally, we will discuss both the benefits and pitfalls of Fault Tolerance. Oh, and I will provide you with some links to documentation both through the blog, and again at the end. Just a little light reading for a rainy day, incase you get bored. I almost forgot! I will also show you a Demo of Fault Tolerance , as I test failover. * note, to see video, please open this in a full window.
Rynardt Spies - VMware FT...Can you afford a SAN failure?
In today’s world where mission critical applications need to be available 24x7 with 99.99% availability, companies are throwing millions of dollars or pounds at implementing redundant and fault tolerant infrastructures. We all know that the money we spend today will save us much more in the future. Some companies make two to three million profit each and every day. In order to be competitive in the current climate, they need business applications such as messaging and collaboration to be available at all times. Imagine if a business with hundreds of employees one day suddenly lost the ability to send and receive email.
This may sound unheard of, but just this very week I’ve dealt with such a case where a company employing almost 10,000 people had no email, collaboration, database systems and even a corporate website for more than 24 hours, just because a critical component failed on their main SAN.
Cody Bunch - Scheduling VMware’s FT (Fault Tolerance)
Joep Piscaer - What I’ve learned from BC2961, “VMware Fault Tolerance Architecture and Performance”One of the other use cases that for FT that I find especially interesting, came from episode 53 of the VMTN podcast. Using the ability to selectively turn FT on and off again for a specific VM, you can provide protection to long running reports/jobs within your infrastructure. Say that accounting report that takes 3 days every month to run, now with FT, if host dies 2.5 days in, the VM will still be processing, uninterrupted on the other node.
On hearing that, I decided that, with a predictable workload like this, there is no reason it shouldn’t be scheduled. After all, clicking enable FT once a month is only cute the first time. How do we go about scheduling it? First you figure out how to enable FT using the PowerCLI (or have someone on Twitter point out the communities post it’s in
![]()
VMware FT works by recording non-deterministic events or inputs to the VM (disk reads, network receives (or rx), keystrokes, etc) and certain CPU events like RDTSC and interrupts. Recording these requires way less logging than recording every single CPU instruction.
Because vLockStep does not record and replay complete CPU instructions, but only certain events, CPU usage isn’t identical on either host. This could lead to a difference in CPU usage, and can cause the ‘vLockStep interval’ or execution lag to increase. Whenever the secondary host is busy, the primary VM will have to wait for the secondary VM. If the secondary host catches up with the primary (because CPU utilization goes down), the interval decreases.
Hany Michael - vSphere 4.0 Fault Tolerance (Architecture Diagram, Video and Use Cases) comes in with an architecture diagram, a video, and 3 really interesting use cases:
I’m taking off now my “VMware Evangelist” hat, and putting on the “VMware Customer” hat. What you’ll read here is my real-life use cases for the FT, no marketing talk, no political debates. This “is” the real deal:
1 – Blackberry Enterprise Server & RoveIT Mobile Admin:
BES is one of our most business critical applications because it’s being used by our higher management in their day-to-day communications. Initially we were depending on HA since we didn’t think that our luck would be that bad to have an ESX host failure while one of the executives sending an email.This continued to be the case until we deployed the RoveIT Mobile Admin & vCenter Mobile Access (with BES/MDS in the backend). We basically wanted to have a 24/7 access for our SysAdmins to our entire IT environment (including the VMware VI 3.5) while they are on the go, using their Blackberry smart-phones (given by the corp for this specific purpose). This was mainly to improve our response time for emergency situations, and of course this service makes no sense unless it can tolerate the most severe situations of hardware failures. Enabling FT on both the BES and the Mobile Admin VMs allow us, from one hand, to ensure that our executives will never complain that they can’t use their Blackberry whenever they need, and that “IT Suck”. From the other hand, we, the IT suckers..er..i mean SysAdmins & consultants, can have a piece of mind that we will always be able to get to our backend systems wherever there is a problem that requires an immediate attention.
What the deuce is fault tolerance? Tolerance of peopling being at fault? Seriously, someone gimme a summary so my lazy self doesn't have to go read all these blogs about the subject...
Posted by: captive insurance company | 12/03/2010 at 08:20 PM