Product Announcements

More on Fault Tolerance from the Blogs

[Updated with new entries below! -jmt]

As this round of the contest goes forward, here are some of the posts we've seen on FT. Drop us a line at [email protected] and we'll post a link to your blog. There's some great info in here!

Eric Sloof – Fault Tolerance at your home lab

After publishing an article about the CPU compatibility with VMware Fault Tolerance,
my search for a white CPU began. The vLockstep technology used by FT
requires the physical processor extensions added to the latest
processors from Intel and AMD. In order to run FT, a host must have an
FT-capable processor, and both hosts running an FT VM pair must be in
the same processor family.

Richard Garsthagen’s “CPU-Host-Info
shows all the available options on both the Intel Q9400 and Q9550
marked true. I’ve used the Intel Q8200 in another white box and it
didn’t work, so in order to use FT, you need FT and both the VT
options. The next step is run through the Fault Tolerance Checklist.

Jason Boche – After enabling FT on a VM – subtleties to expect

In this particular instance, the underlying cause for this condition
is VMware Fault Tolerance (FT) has been enabled on the FT “primary” VM.
The fact that the memory resource settings cannot be modified is by
design and is used as a means to help guarantee the FT “secondary” VM
stays in vLockstep with the primary. What has actually happened is that
when FT was enabled on the VM, a memory reservation was set equal to
the amount of memory configured for the VM. This eliminates VMkernel
swap file for the VM managed by the host in all cases, not just for FT
enabled VMs.

What other subtle changes can you expect when you enable VMware Fault Tolerance (FT) on a VM?

Roger Lund – VMware Fault Tolerance: What is it? What does it do? (Roger created a clear video of FT in action for this post.)

With the advent of vSphere, VMware has released a host of new features. Today I am going to talk about VMware Fault Tolerance.
I’ll give you a overview, and talk to you about the Requirements. Next
I’ll walk you through the setup and configuration , and finally, we
will discuss both the benefits and pitfalls of Fault Tolerance.
Oh, and I will provide you with some links to documentation both
through the blog, and again at the end. Just a little light reading for
a rainy day, incase you get bored. I almost forgot! I will also show
you a Demo of Fault Tolerance , as I test failover. * note, to see
video, please open this in a full window.

Rynardt Spies – VMware FT…Can you afford a SAN failure?

In today’s world where mission critical applications
need to be available 24×7 with 99.99% availability, companies are
throwing millions of dollars or pounds at implementing redundant and
fault tolerant infrastructures. We all know that the money we spend
today will save us much more in the future. Some companies make two to
three million profit each and every day. In order to be competitive in
the current climate, they need business applications such as messaging
and collaboration to be available at all times. Imagine if a business
with hundreds of employees one day suddenly lost the ability to send
and receive email.

This may sound unheard of, but
just this very week I’ve dealt with such a case where a company
employing almost 10,000 people had no email, collaboration, database
systems and even a corporate website for more than 24 hours, just
because a critical component failed on their main SAN.

Cody Bunch – Scheduling VMware’s FT (Fault Tolerance)

One of the other use cases that for FT that I find especially interesting, came from episode 53 of the VMTN podcast.
Using the ability to selectively turn FT on and off again for a
specific VM, you can provide protection to long running reports/jobs
within your infrastructure. Say that accounting report that takes 3
days every month to run, now with FT, if host dies 2.5 days in, the VM
will still be processing, uninterrupted on the other node.

On hearing that, I decided that, with a predictable workload like
this, there is no reason it shouldn’t be scheduled. After all, clicking
enable FT once a month is only cute the first time. How do we go about
scheduling it? First you figure out how to enable FT using the PowerCLI
(or have someone on Twitter point out the communities post it’s in :-)

Joep Piscaer – What I’ve learned from BC2961, “VMware Fault Tolerance Architecture and Performance”

VMware
FT works by recording non-deterministic events or inputs to the VM
(disk reads, network receives (or rx), keystrokes, etc) and certain CPU
events like RDTSC and interrupts. Recording these requires way less
logging than recording every single CPU instruction.

Because vLockStep does not record and replay complete CPU
instructions, but only certain events, CPU usage isn’t identical on
either host. This could lead to a difference in CPU usage, and can
cause the ‘vLockStep interval’ or execution lag to increase. Whenever
the secondary host is busy, the primary VM will have to wait for the
secondary VM. If the secondary host catches up with the primary
(because CPU utilization goes down), the interval decreases.

Hany Michael – vSphere 4.0 Fault Tolerance (Architecture Diagram, Video and Use Cases) comes in with an architecture diagram, a video, and 3 really interesting use cases:

I’m taking off now my “VMware Evangelist” hat, and putting on the
“VMware Customer” hat. What you’ll read here is my real-life use cases
for the FT, no marketing talk, no political debates. This “is” the real
deal:

1 – Blackberry Enterprise Server & RoveIT Mobile Admin:
BES
is one of our most business critical applications because it’s being
used by our higher management in their day-to-day communications.
Initially we were depending on HA since we didn’t think that our luck
would be that bad to have an ESX host failure while one of the
executives sending an email.

This continued to be the case until we deployed the RoveIT Mobile Admin & vCenter Mobile Access
(with BES/MDS in the backend). We basically wanted to have a 24/7
access for our SysAdmins to our entire IT environment (including the
VMware VI 3.5) while they are on the go, using their Blackberry
smart-phones (given by the corp for this specific purpose). This was
mainly to improve our response time for emergency situations, and of
course this service makes no sense unless it can tolerate the most
severe situations of hardware failures. Enabling FT on both the BES and
the Mobile Admin VMs allow us, from one hand, to ensure that our
executives will never complain that they can’t use their Blackberry
whenever they need, and that “IT Suck”. From the other hand, we, the IT
suckers..er..i mean SysAdmins & consultants, can have a piece of
mind that we will always be able to get to our backend systems wherever
there is a problem that requires an immediate attention.