Home > Blogs > VMware vSphere Blog > Monthly Archives: September 2009

Monthly Archives: September 2009

VMware Fault Tolerance: winning blog post

Thanks to everyone who participated in the first cycle of our vSphere blogging contest. A great number of people decided to participate, and the amount of content — really useful content — that was produced was incredible. The more useful information about virtualization that gets published on the Internet, the more all of us in the virtualization community benefit from that shared knowledge — a better toolkit for our virtualization projects. So thanks again to everyone who took the time and effort to participate! Blog entries spring forth only with a great deal of blood, sweat and time, and your efforts are appreciated. Here are the list of entries in this cycle – they all are well worth reading if you're interested in implementing Fault Tolerance:

See the list of blog entries about VMware Fault Tolerance: part 1 and part 2

Before we announce the winner, we'd also like an honorable mention to go to Eric Siebert and his entry: Master's guide to VMware Fault Tolerance. Although we didn't award him the prize this week, the judges thought that this was a great reference and one that's well worth bookmarking.

But overall, the judges awarded the prize for the best blog post in this first round of blogging about VMware Fault Tolerance to Hany Michael, for his entry

The panel thought that Hany had a great way of explaining FT, a very nice diagram that was indeed worth a thousand words, a well-produced video, and some real-world use cases. A new enabling feature like FT brings 24 x 7 availability to workloads that previously would have been impractical to protect because of cost and complexity. A few good examples go a long way to explaining the new uses that VMware Fault Tolerance now makes practical. (Barry Coombs also mentioned a real-world use case.)

Thanks Hany and all the people who entered, and let's join the next cycle of the vSphere blogging contest, already in progress. In this cycle we're talking about the vNetwork Distributed Switch.

Even more on Fault Tolerance from the blogs

And the first cycle of our contest closes with these blog posts:

Eric Siebert – Master's Guide to VMware Fault Tolerance. Eric provides a comprehensive look at FT, tools to check the compatibility of your hardware, tips & best practices, and a list of links to further resources and his previous articles on FT. This is an article you may want to bookmark for future reference.

VII. So should you actually use FT? Enter SiteSurvey

Now that you’ve read all this, you might be wondering if you meet
the many requirements to use FT in your own environment. VMware
provides a utility called SiteSurvey
that will look at your infrastructure and see if it is capable of
running FT. It is available as either a Windows or Linux download and
once you install and run it, you will be prompted to connect to a
vCenter Server. Once it connects to the vCenter Server you can choose
from your available clusters to generate a SiteSurvery report that
shows whether or not your hosts support FT and if the hosts and VMs
meet the individual prerequisites to use the feature.

You can also click on links in the report that will give you
detailed information about all the prerequisites along with compatible
CPU charts. These links go to VMware’s website and display the help document for the SiteSurvey utility, which is full of great information, including some of the following prerequisites for FT.

Brian Atkinson – VMware Fault Tolerance Requirements and Limitations. Brian gives us a list of tips & considerations for FT brought together from a number of places.

This blog entry continues to get a lot of hits, so I thought I would
keep it updated and reformat it a bit. VMware's Fault Tolerance is a
great feature that has generated a lot of interest, and it is also a
new feature of vSphere that will only continue to improve. With that
being said, the list below is the current state of requirements and
limitations for enabling FT virtual machines in vSphere. The majority
of this information came from the vSphere Pre-requisites Checklist, the VMware Fault Tolerance Datasheet and the Availability Guide. Other items were picked up in the forums or in the VMware knowledge base. kb article 1010601 "Understanding VMware Fault Tolerance" is a great kb resource to start with, if you are new to this feature.

David Strebel – VMware Fault Tolerance Setup and Best Practices hits the FT highlights for a quick overview.

Here is a few of the best practices VMware recommends when using Fault Tolerance

  • Use multiple NIC’s, HBA’s etc. for redundancy
  • Isolate vMotion and FT Logging traffic
  • Use consistent power management settings
  • Limit the number of FT VM’s on a host to four
  • Use 10Gb NIC for FT logging
  • Synchronize guest OS time
  • Use NTP on ESX servers

Barry Coombs – VMware FT gives his perspectives, some use cases, and a very nice screenshot tutorial for how to use FT. 

There are already numerous white papers and technical documents
surrounding VMware FT so I didn’t want this just to become a rewrite of
one of these, I thought I would share some use cases that I have
experienced over my past few months installing vSphere, some findings
to help you getting started and a step by step overview of enabling FT
on your VM’s.  The work I undertake is mainly in the SMB environment so
this will maybe give those that work mainly with Enterprise
environments another view and also will hopefully be useful to those at
all levels of are considering or wanting to know more about VMware FT. …

I have recently completed a project for a legal firm, their Exchange
server was at the heart of their business and any downtime during the
limited amount of time their barristers have to access email could be
extremely costly. Although VMware HA would cover their risk of server
failure, they couldn’t afford to trust an unclean power down and the
amount of calls to deal with whilst the servers were booted on another
host would be huge. With only 2 IT staff managing their IT
infrastructure learning, monitoring and maintaining a complicated
replication or clustering technology would not be possible. So after
viewing a demo of VMware FT it was clear this was a must have feature
for them. After initial analysis and ensuring their Exchange server
could work within the limitations of FT their Exchange server was
virtualised and is now protected with VMware FT.

Didier Pironet – To FT Or Not To FT? asks himself the following questions with respect to VMware HA, VMware FT, and Microsoft clusters

You need to ask yourself the right questions:

  1. How long can I afford a downtime?
  2. How complex my high availability setup can be?

VMware Fault Tolerance, single vCPU workloads, and performance on modern hardware

Two recent posts on VMware's VROOM! Blog, written by Todd Muirhead (@virtualTodd) in our performance team. The upshot? FT can currently be used only on VM's with one vCPU. With current hardware, however, this may not be as much of a limitation as you think! In the most recent post, Todd walks us through a case where a Nehalem single vCPU VM is doing just as much as a double processor VM running on just slightly older parts.

Comparing Performance of 1vCPU Nehalem VM with 2vCPU Harpertown VM

 

There are a couple of interesting things to note about the results. 

The first is that the sendmail average latency results with FT enabled on a 1vCPU Xeon 5570 based VM with 1500 users was within 5ms of the 2vCPU Xeon 5460 VM with 2000 users.  This means that the Nehalem based 1vCPU VM was getting an extra 50% more users per vCPU than the 2vCPU Harpertown based VM.

Average CPU utilization on the 1vCPU VM with 2000 users and FT enabled was only 45% which leaves head room for spikes in usage.  This means that 2000 heavy online LoadGen users ran comfortably in a 1vCPU VM. 

Conclusion

A 1vCPU Xeon X5500 series based Exchange Server VM can support 50% more users per core than a 2vCPU VM based on previous generation processors while maintaining the same level of performance in terms of Sendmail latency.  This is accomplished while the VM’s CPU utilization remains below 50%, allowing plenty of capacity for peaks in workload and making an FT VM practical for use with Exchange Server 2007.

In his previous post, Todd looks at the performance impact of FT on the Microsoft Exchange workloads.

Performance of Exchange Server 2007 in a Fault Tolerant Virtual Machine

 

The testing showed that the performance of the Exchange VM was affected only slightly when FT was used. Sendmail average latency increased by 10 to 13 milliseconds, and 95th percentile avgerage latency increased by 33 to 45 milliseconds.  All test results were under the 1000ms threshold at which user experience starts to degrade.  These results indicate that, even at 2000 users, the performance of Exchange on a 1 vCPU VM was acceptable with or without FT.

 

 

SendMailLatencyGraphs_withFT

The CPU utilization results for the overall system show a low impact of using FT.  Because the Exchange VM was the only one on the ESX server, overall system utilization was very low with a peak of just over 7% in the most stressful test.  Enabling FT only caused an additional 1 to 1.5% of system CPU to be used.  The utilization of the ESX host with the secondary VM was slightly lower than the primary.  When examining the CPU utilization of the 1 vCPU VM, the utilization average reaches just under 45%.  This is a comfortable level that still leaves room for the bursty nature of Exchange. 

More on Fault Tolerance from the Blogs

[Updated with new entries below! -jmt]

As this round of the contest goes forward, here are some of the posts we've seen on FT. Drop us a line at vmtn@vmware.com and we'll post a link to your blog. There's some great info in here!

Eric Sloof – Fault Tolerance at your home lab

After publishing an article about the CPU compatibility with VMware Fault Tolerance,
my search for a white CPU began. The vLockstep technology used by FT
requires the physical processor extensions added to the latest
processors from Intel and AMD. In order to run FT, a host must have an
FT-capable processor, and both hosts running an FT VM pair must be in
the same processor family.

Richard Garsthagen’s “CPU-Host-Info
shows all the available options on both the Intel Q9400 and Q9550
marked true. I’ve used the Intel Q8200 in another white box and it
didn’t work, so in order to use FT, you need FT and both the VT
options. The next step is run through the Fault Tolerance Checklist.

Jason Boche – After enabling FT on a VM – subtleties to expect

In this particular instance, the underlying cause for this condition
is VMware Fault Tolerance (FT) has been enabled on the FT “primary” VM.
The fact that the memory resource settings cannot be modified is by
design and is used as a means to help guarantee the FT “secondary” VM
stays in vLockstep with the primary. What has actually happened is that
when FT was enabled on the VM, a memory reservation was set equal to
the amount of memory configured for the VM. This eliminates VMkernel
swap file for the VM managed by the host in all cases, not just for FT
enabled VMs.

What other subtle changes can you expect when you enable VMware Fault Tolerance (FT) on a VM?

Roger Lund – VMware Fault Tolerance: What is it? What does it do? (Roger created a clear video of FT in action for this post.)

With the advent of vSphere, VMware has released a host of new features. Today I am going to talk about VMware Fault Tolerance.
I’ll give you a overview, and talk to you about the Requirements. Next
I’ll walk you through the setup and configuration , and finally, we
will discuss both the benefits and pitfalls of Fault Tolerance.
Oh, and I will provide you with some links to documentation both
through the blog, and again at the end. Just a little light reading for
a rainy day, incase you get bored. I almost forgot! I will also show
you a Demo of Fault Tolerance , as I test failover. * note, to see
video, please open this in a full window.

Rynardt Spies – VMware FT…Can you afford a SAN failure?

In today’s world where mission critical applications
need to be available 24×7 with 99.99% availability, companies are
throwing millions of dollars or pounds at implementing redundant and
fault tolerant infrastructures. We all know that the money we spend
today will save us much more in the future. Some companies make two to
three million profit each and every day. In order to be competitive in
the current climate, they need business applications such as messaging
and collaboration to be available at all times. Imagine if a business
with hundreds of employees one day suddenly lost the ability to send
and receive email.

This may sound unheard of, but
just this very week I’ve dealt with such a case where a company
employing almost 10,000 people had no email, collaboration, database
systems and even a corporate website for more than 24 hours, just
because a critical component failed on their main SAN.

Cody Bunch – Scheduling VMware’s FT (Fault Tolerance)

One of the other use cases that for FT that I find especially interesting, came from episode 53 of the VMTN podcast.
Using the ability to selectively turn FT on and off again for a
specific VM, you can provide protection to long running reports/jobs
within your infrastructure. Say that accounting report that takes 3
days every month to run, now with FT, if host dies 2.5 days in, the VM
will still be processing, uninterrupted on the other node.

On hearing that, I decided that, with a predictable workload like
this, there is no reason it shouldn’t be scheduled. After all, clicking
enable FT once a month is only cute the first time. How do we go about
scheduling it? First you figure out how to enable FT using the PowerCLI
(or have someone on Twitter point out the communities post it’s in :-)

Joep Piscaer – What I’ve learned from BC2961, “VMware Fault Tolerance Architecture and Performance”

VMware
FT works by recording non-deterministic events or inputs to the VM
(disk reads, network receives (or rx), keystrokes, etc) and certain CPU
events like RDTSC and interrupts. Recording these requires way less
logging than recording every single CPU instruction.

Because vLockStep does not record and replay complete CPU
instructions, but only certain events, CPU usage isn’t identical on
either host. This could lead to a difference in CPU usage, and can
cause the ‘vLockStep interval’ or execution lag to increase. Whenever
the secondary host is busy, the primary VM will have to wait for the
secondary VM. If the secondary host catches up with the primary
(because CPU utilization goes down), the interval decreases.

Hany Michael – vSphere 4.0 Fault Tolerance (Architecture Diagram, Video and Use Cases) comes in with an architecture diagram, a video, and 3 really interesting use cases:

I’m taking off now my “VMware Evangelist” hat, and putting on the
“VMware Customer” hat. What you’ll read here is my real-life use cases
for the FT, no marketing talk, no political debates. This “is” the real
deal:

1 – Blackberry Enterprise Server & RoveIT Mobile Admin:
BES
is one of our most business critical applications because it’s being
used by our higher management in their day-to-day communications.
Initially we were depending on HA since we didn’t think that our luck
would be that bad to have an ESX host failure while one of the
executives sending an email.

This continued to be the case until we deployed the RoveIT Mobile Admin & vCenter Mobile Access
(with BES/MDS in the backend). We basically wanted to have a 24/7
access for our SysAdmins to our entire IT environment (including the
VMware VI 3.5) while they are on the go, using their Blackberry
smart-phones (given by the corp for this specific purpose). This was
mainly to improve our response time for emergency situations, and of
course this service makes no sense unless it can tolerate the most
severe situations of hardware failures. Enabling FT on both the BES and
the Mobile Admin VMs allow us, from one hand, to ensure that our
executives will never complain that they can’t use their Blackberry
whenever they need, and that “IT Suck”. From the other hand, we, the IT
suckers..er..i mean SysAdmins & consultants, can have a piece of
mind that we will always be able to get to our backend systems wherever
there is a problem that requires an immediate attention.

On the Size of ESXi

In an earlier post, we explained how ESXi is just as fully-featured as classic ESX, and what it can do just depends upon what features are licensed in a given deployment.  Some people are surprised that this much functionality can be provided in such a small package.  Our colleague Eric Gray has posted a two-part series on his vCritical blog that goes into great detail about this.  The first part dissects an ESXi image and shows how, even though an ESXi deployment requires about 1 GB of disk or flash memory space, the actual ESXi image is only around 70 MB (the rest being used for other purposes).  The second part shows how to get a full-blown ESXi system up and running on a mere 64MB USB stick.

Of course, we don't recommend squeezing ESXi into a smaller space than what we preconfigure, since that extra content (such as vSphere Client and VMware Tools) is all useful stuff.  But if you want to satisfy your curiosity, then check out those blog posts.

Kicking it off: VMware vSphere Blogging Contest

Welcome to the new VMware vSphere Blog! This will be the central place to check for news, commentary, links to new resources, and other information about VMware vSphere. Feedback about what you'd like to see here is always welcome. Your host and editor will be Mike Adams.

To kick off this shiny new blog, we wanted to start a discussion, and we thought it would be fun to combine it with a contest — to add a bit of bragging rights and put some skin in the game. And since vSphere is the most comprehensive business virtualization platform on the planet, we thought one contest wasn't enough, so we decided to have seven. The contest will be held in two week cycles, with each cycle focusing on a different feature of vSphere. During each cycle, we will discuss the feature here on this blog and point to other blogs in the discussion. At the end of each cycle, we'll announce a winner and feature them here.

Contest Overview - In the past, a key benefit of virtualization was the ROI from server consolidation. With vSphere 4, you can now take advantage of the breadth of features to abstract your physical infrastructure and create a dynamic, flexible infrastructure, with a return on investment that goes far beyond simple server consolidation using features like Fault Tolerance, Dynamic Resource Scheduling, vShield Zones, the vNetwork Distributed Switch, and Storage vMotion. The ease of use, power, and flexibility of these features are found only with VMware vSphere.

For this contest, we're looking for bloggers to discuss novel technical or business ROI topics, not just
repetitions of a marketing message. Examples
of good topics are hands-on stories about your data center or lab experiences, technical evaluations, tutorials, use case stories, or other insights into the feature you've gained
while working with it. Our criteria include technical
correctness, value to the community, creativity, and "interestingness" — is this something you would link
to from your blog.

A short post in many cases can be more interesting than a longer
post, and they do say a picture is worth a thousand words, so we hope
that everybody will have time to contribute, even if you're heads down
in your data center that week.

Virtual Lab Time. Since we know that not everyone has had a chance to check out all the new features of vSphere, we are also offering free lab access to bloggers who are participating in the contest. Contact Mike Adams (madams@vmware.com) to schedule some time to remotely connect to our virtual lab.

Schedule. Here is the planned schedule for the rest of the year. (We reserve the right to switch things around if necessary, and we're open to other topics after that, so let us know if you have other ideas.)

  • Fault Tolerance (9/14-9/25)
  • vNetwork Distributed Switch (9/28-10/09)
  • Data Recovery (10/12-10/23)
  • Thin Provisioning (10/26-11/06)
  • vShield Zones (11/09-11/20)
  • ESXi (11/30-12/11)
  • Tier 1 Apps (12/14-12/28)

Rules

  1. During the two week period, post a blog entry about the feature we're focusing on according to the criteria outlined above.
  2. You may add a disclaimer that you are blogging for the contest.
  3. You may link to or republish older, already-published blog entries for the contest, but some additional new content must be added.
  4. You may blog multiple times in the two-week period, and each entry will be considered for the contest. If you create a multi-part entry, consider having one overview post that links to all the other pieces.
  5. Blogging is often about dialog. You may respond to or build upon another blogger's post in your entry, but only your original contribution will be considered for the contest. You may revise or correct entries if needed throughout the cycle, and only the final form of the entry will be considered.
  6. The contest is open to all countries, but VMware employees are not eligible.
  7. If you blog anonymously for work reasons, you must let us know who you
    are so that we know where to send the prize, but we are happy to award the
    prize publicly to your pseudonym.
  8. Entry eligibility ends on the close of business Friday Pacific time of the 2nd week of each cycle.
  9. Send a note to vmtn@vmware.com to submit your entry for consideration. Your email will not be used for marketing purposes.
  10. A panel of 5 at VMware (John Troyer – Head of VMware Communities, Mike Adams – Blogger editor for vSphere and Product Marketing Manager for vSphere, John Gilmartin – Director of Product Marketing for vSphere, Tim Stephan – Senior Director of Competitive Marketing, and the technical marketing lead for the specific product) will judge the winning entry in each two week cycle. The winning blog entry wins a $100 AMEX gift card and will be featured on the vSphere Blog. All decisions are final. 
  11. Official contest rules can be found here

Link to Fault Tolerance Overview First Contest Cycle: Fault Tolerance.

We're kicking things off with Fault Tolerance. Over the next two weeks, we'll be blogging, tweeting, and linking to various Fault Tolerance resources. We think FT is a ground-breaking capability that lets you add continuous availability to a new class of business-critical workloads. 

What do you think? Let the games begin!

- Mike Adams, John Troyer, and the vSphere team