VMware Infrastructure 3

Is Quick Migration “good enough”?

Is Microsoft’s Quick Migration in Hyper-V good enough? Well, it certainly won’t keep your services running without interruption. From the VMware: Virtual Reality blog, Reviving the Dormant Grand Architectures of IT with VMotion

It’s important to
know that Microsoft dropped plans for live migration in Hyper-V and is relying on a “not quite live” migration method it calls, “Quick Migration.”
Microsoft Quick Migration works very differently than the iterative
live memory transfer method used by VMware VMotion. Quick Migration
fully suspends a VM, copies its memory image to disk, and then reloads
and resumes the VM on a new host. That suspend/resume migration
technique is far from live. In fact, Microsoft has documented (slide 47)
that, even in ideal conditions, Quick Migration interrupts VMs between
eight seconds and two minutes when using Gigabit speed networked
storage, depending on VM memory size. …

Unfortunately, that kind of downtime is more than most networked
applications can tolerate. Just a few seconds of unresponsiveness will
trigger TCP timeouts and application errors. We tried Quick Migration
with the Hyper-V beta using Gigabit iSCSI storage connections and the
results weren’t pretty, as you can see in this screen capture video:

The Quick Migration downtime caused file copies to fail, VM console connections were severed, and database clients had to be restarted. Scheduling planned maintenance downtime and telling users their apps will be down does not fit anyone’s definition of “Dynamic IT.” In contrast, migrating the same VM with VMotion on a VMware Infrastructure platform didn’t cause even a blip in the network sessions as this video shows:

You should read the whole thing to get the big picture.

For some interesting background on why these failures occur, and why you don’t see them in MSCS, and then some stories from customers who depend on VMotion and DRS every day in their business, check out these two blog posts from Mike D.

Part I: Quick Migration vs VMware VMotion and Live Migration – Why Things Fail with Quick Migration

Quick Migration uses MSCS to coordinate the failover of a VM from one
node to another. However, there is a BIG difference between the network
failover that occurs with a normal cluster failover inside the OS and a
cluster failover that occurs during Quick Migration. With a Quick
Migration setup there is no virtual host IP serviced by MSCS.
The IP address and all communication to the Guest OS in the VM is
controlled by the network protocol stack running inside the VM. If the
VM is not running then there is nothing responding to the VM’s IP
address. During a Quick Migration you actually suspend the VM to disk,
failover the disk resource, and then unsuspend the VM on the second
host. During this transition there is no network stack up and running
and ready to reply to requests for the VM’s IP Address.

Part II: Quick Migration vs VMware VMotion and Live Migration – The Financial Impact

The first customer in question is in the financial market. … For this
particular application the customer says that every 20 seconds of
downtime equates to $800,000 in lost revenue for the institution. The
application will sit at very low utilization for most of the day even
though it’s still doing trades. Every once in a while the application
spikes and gets really busy. … VMware’s dynamic resource scheduler (DRS) can see these
application spikes and automatically initiate a VMware VMotion to
another host in the environment that has enough resources. … For this particular
client the money lost with Quick Migration downtime will pay for all of
their licenses of VMware with just one migration. …  That’s a truly dynamic datacenter. …

This customer is in the trucking industry. …
One of the applications in their environment is the timecard system. … The timecard
system sits idle most of the day and runs on a server with a dozen
other workloads. At shift change (which happens 3 times a day) the
timecard VM gets very busy. …  What’s worse is the popular time
keeping system that they’re using will disconnect the clients (the card
readers) during a TCP timeout and the only way to bring the clients
back on-line is to manually go into a configuration screen and
“disconnect” them and “reconnect” them – again, more loss on
productivity.