Uncategorized

Hidden benefits of virtualisation – reboot time and the impact on server availability and regular operations

By guest blogger, Christian Wickham, Technical Account Manager, South Australia and Northern Territory, and Local Government and Councils in Western Australia, Victoria and New South Wales at VMware Australia and New Zealand

Hidden benefits of virtualisation – reboot time and the impact on server availability and regular operations

Within VMware we are often focussing on the latest and greatest features and capabilities offered by all our newest software. Of course, we are always driving forward and the next version’s enhancements and benefits are forefront of our minds – but there are still some people out there who are just starting on their virtualisation journey, or have taught themselves how to use VMware products and are missing out on some of the many benefits. The advantages offered by our premium versions of vSphere, such as Enterprise Plus and the vCloud Suite editions, offer exceptional advances for businesses and enterprises, but some smaller businesses are unable to afford these editions – particularly at the start.

Some benefits of virtualisation, particularly with vSphere, are inherent and included in all versions – and deliver significant savings in both money and time. In this series, I will outline some of the simple benefits that are often not highlighted to new users of virtualisation, but well known to (most) existing users.

How long does it take you to boot up a server? I don’t just mean the time it takes to start Windows or Linux, I mean the time it takes the hardware to begin starting Windows. Next time you reboot a physical server, go and time it – you won’t realise how long it really takes.

You need to consider that server manufacturers rarely produce every single component within their chassis. The big OEM hardware vendors of HP / IBM / Dell / Cisco / Fujitsu all purchase components from other ODM manufacturers like Broadcom, LSI, Emulex, Intel and many others. The OEMs may re-badge or rename the devices, but they are still independent hardware underneath.

When a physical server boots up from cold (that is, not a reboot), then it will perform various system checks such as scanning RAM for faults, scanning the PCI bus for devices and then loading it’s BIOS (Basic Input/Output System). Depending on the hardware manufacturer, this might then progress to initialising on-motherboard sensors such as; temperature sensors, fan speed sensors, embedded and out of band management (for example; HP’s iLO, Dell’s iDRAC, IBM’s RSA). After this, devices connected to the PCI bus will then initialise, not just add-in cards but also on motherboard components. As I mentioned above, these often are manufactured by independent vendors, they frequently will have their own ‘advertising’ or declaring their product name and version / copyright details, and also importantly, offer a chance for the user to press a key combination to allow access to an embedded management interface or configuration menu. To give the person accessing these systems time to press the key combination, a delay is put in to the startup sequence – this is frustratingly quick when you want to use it, and frustratingly slow the other 99% of the time when you don’t need to use it! If you have a server with multiple add-in cards or embedded resources, 5-10 seconds per device can really add up.

After all of this, only then does the server start to boot up Windows (or Linux). From a cold boot, I have seen servers take 45 minutes. From a warm boot (that is, a reset or reboot from a server that was running), it can be shorter, although I have seen this take nearly 20 minutes. All this before it even starts to do anything “useful”. Don’t be tricked by being able to ping a physical server – remember that WOL (Wake On LAN) will have an IP address and respond even when a server is “off”.

In steps virtualisation. With a virtual machine, there are no ODMs and no hardware devices to initialise, no copyright announcements and delays to press a key combination. In fact, with vSphere there is an option on each virtual machine to delay the boot up sequence (and one to enter the BIOS setup screen) before starting Windows/Linux. By default of course this is set to zero milliseconds in vSphere. Think of this for each time you apply a Windows update or make a change to Windows settings that requires a reboot. If you have ten servers you need to reboot, this could be saving you 200 minutes a month of just sitting there and watching a server begin to boot – may not sound like much, but when your staff are doing this out of hours (after all, a reboot is taking a server offline), it all adds up.

But wait, there’s more! Not just some steak knives, but other virtualisation benefits that speed up boot time. Remember all those independent hardware components in most servers? Well, each one needs its own drivers to let the operating system be able to use them, its own management software to allow you to configure or monitor them – and as the devices are often manufactured by differing vendors, they will often be independent and need their own resources, even if they are re-badged to match the label on the front of the server. So, whilst your operating system is booting, part of the process is to load all this software into memory, some of it may be unloaded again, but it all goes to further delay boot time.

Unfortunately this step is often forgotten by people when they perform a physical to virtual conversion (P2V) with software such as VMware Converter. Companies may end up with a virtual server that is slower to boot than similarly configured VMs, and it’s because the physical hardware drivers are still installed and loading – then failing as the device is not present and then unloading. I have seen that as soon as these devices, drivers and management tools are uninstalled, the VM starts (and runs) faster.

So – how long does it take a virtual machine to boot in vSphere? In my lab, a bare Windows Server 2008 R2 virtual machine boots (from powered off to the Ctrl-Alt-Del prompt) in 28 seconds, using SATA 7200 RPM disks on an NFS datastore. With SAS or a Solid State Disk, this would be even faster.

So, consider the savings in time for your server administrators for Windows patch Tuesday. Consider the savings in productivity when a Windows server has a problem and needs to be rebooted during the work day. Consider the savings in overtime and out of hours work needed to perform maintenance or other tasks that need a reboot. Although I am not recommending this approach, some of my customers are happy to reboot less critical servers during the day because it happens so fast that users don’t notice. When servers reboot themselves and no-one notices, then you need to ensure that systems are monitored and administrators are alerted to outages.

If you have a very small number of servers in your site and have thought that this rules out virtualisation, consider the savings in reboot time as a factor. I have customers running an ESXi host in a branch office where there is only one VM running. The reboot time for the VM is fast, faster than if it was installed directly onto the server. There are many other benefits (such as hardware abstraction, portability, snapshots etc) but that is for another time…

For clusters enabled with vSphere HA (which is available on all licenses above the basic ‘Essentials’ or the free edition), during a host outage, VMs that were on the unavailable host will automatically be started on surviving hosts just as quickly – although depending upon the installed application(s) and configuration of the server where it might attempt to do disk checks or application recovery or consistency checks.

Factor this in to your considerations of using other technologies such as Microsoft Failover Clustering when calculating uptime capabilities;

  • Do a test on a VM that you need to maximise uptime for – I will use the example of a Microsoft SQL server on Windows
  • Power on the VM from cold, whilst at the same time attempting connection to the SQL service with a client – you should get a result of a clean boot time.
  • Now take a SQL cluster and perform a ‘move resource’ action on the resource group – time how long between losing connection with a SQL client and the service returning.
  • At this stage, you can evaluate if it is quicker to boot up a SQL server or to stop and then restart SQL services on another node within a Failover SQL cluster.
  • You can also time how long it takes to simply restart a service on a running server – if you choose one with some dependencies, sometimes this itself can be slower than a reboot (which, ironically, includes starting the same service!).
  • Go a bit further and do an unexpected reboot on a running SQL server – in vSphere this is done with the “Reset” option to perform a power cycle. Time how long it takes to boot and recover so that a SQL client can connect. This test can be risky for data, so don’t do this on a production system, and ensure you have recoverable backups of any data!
  • Then perform the same with a SQL failover cluster node – perform a power cycle of the node that is running a SQL instance, time how long before the SQL service responds on another node. Don’t do this through the Cluster Manager, but instead force a failure in another way such as a power cycle – Microsoft Failover Clustering performs “LooksAlive” checks every 5 seconds and the “IsAlive” check is every 30 seconds – it takes around 10 seconds before the Microsoft Failover Cluster will start to do it’s failover actions.

It goes without saying that this is a test and one way to provide an evaluation in your own environment of the relative benefits of VMware HA against other products such as Microsoft Failover Clustering. Your own experience will vary depending on your application and it’s configuration – particularly when databases are large or have transactions outstanding, that’s why figures here cannot be taken as accurate.

In my testing, a Clustered SQL Node failure took 37 seconds to return to service after a failure. In a similar test of a non-clustered SQL server using just HA, it was 43 seconds before it returned to service. I must stress that the SQL servers were not 100% identical and not under load and the test database was benign and basic, and the timing was subject to my reaction time to start and stop – but this is an indication for you to consider. My once-off test showed only 6 seconds of improvement in recovery time by using MSCS over vSphere HA – and although I am experienced with setting up MSCS and SQL clusters, I left all settings at default and did not perform any tuning.

Your business may demand “zero downtime” for their application, and software vendors may recommend products such as Microsoft Failover Clustering, but when you are armed with the facts from your environment related to the capabilities offered by VMware HA against more complicated (and fragile) alternatives, you can save a lot of money and heartache in simply using the built in capabilities – VMware HA is, after all, enabled just with a single check box.

http://www.vmware.com/files/pdf/partners/oracle/Oracle_Databases_on_VMware_-_High_Availability_Guidelines.pdf