Home > Blogs > VMware vSphere Blog > Monthly Archives: December 2008

Monthly Archives: December 2008

Site Recovery Manager 1.0 Update 1 Available

Site Recovery Manager 1.0 Update 1 was recently released (just in time for December holiday shopping!).  The full details of what’s new can be found in the release notes at http://www.vmware.com/support/pubs/srm_pubs.html.  Here are some highlights:

  • Expanded platform support for VMware Infrastructure.  Site Recovery Manager 1.0 U1 adds support for ESX 3.5 Update 2 and Update 3 as well as vCenter Server 2.5 Update 2 and Update 3.  It also adds support for ESXi for FibreChannel storage.  See the Site Recovery Manager Compatibility Matrix for full details of supported versions and required patches.
  • Full support for RDM devices.  This release provides full support for virtual machines that use raw disk mapping (RDM) devices.  This enables support for adding virtual machines that are used with third-party clustering software such as Microsoft Clustering Services to a Site Recovery Manager recovery plan.
  • Bulk entry for IP address reconfiguration.  This release includes a tool that allows you to specify IP properties (network settings) for any or all of the virtual machines in a recovery plan to make it easier to enter a large number of IP address changes.
  • More granular permissions.   SRM now has separate permissions for testing a recovery plan and for running a recovery plan so that you can control access to both of those functions independently.
  • Single-click reconfiguration of protection for multiple virtual machines.  This release introduces a “Configure All” button that applies existing inventory mappings to all virtual machines that have a status of “Not Configured”.
  • Improved internationalization.  Non-ASCII characters are now allowed in many fields during installation and operation.
  • Simplified Log Collection.  This release introduces new utilities that retrieve log and configuration files from the server and collect them in a compressed (zipped) folder on your desktop.

We’re excited about the continued momentum around Site Recovery Manager and looking forward to seeing people start to take advantage of Update 1.  If you are testing or deploying this update, let us know your feedback!

Jon

Site Recovery Manager Demonstration Video now online!

After many takes and re-takes we have finished the SRM demonstration video and posted this for all to see. The purpose of the video is to provide a basic understanding of the SRM solution along with the disaster recovery challenges it solves.

We hope you find this a useful resource to share with your colleagues or customers.

VMware Site Recovery Manager – “From general release to Update1, what have we learnt and what’s new?”

In this post we will
focus on the activity that has surrounded VMware Site Recovery Manager (SRM)
since its launch. According to our own download page Site Recovery Manager
(SRM) build 97878 has been available since 2008/06/19 so what has been
happening with SRM in the field since that date? The short answer is a lot! SRM
has been written about in many places, VMware 
evangelists (yes that’s you Mike!) now have books specifically dedicated
to SRM administration and SRM has been a big draw at all VMware stands at the
various shows we have put on or attended during 2008.


As one of VMware's technical folks my main
interest is what has been good, bad or plain ugly from an implementation point
of view. I spend a lot of my time assisting customers and partners with their
SRM deployments, configuration woes and the like so I wanted to give you all a
quick run down of some of the key gotchas as well as some pointers to useful
references for SRM help. Going forward I hope to post further blog updates
focusing on specific topics of SRM implementation including closer looks at
networking, storage replication integration, sample architectures, customizing
recovery plans to name a few. Getting back to some of the common questions that
come up let's get started.

 

Q.    
We
have installed SRM but cannot see any SRM screens inside the vCenter Client?

To make the SRM icon and screens available
you must download/install the SRM
plugin via the vCenter Client “Manage
Plugins” menu. Your vCenter userid will also need the appropriate privileges to
be able to work with SRM.

Blog1 


Q.    
Where should we install SRM (and the SRA)?

o    In a VM?

o    On the same VM as vCenter?

o    Should the SRM database reside
alongside the vCenter
db?

o    Can the SRM database be of a
different type? i.e.
Oracle?

It can depend on a lot of factors some of
which we have listed below. For POC/Eval and test environment most customers
will deploy the two SRM servers (and their databases) alongside their vCenter
servers, within the same virtual machines. For production environments the
reality of day to day operational processes will probably mean the SRM server
and SRA (storage replication adapter) will be installed alongside each other in
their own virtual machine to make tasks such as raising change requests for
maintenance / patching more straightforward. Other factors will include:

o   
Size of
VI environment (number of ESX hosts/number of VMs)

o   
Small
number of hosts & VMs  can mean customers
deploying SRM in same VM as vCenter as typically in this configuration the
vCenter server is lightly loaded.

o   
Larger
number of hosts & VMs customers installing SRM components in separate VM

o   
Type of
SRA being used can be a factor i.e. does your SRA need access to “admin” LUN(s)
to communicate with storage?

Q.    
We download the SRA for our storage platform from vmware.com install, no
other checks needed, is that correct we just go ahead and install?

Not quite, each vendor provides a readme, you
must ensure you review this first. Second each storage vendor also generally supplies
a whitepaper / technote covering best practice implementation for setting up
their adapter (SRA), ensure you seek these out! Links to documentation from
storage vendors can be found on the SRM resources page:  http://www.vmware.com/products/srm/resource.html

 

Q.     Do all the of SRA adapters communicate with their respective storage arrays
in the same way?

No, again each vendor’s architecture is
different for connectivity some require the installation of a client side
remote command suite (provided by the storage vendor) some don’t. Again review
your storage vendors readme and implementation guide and if you have one, speak
to your storage team. Don’t forget the SRA’s are supported by the storage
vendors so if you do have issues with the adapters you can raise support
requests with your storage vendor assuming you have a valid support contract.


Q.    
Our Storage Replication Adapter (SRA) is installed correctly; all seems
ok however in the datastore
groups screen no
datastores appear?

If the
replicated VMFS
datastores are empty i.e. contain no VMs then the datastore will not appear. Add VMs into the datastore(s) and use the rescan arrays button to update the view.

Blog2 


Q.    
When creating a protection group SRM prompts for a datastore location to
house “Placeholder VMs” what are these used for?

Placeholder virtual machines are
used to identify a location of the recovered VM in within the recovery site vCenter
inventory. SRM
will replace the placeholder VM with the VM registered from the replicated
storage during testing / failover.

Q.    
During
the install process port 80 is defined as the communication port for vCenter
, can
this be changed?

Even though SRM uses SSL when it
communicates to vCenter, it does not use port 443. SRM establishes a TCP
connection to port 80, then uses an HTTP CONNECT request to establish a tunnel
to the vCenter server, then does an SSL handshake with vCenter over that
tunnelled connection. The SRM installation enforces these semantics.

 

Q.    
Which datastore should be selected to hold
the placeholder VMs? What to consider?

The first recommendation
would be to locate all of the placeholder virtual machines in the same
datastore at the recovery site. If all the placeholder virtual machines are
located in the same datastore at the recovery site they will be easier to
locate should you need to and equally simpler to locate should you need to
perform any troubleshooting.

Having a small datastore
set aside for use only as the SRM placeholder virtual machine datastore will
also mean you are not placing them in datastores at the recovery site that
contain actual virtual machines that reside at that site permanently. vCenter
users not authorized to use or familiar with SRM may find it confusing should
they stumble across a placeholder virtual machines folder lying within a
datastore normally used for other virtual machines. Other factors to consider:

o    Datastore
needs to reside at the recovery site.

o    Datastore
does not need to be replicated.

o    Sizing
- datastore
will only contain VM config files (*.vmx, *.vmxf, *.vmsd (typically 3 files < 1KB each).

   Q.   Which
vCenter object is SRM enabled on, Host, Cluster, Resource Pool?

In SRM the basic unit of replication is the
datastore. Recovered VMs can be placed on arbitrary hosts/clusters, as long as
the hosts can access the replicated datastores.

Q.     Do all
VM’s we are protecting need to be in a cluster?

No. SRM only requires separate vCenter
instances. One managing protected site and other managing recovery site.

Q.     For
failover how does SRM guarantee resources at the recovery site?

SRM can suspend local VM’s at the recovery
site as part of recovery plan. Best practice is to also use resource pools at
the protected site and map these to resource pools at the recovery site using
SRM “Inventory Mapping”


Q.     We see that the “Recompute Datastore Group”
task run periodically within vCenter since we installed SRM, what triggers
these tasks?

        Blog3 

Datastore Group computation is triggered by
the following events:

o   
Existing
VM is deleted or unregistered

o   
VM is
storage vmotioned to a different datastore

o   
New
disk is attached to VM on a datastore previously not used by the VM

o   
New
datastore is created

o   
Existing
datastore is expanded


Q.   Occasionally when we login to the SRM screens we see the sites
pairing status displayed as “Low Resources on Paired Site” what causes this?

        Blog4 

The “Low Resources…” message can be
generated if any of the following conditions are true on the server (VM) where
SRM is installed:

 

o   
Remote
site free disk space drops below 100 Mb (default)

o   
Remote
site CPU usage goes above 70 % (default)

o   
Remote
site available memory drops below 32 Mb (default)

 

These are default values which can be
configured by modifying the vmware-dr.xml file located in C:\Program
Files\VMware\VMware Site Recovery Manager\config. The fields to modify are minDiskSpace,
maxCpuUsage, and minMemory.


Q.    
What are the SRM failback options we see no button for failback which is
confusing us?

 

SRM absolutely supports failback and each
storage vendor documents the failback process for their specific replicated
storage configuration. What you have to consider is that without SRM in your
virtual environment you are back to manual and/or home grown scripts for DR you
will no longer have automated Recovery Plans, no offline DR testing
capabilities, and no DR audit trail
. You can still failback manually without
using SRM, high-level steps would be:

o   
Delete the protection groups in the Protected Site vCenter

o   
Unregister the protected virtual machines in the Protected Site vCenter

o   
Work with your storage team, reverse data replication

o   
VM re-inventory in Protected Site vCenter, restart and re-ip (manual or scripted)

 

With SRM in place you will have Recovery
Plan(s), the ability to test failover before Recovery, and will have a built-in
audit trail
. SRM can also be
used to help you failback once your primary site has been restored. The
high-level steps would be:

o   
Delete the protection groups in the Protected Site vCenter

o   
Unregister the protected virtual machines in the Protected Site vCenter

o   
Work with your storage team, reverse data replication

o   
Leverage SRM, complete SRM workflows in the reverse direction from
Recovery Site back to the Protected Site

 

Repeat the above steps from the Protected
Site back to the Recovery Site to complete the re-protection of the virtual machines in
the Protected Site.

 

I hope that has answered a few FAQs I am
sure there will be more to come but for now, 
thanks for stopping by!

Lee Dilworth

   


Welcome to the Uptime Blog

We continue to see a lot of interest and questions about how to protect VMware environments as well as a lot of excitement about the new and future technologies that VMware has developed and talked about, so we wanted to create a place where we can give you some additional insight into what we're seeing and working on here at VMware.  This blog will focus on products and solutions for business continuity in virtualized environments.  We’ll talk about data protection, high availability, and disaster recovery solutions that include VMware Infrastructure and products like VMware Consolidated Backup, High Availability, and Site Recovery Manager.

To get started, here are just a few of the
resources available on vmware.com:

·       Data Protection:

·        
High Availability:

·        
Disaster Recovery (including Site Recovery Manager):

 Stay tuned for more information, and let us know what you want to hear more about.

Beaconing Demystified: Using Beaconing to Detect Link Failures

Beaconing is one of those features that often confuses even
the most experienced networking admin.

 

Shudong Zhou, one of our senior engineers, recently posted
an entry on the internal blog explaining how it works and how you might use it.
He gave me permission to cut and paste his entry. Here it is …

 

Beaconing is a
software solution for detecting link failures downstream from the physical
switch.ESX provides a simple and elegant teaming solution. All uplinks
connected to a vswitch are assumed to connect to the same physical network
(same broadcast domain) so they are all equivalent. Users can configure a list
of active and standby uplinks for traffic to go out of the ESX host. If a link
fails, the adapter driver detects it and marks the uplink as failed and stops
using this uplink. Existing traffic will fail over to a standby uplink or
redistributed to the remaining team members.

If a downstream
link beyond the immediate physical port fails, the adapter driver obviously cannot
detect it. This causes existing VMs using the uplink to lost network
connectivity. The proper way to solve this problem is to enable Link State Tracking on the physical switch so that
the adapter driver can see the failure. If the physical switch does not support
Link State Tracking, beaconing provides a software alternative. Beaconing
works as follows:

ESX
periodically broadcast beacon packets out of all uplinks in a team. The
physical switch is expected to forward all packets to other ports on the same
broadcast domain. Hence, a team member is expected to see beacon packets from
other team members. If an uplink fails to receive any beacon packets (actually
missing 3 consecutive packets), we mark it bad. The failure can be due to the
immediate link or a downstream link. With 3 or more uplinks in a team, we can
pin point failures of a single uplink. With 2 uplinks in a team, we can detect
downstream link failure, but we don't know which one is good and which bad.

ESX
behavior when a beaconing failure is detected is as follows:

  1. If two or more
    uplinks receive beacons from each other, those uplinks are considered good.
    We stop
    using uplinks which do not receive any beacon packets.
  2. On ESX 3.5, if
    no uplink receives beacon packets, traffic is sent to all uplinks (shotgun
    mode). If a team has two uplinks, any link failure will result in all packets
    being sent to both uplinks.
  3. On a future
    edition of ESX, we intend to make an additional improvement. If no uplink
    receives beacon packets, traffic is only sent to uplinks whose link status is
    “up”. If a team has two uplinks and one uplink experiences a failure in its
    immediate link, traffic will be sent out to the other uplink. This saves some
    CPU cycles.

When should one
enable beaconing? When you are concerned that downstream link failures may impact availability and there is no
Link State Tracking on the physical switch. Ideally, you should have 3 or more
uplinks in the team (active + standby). But you can enable beaconing with 2
uplinks. Some customers don't like the shotgun mode on failure (see #2 above),
that's a trade off you should make against some VM losing connection right away.