VMware

11/02/2011

The Uptime Blog is Moving!

We've decided at VMware that we shouldn't make you scramble around quite as much for different material found in different blogs, so we're going to start consolidating our blogs under the vSphere blog.

This has a few advantages, primarly the fact that you won't need to go to different locations to retrieve information about various VMware related material.  You can also now perhaps be exposed to a wider variety of information than you would have seen in the past as you'll have more authors piping data into one location.

So from now on, keep on top of a lot more data by updating your link to the uptime blog to the Main vSphere Blog.

Now, for those of you who would rather still see just uptime-related material, it's not a problem, we are maintaining some semblance of structure, so you can just go directly to the Uptime Section and not have to see the rest of the postings.  

So you have options!  The main blog will include all sorts of material about the Platform, Uptime, Networking, Storage, vCenter, and vCloud, but as I mentioned you also have the option of limiting your viewing to just the subjects of interest.  If at any time you want to see subject material specific to one of those categories, you can just click on the appropriate category on the right of the screen.

So, I'll see you soon at the new location at http://blogs.vmware.com/vsphere

-Ken

 


10/17/2011

SRM Compatibility Updates

Just a quick update for those of you who've had questions about supported platforms and databases.

We've updated the VMware Product Interoperability Matrices to be more authoritative.  Some platform support has changed (e.g. we only support Update 3 for ESX/ESXi 4.0, and no prior update versions), and supported databases should be a lot broader than was initially indicated in the pdf included with the release.

So make sure you check against the online interoperability matrix for definitive information if you want to check what components will work with others in your environment.

-Ken


10/12/2011

Upgrading to SRM 4.1 before upgrading to 5.0

Please note that an inplace upgrade to SRM 5.0 is only supported from version 4.1! 

If you're running on an older version such as 1.x you will need to follow a few steps to get to the current version including an upgrade first to 4.1. 

Michael White did a fantastic job of documenting the 4.1 upgrade process in July 2010 and if you're looking at moving to 5.0 from an older version you should consider his material mandatory reading before upgrading to 4.1. :)

Take a look at Michael's upgrade material here:

http://blogs.vmware.com/uptime/2010/07/upgrading-to-srm-41-including-upgrading-to-vsphere-virtualcenter-41.html

 


10/07/2011

VMworld 2011 Copenhagen

Hope you're all having a great Friday and get a chance to relax over the weekend!  

Thank you for your attention and participation this week in the VMware Experts' Forum specifically focussing on Business Continuity and Disaster Recovery with Site Recovery Manager 5.0  Hopefully you've found some of these postings useful in understanding some of the technical aspects of protecting your business with SRM.

While this concludes the three week Experts' Forum, I'll be continuing to post material of this nature here at the Uptime Blog, and you can follow me on twitter @vmKen.

Make sure you keep in touch with us on our Facebook page as well: IT Management

Lastly, for those of you who are going to attend VMworld 2011 in Copenhagen, catch some of my BC/DR related sessions:

BCO1269: SRM 5.0 - What's New and Recommendations for Success #BCO1269

BCO2527: How to Be Successful with Site Recovery Manager Implementations #BCO2527

My group discussions:

GD19: Site Recovery Manager with vSphere Replication #GD19

And my knowledge expert one-on-one:

EXPERTS-07: Knowledge Experts One-on-One #EXPERTS-07

 

Hope to see you there!

-Ken


SRM 5 - Using Dependencies

With previous SRM versions your options for controlling start sequence of individual VMs boiled down to choosing whether to put the individual VMs into High, Medium or Low priority groups.  High priority VMs started sequentially, Medium and Low priority VMs started in parallel.

This meant that in order to stage startup of VMs what most of our customers would do is more or less leave everything in "Medium" and then control sequencing by interleaving multiple recovery plans.

This meant you had to carefully manage your recovery plans and the sequence with which they were run.  You might have had an RP for 'base infrastructure services' another RP for 'base databases' another for each tier of a 3-tier architecture, and so forth, and then manage running them in the appropriate sequence.

Now we have a slightly different way of manipulating startups.  We have 6 priority groups within a recovery plan.  The VMs in each priority group will all start in parallel with each other, meaning we've in essence doubled the number of priorities that can be set for VMs within a recovery plan.

Figure 45

But beyond that we want to be able to control the sequence for starting VMs *within* a priority group as well, and that's where the new feature of "Dependencies" comes into play.  With SRM 5 you can set as a property of the VM itself another VM or a number of VMs that must be started and running before the VM on which you're working will start.

In essence we can now have a single recovery plan, with 6 groups of VMs within it, with individual dependencies for each VM set within the priority group to control start sequence.  

Figure 48

So what, Ken?  Well the great part about this is that we can now reduce the number of recovery plans needed to control start sequence.  We can take those old RPs and in essence turn them into priority groups within a single recovery plan, and then from there set mandatory start ordering within each priority group!

So from an ease of management perspective, this is brilliant, and everyone should be quite pleased by this.  But what's the catch?

Well there's no real catch per se, but one thing you *will* need to be aware of is that the RTO, Recovery Time Objective, may be impacted if instead of doing things in parallel you start setting dependencies and start ordering for all your VMs.  

If your recovery time with 10 VMs starting in parallel is 5 minutes, adding a single dependency into that may add another, say, 2 minutes while one VM waits for its 'parent' to complete booting.  If every VM has a dependency set on one or two mandatory VMs it's probably not a big deal.  If every VM is dependent on another VM in essence you've set them back to sequential boot sequence, and your 5 minute RTO may have to be adjusted to 25 minutes or so!  

The point is that dependencies are amazing and give you a lot more control over how and when VMs start, and allow you to redesign your recovery plans, but be cautious with them.  If a dramatically quick RTO is your chief objective you may be better served by either putting non-dependent groups of VMs into more priority groups or even by returning to the old model of interleaving sets of recovery plans!

As always... test often!

-Ken


10/06/2011

vSphere Replication Bandwidth pt 2

Continuing from the earlier post about vSphere Replication bandwidth...

 

RPO.

The RPO is another key to figuring out your traffic patterns for replication.  We look at how many blocks have changed during the RPO window and transfers those blocks that have changed.  The data change rate within that RPO gives us the total number of blocks that need to be shipped in each window.  This may vary widely throughout the day, however, which will again alter the traffic generated by replication.  If we have systems that are busy during business hours but idle at night our overall figure above of 100GB daily may hold true, but the 4GB per window may drift wildly over the course of the 24 hour period.  

From an averages perspective, however, the 4GB number may be good.  

Here's another complication:  Because we strictly look at changed blocks, not how many times those blocks have changed, we may again end up with widely differing bundles for replication from one replication window to the next.  If traffic on a VM is 'bursty' meaning for example very busy in one hour and idle during the next, we may have to ship a lot of blocks in one hour and none in the next!  Moreover, if we are using VSS to quiesce the VM our replication traffic can not be spread out in small trickled sets of bundles throughout the RPO window, and instead we will need to transfer all the changed blocks as one set as determined when the VM was quiescent.  Without VSS, VR can ship more smaller bundles of changed blocks on an ongoing basis as blocks change, smearing out the traffic throughout the RPO.  So the traffic shape will change based on your choice to use VSS, and VR will handle the replication schedule differently based on these factors, leading to varying traffic patterns.

Lastly, as you change your RPO we will obviously have to ship more or less data per transfer to meet that RPO, so that is why figuring out replication bandwidth is so dependent on both the change rate and the RPO configured.

Link Speed.  

Obviously the link speed is the other defining factor in our calculation.  If we now know that we have an average replication bundle of 4GB to transfer in a 1 hour period, we need to look at our link speed to determine if the RPO can be met.  If we have a 10Mb pipe, we can do the math: 4GB/10MBps would take about an hour to complete on a completely dedicated pipe with little overhead under ideal conditions.  

So just to meet our RPO we would be completely saturating a 10Mb WAN connection, under ideal conditions, with no overhead or limiting factors such as retransmits, shared traffic, excessive bursts of data change rates, etc. and presuming a very standard and unchanging data change rate!

Realistically we need to expect say 70% of a link will be available for actual traffic replication which means on 10Mb links we can get around 3GB/h, on a 100Mb link we can get around 30GB/h, etc.

The real question becomes one of looking at the data churn rate within an RPO and determining if the link speed and conditions will allow for that to be replicated.

Another way to look at it would be to again track what your data churn rate is within your RPO (or average it over a longer period and then divide down to your RPO) and then determine what traffic that will generate and measure that against your link speed.

For example, given a churn rate of 100 GB, you will need approximately 200 hours to replicate that on a T1, 30 hours to replicated that on 10Mbps, 3 hours on 100Mbps, and so forth.

The last major consideration here is that we are assuming a very standard RPO for all our VMs and a churn rate that smears out nicely over the 24 hours.  Both of these factors will likely not hold true in your environment.

If we have many tiers of RPOs for our VMs, say groups that have RPOs of 15 minutes, 1 hour, 4 hours and 24 hours, we need to calculate this for each individual tier, and then look at how it is being met by our factors listed above, and then look at the aggregate of each tier!

So this is the complexity of VR bandwidth calculations… We need to factor in all the different RPOs in the environment, the subset of the environment that is protected, the change rate of the data within that subset, how much of that data changes within each configured RPO, and the links in use.

Ultimately it's just math, but getting to an answer of what bandwidth is required for VR unfortunately always lands us back at the statement "It depends".  The biggest dependency is your data change rate, and that's probably the best place to start when looking at figuring this out.

Hope this helps understand VR and bandwidth usage! 

-Ken

 


vSphere Replication Bandwidth

One of the most common questions about SRM 5 is with regards to vSphere Replication and how much bandwidth is used to replicate VMs.

The answer is of course that "it depends" and one of the goals of this post is to help you understand what factors are important in determining what to expect in your environment from a bandwidth perspective.

There are a number of factors involved in replication with VR:

1) Size of dataset

2) Data change rate ("Churn" for lack of better terminology)

3) Recovery Point Objective ("RPO")

4) Link speed

Fundamentally, at a high level, the calculation for your required bandwidth to use VR comes down to calculating your average churn rate within an RPO divided by your link speed.  Let's take a look at why these factors are important.

Size of dataset. 

We can not assume that every VM in your environment will be protected by VR, let alone every VMDK in your VMs.  We need to look at our data stores and figure out what percentage of our total storage is for virtual machines that we are protecting with VR, and what number of VMDKs within that subset are actually configured for replication.  

Let's take an example wherein we have 2TB of VMs on our data stores, and presume that we are protecting half of them with VR.  That gives us a maximum amount of data for replication of 1TB.  It may be that we are even then only protecting a subset of the VMDKs represented by that 1TB, but for our example we can now assume that the size of replication we are dealing with is at maximum 1TB.  For our 'high water' calculations we will use this number.

Data change rate.

This is the absolute key to all calculations with VR.  We do not replicate every block of the dataset.  The change rate is very tightly coupled with the RPO when we calculate what needs to be replicated, as the figure we need to worry about is how many blocks have changed within a given RPO.  We need to know how many blocks have changed within a given RPO for a VM in order to estimate the transfer size for each window's replication.  This is not always easy, nor is it easy to calculate for each VMDK and then calculate a sum, so instead let's step up a level and look at overall averages.  We may estimate for example that we have a churn rate of 10% daily for the dataset being used with VR.  This will differ radically for each of you, but it gives nice round numbers for these sample calculations.  Given the 1TB dataset, if we know the churn rate of 10% daily, that means we are working with a daily set of blocks that need to be shipped of about 100GB.

Now this is where the complications begin.  Even here we can not assume that all 100GB needs to be shipped.  

With VR we ship blocks based on the RPO.  If we have a 1 hour RPO, that means that within each hour any block that has changed will need to be shipped in order to meet that RPO.  This does NOT mean, however, that every time a block changes we need to ship it!  If a given block changes 100 times within that hour, we don't need to ship it 100 times, we only transfer it once at its current state when the block bundle is created for transfer.  All we care about is that it *has* changed within the RPO, not how *many* times it has changed.

So this means that again our changed block rate gives us the high water mark, not a realistic estimation of what needs to be shipped, nor how often.  But again to continue with the calculation we can assume that we have at *most* 100GB to transfer, and with an example RPO of 1 hour that means we need to do a replication 24 times throughout the day.  

100GB of data divided by 24 "windows" for shipping gives us an average bundle size of a little over 4GB per window.

... This will be continued in a few hours!


10/05/2011

SRM can protect anything - but should it?

There are several categories of systems that SRM should not protect.  Broadly speaking, they may be grouped into two groups.  One group contains items that should not be protected because they may have issues like corruption during replication or recovery, and the other group is for systems that don't need to be protected. 

SRM is so easy to use once it is working, that sometimes people forget that they should not protect everything.

For example, Microsoft Domain Controllers should NOT be protected.  This is due to the potential for corruption, clock drift, isolation and more in Active Directory.  It is, moreover, easy and simple to have a DC on the recovery side so there is no need to replicate it and recover one.  Sometimes people replicate DC's so that they can have a DC inside of a test reocvery, but this is not a good thing.  There is a far better way to do this:  I suggest that you have a script at the begining of a test recovery that only executes during recovery and does the following actions:

a) Turns off a virtual DC at the recovery site
b) Makes a cold clone of the DC
c) Changes the network of the clone to the test network
d) Turns on the clone
e) Turns on the DC within the test environment

This will get a DC working inside the recovery bubble, but ensures that there is no way for any info that is changed in it getting outside.  BTW this is called enhanced testing and more info can be found in http://www.vmworld.com/docs/DOC-4202 or http://blogs.vmware.com/uptime/2009/01/how-to-exploit-the-test-bubble-for-all-its-worth.html.

There are other things that should not be replicated, like:

a) Antivirus servers - they should live at the site where their managed objects are.  If they run at a remote site they use up a lot of network bandwidth.
b) Print servers - like AV servers they should be where their managed objects are.  If not they use up a lot of network bandwidth.
c) CCTV or security servers - they really are only necessary on the side where they are accumulating information on.  Many other systems that can be categorized as "real-time" will fall into this category.


As well, you should really know what you need to protect, and what is required to make it work, and then protect only what is necessary, and remember things like AV, print, or DC's should already be running at the recovery side and so don't need to be protected. 

The more things that are running at the recovery side the fewer things you have to fail over, and the quicker the failover will occur!

What do you think?  Are there systems you can not protect that you would like to?  Are there other systems you think should not be protected?  Let me know!


10/04/2011

Working at the edge of SRM 5 limits

If you are working at the edge of our SRM 5 scalability limits and you are finding that you get timeout errors than we can help with that.  SRM tries to start all of the VMs it is recovering as fast as it can.  Sometimes near our scalability limits that may be too aggressive.  If you are in that position you may experience timeout errors.  The actual errors may be with regards to scripts, VMware Tools, or uploads, but in all cases there will be a mention of timeouts in the error message.

There are a number of things to look at to tweak things so that you can increase your success during recoveries.  Please be aware these are advanced changes that should not be required for most people, and that you should be able to undo these changes if necessary, and test them carefully.  The numbers below are suggestions and you may increase them or decrease them as you need - and remember, as always, to test extensively!

You can throttle the startup of recovered VMs in two ways.  You can add a global configuration option to SRM that will throttle the VM operations in every cluster and OR you can make a change to each cluster.  This means you can make one change and all clusters will start fewer VMs, but if you need to, for example if you have a very big cluster with lots of resources, you can add a configuration option to that cluster alone and start more VMs in it than any other cluster.

The cluster change is done in DRS Advanced Options and the parameter is called srmMaxBootShutdownOps with a value of 32.  Remember that the cluster change has priority over the change in the SRM configuration file.

To change the value for all clusters you need to change the vmware-dr.xml file.  You need to add the section below to the vmware-dr.xml file.  After you make a change to the vmware-dr.xml file you must restart the SRM service to make that change active.  You do not need to use both PerCluster and PerHost below and generally just the cluster option will be required.

<Config>

<defaultMaxBootAndShutdownOpsPerCluster>32</defaultMaxBootAndShutdownOpsPerCluster>
<defaultMaxBootAndShutdownOpsPerHost>4</defaultMaxBootAndShutdownOpsPerHost>

</Config>

As well, sometimes you may see synchronization timeouts in vSphere Replication, and this can happen as a result of slow networks, or very large VMs, or large numbers of VMs.  You can change the amount of time available for synchronization using the info below so you can avoid this issue.

To change the synchronization timeout for vSphere Replication you can make the changes below in the vmware-dr.xml file.  Remember that you need to restart SRM to make this change active.

<Config>

     <hbrProvider>
         <syncronizationTimeout>3600</syncrhonizationTimeout>
     </hbrProvider>

As well, sometimes when you have a large number of LUNs you will have errors when preparing storage during recovery operations, and in that case you can adjust the time for SRA storage operations.  The error for this issue is very clear and will mention the commandTimeout parameter.

You can change this in the Storage section of the Advanced Settings, and it is called commandTimeout.  The default value is 5 minutes - but it is recorded in seconds so it is 300 seconds.  Sometimes storage vendors will suggest you change this value and I think the largest I have seen is 1500 seconds.  You access Advanced Settings by right clicking on the Site in the UI.

In summary, with this information, if you are operating at or near the scalability of SRM you can adjust some configuration options so that you can decrease the number of errors during recovery so you can have a more successful recovery.

Good luck!


10/03/2011

App Aware APIs for Linux

As some of you have noticed, the APIs for Application Awareness with vSphere 5.0 were missing from the list of available downloads at VMware.com.  This issue has been fixed and the APIs for both Windows and Linux are posted

These APIs are the same APIs you will find in NeverFail's vAppHA product and Symantec's ApplicationHA product.  With these APIs publicly available now, you can easily craft your own solution to monitor a application within a guest OS and have vSphere HA take action in the event of a application failure.  You can also add the application status to the vCenter UI. 


About This Blog



This blog has moved. For the latest posts please visit: blogs.vmware.com/vsphere/uptime/

Community


Discussions and resources for VMware Site Recovery Mgr (SRM)

Visit now



Facebook

YouTube


    VMware Blogs