By Jonathan McDonald
Jonathan McDonald is a Technical Solutions Architect for the Professional Services Engineering team. He currently specializes in developing architecture designs for core Virtualization, and Software-Defined Storage, as well as providing best practices for upgrading and health checks for vSphere environments.
This is the second part of a two-part blog series. Click here to read Part One.
As I mentioned at the first blog of this series, I want to start off by saying that all of the details here are based on my own personal experiences. It is not meant to be a comprehensive guide to setting up stretch clustering for Virtual SAN, but rather a set of pointers to show the type of detail most commonly asked for. Hopefully it will help you prepare for any projects of this type.
Continuing on with the configuration, the next set of questions regarded networking!
Networking, Networking, Networking
With sizing and configuration behind us, the next step was to enable Virtual SAN and set up the stretch clustering. As soon as we turned it on, however, we got the infamous “Misconfiguration Detected” message for the networking.
In almost all engagements I have been a part of, this has been a problem, even though the networking team said it was already set up and configured. This always becomes a fight, but it gets easier with the new Health UI Interface and multicast checks. Generally, when multicast is not configured properly, you will see something similar to the screenshot shown below.
It definitely makes the process of going to the networking team easier. The added bonus is there are no messy command line syntaxes needed to validate the configuration. I can honestly say the health interface for Virtual SAN is one of the best features introduced for Virtual SAN!
Once we had this configured properly the cluster came online and we were able to configure the cluster, including stretch clustering, the proper vSphere high availability settings and the affinity rules.
The final question that came up on the networking side was about the recommendation that L3 is the preferred communication mechanism to the witness host. The big issue when using L2 is the potential that traffic could be redirected through the witness in the case of a failure, which has a substantially lower bandwidth requirement. A great description of this concern is in the networking section of the Stretched Cluster Deployment Guide.
In any case, the networking configuration is definitely more complex in stretched clustering because the networking across multiple sites. Therefore, it is imperative that it is configured correctly, not only to ensure that performance is at peak levels, but to ensure there is no unexpected behavior in the event of a failure.
High Availability and Provisioning
All of this talk finally led to the conversation about availability. The beautiful thing about Virtual SAN is that with the “failures to tolerate” setting, you can ensure there are between one and three copies of the data available, depending on what is configured in the policy. Gone are the long conversations of trying to design this into a solution with proprietary hardware or software.
A difference with stretch clustering is that the maximum “failures to tolerate” is one. This is because we have three fault domains: the two sites and the witness. Logically, when you look at it, it makes sense: more than that is not possible with only three fault domains. The idea here is that there is a full copy of the virtual machine data at each site. This allows for failover in case an entire site fails as components are stored according to site boundaries.
Of course, high availability (HA) needs to be aware of this. The way this is configured from a vSphere HA perspective is to assign the percentage of cluster resources allocation policy and set both CPU and memory to 50 percent:
This may seem like a LOT of resources, but when you think of it from a site perspective, it makes sense; if you have an entire site fail, resources in the failed site will be able to restart without issues.
The question came up as to whether or not we allow more than 50 percent to be assigned. Yes, we can set it to use more than half consumed, but there might be an issue if there is a failure, as all virtual machines may not start back up. This is why it is recommended that 50 percent of resources be reserved. If you do want to configure a utilization of more than 50 percent of the resources for virtual machines, it is still possible, but not recommended. This configuration generally consists of setting a priority on the most important virtual machines so HA will start up as many as possible, starting with the most critical ones. Personally, I recommend not setting above 50 percent for a stretch cluster.
An additional question came up about using host and virtual machine affinity rules to control the placement of virtual machines. Unfortunately, the assignment to these groups is not easy during provisioning process and did not fit easily into the virtual machine provisioning practices that were used in the environment. vSphere Distributed Resource Scheduler (DRS) does a good job ensuring balance, but more control was needed rather than just relying on DRS to balance the load. The end goal was that during provisioning, placement in the appropriate site could be done automatically by staff.
This discussion boiled down to the need for a change to provisioning practices. Currently, it is a manual configuration change, but it is possible to use automation such as vRealize Orchestrator to automate deployment appropriately. This is something to keep in mind when working with customers to design a stretch cluster, as changes to provisioning practices may be needed.
Finally, after days of configuration and design decisions, we were ready to test failures. This is always interesting because the conversation always varies between customers. Some require very strict testing and want to test every scenario possible, while others are OK doing less. After talking it over we decided on the following plan:
- Host failure in the secondary site
- Host failure in the primary site
- Witness failure (both network and host)
- Full site failure
- Network failures
- Witness to site
- Site to site
- Disk failure simulation
- Maintenance mode testing
This was a good balance of tests to show exactly what the different failures look like. Prior to starting, I always go over the health status windows for Virtual SAN as it updates very quickly to show exactly what is happening in the cluster.
The customer was really excited about how seamlessly Virtual SAN handles errors. The key is to operationally prepare and ensure the comfort level is high with handling the worst-case scenario. When starting off, host and network failures are always very similar in appearance, but showing this is important; so I suggested running through several similar tests just to ensure that tests are accurate.
As an example, one of the most common failure tests requested (which many organizations don’t test properly) is simulating what happens if a disk fails in a disk group. Simply pulling a disk out of the server does not replicate what would happen if a disk actually fails, as a completely different mechanism is used to detect this. You can use the following commands to properly simulate a disk actually failing by injecting an error. Follow these steps:
- Identify the disk device in which you want to inject the error. You can do this by using a combination of the Virtual SAN Health User Interface, and running the following command from an ESXi host and noting down the naa.<ID> (where <ID> is a string of characters) for the disk:
1esxcli vsan storage list
- Navigate to /usr/lib/vmware/vsan/bin/ on the ESXi host.
- Inject a permanent device error to the chosen device by running:
1python vsanDiskFaultInjection.pyc -p -d <naa.id>
- Check the Virtual SAN Health User Interface. The disk will show as failed, and the components will be relocated to other locations.
- Once the re-sync operations are complete, remove the permanent device error by running:
1python vsanDiskFaultInjection.pyc -c -d <naa.id>
- Once completed, remove the disk from the disk group and uncheck the option to migrate data. (This is not a strict requirement because data has already been migrated as the disk officially failed.)
- Add the disk back to the disk group.
- Once this is complete, all warnings should be gone from the health status of Virtual SAN.
Note: Be sure to acknowledge and reset any alarms to green.
After performing all the tests in the above list, the customer had a very good feeling about the Virtual SAN implementation and their ability to operationally handle a failure should one occur.
Last, but not least, was performance testing. Unfortunately, while I was onsite for this one, the 10G networking was not available. I would not recommend using a gigabit network for most configurations, but since we were not yet in full production mode, we did go through many of the performance tests to get a baseline. We got an excellent baseline of what the performance would look like with the gigabit network.
Briefly, because I could write an entire book on performance testing, the quickest and easiest way to test performance is with the Proactive Tests menu which is included in Virtual SAN 6.1:
It provides a really good mechanism to test different types of workloads that are most common – all the way from a basic test, to a stress test. In addition, using IOmeter for testing (based on environmental characteristics) can be very useful.
In this case, to give you an idea of performance test results, we were pretty consistently getting a peak of around 30,000 IOPS with the gigabit network with 10 hosts in the cluster. Subsequently, I have been told that once the 10G network was in place, this actually jumped up to a peak of 160,000 IOPS for the same 10 hosts. Pretty amazing to be honest.
I will not get into the ins and outs of testing, as it very much depends on the area you are testing. I did want to show, however, that it is much easier to perform a lot of the testing this way than it was using the previous command line method.
One final note I want to add in the performance testing area is that one of the key things (other than pure “my VM goes THISSSS fast” type tests), is to test the performance of rebalancing in the case of maintenance mode, or failure scenarios. This can be done from the Resyncing Components Menu:
Boring by default perhaps, but when you either migrate data in maintenance mode, or change a storage policy, you can see what the impact will be to resync components. It will either show when creating an additional disk stripe for a disk, or when fully migrating data off the host when going into maintenance mode. The compliance screen will look like this:
This represents a significant amount of time, and is incredibly useful when testing normal workloads such as when data is migrated during the enter maintenance mode workflow. Full migrations of data can be incredibly expensive, especially if the disks are large, or if you are using gigabit rather than 10G networks. Oftentimes, convergence can take a significant amount of time and bandwidth, so this allows customers to plan for the amount of data to be moved while in or maintenance mode, or in the case of a failure.
Well, that is what I have for this blog post. Again, this is obviously not a conclusive list of all decision points or anything like that; it’s just where we had the most discussions that I wanted to share. I hope this gives you an idea of the challenges we faced, and can help you prepare for the decisions you may face when implementing stretch clustering for Virtual SAN. This is truly a pretty cool feature and will provide an excellent addition to the ways business continuity and disaster recovery plans can be designed for an environment.