Introduction
A deployment of a Windows Server Failover Cluster (WSFC) on top of a highly available and resilience virtual infrastructure should never be done without a purpose and these purposes are tightly bound to Mission Critical or Business Critical Applications. Such applications require not only highly available hardware, but also protection from software failures (a service not able to start or other Guest OS corruptions) and minimal downtime during maintenances (such as patching of the operating system or rolling over a new version of application). While planning, deploying, and operating a WSFC it’s vital to ensure that the WSFC configuration is error prone and has full vendor support. It’s a lot of resources depicting how to build a WSFC on VMware vSphere, vSAN, and particularly on VMware Cloud on AWS, however it’s often underestimated the importance of the WSFC Validate a Configuration Wizard (Validation Wizard) to create a supported WSFC.
In this blog we will discuss how to prepare VMware vSphere Virtual Machines (VMs), execute tests, interpret the results and get to the supported configuration on vSAN. We will use VMware Cloud on AWS as an example of the infrastructure using the vSAN as the default storage type. While all the categories of the wizard are equally important, we will pay most attention to the storage category as we have seen a lot of confusion and questions around shared disks for WSFC from our customers.
Preparing VMs to host a WSFC.
Before you get to the point where you would be able to start the Validation Wizard, you should invest some time in preparing the environment. The steps below are just basic steps, please refer to the documentation for more details.
- Minimum of two VMs with exact the same virtual hardware configuration.
- Windows OS deployed and updated.
- Failover Cluster role deployed.
- Shared disks presented to all VMs participating in the cluster. To present a shared disk:
- On the first VM:
- Create a new virtual SCSI controller of the type VMware Paravirtual (PVSCSI).
- On the first VM:
Note: Never attach a shared disk to the SCSI controller 0 or other virtual controllers hosting the boot disk of your VM!
-
-
- Set SCSI bus sharing for this new SCSI controller to physical.
- Add a new VMDK disk from the Workload datastore.
- Attach the VMDK to the newly created controller. DO NOT use the “multi-writer” flag.
- Set the Disk Mode to Independent-persistent.
- On all consecutive VMs:
- Create a new virtual SCSI controller of the type VMware Paravirtual (PVSCSI).
- Set SCSI bus sharing for this new SCSI controller to physical.
- Add an Existing hard disk by choosing the VMDK created on the first VM.
- Set the Disk Mode to Independent-persistent.
-
Note: Deploying a WSFC on a stretched cluster SDDC on VMware Cloud on AWS is a supported configuration. For more details you can consult this post, the information is applicable to both vSAN and VMware Cloud on AWS.
To validate your cluster, it would be sufficient to provision a single, small VMDK (for example, 1GB). You can use this disk after the creation of the cluster as the quorum. You are not required to have all the disks presented at the validation time if you are using the same datastore.
Note: If during Power On of the second VMs you encounter an error “File System specific implementation of OpenFile[file] failed …”, you should check if the shared disk is attached to the New created SCSI controller and/or if the SCSI bus sharing for this controller is configured as physical.
After shared disk(s) is/are attached to all VMs, ensure that the the disk(s) is present in the Disk Management mmc. Initially, a new added shared disk is recognized as offline and non-initialized.
If you would run the Validation Wizard at this stage, and no other disks except of the boot diskare available, the Validation Wizard would skip the storage category marking it as “Non applicable”.
Drilling down to the Storage Category reveals the reason: WSFC is not able to work with RAW disks.
You should Initialize the disk and bring it online.
This step should be performed on only one node of the cluster (remember, at this point of time, WSFC does NOT control the access to the disk and you can possibly corrupt the disk by accessing it from other nodes). On all other nodes the shared disk should be visible as Initialized and Offline:
It’s not required to create a file system and assign a drive letter for the disk selected for the validation. However, doing so would not break the wizard and you still would be able to validate the disk.
Running the Validation Wizard
Now you are ready to run the Validation Wizard. To do so open the Failover Cluster Manager and select Validate Configuration under Management and follow the instructions.
Upon competition of the wizard, you are presented with the results, displayed using basic html format in the browser of your choice. The report is stored by default in C:\Users\<%UserName%>\AppData\Local\Temp. It’s recommended to save the *.htm file: you might need this report if you would need to contact the vendor support later.
Analyzing the results
Let us dig into the results. If you see any category showing Error in the Description column, WSFC would not allow you to create the cluster – you must fix all the errors highlighted in red before moving forward. Warnings are different: some of them could be tolerated, some of them must be fixed before you proceed with the cluster creation.
While it’s a lot of warnings that you can possibly face, let us discuss two of them which are expected and would not revoke the support.
Storage – Validate Storage Spaces Persistent Reservation
With the introduction of Storage Spaces, Microsoft added new checks to validate Storage Spaces requirements. These tests are only valid for Storage Spaces and have no impact on shared disks. VMware Cloud on AWS and VMware vSAN does not provide support for PERSISTENT RESERVE OUT Register (00h) persistent reservation commands and Storage Spaces is not a valid configuration for WSFC on VMware Cloud on AWS or on any vSAN datastore. This warning is not applicable to a WSFC deployed on VMware Cloud on AWS and can be safely ignored. Check this article on Microsoft Techcommunity formore details.
Network – Validate Network Communication
It’s a lot of confusion around this subcategory when deploying a WSFC on virtual environments. While absolutely valid if nodes of WSFC are physical, this check is not applicable to VMs. In virtualized environments vNICs of a VM use physical NICs of your ESXi host to communicate with clients and across nodes. Adding more vNICs or separating WSFC heartbeats to a separate vNIC would not improve the network availability and might just complicate the configuration. You can safely ignore this warning.
Any other warnings not described above should be fixed before deploying a production instance of WSFC.
Additional Considerations
While troubleshooting possible issues with storage, pay close attention to the subcategories List Disks and List Disks To Be Validated in the Storage category. The way how the Validation Wizard numerates disks is different from the Windows OS and VMware vSphere. List Disks subcategory reflects the order and numbering of disks as seen in the Disk Management mmc:
However, the List Disks To Be Validated subcategory assigns “Test Disk 0” to the first shared disk. In our example, the Test Disk 0 corresponds to the Disk Number # 1 from the List Disks table. All further references in the storage category are using disk numbers as shown under List Disks To Be Validated. Do not mix Test disk 0 with the disk number 0!
Summary
This blog highlights the importance of the WSFC validation and provides recommendations on how to proper prepare shared storage and test you VMs hosting a WSFC on VMware vSAN and VMware Cloud on AWS. Make sure to follow the recommendations outlined in WSFC on VMware vSphere Deployment Guide and VMware Cloud on AWS.
Happy Clustering in the New Year!