In a little more than 4 years, VMware Hyperconverged Infrastructure (HCI) powered by vSAN has accumulated over 17K customers. One of the powerful tools that has helped enable this rapid vSAN adoption is Proof of Concept (PoC) testing.
Converging HCI Buyers
Hyperconverged Infrastructure has evolved to now incorporate all major data center functions including software-defined compute, storage, networking, security and cloud management to deliver a single, ubiquitous control plane agnostic of hardware, transforming enterprise IT in terms of acquisition, consumption, and operation.
Traditionally, server buyers, storage buyers, and application buyers are different crowds with very diverse requirements. HCI is breaking silos and bringing these different groups of buyers together to evaluate and choose a solution that converges their requirements at different layers of the full stack. A few examples include:
- The server team that normally puts a high priority on Power, Space, and Cooling (PSC) may need to opt for a rack server configuration that accommodates storage density requirements instead of a blade architecture.
- The storage admins who tend to be conservative by using mature technologies to minimize the risk of losing data may now embrace server platforms that enable fast adoption of innovations in compute and storage such as composable infrastructure and NVDIMM.
- The application architects who in the past prefer local storage for performance and rely on an application’s built-in replication for availability will need to rethink performance with distributed, shared storage while also considering complementary data availability features provided at the storage layer for the best design.
As a result, today HCI is mostly evaluated as “IT infrastructure” according to our customer surveys.
Evaluating HCI with PoC Testing
Customers need to take an integrated view of an HCI solution to determine if the solution can satisfy IT infrastructure requirements across the stack. PoC testing is one of the most effective ways for customers to evaluate a solution objectively against common criteria. Regardless if PoCs are customer self-driven, conducted by partners, or initiated with VMware field personnel, VMware HCI specialists and PoC solution architects are always ready to engage and assist.
The Power of Choice
Choosing the appropriate hardware for a PoC is one of the most important factors in the successful validation of vSAN. Below is a list of the more common options for vSAN PoCs:
Bring your own: Organizations considering vSAN for existing workloads can choose their existing hardware. One of the benefits of this option is 100% validation that vSAN achieves the success criteria and there are no surprises.
Virtual PoCs: Organizations solely interested in seeing vSAN functionality may be interested in the Virtual PoC. This is a virtual environment and is not a true test of performance or hardware compatibility but can help stakeholder feel more comfortable using vSAN. Please contact your VMware HCI specialist to take advantage of our “Test Drive” environment.
Loaner PoC: The vSAN PoC team maintains a collection of loaner gear to validate vSAN when there is no hardware available for testing. Please contact your VMware HCI specialist to take advantage of this option.
Hosted PoCs: Many resellers, partners, distributors, and OEMs recognize the power of vSAN and have procured hardware to make it available to their current, and future customers in order to be able to conduct vSAN proof of concepts.
Try and Buy: Whether a VxRail or a vSAN ReadyNode, many partners will provide hardware for a vSAN POC as a “try and buy” option.
The most important aspects to validate in a Proof of Concept are:
- Successful vSAN configuration and deployment
- VMs successfully deployed to vSAN Datastore
- Reliability: VMs, and data remain available in the event of failure (host, disk, network, power)
- Serviceability: Maintenance of hosts, disk groups, disks, clusters
- Performance: vSAN, and selected hardware can meet the application, as well as business needs
- Validation: vSAN data services working as expected (dedupe/compression, RAID-5/6, checksum, encryption)
- Day 2 Operations: Monitoring, management, troubleshooting, upgrades
These can be grouped into 3 common vSAN PoCS, resiliency testing, performance testing, and operational testing.
Operational testing is a critical part of a vSAN PoC. Understanding how the solution behaves on day-2 operations is important to consider as part of the evaluation. Fortunately, because vSAN is embedded in the hypervisor, a lot of the vSAN operations are also vSphere operations. Adding hosts, migrating VMs between nodes, and cluster creation are some of the many operations that are consistent between vSphere and vSAN, resulting in a smaller learning curve, and eliminating the need to have storage specialists.
Some of the Operational Tests include:
- Adding hosts to a vSAN Cluster
- Adding disks to a vSAN node
- Create/Delete a Disk Group
- Clone/vMotion VMs
- Create/edit/delete storage policies
- Assign storage policies to individual objects (VMDK, VM Home)
- Monitoring vSAN
- Embedded vR Ops (vSAN 6.7 and above)
- Performance Dashboard on H5 client
- Monitor Resync components
- Monitor via vRealize Log Insight
- Put vSAN nodes in Maintenance Mode
- Evacuate Disks
- Evacuate Disk Groups
For more information about operational tests please visit the following sections on the vSAN PoC Guide:
- Basic vSphere Functionality on vSAN
- Scale Out vSAN
- Monitoring vSAN
- vSAN Storage Policies
- vSAN Management Tasks
Of all the PoCs, performance testing seems to get a lot of attention during a vSAN PoCs. The goal is to determine if vSAN and the selected hardware can meet the application, as well as business needs.
Prior to conducting a performance test, it is important to have a clear direction on whether it is a benchmark test or real application test.
Ideally, having a clone of an environment for testing will yield the most accurate results during the test. Understanding the applications, and use case is important as this will determine the policies for objects and/or VMs.
On the other hand, a synthetic benchmark test can also be helpful as the test tends to be faster since there is no need to clone production VMs or do additional configuration. However, synthetic tests require that you understand the workload profile to be tested such as block size, read/write percentage and sequential/random percentage etc. Conducting such a test requires knowledge of testing tools like Oracle’s vdbench. There is a fling available called HCIBench that automates the deployment of Linux VMs with vdbench as a way to generate load on the cluster. vdbench also provides a Web UI interface for configuration, and a results view. HCIBench will create the desired number of VMs with the desired number of VMDKs in just minutes. HCIBench is truly a great tool that allows for faster testing utilizing well-known industry benchmarking tools.
For more information about HCIBench, please refer to the following blog series.
Conducting resiliency testing on a vSAN cluster is an eye-opening experience. By default, vSAN protects your information with 2 replicas of data, based on the vSAN default storage policy. As the number of nodes increases, you can increase the number of failures to tolerate.
Just like with any other storage solution, failures can occur on different components at any time due to age, temperature, firmware, etc. Such failures can occur at the storage controller level, disks, nodes, and network devices among others. A failure on any of these components can manifest itself as a failure in vSAN.
When a failure occurs, the objects can go into an absent state or a degraded state. Depending on the state of the components after the failure, they will either rebuild immediately (degraded) or wait for the time out (absent). By default, the repair delay value is set to 60 minutes because vSAN is not certain if the failure is transient or permanent. One of the common tests conducted is physically removing a drive from a live vSAN node. In this scenario, vSAN sees the drive is missing, and claims failure. vSAN doesn’t know if the missing drive will return, so the objects on the drive are put in an absent state. vSAN notes the failure, and updates the state of the object, and the 60-minute repair timer countdown begins. If the drive does not come back within the time specified, vSAN will rebuild the objects to restore policy compliance. If the drive was mistakenly pulled and put back in within the 60 minutes, there is no rebuild, and after a quick sync of metadata, the objects will be healthy again.
In cases of a drive failure (PDL), the disk is marked as degraded. vSAN will receive error codes, mark the drive as degraded, and begin the repair immediately.
Whether you have access to the physical nodes or not, running a test for a failed drive can be hard, unless the drive happens to die during the PoC. Fortunately, there are python scripts available within ESXi that allows you to insert various error codes to generate both absent, and degraded states. There is a python script called vsanDiskFaultInjection.pyc.
Apart from disk failure testing, we also recommend to include the following tests to better understand the resiliency of vSAN:
- Simulate node failure with HA enabled
- Introduce network outage
- with & without redundancy
- Physical cable pull
- Network Switch Failure
- vCenter failure
- not considered a vSAN failure as vSAN keeps running
For additional information about Failure Testing please refer to the vSAN PoC Guide.
As customers evaluate HCI solutions, its important to determine if the solution can satisfy their IT infrastructure requirements across the stack. PoC testing is one of the most effective ways to evaluate a solution objectively against common criteria. Regardless if the PoC is customer self-driven, conducted by partners, or initiated with VMware field personnel, VMware HCI specialists and PoC solution architects are always ready to engage and assist.