It is the time of year to prepare the infrastructure for the Hands-on Labs at the annual VMware conference in the United States.
The conference may have a new name this year, but the Hands-on Labs will still be there for you. We have been a cornerstone of VMware conferences since the beginning. Each year our team of full-time employees, contractors, volunteers and interns work to create the best experience possible.
My team is known as vPod Architecture, which is a bit of an odd name and may not mean much to those outside our program. We handle the platforms used to build and host our lab environments — we call these vPods. We are the part of Hands-on Labs (HOL) that works with VMware’s internal Cloud Platform Engineering (CPE) team to manage the cloud capacity and the VMware Lab Platform team which provides the front-end interface, pool-of-running-labs management, and more. We sit in between these groups, providing the content and a framework to ensure labs are deployed in a logical, standardized, and reliable manner. All teams involved work with VMware Performance Engineering, to bring our latest content to the conference.
This is a drastic simplification, but gives a rough idea of the teamwork involved.
None of this would be possible without the business units across VMware who facilitate our access to passionate and skilled employees that create the content we share with our conference attendees and, ultimately, our online users.
These partnerships are a shining example of we like to call One VMware.
Each year, we maintain a portion of stable capacity — referring to the persistence of the environment rather than it’s functional characteristic. These are private clouds that run our production service at https://labs.hol.vmware.com/ or Pathfinder, so we have an idea of their capabilities and limitations. Around July of each year, we stop sending production workloads to these clouds, give them a refresh of the latest GA software, security patches, firmware updates, and load them up with the latest lab content for the conference.
While some people object to the phrase, I think it is appropriate in this case. We build one of our private clouds using the next-generation VMware products – typically, this is based on versions that have yet to be released or even announced. Our goal is to find the bugs before you even know the product is coming! Sometimes, these builds are not pretty, but that’s the point.
Depending on what we have available during our conference timelines, release cycles and team workloads, we also try to have public cloud capacity to run our labs. This can be challenging due to the interesting requirements of our workloads: running ESXi nested is not a supported configuration and the networking gymnastics required to support it are not permitted in many clouds.
Our goal is to have as much redundancy as we can afford and as much variety as we can support.
I have been asked at various events to draw our infrastructure and there is a nice official diagram somewhere with pretty icons and colors.
For this post, let’s just do some high-level visualization to show the redundancy we’re planning this year.
Note that any of our labs should be able to be run in any cluster within any cloud instance and be reachable from any station in our room via the VMware Lab Platform, which is itself a highly-redundant, distributed application.
Let’s dig a little deeper into some of the layers.
At the highest level, we want to have geographically-distributed cloud instances. In the past, we have had instances in various locations across the US and Europe. This year, all of our capacity is on the west coast of the US in 3 states: California, Oregon and Washington. This redundancy may not help us if there is an issue along the coast, but since our event is also running there, the consumers would be affected as well!
This year, our workloads will be hosted in 3 different physical datacenters. We rely on datacenter providers to ensure that the typical services are provided in a reliable and redundant manner: power, cooling, network access and physical security, to name a few.
Each of our labs runs as a set of virtual machines within a cloud instance managed by VMware Cloud Director.
Each datacenter has its own cloud instance, and the one in Santa Clara, California has two. Each cloud instance is made up of multiple redundant cells, so loss of any cell may result in an interruption of access, but a retry re-connects a user to a lab that has remained running on the backing vSphere.
We try to minimize the number of cloud instances for management simplicity and content efficiency: more instances means more copies of the content, more settings and configurations to manage, more testing and more resources consumed. My advice: Keep it simple!
vSphere: vCenters and Clusters
Each of our cloud instances is backed by multiple vCenter servers, each managing a cluster of ESXi hosts. The construct in Cloud Director is known as a virtual datacenter (VDC) and there are multiple types and configuration options. Once again, we go as simple as we can: one cluster per vCenter server.
A single vCenter server managing a single cluster helps keep the design simple and enables greater compartmentalization.
From a performance perspective, we deploy tens of thousands of virtual machines (VMs) each evening to prepare for the next day and thousands more VMs during each day of the event. While vCenter server is scalable on its own, being able to parallelize even more helps us get through the nightly deployment wave faster so we can… get to sleep… earlier! The sheer volume of tasks required to deploy that many VMs with the associated storage and networking is mind-boggling. If something is going to fail, we want to minimize the impact as much as possible. Breaking it up means that a portion of the work can proceed while we troubleshoot any failures in another portion.
VLP helps us distribute the lab instances across our active cloud capacity, maps our attendees to running environments, and manages pools of environments to ensure that we always have new ones starting up to replace those being consumed.
Some of our lab environments take an hour to start up and be ready for use. We don’t want you to wait that long, so we start the labs in advance, based on anticipated demand. Our team stages the lab templates in each cloud and uses VLP to manage all of that.
Every user is provided with a dedicated set of virtual machines ( a “vPod”) for the duration of their lab session. Each set of machines runs within a single vSphere cluster. This takes advantage of vSphere’s high availability and load balancing functionality while not creating a binding across datacenters, cloud instances, vCenter servers or clusters. Following the theme, this keeps the impact of failures localized: a cloud instance, vCenter or cluster going on vacation for some reason will only impact users who have labs running within that failure domain.
If a portion of the infrastructure decides to have an unscheduled out of office experience, affected attendees can be instructed to drop their current sessions and connect to a fresh environment running on healthy infrastructure.
In the event of that kind of catastrophic failure, individual lab state is lost, but because our content is designed to be modular, an attendee can generally pick up lab flow near the point at which the issue occurred. That is not optimal, but it does not happen frequently either.
For those who were with us in 2019 in San Francisco, you know that if we lose power to the entire venue (especially the entire city block!), there is not much we can do except wait. Your labs will continue running in the clouds, but you will be unable to reach them until power and Internet access are restored.
Come See us!
If you would like to hear more, stop by the Hands-on Labs room at VMware Explore – We’ll be in Moscone West on the 3rd floor. We have drop-in, behind-the-scenes tours scheduled each day and we’re happy to answer questions either on the tours or if you see one of us in black polo shirts milling about in front of the Command Center wall where we showcase our monitoring of the environment.
Register to attend the US event: REGISTER