In August 2019 VMware joined AMD on stage as they announced their second generation EPYC server-class CPUs. Together we announced support, in a future release of vSphere, for the AMD Secure Encrypted Virtualization (SEV) and Secure Encrypted Virtualization – Encrypted State (SEV-ES) advanced security features of those CPUs. vSphere 7 Update 1 keeps our promise, delivering that functionality to customers.
January 2018 was a turning point for security in infrastructure. For security researchers, the Spectre and Meltdown vulnerability announcements opened their eyes to a whole new world of hardware capabilities to be probed for their security properties. We’ve seen the results over the past few years, and it’s contributed to a much wider interest in infrastructure security. People are starting to ask the right questions about security in the cloud, defense-in-depth, and isolation between workloads even inside an on-premises data center. Do we have to trust the hardware so much? Do we have to trust the software so much? How do we keep others from seeing inside our workloads? How do we limit exposure, add more isolation, and limit our risk?
These questions are good questions and speak to the greater need for Trusted Execution Environment (TEE) capabilities in hardware, accessible by workloads.
AMD’s response was to add the AMD Secure Processor, a small coprocessor integrated into the AMD CPU die itself, that enables a hardware root of trust, and supports the cryptographic operations that underpins technologies like SEV-ES. With the first generation of EPYC CPUs the Secure Processor could only handle 15 encryption keys, but the second generation EPYC CPUs extend that to more than 500 keys. One encryption key for the hypervisor and the rest can be used for workloads.
Isolation in ESXi
You might be thinking “why do we need this? ESXi has a lot of isolation already!” You’d be right; there are many layers of isolation inside a virtualized infrastructure, but it’s very nice to have more options. The guest OS has its own process protections and permission models (unless you’re still running Windows 3.1, of course). The VM Runtime, or VMX, is a form of isolation itself, as it is the process inside ESXi that runs the guest VM. Around the VM Runtime is what we call the sandbox, a hard layer of protection separating the guest and the rest of the hypervisor. These are all separations between workloads on the same host, but there is also the isolation that comes from workloads being on physically separated hosts, too.
All of this is software, though, and still requires a level of trust in the CPU, memory controllers, PCIe bus controllers, and the like. Similarly, VMs require a level of trust in the hypervisor, too. This is where SEV & SEV-ES come in. A guest OS that supports SEV can ask the AMD Secure Processor to issue it an encryption key, and enable full in-memory, in-hardware encryption. SEV-ES extends that to CPU registers, too, so that data in use inside the CPU itself is encrypted. This is a very powerful protection against entire classes of vulnerabilities, both in hardware and in the hypervisor. When a VM stops actively running on a CPU (what is known as a context switch, not a power-off of a VM) the contents of the CPU registers get copied into hypervisor memory. A compromised hypervisor could read or modify that register data, either to steal the data itself, steal things like encryption keys that are held in CPU registers (to decrypt a disk, for example), or to alter the behavior of the VM itself. None of those things are good. With SEV-ES, the hypervisor does not have access to the encryption keys for a guest unless the guest explicitly allows it, greatly reducing the attack surface.
It’s always hard to see the future, but a major benefit to these technologies is that they offer possible opportunities for protection against future hardware issues, too. Over the past three years we’ve seen where a lot of vulnerabilities lie, and risk is found in many places. As such, vSphere 7 is the first hypervisor to provide full SEV-ES support, not just SEV. Protecting data in memory is good. Protecting data inside the CPU itself is great, and we at VMware believe that the full protection of SEV-ES is what customers need to enable for protection.
Security is always a tradeoff, whether it’s usability, performance, cost, opportunity cost, or a combination of factors. I like to call them “considerations” because they’re design points and not necessarily bad, just things to be traded against risk tolerance and compliance & regulatory needs. First, and most obvious, to use these technologies you need a CPU that supports them, which is an AMD EPYX 7xx2 CPU or newer. Second, and also somewhat expected, you need a guest OS that supports SEV-ES, and not just SEV, because we require both to be enabled. As of this writing the native support is found in particular Linux distributions and kernels. Customer feedback is a very powerful thing, and we encourage you to ask your guest OS vendors when they plan to fully support AMD CPUs and their advanced platform security features like SEV-ES.
Another set of considerations comes from the fact that the hypervisor doesn’t have access to a protected VM’s memory. When that happens, the hypervisor loses the ability to do certain things, like vMotion, memory snapshots, hot add of devices, suspend & resume, Fault Tolerance, and hot clones. Some of these things, like vMotion, might be important to operations in your environment. They also might not, because with good application design an application or service can be resilient to maintenance operations that require the host to restart. Furthermore, Kubernetes-based modern applications may not need vMotion. vSphere with Tanzu doesn’t vMotion its Kubernetes workloads. It restarts them on a new host and terminates the copy on the host entering maintenance mode. This makes the inability to use vMotion irrelevant to those applications – you don’t miss what you didn’t need!
There is considerable upside to the way AMD has designed SEV-ES. First, workloads running on a supported guest OS gain these deep protections without modifications to the applications themselves. This means that you don’t have to wait for a commercial application vendor to recompile their application to support this technology. Second, this technology isn’t all or nothing, even on a single host. You can enable it for certain workloads, leave it disabled for others, and they all coexist peacefully. That flexibility means that enabling and deploying this technology can be done at your own pace. Last, it’s easy to enable this for one or 10000 VMs by using a PowerCLI cmdlet.
AMD has demonstrated considerable thoughtfulness towards consumers with how they’ve designed these features, and their ideals mesh well with our work to make security extremely easy to use inside vSphere. It’s always nice to have these types of tools in our collective security toolbox, helping to reduce risk, and we at VMware are proud to lead the industry in championing them.
We are excited about these new releases and how vSphere is always improving to serve our customers and workloads better in the hybrid cloud. We will continue posting new technical and product information about vSphere with Tanzu & vSphere 7 Update 1 on Tuesdays, Wednesdays, and Thursdays through the end of October 2020! Join us by following the blog directly using the RSS feed, on Facebook, and on Twitter, and by visiting our YouTube channel which has new videos about vSphere 7 Update 1, too. As always, thank you, and please stay safe.