posted

2 Comments

vSphere Platinum ShieldVMware has released updates to mitigate several Intel CPU issues, two of which affect security, as well as a regression in how VMware mitigates the previous MDS vulnerabilities. As with other CPU errata and vulnerability mitigations these topics are extremely complicated. Information security often speaks to three core tenets: confidentiality, integrity, and availability. These types of issues often touch all three areas, and require choices based on how your own environments and workloads are designed, what your performance needs are, and your tolerance for risk.

Our goal in this post is to help VMware customers figure out what action they need to take, and how to talk about any potential risk decisions with their CISOs, while not duplicating the VMware KB articles and other official guidance linked below. If you have questions please reach out to your account team, Technical Account Manager, and/or VMware Global Support Services.

VMSA-2019-0008.2 – MDS & L1TF Vulnerabilities

What it is: VMware has found situations where the MDS & L1TF mitigations in ESXi 6.7U2 (builds 13006603, 13473784, 13644319, 13981272 and 14320388, and/or other 6.7 U2 based hot-patches you might have received from VMware Support) were not effective. We strongly urge customers who have or are remediating the MDS vulnerabilities (CVE-2018-12126, CVE-2018-12127, CVE-2018-12130, and CVE-2019-11091) to update to the latest releases.

What you need to do: Patch your vSphere infrastructure.

What effect it will have: Your environment will return to the pre-6.7U2 performance levels.

TSX Asynchronous Abort (TAA) Speculative-Execution Vulnerability (CVE-2019-11135)

What it is: Cascade Lake CPUs have their own built-in mitigations for MDS, but there’s a piece missing that was already covered in the MDS mitigations for earlier CPUs (Skylake and earlier).

What you need to do: Patch your infrastructure and enable MDS remediations if they aren’t enabled already.

What effect will it have: Depends on your environment and whether you have MDS remediations enabled already. Part of the remediation is to stop presenting certain affected CPU instructions (HLE/RTM) that are found on Cascade Lake CPUs to VMs. If you have VMs already using those instructions there may be performance implications, but power-cycling the VM will fix that (look into the new vmx.reboot.PowerCycle feature). If you have workloads that require HLE/RTM contact VMware Support for a workaround and information on potential performance implications.

Machine Check Error on Page Size Change (MCEPSC) Denial-of-Service Vulnerability (CVE-2018-12207)

What it is: A malicious kernel driver inside a VM could use this vulnerability to cause a CPU machine check. This will cause ESXi to crash, generating a purple crash screen (Purple Screen of Death, or PSOD) with a particular error code listed on it (0150H).

What you need to do: Depends on whether you’re running nested hypervisors or using the Virtualization-Based Security (VBS) feature in vSphere. VBS enables use of Microsoft Device Guard & Credential Guard inside Windows 10 and Windows Server 2016 & 2019. You can enable this at the host or at the VM level, allowing you to apply it selectively. This allows you to test and selectively enable it to protect particular workloads. Refer to the KB articles on how to enable these.

What effect will it have: This vulnerability is trickier to mitigate because it has many other implications to it, including performance. If you are using nested hypervisors or Virtualization-Based Security you absolutely need to test this before you enable it for all of your workloads.

Please see the questions below for more nuanced discussion around remediating MCEPSC.

Jump Conditional Code (JCC) Erratum

What it is: A situation where the way the CPU’s cache interacts with the “jump” instruction can cause unpredictable behavior. Intel fixed this in their new CPU microcode, but the fix can have performance implications for what Intel describes as “tight loops.” For infrastructure a tight loop often means things like compression and encryption, such as what is used for vSAN or VM Encryption. Intel has supplied VMware with fixes for potential performance issues in vSphere, which we’ve included in these updates. Other applications may be affected and would require updates from the respective application vendors.

What you need to do: Just as you patch vSphere, it’s a great idea to keep your server firmware updated, too. VMware ships CPU microcode as part of vSphere, but you can also get it as part of server firmware updates. If you apply the new microcode without the fixed version of vSphere installed you may notice performance degradations, depending on how busy your environment is.

If you use VM Encryption or vSAN compression, deduplication, or encryption you should consider patching both firmware and vSphere together to avoid issues. Many servers allow you to use out-of-band management controllers, like iDRACs or iLOs, to stage firmware updates for the next system reboot. Alternately, you could update vSphere first to gain the performance fixes, then update your server firmware immediately afterwards.

What effect will it have: For vSAN & vSphere the net effect will be zero once the vSphere patches are applied. However, other workloads and applications may be affected and will require an update from the appropriate vendor.

Other Intel Errata and Updates

What it is: There are many other components inside a typical server. Like CPUs, these sometimes have bugs, too. There are new vulnerability disclosures around components like the Intel Management Engine and the BIOS, as well as updates to how certain CPUs manage power to improve stability.

What you need to do: Patch your server’s firmware.

What effect will it have: Patching removes vulnerabilities, so you’ll be more secure, but if you use vSAN Encryption, vSAN Deduplication & Compression, or VM Encryption you might want to consider updating vSphere first (see the Jump Conditional Code entry above).

So… should I be worried?

VMware classifies these updates as “moderate” severity which reflects how we perceive the threats associated with them. That doesn’t mean you shouldn’t worry, though – your organization’s specific configuration might mean you change the priority based on your own risks. As with MDS and L1TF these remediations need testing and discussion inside your organization. Remember that security is confidentiality, integrity, and availability. For example, TAA is a confidentiality problem, while MCEPSC is an availability problem. Your organization’s responses to each will differ.

For perspective, always keep in mind what an attacker needs in order to exploit these vulnerabilities: they need administrator/root-level access to a guest VM. Chances are that if an attacker is that far into your environment many things have already gone wrong. Furthermore, crashing a host alerts vSphere admins to the attacker’s presence, which isn’t typically what the attacker wants. This isn’t to say you shouldn’t act on these issues – you absolutely should – but you might be able to do it in a more controlled & tested manner. This is especially true if your organization uses Virtualization-Based Security on Windows VMs or other forms of nested virtualization.

Frequently Asked Questions

Q1. Are the mitigations for these issues enabled by default when these patches are applied?

A1. The updates for the MDS & L1TF issues are enabled when you enable a Side-Channel Aware Scheduler. The mitigations for TAA & JCC are enabled by default once the patches are applied and the hosts restarted. The mitigations for MCEPSC are not enabled by default because of the potential performance implications for VBS.

Q2. You seem to have avoided talking about performance. I need to know how this will impact my workloads.

A2. Performance is tricky because every environment and workload is different. However, we understand, and have written a KB article that addresses the performance impacts based on VMware’s internal testing.

Q3. How will I know if someone is exploiting TAA?

A3. Once an attacker has access to a VM they can silently exploit TAA, just as with L1TF and MDS. Therefore it is very important to protect guests with an Endpoint Detection & Response (EDR) solution, as well as ensuring that both guest OSes and workload applications are patched, too.

Q4. How will I know if someone is exploiting MCEPSC?

A4. If an attacker exploits MCEPSC the ESXi host will crash and display the purple crash screen (PSOD). If you are experiencing PSODs please contact VMware Support for help diagnosing the issues.

Q5. If I choose not to mitigate MCEPSC directly what else can I do to protect my environment?

A5. In order to exploit CPU vulnerabilities the attacker has to have administrator/root-level access to a guest VM in your environment, and the ability to inject a malicious kernel driver into the guest OS. Continuing to protect your applications and guest OSes with regular patching, firewalls, using solutions like Carbon Black’s EDR products, implementing AppDefense/Carbon Black Defense, and other defense-in-depth methods always pays dividends by preventing attackers from gaining access to your applications and data. Generally speaking, attackers are usually not that interested in your infrastructure except as a means to steal data.

You can also selectively remediate your environment by enabling MCEPSC for VMs not using VBS, using things like DRS groups and rules to “pin” unremediated VMs to particular groups of hosts to limit the risk. We always urge caution around making your environment more complex, though. Complexity increases risk of human error.

Consider that the risk from MCEPSC is to availability, versus a data loss situation.

Q6. If a host crashes because of MCEPSC will vSphere HA restart it on another host and cause the same issue?

A6. It’s possible. You may want to consider disabling HA for VMs that are unmitigated.

Q7. If nested hypervisors are impacted by MCEPSC how does this affect my vSAN witness VM?

A7. The impact is to nested workloads (VM inside a VM, or Virtualization-Based Security which runs Windows components inside Hyper-V inside a VM). The witness functionality is implemented as a process on ESXi, but not nested, so it will operate normally.

Q8. Do the Side-Channel Aware Schedulers (SCAv1 and SCAv2) mitigate these issues?

A8. The versions of SCAv1 and SCAv2 in these patches help mitigate L1TF, MDS, and the TAA issues but do not affect MCEPSC.

Q9. I have EVC enabled, will these mitigations work?

A9. Yes. EVC enables lots of flexibility around CPUs and that flexibility lends itself well to mitigating CPU vulnerabilities.

Q10. How do I know if my CPU or server is vulnerable?

A10. Please check with your hardware vendor to determine if your server is vulnerable.

Q11. Are these issues being actively exploited?

A11. At the time of this writing VMware is not actively aware of these issues being exploited, but as with most security vulnerabilities it’s likely just a matter of time until exploit code is generally available. To reiterate: to exploit these issues an attacker will need high-level access to a guest VM, so they will need to break into an application and/or a VM first.

Q12. Is this a VMware issue or does it affect other hypervisors?

A12. This is a hardware issue. All operating systems and hypervisors are affected. VMware is simply helping our customers deal with these complicated and tricky issues.

Q13. Do these issues affect Kubernetes?

A13. Yes. These issues affect all operating systems and workloads, because it’s an issue with the underlying CPU hardware.

Q14. Do all my VMs need to be power-cycled as they did with the MDS remediations?

A14. No. By vMotioning VMs to a patched host they gain the protections. However, if you haven’t fully remediated for MDS you might be interested in the new vmx.reboot.PowerCycle advanced parameter that has been added to vSphere.

Q15. How do I get the updated CPU microcode for my server?

A15. VMware ships CPU microcode as part of vSphere releases and updates. However, those microcode updates don’t include updates for things like the Intel Management Engine, BIOS issues, etc. Our recommendation is to update the firmware on your server using your hardware vendor’s preferred methods because that will include the updates for other hardware components. Please review the notes around the JCC issues above for thoughts on the order of operations (vSphere updates first, then hardware updates).

Q16. I am on an air-gapped network. Do I need to mitigate these issues?

A16. Only you and your organization can determine that. VMware always recommends that customers protect themselves by patching, but when issues like MCEPSC might trade confidentiality and availability there isn’t one right choice we can recommend.

Q17. The VMware Security Advisory does not list a source for vSphere 5.5 updates. Where can I get those?

A17. Security advisories do not include products past the end of general support. ESXi 5.5 fixes are not currently planned for extended support.

Q18. I don’t want to apply these patches. Is there a workaround I can use?

A18. No workarounds have been identified for any ESXi version.

Q19. I don’t want the new microcode. Can I remove it from the ESXi patches?

A19. Altering ESXi in that way isn’t supported. Altering products outside of the guidance of VMware Support may lead to instability and support issues. Please open a support case if you have concerns or questions.

Q20.  This update does not appear to have a vCenter Server release associated with it, but the recommended patching sequence indicates that vCenter Server should always be patched first. What should I do?

A20. Always check to ensure you’re running the latest release of vCenter Server first. If there are no available updates for your vCenter Server and/or Platform Service Controllers then move on to updating ESXi.

Q21. Do I need to disable Hyperthreading to mitigate any of these new vulnerabilities?

A21. No, but we recommend continuing to use SCAv1 or SCAv2 schedulers to mitigate MDS and L1TF on older CPUs. You might find our post, Which vSphere CPU Scheduler to Choose, to be helpful.

Q22. Do the updates released by VMware include previous security fixes?

A22. Yes. Patches are cumulative and the latest updates include all the features, bug fixes, and security remediations ever shipped as part of the particular vSphere version.

Q23. Are any of these issues considered Remote Code Execution (RCE) vulnerabilities?

A23. No — an attacker needs to break into a guest VM first and gain privileged access (administrator or root) before they would be in a position to exploit these issues.

Q24. The KB articles suggest changing configuration parameters via SSH, but I have SSH disabled for security reasons. Is there another way to make these changes?

A24. Not at this time.

Q25. I have a cluster with the MCEPSC mitigations enabled on the hosts. Can I set the monitor.if_pschange_mc_workaround to FALSE for a specific VM to exclude it?

A25. No, a VM cannot opt out of host-level mitigations.

Have other questions? Feel free to add them to the comments.

(Many thanks to David Dunn, Edward Hawkins, and Mike Foley in the creation of this article)