A serious situation is developing for some customers running vSphere 6.5 Update 2 and newer where the Security Token Service (STS) certificate is expiring after its two year lifespan and causing problems for authentication on vCenter Server. This post is intended to help vSphere Admins identify & repair the problem proactively.
When the STS certificate expires users attempting to log into the vSphere Client will not be able to log in, and will see the error:
HTTP Status 400 – Bad Request Message BadRequest, Signing certificate is not valid
To quote the vSphere documentation, the Security Token Service “is a service inside vCenter Server that issues, validates, and renews security tokens.” Any time a user logs into vCenter Server they will be issued one of these tokens as part of the Single Sign-on process, which is then used for authentication whenever a request is made.
vSphere protects all communications between services with encryption. To enable TLS encryption you need a certificate, and that certificate is usually issued from the VMware Certificate Authority (VMCA). The VMCA is a part of vCenter Server that automates issuing certificates to these services. Because of industry-wide changes to certificate expiration standards, some certificates issued by vSphere 6.5 Update 2 and newer versions of 6.5 only had a lifespan of two years, rather than the usual ten-year lifespan for that particular certificate. Normally this would not be a big problem, but three other issues have conspired to complicate this. First, vSphere upgrades do not refresh the STS certificate, so a two-year certificate may have been carried forward during an upgrade and is likely nearing expiration now. Second, there is not an alarm on STS certificate expiration like there is for other certificates, warning of the expiration.
Third, when that certificate expires, vSphere does the right thing and stops trusting the communications with the service, because it no longer has a valid certificate. Unfortunately, that means that logins to vCenter Server, as well as other management operations like certificate management, stop until the STS certificate can be regenerated. Users suddenly start getting the “Signing certificate is not valid” error above at login, and vSphere Admins cannot use the certificate-manager tools to reset the certificates.
How do I know if I am impacted?
VMware KB article 79248, “Checking Expiration of STS Certificate on vCenter Server,” has the details on how you can check whether you are affected or not. If you are running vSphere 6.5 or 6.7 the older Flash-based vSphere Web Client is the easiest way to check. The procedure is documented in KB article 79248, and it will look similar to:
That KB article also has a Python script that can be run on the vCenter Server to check the certificate lifespan. See below for an illustration of using the “wget” command on the vCenter Server Appliance to retrieve a script and execute it.
There are also some Community-generated assets as well. VMware Code has “Get-STSCerts.ps1” which is a user-contributed example of a way to check the certificate validity through PowerCLI. As with other things on code.vmware.com it isn’t supported by VMware directly, but is the community helping others, which we appreciate very much!
Please note that all of these scripts need to be run against the appliance or system where the VMCA is running. If you have external Platform Services Controllers (PSCs) it will be one of those. If your PSCs have been converged, or those functions are part of vCenter Server, then you will need to run the script there. If there are questions or concerns please engage VMware support.
If this happens to me, what will be affected?
Logins to vCenter Server will be affected, so any system or solution that needs to authenticate will have trouble. Similarly, numerous vSphere management operations that need to verify the validity of a security token would also have trouble (SSO operations, console accesses, etc.). However, workloads running in guest VMs will remain online and accessible, vSphere HA and DRS will continue to function and so on. The ESXi consoles also continue functioning, so in an emergency you can access guest VM consoles and manage workloads that way.
What do I do if I am impacted?
There are two VMware KB articles written to help guide folks handling this situation:
Both contain scripts that will assist you in fixing this problem. Those scripts are listed in the “Attachments” sections of the KB articles.
If you are using a vCenter Server Appliance you can copy the URL of the attachment and use the “wget” command on the vCenter Server Appliance to download it. Also note that recent editions of Microsoft Windows 10, as well as Apple MacOS 10, have SSH built in. Here is an example of me downloading and renaming the file:
To get the URL for the “wget” command I right-clicked the attachment in the KB article, chose “Copy Link Address…” and then pasted it in the Powershell window in Windows. This will only work if your vCenter Server has outbound access to the internet, but many people allow that for patching. If your environment does not permit that you will likely need to use the “scp” or “wget” commands from the vCenter Server Appliance itself to retrieve the file from a place on your local network.
You can also always open a support case with VMware Global Support Services, especially if you have production systems that are down. Our Technical Support Engineers can open a Zoom call with you and restore functionality quickly.
We always encourage vSphere Admins to test changes prior to executing them in their production environments, and to ensure they have a backup of their vCenter Server and Platform Services Controllers prior to any work. While it isn’t supported directly, ESXi can run as a guest OS along with vCenter Server, and that makes for a wonderful test environment. It’s how the Hands-on Labs operates, for instance.
What is VMware doing about this?
As you’ve seen we’ve identified a few areas of possible improvement, with certificate expiration length, alarms, and upgrade processes, and we’re looking at how to make those improvements to the product. Our goal in vSphere continues to be making it easy to be secure, and reducing vSphere Admin time spent on administration tasks, so any time we learn of issues like this fixing them is of great concern to us.
As always we thank you for being our customers, and encourage you to reach out through your account teams with feedback or improvement suggestions if you have them. Help us help you!