Technical

Prepping an ESXi 6.7 host for Secure Boot

When 6.7 went “GA” or General Availability, I was excited to get it installed and running on my bare metal hosts in my lab here at VMware. I had gotten my Dell R630’s updated with TPM 2.0 chips and was looking forward to booting with “attested” hosts. But I had a few issues before I was able to get everything working. This blog article is Part 1 of a two part series on how to configure your hosts to use SecureBoot and TPM 2.0.

Technical Debt

All my prior testing with 6.5 and during 6.7 beta was done with nested virtualization and fresh installations. But now it was time to update the bare metal hosts. Like many of you, I tend to just update in place and this leads to the accumulation of “stuff” on my systems. Older drivers that are replaced with newer drivers, older drivers from the server vendor that are now deprecated, etc. As you can see, this caused me some heartache.

If I was using Host Profiles then I’d be booting a clean copy of ESXi and would probably not be encountering these issues!

PSOD!

When I went in to the UEFI BIOS and enabled Secure Boot I got this. See for yourself!

PSOD screen of failed VIBs
PSOD with failed VIBs

Well, Secure Boot is working as designed! It has encountered a number of VIBs that don’t have their VIB signatures carried over via an update. Also, some of these drivers are not “Native” ESXi drivers. Before I could continue, I needed to fix this. I rebooted into the UEFI BIOS and turned off Secure Boot until I was sure it would work.

VIB Signatures

Why did I get this PSOD? This is because prior to ESXi 6.5, we did not save the required VIB metadata necessary for Secure Boot to work. Hence, if the VIB is not updated in the update and carried forward as is, we can’t verify its signature. If you updated using ESXCLI then you may encounter this issue. If you have updated using VUM or via an ISO update then you may find that your VIB’s have been updated with their signatures and you are good to go. I was not so fortunate.

The following KB’s give you some guidance on VIB signature issues but I’m hoping this blog will go into it a little deeper.

  • KB2147606 Cannot enable secure boot on ESXi 6.5 or 6.7 host that was upgraded
  • KB54481 Cannot enable secure boot on host upgraded to ESXi 6.7

Verifying SecureBoot – First Attempt

The first step I tried was installing 6.7 from an ISO over the existing installation of 6.7. This updated some of the VIBs but not nearly all of them. Using the KB’s above as a starting point, I logged in to the host and ran the following command:

This provided me with the list of “offending” VIBs.  As you can see the list was kind of long.

Output edited for clarity

Using this command during this process became mandatory. As I removed VIBs I would always re-run this to ensure I caught everything. When It came up with no VIBs then I was confident that Secure Boot would work.

Native ESXi Drivers

I noticed a lot of drivers that seemed to be duplicates of existing drivers. For example “net-tg3” and “ntg3” or “net-bnx2i” and “bnxnet

In many cases these were drivers that weren’t “Native”. Meaning, they are drivers built using the vmkLinux API’s. These drivers are already on the road to being deprecated (and native has performance advantages) so it was best to clear them up while I’m in there.

To differentiate between native and non-native drivers is somewhat difficult. There’s no API or metadata that I could use to provide a list. However, what I did learn from Engineering is that vmklinux drivers usually (but not always) start with a prefix and a dash. For example: net-tg3 or scsi-megaraid-sas.

So, I went ahead and dumped the list of VIBs using the secureBoot.py script, reformatted the list to a column and then proceeded to create a list of “esxcli software vib remove -n xxxxxxx” commands. In my case, I removed *18* VIBs!

I’m including the list here but this is NOT something you should just blindly copy and execute. You should do your homework first! This list is provided AS AN EXAMPLE ONLY!

Warning: You will absolutely want to verify that the driver is unused before removing it. In my case I went through each VIB that seemed like a non-native driver and looked for a native driver that could take over for it. e.g. qcnic .vs. net-cnic

I put the host in Maintenance Mode, executed the commands and rebooted. The host came up and was working just fine. All the new native drivers took over from the vmklinux drivers with barely a peep. I enabled Secure Boot in the BIOS and no more PSOD’s!

Well, except for one host….

Bootbanks

Warning: If you get to this point then I highly recommend that you work with GSS to guide you through all of these steps to ensure that things are done in the correct order. 

This scenario is “not typical”. I’m sharing my experiences for the purposes of education and for search engines to grab the error messages.

I had some further issues with one of my hosts. Again, most customers shouldn’t get these error messages, but in case you do, here’s what I got when I ran the secureBoot.py script:

After much back and forth with one of our engineers, he pointed me at two KB’s that offered a way to address this problem.

  • KB2016147 esxupdate error code 15″ error after patching an ESXi host
  • KB2151655 SDDC Manager ESXi host shows “Unknown – No Profile Defined ” next to Image Profile under Configuration in the Summary Tab

What happened on this finicky, non-typical host is that my imgdb.tgz file was corrupted. On a working system the bootbank and altbootbank were about 19kb. On the broken system they were about 175 bytes. Yea, bytes. Not good.

At this point, for me, curiosity took over. I wanted to understand better what was going on. If I was a customer however,  I would most definitely be on the phone with GSS. 

To fix it I had to create a working bootbank. The steps are outlined in the first KB. One thing that wasn’t clear is that the KB has you only copying /bootbank. You will also want to copy /altbootbank so I repeated the steps to do that as well.

But it wasn’t enough. I got the following error:

Output edited for clarity

Here is where I had reached the point where my desire to just get it working overrode my curiosity. I called in one of our engineers to sort out inconsistencies in the host and with his help the check eventually passed.

End goal

This is what your end goal is. To have the secureBoot.py script output the following:

Wrap Up

I hope this has been helpful. You shouldn’t have to go to the extremes that I did to get your hosts updated. This is just my experience and I think it’s helpful to share in case you DO run into it.

I want to thank all the engineers that helped out on this. It really helped me understand what’s going on under the covers and write these blogs.

Please note that while the title says “6.7” that is only because this is a two part series leading to the use of TPM 2.0 which is only available in 6.7. The content in this blog article is also valid for enabling Secure Boot for 6.5.

If you have questions that haven’t been answered you can reply here, send them to mfoley at vmware.com or via Twitter to @vspheresecurity or my personal Twitter account: @mikefoley

@vspheresecurity is a curated list of vSphere Security specific tweets.

Thanks for reading!

mike