That question came up recently for a customer when all of their vSphere hosts (8 in total) were pegged at 100% CPU.
Customer Environment
The environment consists of eight Nutanix nodes (servers) within a cluster running VMware vSphere 6.7 U3. The Nutanix clusters host core services (AD, SQL, File & Print, etc.) along with a Citrix Virtual Apps and Desktops 7.15 LTSR environment. Users access the environment via a pair of Citrix ADC VPX 200s. The user load is split between Virtual App sessions and Windows 10 virtual desktops.
How did we get to vSphere hosts at 100% CPU?
We had a scheduled outage to replace a bad memory DIMM in one of the Nutanix nodes. We went through the typical procedure of evacuating VMs from the host, putting the host in maintenance mode, and shutting down the host controller VM (CVM). The CVM provides the Nutanix software and serves all of the I/O operations for the hypervisor, in this case VMware vSphere. Unfortunately, when the CVM was powered down, this snowballed into the Nutanix cluster shutting down, which brought down another host CVM. The environment was down until we were able to reattach the CVM on the initial host. We thought we were all set at that point but noticed that the Nutanix cluster in Prism (Nutanix management) was at 100% CPU; this was the same within vCenter. All virtual machines were powered on and initially we did not see any issues regarding server access.
We opened cases with both VMware and Nutanix. The engineers started running ESXTOP on the hosts to find which VMs were taking up the CPU. Many VMs were extremely high (> 5%) on the %READY metric, definitely not a good sign. We tried rebooting one of the hosts, but as soon as we vMotioned VMs back to the rebooted host, the CPUs were pegged at 100%. While the VMware and Nutanix engineers were investigating the CPU issue, I searched for similar issues on the VMware and Nutanix forums, but didn’t find any.
I then checked Citrix Studio to determine the state of the Virtual App servers and Windows 10 VDI sessions. The Virtual App servers were registered and did not have any issues. However, the Windows 10 VDI sessions (approximately 70 sessions) were in an unregistered state, although powered on. I went through each Windows 10 VM in vCenter via the web console and discovered they were at a blue screen of death (BSOD). This was because when the Nutanix cluster shut down, the Windows 10 VMs lost storage access. I started to force restart all the affected Windows 10 VMs, which then started registering in Citrix Studio. Right after that, the CPUs for all hosts stabilized at 40% of load, which is the average load for this environment.
Lesson Learned
Don’t only look at the hardware/software aspects of the environment. Dig deeper, look at the state of the virtual machines that are hosted. In this case, I had never seen an issue like this and neither had the Nutanix or VMware engineers, or the escalation engineers. Also, without any reservations, virtual machines that are in BSOD will utilize as much CPU as possible.