VMWare ESXi VM CPU Performance Over Commitment "CPU Stuck"

Many VMWare ESXi installations make the same mistake: They overcommit vCPUs, don't monitor CPU metrics like %RDY, %CSTP and don't know why their virtual machines are slow or have sometimes performance issues, especially in load situations. Sometimes you can find hints like "kernel BUG: soft lockup - CPU stuck for 22seconds" in your logs, but most aren't aware of anything.

What can be the cause of this issue?

A very good explanation about ESXi CPU Scheduling can be found here: https://www.youtube.com/watch?v=8jeBIvzyB80 

It explains how the hypervisor ESXi schedules the physical CPUs to the virtual vCPUs. And here is the issue. For example:
Picture from YouTube Video "The vSphere CPU Scheduler" of "TrainerTests"
 As shown in the screenshot overcommiting the physical CPU by assigning to many vCPUs to VMs may decrease the VMs performance, because you waste time slots.

This can be monitored by monitoring the following CPU metrics:

Which ESXi metrics should be monitored?

·         %USED tells you how much time did the virtual machine spend executing CPU cycles on the physical CPU.
·         %RDY (should be very low) is a very important performance indicator. Start with this one. This one defines how much time your virtual machine wanted to execute CPU cycles but could not get access to the physical CPU. It tells you how much time did you spend in a “queue”. Expect this value to be better than 5% (this equals 1000ms in the vCenter Performance Graphs read about it here)
·         %CSTP (should be 0.0%) tells you how much time a virtual machine is waiting for a virtual machine with multiple vCPU to catch up. If this number is higher than 1% you should consider lowering the amount of vCPU in your virtual machine.
·         %WAIT It is the percentage of time the virtual machine was waiting for some VMkernel activity to complete (such as I/O) before it can continue.
·         %IDLE (should be high) The percentage of time a world is in idle loop.
From http://www.vstellar.com/2015/10/09/understanding-cpu-over-commitment/


👉 High %RDY indicates that vCPUs are waiting for actual physical CPUs to schedule their threads
👉 High %CSTP indicates that ESXi is asking the vCPUs to wait – A.K.A. co-stopping them for scheduling purposes -> decrease vCPUs of VM

No comments:

Post a Comment

Azure Managed Identities (technical service accounts)

Explaination Azure Managed Identities = technical service accounts Password is automatically managed, as it was the case in managed service ...