VMWare ESXi VM CPU Performance Over Commitment "CPU Stuck"

Many VMWare ESXi installations make the same mistake: They overcommit vCPUs, don't monitor CPU metrics like %RDY, %CSTP and don't know why their virtual machines are slow or have sometimes performance issues, especially in load situations. Sometimes you can find hints like "kernel BUG: soft lockup - CPU stuck for 22seconds" in your logs, but most aren't aware of anything.

What can be the cause of this issue?

A very good explanation about ESXi CPU Scheduling can be found here: https://www.youtube.com/watch?v=8jeBIvzyB80 

It explains how the hypervisor ESXi schedules the physical CPUs to the virtual vCPUs. And here is the issue. For example:
Picture from YouTube Video "The vSphere CPU Scheduler" of "TrainerTests"
 As shown in the screenshot overcommiting the physical CPU by assigning to many vCPUs to VMs may decrease the VMs performance, because you waste time slots.

This can be monitored by monitoring the following CPU metrics:

Which ESXi metrics should be monitored?

·         %USED tells you how much time did the virtual machine spend executing CPU cycles on the physical CPU.
·         %RDY (should be very low) is a very important performance indicator. Start with this one. This one defines how much time your virtual machine wanted to execute CPU cycles but could not get access to the physical CPU. It tells you how much time did you spend in a “queue”. Expect this value to be better than 5% (this equals 1000ms in the vCenter Performance Graphs read about it here)
·         %CSTP (should be 0.0%) tells you how much time a virtual machine is waiting for a virtual machine with multiple vCPU to catch up. If this number is higher than 1% you should consider lowering the amount of vCPU in your virtual machine.
·         %WAIT It is the percentage of time the virtual machine was waiting for some VMkernel activity to complete (such as I/O) before it can continue.
·         %IDLE (should be high) The percentage of time a world is in idle loop.
From http://www.vstellar.com/2015/10/09/understanding-cpu-over-commitment/


👉 High %RDY indicates that vCPUs are waiting for actual physical CPUs to schedule their threads
👉 High %CSTP indicates that ESXi is asking the vCPUs to wait – A.K.A. co-stopping them for scheduling purposes -> decrease vCPUs of VM

Small/Medium Businesses - New network devices (switches, routers,..) - Minimum ToDo list

Most small/medium businesses don't do much configuration, monitoring, cfg-baselineing or follow best practises with their network devices like switches, routers, wireless controllers, access points etc. Here is a short list of things you should do at minimum:

1. Extend your monitoring of your network devices

Not only ping them, check their uptime by snmp, but:
1.1 Monitor all uplinks (e.g. SNMP bandwidth)
1.2 Monitor all important ports (ports of the servers, firewalls, storage, etc; again e.g. with SNMP bandwidth)
1.3 Monitor device health, fan status, temperature, etc
1.4 Monitor the routing table, especially if you use dynamic routing protocols and/or have many routes
1.5 Monitor utilization of cpu, mem, i/o, etc..
1.6 Monitor everything with secure protocols like SSH, SNMPv3 AuthPriv AES+SHA
1.7 Send SNMP traps from devices to your monitoring system
1.8 Send Syslog from your devices to your monitoring system & logging solution

2. Harden your network

2.1 Disable telnet
2.2 Disable http
2.3 Implement ACLs for allowing access only from dedicated trusted hosts
2.4 Implement ACLs for dynamics routing protocols like BGP, OSPF, etc
2.5 Use LDAPS/Radius authentication for LDAP/AD-authentication for device mgmt
2.6 Send Syslog from your devices to your monitoring system & logging solution
2.7 Disable SNMPv1/v2c
2.8 Use DHCP-Snooping for Rouge DHCP server protection
2.9 Use ARP Spoofing Protection
2.10 Think about disabling link-layer discovery protocols like LLDP, CDP, EDP, etc
2.11 Allow local admin account login only if LDAPS/Radius server is not reachable
2.12 Delete default users, groups and communities

3. Authentication & dynamische vlan assignment

3.1 Use IEEE 802.1x with certificates (at least two AAA Radius Serves (e.g. FreeRadius) with EAP-TLS)
3.2 Use rfc3580 for dynamic vlan assignment
3.3 Think about using either a quarantine fallback vlan for not authenticated clients or a guest vlan with internet access only
3.4 Think of using DHCP Snooping (forwarding) for your devices which does device fingerprinting

4. Documentation

4.1 Create a layer1 and layer2 network plan (e.g. in visio)
4.2 Create a layer2 and layer3 network plan (e.g. in visio)
4.3 Use the l2&l3 plan as background for your monitoring system in a map to have a live-overview

5. Testing

5.1 Test your loop protection (m/r/stp, loop-protect, elrp, + broadcast limit thresholds like max 200 broadcasts per second, etc) in a maintenance window
5.2 Test your "CrossVlan Protection" in a maintenace window. By "CrossVlan" I mean not wanted connections between to vlans, which should be separated (m/r/stp, loop-protect, extra VLAN which is tagged on all ports and sends ELRP or similar loop protection protocols, etc)
5.3 Test your monitoring alerting - is an alert really send when e.g. an important uplink is full or disconnected, if an important lacp lag is down, etc (test using simulation, e.g. via jPerf, Observer etc)
5.4 Check and test if all best practises of the vendor are applied

6. IP-Subnetting

Yes, so many small and medium companys still have a huge flat layer2 network per site :(
6.1 The more subnets, the more a network issue stays only in that tiny subnet
6.2 The smaller the subnet, the less background noise
6.3 Microsegmentation is key! The smaller the subnet and the more it is separated (using private vlans, ACLs, a firewall, filtering device, host firewalls, a microsegmentation solution, NSX-T or something similar), the more it is protected and lateral movement gets harder.

There are many more things, like using LACP instead of static link aggregation groups, using LACP Mode Fast instead of the default slow, using Bidirectional Forwarding Detection "BFD" for everything, using multi-chassis link-aggregation (like MC-LAG, MLAG, etc) instead of Stacking (firmware-updates & reboots mostly cost the whole stack-topology to reboot, which is not the case in MLAG), using Out of Band management, and much more.

The listed items are the things which should be done at minimum.

FortiGate allows Ping from "not trusted hosts" since FortiOS 6.0

Recently I discovered something after updating a FortiGate cluster, which I intensively monitor, not only via working monitoring queries, but also doing some negative monitoring*: Since FortiOS 6.0 the Fortinet FortiGate firewall answers to source addresses via ICMP Type:8 EchoRequest "Ping", which are not included in the trustedhosts (config system admin -> edit admin -> set trusthost1). This is due to a change in FortiOS, which allows ping from more addresses. To limit ping, you have to use LocalIn-Policies (which always should be used, too).

*negative monitoring = I recommend not only to monitor what is allowed (like your SNMPv3 AuthPriv queries, your SSH or HTTPS-Calls), but also to monitor what is forbidden, in order to find some possible bugs or misconfigurations, if suddenly something is answering what really shouldn't. This can be done by trying to use insecure protocols like Telnet, SNMPv1/v2c, HTTP, etc.. or by sending requests from not allowed source ips or by trying to login using deleted/altered accounts or with wrong credentials.


Fortinet documentation:
In versions 5.x and below, trusted hosts configured by an administrator user only allow access from certain IP addresses configured in trusted hosts, to all services configured on the interface, including ping.

From version 6.0 onwards ping service on management interfaces are not included within the scope of trusted hosts. This means that you will be able to ping the interface from an IP that is not included within trusted hosts.

In order to only allow trusted hosts to be able to ping the interface and deny everyone else, you will need to configure a Local In Policy as below.

CLI configuration:
System > Administrators >
config system admin
    edit "admin"
        set trusthost1 172.26.73.48 255.255.255.255
        set accprofile "super_admin"
        set vdom "root"
    next
end

Configuring address and address group as per the trusted hosts:
config firewall address
    edit "trusted-1"
        set type ipmask
        set comment ''
        set visibility enable
        set associated-interface ''
        set color 0
        set allow-routing disable
        set subnet 172.26.73.48 255.255.255.255
    next
end

config firewall addrgrp
    edit "trusted_grp"
        set member "trusted-1"
        set comment ''
        set visibility enable
        set color 0
    next
end
Configuring Firewall local in policies:
config firewall local-in-policy
    edit 2
        set intf "port1"
        set srcaddr "trusted_grp"
        set dstaddr "all"
        set action accept
        set service "PING"
        set schedule "always"
        set status enable
        set comments ''
    next
    edit 1
        set intf "port1"
        set srcaddr "all"
        set dstaddr "all"
        set action deny
        set service "PING"
        set schedule "always"
        set status enable
        set comments ''
    next
end
Before configuring the local in policy:
diagnose sniffer packet any 'host 172.26.73.78 and icmp' 4
interfaces=[any]
filters=[host 172.26.73.48 and icmp]
3.647787 port1 in 172.26.73.78 -> 10.5.22.114: icmp: echo request
3.647850 port1 out 10.5.22.114 -> 172.26.73.78: icmp: echo reply
4.651341 port1 in 172.26.73.78 -> 10.5.22.114: icmp: echo request
4.651383 port1 out 10.5.22.114 -> 172.26.73.78: icmp: echo reply
5.657949 port1 in 172.26.73.78 -> 10.5.22.114: icmp: echo request
5.657992 port1 out 10.5.22.114 -> 172.26.73.78: icmp: echo reply
After configuring the local in policies:
diagnose sniffer packet any 'host 172.26.73.78 and icmp' 4
interfaces=[any]
filters=[host 172.26.73.48 and icmp]
4.264950 port1 in 172.26.73.78 -> 10.5.22.114: icmp: echo request
8.904217 port1 in 172.26.73.78 -> 10.5.22.114: icmp: echo request
13.906576 port1 in 172.26.73.78 -> 10.5.22.114: icmp: echo request
18.893835 port1 in 172.26.73.78 -> 10.5.22.114: icmp: echo request

trace_id=22 func=print_pkt_detail line=5497 msg="vd-root:0 received a packet(proto=1, 172.26.73.78:1->10.5.22.114:2048) from port1. type=8, code=0, id=1, seq=335."
id=20085 trace_id=22 func=init_ip_session_common line=5657 msg="allocate a new session-00874fe6"
id=20085 trace_id=22 func=vf_ip_route_input_common line=2591 msg="find a route: flag=80000000 gw-10.5.22.114 via root"
id=20085 trace_id=22 func=fw_local_in_handler line=409 msg="iprope_in_check() check failed on policy 1, drop
Reply for the trusted host:
diagnose sniffer packet any 'host 172.26.73.48 and icmp' 4
interfaces=[any]
filters=[host 172.26.73.48 and icmp]
7.239647 port1 in 172.26.73.48 -> 10.5.22.114: icmp: echo request
7.239743 port1 out 10.5.22.114 -> 172.26.73.48: icmp: echo reply
8.261081 port1 in 172.26.73.48 -> 10.5.22.114: icmp: echo request
8.261122 port1 out 10.5.22.114 -> 172.26.73.48: icmp: echo reply
9.276261 port1 in 172.26.73.48 -> 10.5.22.114: icmp: echo request
9.276321 port1 out 10.5.22.114 -> 172.26.73.48: icmp: echo reply
10.294536 port1 in 172.26.73.48 -> 10.5.22.114: icmp: echo request
10.294588 port1 out 10.5.22.114 -> 172.26.73.48: icmp: echo reply


Diag debug flow on FortiOS 6.0.9
Source IP of ping is not configured in trustedhosts

2020-02-26 08:38:47 id=20085 trace_id=255 func=init_ip_session_common line=5684 msg="allocate a new session-00018a90"
2020-02-26 08:38:47 id=20085 trace_id=255 func=vf_ip_route_input_common line=2591 msg="find a route: flag=80000000 gw-10.128.36.35 via root"
2020-02-26 08:38:47 id=20085 trace_id=256 func=print_pkt_detail line=5519 msg="vd-root:0 received a packet(proto=1, 10.128.36.35:1->10.240.161.178:0) from local. type=0, code=0, id=1, seq=1156."


Diag debug flow on FortiOS 5.6.12
Source IP of ping is not configured in trustedhosts

2020-02-26 08:46:05 id=20085 trace_id=10 func=print_pkt_detail line=5375 msg="vd-root received a packet(proto=1, 10.240.161.178:1->10.128.36.4:2048) from mgmt1. type=8, code=0, id=1, seq=1166."
2020-02-26 08:46:05 id=20085 trace_id=10 func=init_ip_session_common line=5534 msg="allocate a new session-fd0ad6af"
2020-02-26 08:46:05 id=20085 trace_id=10 func=vf_ip_route_input_common line=2574 msg="find a route: flag=80000000 gw-10.128.36.4 via root"
2020-02-26 08:46:05 id=20085 trace_id=10 func=fw_local_in_handler line=402 msg="iprope_in_check() check failed on policy 0, drop"

Source: https://kb.fortinet.com/kb/documentLink.do?externalID=FD44156

Cribl - Change values to lowerCase

Some logs (e.g. Microsoft Azure) sometimes are not fully normalized to all lowercase characters. You can use Cribl to adjust those values by...