how2itsec: VMWare

Showing posts with label VMWare. Show all posts

Windows VMs have issues resolving DNS names, run into network timeouts or packet loss

Problem

Windows VMs (VMware vSphere) have issues when trying to resolve DNS names and run into network timeouts or packet loss on other protocols, too.

For example running a simple PowerShell script shows the issue (Change *YourFQDN* to your FQDN and '*DNS-Server-IP*' to your DNS server ip-address) :


1..1000 | Foreach-Object -Process {
    [pscustomobject]@{
        Try         = $_
        ElapsedTime = (Measure-Command -Expression {
               
 Resolve-DnsName -DnsOnly -QuickTimeout -NoHostsFile -Name 
'*YourFQDN*' -Server '*DNS-Server-IP*'
            }).TotalMilliseconds -as [int]
    }
} |
    Group-Object -Property 'ElapsedTime' |
    Sort-Object -Property ‚Count'

From 1000 DNS-queries 541x were answered within 2ms
From 1000 DNS-queries 243x were answered within 1ms
From 1000 DNS-queries 57x were answered within 3ms
From 153 DNS-queries were not answered, timeout >1000ms

Debug-Logs of vnetWFP show the event „DEBUG: ALEInspectInjectComplete : Packet injection status is : c000021b”.

Solution

Update your VMware Tools 11.x with Guest Introspection Driver to version 11.2.6 and reboot your VM or uninstall the Guest Introspection Driver. We first suspected it is VMware NSX-T or VMware Carbon Black EDR, but it was not. It was the NSX Guest Introspection Driver.

Root Cause: Packet drop is seen due to intermittent failure reported by the Microsoft WFP packet injection API.

https://kb.vmware.com/s/article/79185

After the update or removal of the driver the issues were gone:

PowerShell DNS query test script after vmware tools update

From 1000 DNS-queries 985x were answered within 1ms
From 1000 DNS-queries 10x were answered within 2ms
From 1000 DNS-queries 3x were answered within 3ms
From 1000 DNS-queries 1x was answered within 4ms
From 1000 DNS-queries 1x was answered within 35ms
From 1000 DNS-queries 0x timed out.

Veeam backup causes BGP route flapping on VMware NSX-T Edge VMs

When running VMware NSX-T with BGP and BFD and you are using Veeam backup, you may see BGP route flapping or BGP neighbor adjchanges or Down BGP Notification FSM-ERR.

Issue could be caused by Veeam backup, which is creating a snapshot of your NSX-T edge VM in order to back it up.

Logs show something like:
2020-12-20T20:38:05.278Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 142898 us 2020-12-20T20:35:05.806Z| vcpu-0| I125: SnapshotVMXTakeSnapshotComplete: Done with snapshot 'VEEAM BACKUP TEMPORARY SNAPSHOT': 153

Router logs show something like:
date=2020-12-20,time=20:35:10,devname="fwdev01",logid="0103020300",type="event",subtype="router",level="warning",vd="dev",eventtime=693310,logdesc="BGP neighbor status changed",msg="BGP: %BGP-5-ADJCHANGE: neighbor 172.23.39.35 Up " date=2020-12-20,time=20:35:10,devname="fwdev01",logid="0103020301",type="event",subtype="router",level="warning",vd="dev",eventtime=693310,logdesc="Routing log",msg="BGP: 172.23.39.35-Outgoing [DECODE] Open Cap: unrecognized capability code 73 len 8" date=2020-12-20,time=20:35:10,devname="fwdev01",logid="0103020301",type="event",subtype="router",level="warning",vd="dev",eventtime=693310,logdesc="Routing log",msg="BGP: 172.23.39.35-Outgoing [DECODE] Open Cap: unrecognized capability code 69 len 4" date=2020-12-20,time=20:35:06,devname="fwdev01",logid="0103020300",type="event",subtype="router",level="warning",vd="dev",eventtime=693306,logdesc="BGP neighbor status changed",msg="BGP: %BGP-5-ADJCHANGE: neighbor 172.23.39.35 Down BGP Notification FSM-ERR" date=2020-12-20,time=20:35:06,devname="fwdev01",logid="0103020301",type="event",subtype="router",level="warning",vd="dev",eventtime=693306,logdesc="Routing log",msg="BGP: %BGP-3-NOTIFICATION: received from 172.23.39.35 6/2 (Cease/Administratively Shutdown.) 0 data-bytes

VMWare ESXi VM CPU Performance Over Commitment "CPU Stuck"

Many VMWare ESXi installations make the same mistake: They overcommit vCPUs, don't monitor CPU metrics like %RDY, %CSTP and don't know why their virtual machines are slow or have sometimes performance issues, especially in load situations. Sometimes you can find hints like "kernel BUG: soft lockup - CPU stuck for 22seconds" in your logs, but most aren't aware of anything.

What can be the cause of this issue?

A very good explanation about ESXi CPU Scheduling can be found here: https://www.youtube.com/watch?v=8jeBIvzyB80

It explains how the hypervisor ESXi schedules the physical CPUs to the virtual vCPUs. And here is the issue. For example:

Picture from YouTube Video "The vSphere CPU Scheduler" of "TrainerTests"

As shown in the screenshot overcommiting the physical CPU by assigning to many vCPUs to VMs may decrease the VMs performance, because you waste time slots.

This can be monitored by monitoring the following CPU metrics:

Which ESXi metrics should be monitored?

· %USED tells you how much time did the virtual machine spend executing CPU cycles on the physical CPU.

· %RDY (should be very low) is a very important performance indicator. Start with this one. This one defines how much time your virtual machine wanted to execute CPU cycles but could not get access to the physical CPU. It tells you how much time did you spend in a “queue”. Expect this value to be better than 5% (this equals 1000ms in the vCenter Performance Graphs read about it here)

· %CSTP (should be 0.0%) tells you how much time a virtual machine is waiting for a virtual machine with multiple vCPU to catch up. If this number is higher than 1% you should consider lowering the amount of vCPU in your virtual machine.

· %WAIT It is the percentage of time the virtual machine was waiting for some VMkernel activity to complete (such as I/O) before it can continue.

· %IDLE (should be high) The percentage of time a world is in idle loop.
From http://www.vstellar.com/2015/10/09/understanding-cpu-over-commitment/

👉 High %RDY indicates that vCPUs are waiting for actual physical CPUs to schedule their threads
👉 High %CSTP indicates that ESXi is asking the vCPUs to wait – A.K.A. co-stopping them for scheduling purposes -> decrease vCPUs of VM

Intel NUC running VMware ESXi 6.5.3

Because there are some people on the internet asking for experience on "is my Intel NUC supporting ESXi version x.y.z?", so I will provide short feedback regarding that: Using an Intel NUC NUC7i7DNHE "Intel NUC 8th Gen Commercial" (https://www.intel.de/content/www/de/de/products/boards-kits/nuc.html) I'm running VMware ESXi 6.5 update 3 without any additional drivers necessary.

VMware updates can be found here: http://www.vmware.com/patchmgr/download.portal

Extend ZFS zpool of Linux Ubuntu NextCloud VM

If you are running a virtual machine with NextCloud v16 on linux ubuntu and you want to extend /mnt/ncdata, then:

Make sure your backup is fine, think about creating a vm snapshot
Extend second virtual harddrive disk in VMWare virtual machine settings (e.g. from 40Gb to 170Gb)
Login using SSH/VMWare-Console, change to root user using sudo su or sudo -i
Check the zpool size using zpool list
Check the /mnt/ncdata size using df -h
Read the new partition size using parted -l with the answer "fix" for the adjustment
You can delete the buffer partition 9 using parted /dev/sdb rm 9
Extend the first partition using to 100% of the available size parted /dev/sdb resizepart 1 100%
Using zpool online -e ncdata /dev/sdb you can adjust the partition to the correct size
Check the new zpool size using zpool list
Check the new /mnt/ncdata size using df -h

Example with nextcloud 16 on Ubuntu 18.04:

root@nextcloud:/mnt# zpool list NAME     SIZE ALLOC   FREE EXPANDSZ   FRAG    CAP DEDUP HEALTH ALTROOT ncdata 39.8G 15.2M 39.7G         -     0%     0% 1.00x ONLINE -
root@nextcloud:/mnt#
root@nextcloud:/mnt#
root@nextcloud:/mnt# df -Th Filesystem                     Type      Size Used Avail Use% Mounted on [...] ncdata                         zfs        39G 896K   39G   1% /mnt/ncdata [...] root@nextcloud:/mnt# root@nextcloud:/mnt# parted -l [...] Model: VMware Virtual disk (scsi) Disk /dev/sdb: 183GB Sector size (logical/physical): 512B/512B Partition Table: gpt Disk Flags: Number Start   End     Size    File system Name                  Flags 1      1049kB 42.9GB 42.9GB zfs          zfs-28cd7 9      42.9GB 42.9GB 8389kB[...]
root@nextcloud:/mnt# root@nextcloud:/mnt# root@nextcloud:/mnt# parted /dev/sdb rm 9 Information: You may need to update /etc/fstab. root@nextcloud:/mnt# root@nextcloud:/mnt# parted /dev/sdb resizepart 1 100% Information: You may need to update /etc/fstab. root@nextcloud:/mnt# root@nextcloud:/mnt# zpool online -e ncdata /dev/sdb root@nextcloud:/mnt# root@nextcloud:/mnt# zpool list NAME     SIZE ALLOC   FREE EXPANDSZ   FRAG    CAP DEDUP HEALTH ALTROOT ncdata   170G 2.51M   170G         -     0%     0% 1.00x ONLINE - root@nextcloud:/mnt# root@nextcloud:/mnt# df -h Filesystem                      Size Used Avail Use% Mounted on[...]
ncdata                          165G 896K 165G   1% /mnt/ncdata[...]
root@nextcloud:/mnt#

Update 25.12.2020: If you are running Ubuntu 20/Nextcloud VM 20, then follow the instrutions here starting point 7: https://how2itsec.blogspot.com/2020/12/increase-disk-and-zfs-of-nextcloud-vm.html

Hint when VMWare doesnt show a snapshot

If VMWare ESXi/vCenter/your vSphere environment does not show a virtual machine snapshot, however it is there and the virtual machine is using it, here is a little trick which might help you:

Make sure you have some free space left in your datastore, in which the vm is stored.
Create another virtual machine snapshot
Click on "delete all snapshots"

This can trigger the ESXi to delete all snapshots, including the one which isn't shown. However this does not always work, in which case working with a clone of the VM and vmkfstools can help.

how2itsec