Troubleshooting

Regenerate Standalone ESXi Host Certificate

by Florian Grehl
July 3, 2023

On a freshly installed ESXi host, the following error is displayed:

The certificate assigned to this host is not valid yet. You should install a valid certificate.

The issue is caused by a system time that is set to the future during ESXi installation. Having not configured the correct time can also cause issues when trying to add the ESXi host to vCenter Server. To solve the issue, set the correct time (Best practice is to use an NTP server) and regenerate the certificate.

vSphere 8 vCLS Machines High CPU Usage

by Florian Grehl
April 20, 2023
1 Comment

Yesterday, vSphere 8.0 Update 1 has been released and as usual, before reading any release notes, I hit the update button. The update went well but after the update, I noticed that vCLS machines went rogue. One of the two machines had unusually high CPU usage.

There also was a stale vCLS machine which was neither powered on nor deleted.

Installation or Removal of VIB Packages in ESXi 7.0 fails with Error: Failed to query file system stats:

by Florian Grehl
November 27, 2021November 27, 2021

While installing ESXi updates, I noticed that on one of my hosts, the installation or removal of VIB packages fails with the following error message

# esxcli software vib install -d [package]
# esxcli software vib remove -n [package]
[InstallationError]
Failed to query file system stats: Errors:
Error getting data for filesystem on '/vmfs/volumes/59a83d9c-628c6ae0-7b35-f44d306ec05a': Cannot open volume: /vmfs/volumes/59a83d9c-628c6ae0-7b35-f44d306ec05a, skipping.
cause = Errors:
Error getting data for filesystem on '/vmfs/volumes/59a83d9c-628c6ae0-7b35-f44d306ec05a': Cannot open volume: /vmfs/volumes/59a83d9c-628c6ae0-7b35-f44d306ec05a, skipping.
Please refer to the log file for more details.

The device 59a83d9c-628c6ae0-7b35-f44d306ec05a was a non existing volume, referenced by a vffs mount. VFFS (Virtual Flash File System) was used in earlier ESXi releases by vSphere Flash Read Cache. I'm not sure where that comes from but this is how you can remove the stale mount:

Troubleshooting CSE 3.1 TKGm Integration with VMware Cloud Director 10.3

by Florian Grehl
November 24, 2021November 24, 2021

This article recaps Issues that I had during the integration of VMware Container Service Extension 3.1 to allow the deployment of Tanzu Kubernetes Grid Clusters (TKGm) in VMware Cloud Director 10.3.

If you are interested in an Implementation Guide, refer to Deploy CSE 3.1 with TKGm Support in VCD 10.3 and First Steps with TKGm Guest Clusters in VCD 10.3.

CSE Log File Location
DNS Issues during Photon Image Creation
Disable rollbackOnFailure to troubleshoot TKGm deployment errors
Template cookbook version 1.0.0 is incompatible with CSE running in non-legacy mode
https://[IP-ADDRESS] should have a https scheme and match CSE server config file
403 Client Error: Forbidden for url: https://[VCD]/oauth/tenant/demo/register
NodeCreationError: failure on creating nodes ['mstr-xxxx']
Force Delete TKGm Clusters / Can't delete TKGm Cluster / Delete Stuck in DELETE:IN_PROGRESS

Troubleshooting NSX Advanced Load Balancer (AVI) Integration with VMware Cloud Director 10.3

by Florian Grehl
October 31, 2021October 31, 2021
2 Comments

While writing the Getting Started with NSX Advanced Load Balancer Integration in VMware Cloud Director 10.3 guide, I came across a couple of issues. In this article, I'm going through those issues explaining what I did to solve them.

Deploy NSX-T Edge VM SSH Keys with Ansible

by Florian Grehl
March 21, 2021

While working with NSX-T, there are many reasons to access edge appliances using SSH. Most troubleshooting options are only available using nsxcli on the appliance itself. During the deployment, each appliance has 3 user account: root, admin, and audit. Alle Accounts are configured with password-based authentication. In a previous article, I've already described how to deploy SSH Keys using nsxcli, which allows a secure and comfortable authentication method. In this article, I'm explaining how to use ansible to deploy SSH public keys to NSX-T Edges. This option allows you to easily manage keys on a large platform.

Error when connecting Virtual Machine to NSX-T Segments

by Florian Grehl
March 13, 2021

When you try to connect an NSX-T based Segment to a virtual machine, the task fails with the following error message:

Reconfigure virtual machine - An error occurred during host configuration

In the nsx logfile on the ESXi host where the VM is located, the following error is displayed:

/var/log/nsx-syslog.log
2021-03-13T19:00:36Z nsx-opsagent[527252]: NSX 527252 - [nsx@6876 comp="nsx-esx" subcomp="opsagent" s2comp="nsxa" tid="527596" level="ERROR" errorCode="MPA44211"] [PortOp] Failed to create port 780b915d-1479-4eed-8e29-2364d9563f95 with VIF f3f605f2-38a1-4263-bbbd-81b189077f69 because DVS id is not found by transport-zone id 1b3a2f36-bfd1-443e-a0f6-4de01abc963e
2021-03-13T19:00:36Z nsx-opsagent[527252]: NSX 527252 - [nsx@6876 comp="nsx-esx" subcomp="opsagent" s2comp="nsxa" tid="527596" level="ERROR" errorCode="MPA42001"] [CreateLocalDvPort] createPort(uuid=780b915d-1479-4eed-8e29-2364d9563f95, zone=1b3a2f36-bfd1-443e-a0f6-4de01abc963e) failed: Failed to create port 780b915d-1479-4eed-8e29-2364d9563f95 with VIF f3f605f2-38a1-4263-bbbd-81b189077f69 because DVS id is not found by transport-zone id 1b3a2f36-bfd1-443e-a0f6-4de01abc963e

vSphere with Tanzu - SupervisorControlPlaneVM Excessive Disk WRITE IO

by Florian Grehl
February 23, 2021March 14, 2021
3 Comments

After deploying the latest version of VMware vSphere with Tanzu (vCenter Server 7.0 U1d / v1.18.2-vsc0.0.7-17449972), I noticed that the Virtual Machines running the Control Plane (SupervisorControlPlaneVM) had a constant disk write IO of 15 MB/s with over 3000 IOPS. This was something I didn't see in previous versions and as this is a completely new setup with no namespaces created yet, there must be an issue.

After troubleshooting the Supervisor Control Plane, it turned out that the problem was caused by fluent-bit, which is the Log processor used by Kubernetes. The log was constantly spammed with debugging messages. Reducing the log level solved the problem for me.

[Update: 2021-03-14 - The problem is not resolved in vSphere 7.0 Update 2]

Heads Up: VMFS6 Heap Exhaustion in ESXi 7.0

by Florian Grehl
November 20, 2020

In ESXi 7.0 (Build 15843807) and 7.0b (Build 16324942), there is a known issue with the VMFS6 filesystem. The problem is solved in ESXi 7.0 Update 1. In certain workflows, memory is not freed correctly resulting in VMFS heap exhaustion. You might be affected when your system shows the following symptoms:

Datastores are showing "Not consumed" on hosts
Virtual Machines fail to vMotion
Virtual Machines become orphaned when powered off
Snapshot creation fails with "An error occurred while saving the snapshot: Error."

In the vmkernel.log, you see the following error messages:

Heap vmfs3 already at its maximum size. Cannot expand
Heap vmfs3: Maximum allowed growth (#) too small for size (#)
Failed to initialize VMFS distributed locking on volume #: Out of memory
Failed to get object 28 type 1 uuid # FD 0 gen 0: Out of memory

Quick Tip: Reset Tanzu SupervisorControlPlaneVM Alarms

by Florian Grehl
November 15, 2020
1 Comment

When you are working with the Kubernetes Integration in vSphere 7.0, you might come into the situation where the SupervisorControlPlaneVM has an active alarm. Those Virtual Machines are deployed and controlled by the WCP Agent and even as an Administrator, you are not allowed to touch those objects.
You can't power then off, reboot, or migrate them using vMotion. The problem is that you can't even clear alarms. One alarm I recently had was the "vSphere HA virtual machine failover failed" alarm, which you usually see when the ESXi hostd crashed, but the Virtual Machines are still running.Read More »Quick Tip: Reset Tanzu SupervisorControlPlaneVM Alarms