Today I tried to explain why this Storage vMotion / dvSwitch / HA problem actually existis, how the virtual machines getting affected and what I can do to mitigate the problem. This issue seems to be hard to explain so i started to search the internet for pictures. I could find many explanations, but no one draws a picture of it. I've done it:
- We have 2 ESXi hosts, both connected to a shared storage containing 2 LUNs. The first ESXi runs two virtual machines (VM1 & VM2), the second runs only one (VM3). All three virtual machines are connected to a Distributed Virtual Switch. There is an additional folder on each LUN containing the dvSwitch Port information (.dvsData). This information is required for ESXi servers to know where the ports belong to, without asking the vCenter.
- Using storage vMotion VM2 is migrated to LUN 2. This could be done either manually or triggered by Storage DRS. Actually, this is where the bug happens. All files like .vmx or .vmdk are moved to LUN 2 and whyever the dvSwitch port information remains on LUN 1. The bug has happend, but nothing noticeable at this point. VM2 stays up and running without any network issues.
- The first ESXi dies. This is where HA shoud get active and initiates a restart on another host.
- HA initiates VM1 to be started on the second ESXi. Everything fine.
- HA initiates VM2 to be started on the second ESXi. During this process, HA tries to access the port information inside the .dvsData directory. HA fails as it can't find the port information within LUN2 .dvsData directory.
Operation failed, diagnostics report: Failed to open file /vmfs/volumes/UUID/.dvsData/ID/Port Status (bad0003)= Not found
Issue explaind by Duncan Epping @ Yellow-Bricks
Script to identify and fix affected VMs by Alan Renouf