How long are Virtual Machines stunned for Snapshots and vMotion?

I've read this excellent article by Cormac Hogan explaining why and when a Virtual Machine receives a "stun". This post is a follow-up explaining how you can determine the duration of a Virtual Machine stun on the most common vSphere functions:

  • Create a Snapshot
  • Delete a Snapshot
  • vMotion

Create a snapshot

When you create a snapshot the original virtual disk (*-flat.vmdk) is closed and a new delta disk is created (*-delta.vmdk). All blocks that are changed while the snapshot is active are written to the delta disk. The original virtual disk is in read-only mode. The VM is stunned to switch the virtual disk write stream from the original VMDK to the to the delta VMDK.

The stun time is written to the vmware.log file in the VM home directory:

2015-10-27T22:01:42.149Z| vcpu-0| I120: Checkpoint_Unstun: vm stopped for 188405 us

The time is in microseconds (us) so the stun time for this snapshot was 188ms or 0,18 seconds. You can also determine the time by comparing the destory event and the unstun:

2015-10-27T22:01:41.961Z| vcpu-0| I120: Destroying virtual dev for scsi0:0 vscsi=8198
2015-10-27T22:01:42.149Z| vcpu-0| I120: Checkpoint_Unstun: vm stopped for 188405 us

The impact you might have in your VM is not only the stun itself, but also the quiescence if you create the snapshot with the "Quiesce guest file system" option. The quiescence option allows the virtual machine to flush dirty buffers from the operating systems memory cache to disk. This makes the disk suitable for backups. The quiescence time is not written to the log like the stun time, but you can determine it very easily from the vmware.log file in the VM home directory:

When the quiescence option was used, you should see quiesced=1 in the TakeSnapshot event:

2015-10-27T22:01:39.876Z| vmx| I120: SnapshotVMX_TakeSnapshot start: 'Backup', deviceState=0, lazy=0, logging=0, quiesced=1, forceNative=0, tryNative=0, saveAllocMaps=0 cb=2E9B7630, cbData=32AD80F0

Prior to create the snapshot, the guest file system is quiesced. This is also called "freeze":

2015-10-27T22:01:39.938Z| vcpu-0| I120: ToolsBackup: changing quiesce state: IDLE -> STARTED
2015-10-27T22:01:41.961Z| vcpu-0| I120: ToolsBackup: changing quiesce state: STARTED -> COMMITTED

After creating the snapshot, the quiescence is released. This is also called "thaw":

2015-10-27T22:01:42.201Z| vcpu-0| I120: ToolsBackup: changing quiesce state: COMMITTED -> FINISHING
2015-10-27T22:01:44.166Z| vcpu-0| I120: ToolsBackup: changing quiesce state: FINISHING -> IDLE
2015-10-27T22:01:44.166Z| vcpu-0| I120: ToolsBackup: changing quiesce state: IDLE -> DONE

As you can see, the Virtual Machine was quiesced from 22:01:39.938 until 22:01:44.166. In this time, the Virtual Machine is unable to write to the disk.

Delete a snapshot

When you delete a VM snapshot the delta virtual disk has to be consolidated to the original virtual disk. This requires an additional snapshot, the "consolidate helper". This step is required because VM disk writes during snapshot consolidation must also be committed. This delta information is kept in a temporary invisible snapshot, committed at the end of snapshot removal. If you have a very busy VM it might require multiple iterations of consolidate helper snapshots. This leads to having at least two stuns required to delete a snapshot.

The stun time is written to the vmware.log file in the VM home directory:

2015-10-27T23:01:29.851Z| vcpu-0| I120: Checkpoint_Unstun: vm stopped for 52024 us
2015-10-27T23:01:30.697Z| vcpu-0| I120: Checkpoint_Unstun: vm stopped for 112790 us

The time is in microseconds (us).

vMotion

A stun is required when you migrate a VM from one ESXi host to another. The stun happens in the switchover phase at the end of vMotion. When you start a migration the first iteration copies all of the VMs memory. Subsequent iterations copy only the memory pages that have been modified (dirtied) during the previous iteration. When all dirtied pages can be transfered in less than 500ms the virtual machine is stunned and switched to the destination ESXi host.

To investigate the stun time you first need to know the migration id. You can find it in the vmware.log:

~ # cat vmware.log |grep "migration id"
2015-10-24T15:38:20.200Z| vmx| I120: Migrate_Open: Restoring from <10.0.0.2> with migration id 1446046703473901
2015-10-24T15:38:20.244Z| vmx| I120: Migrate_Open: Restoring from <10.0.0.2> with migration id 1446046703473901

With the migration id you can search through log files to see what happens. Very valuable information can be found in the vmkernel.log. It clearly states when the switchover phase happens:

Source ESXi vmkernel.log:

2015-10-24T15:38:20.160Z cpu0:151370)VMotion: 4650: 1446046703473901 S: Stopping pre-copy: only 1296 pages left to send, which can be sent within the switchover time goal of 0.500 seconds (network bandwidth ~145.404 MB/s, 101088% t2d)
2015-10-24T15:38:20.213Z cpu1:151421)VMotionSend: 4156: 1446046703473901 S: Sent all modified pages to destination (network bandwidth ~355.662 MB/s)

Destination ESXi vmkernel.log:

2015-10-24T15:38:20.272Z cpu0:151065)VMotion: 5994: 1446046703473901 D: Received all changed pages.
2015-10-24T15:38:20.275Z cpu0:151065)VMotion: 5773: 1446046703473901 D: Resume handshake successful

Another good place is the vpxa.log in ESXi 6.0 or the hostd.log in ESXi 5.x, where the vMotion downtime is recorded:

Destination ESXi vpxa.log/hostd.log:

2015-10-24T15:38:20.471Z verbose vpxa[64732B70] [Originator@6876 sub=vpxaMoVMotion opID=5473CC5D-00006321-7b-ce-29-ab] [MIGRATE] (1446046703473901) vmotion result has downtime value 89339

The time is in microseconds (us) so the stun time for this vMotion was 89ms or 0,09 seconds.

Another place for logs is the vmware.log file in the VM home directory. During vMotion a new vmware.log is created and the old is rotated, so you have to compare tha current and the latest vmware.log.

Latest vmware.log (Source ESXi):

2015-10-24T15:38:18.854Z| vmx| I120: MigrateSetState: Transitioning from state 2 to 3.
2015-10-24T15:38:20.177Z| vcpu-0| I120: MigrateSetState: Transitioning from state 3 to 4.
2015-10-24T15:38:20.177Z| vcpu-0| I120: Migrate: Preparing to suspend.
2015-10-24T15:38:20.177Z| vcpu-0| I120: Migrate: VM starting stun, waiting 100 seconds for go/no-go message.
2015-10-24T15:38:20.179Z| vcpu-0| I120: Destroying virtual dev for scsi0:0 vscsi=8194

Current vmware.log (Destination ESXi):

2015-10-24T15:38:20.244Z| vmx| I120: Migrate_Open: Restoring from <10.0.0.2> with migration id 1446046703473901
2015-10-24T15:38:20.265Z| vcpu-0| W110: Restoring VMM

The time between "VM starting stun" and "Restoring VMM" was 90ms.

  1. Thank you for sharing ....

  2. Thank you for sharing! Great write-up

  3. Newsletter: October 31, 2015 | Notes from MWhite - pingback on November 1, 2015 at 6:10 pm

Leave a Comment

NOTE - You can use these HTML tags and attributes:
<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

Trackbacks and Pingbacks: