When you've configured automated backups in NSX-T, you might be unaware that failed backup jobs do not trigger alarms in the integrated NSX-T alarm dashboard. When a backup fails, you can only see the following error message in the Backup & Restore configuration:
At the moment, you have to manually check that the backup is running as expected. This can also be done using the API:
> GET /api/v1/cluster/backups/history HTTP/1.1 { "cluster_backup_statuses": [ { "backup_id": "5cf21742-091a-b9b9-1f24-ad75ede2d23b-1615489436", "start_time": 1615489436085, "end_time": 1615489440865, "success": false, "error_code": "BACKUP_AUTHENTICATION_FAILURE", "error_message": "either backup server login failed or unauthorized access to backup directory" } ], "node_backup_statuses": [ { "backup_id": "5cf21742-091a-b9b9-1f24-ad75ede2d23b-1615403036", "start_time": 1615403036017, "end_time": 1615403354709, "success": true } ], "inventory_backup_statuses": [ { "backup_id": "inventory-1615490636", "start_time": 1615490636254, "end_time": 1615490641758, "success": true } ] }
In this example, the cluster backup failed. Besides the backup status itself, you should also check when the last backup finished. The end_time is given as milliseconds timestamp.
I've published a Nagios check to monitor the status and age of NSX-T backups.
usage: check_nsxt_backup.py [-h] -n NSX_HOST [-t TCP_PORT] -u USER -p PASSWORD [-i] [-a MAX_AGE] # python check_nsxt_backup.py -n nsx.virten.lab -u audit -p password NSX-T cluster backup failed NSX-T node backup is to old (1461 minutes)
The script is available on GitHub: github.com/fgrehl/virten-scripts/blob/master/python/check_nsxt_backup.py