Determine TBW from SSDs with S.M.A.R.T Values in ESXi (smartctl)

smartctl-in-esxiSolid-State-Drives are getting more and more common in ESXi Hosts. They are used for caching (vFlash Read Cache, PernixData FVP), Virtual SAN or plain Datastores. A problem that comes with SSDs is their limited lifetime per cell. Depending on their technology, each cell can be overwritten from 1.000 times in consumer TLC SSDs up to 100.000 times in enterprise SLC based SSDs.

The value to keep an eye on is the guaranteed TBW (Total Bytes Written or Terabytes Written) which is typically provided by the vendor in their specifications. This value describes how many Terabytes can be written to the entire device, until the warranty expires. The current value can be readout with S.M.A.R.T. in the Total_LBAs_Written field.

Unfortunatelly, VMware makes it hard to readout RAW S.M.A.R.T values on ESXi hosts. For that reason I've ported a version of smartctl, which is part of  smartmontools to ESXi. I've made the package available as VIB. The download link is at the bottom of this post.

First of all, let's get started what you can see on an ESXi Host regarding to endurance without smartctl. In this example I'm using a Samsung SSD 850 EVO M.2 250GB which is currently in use as a local Datastore. Warranty for this device is 75TBW. Just mentioning that this is a consumer grade SSD. The lowest Endurance Class for Virtual SAN for exmaple starts at 365TBW.

ESXCLI can display S.M.A.R.T stats with
esxcli storage core device smart get -d [device]

# esxcli storage core device smart get -d t10.ATA_____Samsung_SSD_850_EVO_M.2_250GB___________S24BNXAG805065D_____
Parameter                     Value  Threshold  Worst
----------------------------  -----  ---------  -----
Health Status                 OK     N/A        N/A
Media Wearout Indicator       N/A    N/A        N/A
Write Error Count             N/A    N/A        N/A
Read Error Count              N/A    N/A        N/A
Power-on Hours                99     0          99
Power Cycle Count             99     0          99
Reallocated Sector Count      100    10         100
Raw Read Error Rate           N/A    N/A        N/A
Drive Temperature             N/A    N/A        N/A
Driver Rated Max Temperature  49     0          34
Write Sectors TOT Count       100    0          100
Read Sectors TOT Count        N/A    N/A        N/A
Initial Bad Block Count       N/A    N/A        N/A

What do these values mean? Actually only that the drive is "healthy". It does not provide the information we are looking for. ESXi also keeps track fo the health status with the smartd and writes the status to /var/log/syslog.log like in the following example:

2016-05-18T14:54:23Z smartd: [warn] t10.ATA_____ST9500325AS_________________________________________S2WB2XXB: above TEMPERATURE threshold (40 > 0)

ESXCLI can also display device stats, which are very close to what we are looking for:

# esxcli storage core device stats get -d t10.ATA_____Samsung_SSD_850_EVO_M.2_250GB___________S24BNXAG805065D_____
t10.ATA_____Samsung_SSD_850_EVO_M.2_250GB___________S24BNXAG805065D_____
   Device: t10.ATA_____Samsung_SSD_850_EVO_M.2_250GB___________S24BNXAG805065D_____
   Successful Commands: 93483233
   Blocks Read: 205579211
   Blocks Written: 2123298938
   Read Operations: 3240880
   Write Operations: 90144369
   Reserve Operations: 39107
   Reservation Conflicts: 0
   Failed Commands: 22
   Failed Blocks Read: 0
   Failed Blocks Written: 0
   Failed Read Operations: 0
   Failed Write Operations: 0
   Failed Reserve Operations: 0

ESXi keeps track of all read and write operations to the disk. These counters are reset when ESXi is rebooted. So this does not help to determine wear leveling either.

And here comes smartctl into play:

# smartctl -d sat --all /dev/disks/t10.ATA_____Samsung_SSD_850_EVO_M.2_250GB___________S24BNXAG805065D_____
smartctl 6.6 2016-05-10 r4321 [x86_64-linux-6.0.0] (daily-20160510)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org


=== START OF INFORMATION SECTION ===
Model Family:     Samsung based SSDs
Device Model:     Samsung SSD 850 EVO M.2 250GB
Serial Number:    S24BNXAG805065D
LU WWN Device Id: 5 002538 d404b9f9f
Firmware Version: EMT21B6Q
User Capacity:    250,059,350,016 bytes [250 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 4c
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed May 16 15:25:26 2016 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
[...]
SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   099   099   000    Old_age   Always       -       5039
 12 Power_Cycle_Count       0x0032   099   099   000    Old_age   Always       -       35
177 Wear_Leveling_Count     0x0013   094   094   000    Pre-fail  Always       -       122
179 Used_Rsvd_Blk_Cnt_Tot   0x0013   100   100   010    Pre-fail  Always       -       0
181 Program_Fail_Cnt_Total  0x0032   100   100   010    Old_age   Always       -       0
182 Erase_Fail_Count_Total  0x0032   100   100   010    Old_age   Always       -       0
183 Runtime_Bad_Block       0x0013   100   100   010    Pre-fail  Always       -       0
187 Uncorrectable_Error_Cnt 0x0032   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0032   049   034   000    Old_age   Always       -       51
195 ECC_Error_Rate          0x001a   200   200   000    Old_age   Always       -       0
199 CRC_Error_Count         0x003e   100   100   000    Old_age   Always       -       0
235 POR_Recovery_Count      0x0012   099   099   000    Old_age   Always       -       26
241 Total_LBAs_Written      0x0032   099   099   000    Old_age   Always       -       6343034492

In the SMART Attributes section, we can find with the ID #241 our Total_LBAs_Written value. This value needs to be multiplied with the sector size which is 512 bytes and divided by 1099511627776 (1024^4) to get Terabytes.

Total_LBAs_Written * Sector Size / 1024^4 = TBW

6343034492 * 512 / 1099511627776 = 2.95 TBW

I've used 3 TBW from my guaranteed 75 TBW. According to Power_On_Hours, which can be found in SMART ID #9, the device is in use since about 200 days (24/7 online of course). Guess I have another 13 years to go...

This also proves that the value in "esxcli storage core device stats get" is wrong, respectively only counted since the last reboot. Blocks written according to this command is 2123298938 which results in about 1TB.

How to get smartctl
!!! Please note that the use of this VIB is absolutely unsupported. Use at your own risk !!!
I've tested the package with ESXi 6.0 only

  1. Download smartctl-6.6-4321.x86_64.vib
  2. Copy the VIB to the /tmp/ directory of an ESXi host
  3. SSH to the ESXi host
  4. Set the VIB acceptance level to CommunitySupported
    # esxcli software acceptance set --level=CommunitySupported
  5. Install the package (Maintenance Mode or Reboot is not required)
    #esxcli software vib install -v /tmp/smartctl-6.6-4321.x86_64.vib

The tool is located at /opt/smartmontools/smartctl and works just like the Linux version.
Locate physical disks with ls -l /dev/disks/

/opt/smartmontools/smartctl -d [Device Type] --all /dev/disks/[DISK]

# smartctl -d sat --all /dev/disks/t10.ATA_____Samsung_SSD_850_EVO_M.2_250GB___________S24BNXAG805065D_____
smartctl 6.6 2016-05-10 r4321 [x86_64-linux-6.0.0] (daily-20160510)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Samsung based SSDs
Device Model:     Samsung SSD 850 EVO M.2 250GB
Serial Number:    S24BNXAG805065D
LU WWN Device Id: 5 002538 d404b9f9f
Firmware Version: EMT21B6Q
User Capacity:    250,059,350,016 bytes [250 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 4c
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART Status not supported: Incomplete response, ATA output registers missing
SMART overall-health self-assessment test result: PASSED
Warning: This result is based on an Attribute check.

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (    0) seconds.
Offline data collection
capabilities:                    (0x53) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        No Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 133) minutes.
SCT capabilities:              (0x003d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   099   099   000    Old_age   Always       -       5040
 12 Power_Cycle_Count       0x0032   099   099   000    Old_age   Always       -       35
177 Wear_Leveling_Count     0x0013   094   094   000    Pre-fail  Always       -       122
179 Used_Rsvd_Blk_Cnt_Tot   0x0013   100   100   010    Pre-fail  Always       -       0
181 Program_Fail_Cnt_Total  0x0032   100   100   010    Old_age   Always       -       0
182 Erase_Fail_Count_Total  0x0032   100   100   010    Old_age   Always       -       0
183 Runtime_Bad_Block       0x0013   100   100   010    Pre-fail  Always       -       0
187 Uncorrectable_Error_Cnt 0x0032   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0032   049   034   000    Old_age   Always       -       51
195 ECC_Error_Rate          0x001a   200   200   000    Old_age   Always       -       0
199 CRC_Error_Count         0x003e   100   100   000    Old_age   Always       -       0
235 POR_Recovery_Count      0x0012   099   099   000    Old_age   Always       -       26
241 Total_LBAs_Written      0x0032   099   099   000    Old_age   Always       -       6345601655

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

Warning! SMART Selective Self-Test Log Structure error: invalid SMART checksum.
SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
  255        0    65535  Read_scanning was never started
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
Leave a comment ?

10 Comments.

  1. Looks like it doesnt work properly for SM951 NVMe device:

    [root@esxi:/tmp] /opt/smartmontools/smartctl -d nvme --all /dev/disks/t10.NVMe____SAMSUNG_MZVPV512HDGL2D00000______________xxxxxxxxxxxxxx______00000001
    smartctl 6.6 2016-05-10 r4321 [x86_64-linux-6.0.0] (daily-20160510)
    Copyright (C) 2002-16, Bruce Allen, Christian Franke, http://www.smartmontools.org

    Read NVMe Identify Controller failed: NVME_IOCTL_ADMIN_CMD: Function not implemented

    I tried a scan but that failed:

    [root@esxi:/tmp] /opt/smartmontools/smartctl --scan
    Segmentation fault

  2. How's working with behind hardware raid controller?

  3. Just out of curiosity what did you use for a compile environment for the static smartctl?
    I'm currently using an aged CentOS 3.9 and was wondering if something newer was valid.
    I've experimented with a couple other options but always seem to go back to that one for one reason or another.
    (The latest addition to my custom local vib is a static version of whiptail for an experimental frontend to ghettoVCB)

  4. tanks you for smartctl I install on all my esxi
    (I find 3 Disk HS !!)

    I would like to chek Disk after a megaraid

    /opt/lsi/storcli/storcli -CfgDsply -a0 | grep "Device Id\|DISK"
    Number of DISK GROUPS: 1
    DISK GROUP: 0
    Device Id: 5
    Device Id: 4

    /opt/smartmontools/smartctl -d sat+megaraid,5 -a /dev/disks/naa.600605b006eb32f01a806e721f93a9a4
    smartctl 6.6 2016-05-10 r4321 [x86_64-linux-6.0.0] (daily-20160510)
    Copyright (C) 2002-16, Bruce Allen, Christian Franke, http://www.smartmontools.org

    Smartctl open device: /dev/disks/naa.600605b006eb32f01a806e721f93a9a4 [megaraid_disk_05] [SAT] failed: can't get bus number

    http://guides.ovh.com/LsiMegaraid remplacer MegaCli par storcli

    All the best

  5. SSD Total Bytes Written Calculator | Virten.net - pingback on December 28, 2016 at 6:27 pm
  6. esxi 6.5

    Installation Result
    Message: The update completed successfully, but the system needs to be rebooted for the changes to be effective.
    Reboot Required: true
    VIBs Installed: smartmontools_bootbank_smartctl_6.6-4321
    VIBs Removed:
    VIBs Skipped:

    can't find smartctl, there is no /opt/smartmontools

Leave a Comment

NOTE - You can use these HTML tags and attributes:
<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

Trackbacks and Pingbacks: