I recently had strange issues with Hewlett-Packard servers. ESXi hosts randomly have shown a couple of different symptoms:
- ESXi host unmanageable
- ESXi host grayed out in vCenter
- Starting host services fails with an error message:
Call "HostServiceSystem.Restart" for object "serviceSystem-[*]" on vCenter Server * failed.
- Cannot perform vMotion to or from the host
- Starting virtual machine fails with an error message:
Power On virtual machine *
A general system error occurred: The virtual machine could not start
- Restarting services in DCUI fails
A general system error occurred: Command /bin/sh failed
- SSH connection to the host possible, but no response after login requests
- Local console displays an error message:
/bin/sh cannot fork
- Error Message received at syslog server
sfcb-HTTPS-Daemon[*]: handleHttpRequest fork failed: Cannot allocate memory
crond[*]: can't vfork
cpu*:*)WARNING: Heap: *: Heap_Align(globalCartel-1, 136/136 bytes, 8 align) failed.
cpu*:*)WARNING: Heap: *: Heap globalCartel-1 already at its maximum size. Cannot expand)
- DCUI message log (ALT+F12) displays an error message
WARNING: Heap: *: Heap globalCartel-1 already at its maximum size. Cannot expand.
The problem was caused by the hp-ams module (HP Agentless Management Service) which has a known problem in these versions:
- hp-ams 9.5
- hp-ams 9.6
- hp-ams 10.0
You can verify the version with the following command:
# esxcli software vib list |grep hp-ams
The issue has been resolved in hp-ams 10.0.1 which an be downloaded from the HP website:
- HP Agentless Management Service Offline Bundle for VMware ESXi 5.0 and vSphere 5.1
- HP Agentless Management Service Offline Bundle for VMware vSphere 5.5
If you cannot upgrade the server immediately due to change management processes, you can also mitigate the issue by stopping the hp-ams service and removing the package:
# /etc/init.d/hp-ams.sh stop # esxcli software vib remove -n hp-ams
Yeah! I got this problem on my servers HP DL360 with ESXi 5.5. Thanks for solution.
Interesting.. last Friday I experienced exactly the same problem.
I am still puzzled what can be the cause of this bug. I had ESX hosts with hp-ams 9.6 running for half a year without issues. :???:
Me too... Some die after 4-5 weeks, some still alive after month. Maybe something or someone triggers hp-ams services which causes that issue.
Its caused by a memory leak that fills the ram allocation on the ESXi host, meaning it cannot respond to requests, everything else is just false alarms tbh,
if you esxcli onto the host, you will find it gives the cannot fork message,
I wrote a post on this a while ago.
I would recommend you add in the HP repository to your update manager download sources to capture the latest HP customized drivers.
Thanks for this post. I'm not sure the RAM usage has anything to do with it. For example, I've got a host with 213GB out of 255GB used (MemoryUsageGB from get-vmhost in powercli) and ssh start works, where other hosts at a lower memory usage level (180GB for example) are unable to have ssh started. Are you looking at another metric when you mention a RAM leak? Thanks
The physical memory usage has nothing to do with this issue. The problem can happen with 180GB free physical memory.
There is a limit for the vmkernel. I haven't figured out what's the limit or how to measure it.
Many thanks. Until we refine it with something more specific, this seems to at least tell us when a host has hit the issue (splunk search - "esx5_syslog" "globalCartel-1 "). Update here if I find a way to measure.
Don't forget that we have a fix for that issue ;-)
There is always a fix. It's trying to figure out exactly when to implement etc (wait till new release, that includes fix etc). I get you though. FYI, first response from sneddo here - https://communities.vmware.com/message/2468620#2468620 works well as a report. Script from HP is good for console, but needed report type info.
does this problem occur on all G 6/7 or Gen8 servers?