Sunday, February 09, 2014

CPU wait due to process blocked by IO : a story for VMware tools in Ubuntu guest

When the CPU waits, there is a problem. The server will be slow causing other issues. We were receiving high CPU wait percentages alerts from monit and we found the culprit to be VMWare tools:
$ while true; do date; ps auxf | awk '{if($8=="D") print $0;}'; sleep 1; done
Sun Feb  9 08:59:15 EST 2014
root      1233  0.0  0.0  87576   980 ?        D     2013  79:20 /usr/sbin/vmtoolsd
Sun Feb  9 08:59:16 EST 2014
root      1233  0.0  0.0  87576   980 ?        D     2013  79:20 /usr/sbin/vmtoolsd
Sun Feb  9 08:59:17 EST 2014
...
Clearly this process has been running for a while. I went ahead and traced the pid:
$ sudo strace -p1233
Process 1233 attached - interrupt to quit
But for minutes it stayed there, no traces. In a similar box I was able to see activity:
$ ps -ef|grep vm
root      1240     1  0  2013 ?        01:08:34 /usr/sbin/vmtoolsd
$ sudo strace -p1240
Process 1240 attached - interrupt to quit
restart_syscall(<... resuming interrupted call ...>) = 0
times({tms_utime=306343, tms_stime=105067, tms_cutime=4, tms_cstime=0}) = 2260068445
times({tms_utime=306343, tms_stime=105067, tms_cutime=4, tms_cstime=0}) = 2260068445
poll([{fd=4, events=POLLIN}, {fd=13, events=POLLIN}, {fd=13, events=POLLIN}, {fd=13, events=POLLIN}, {fd=13, events=POLLIN}, {fd=13, events=POLLIN}], 6, 100) = 0 (Timeout)
My first conclusion could have been that there is a deadlock in the below version of VMware Tools:
$ vmware-toolbox-cmd -v
8.6.5.16159 (build-821615)
$ sudo vmware-toolbox-cmd upgrade status
VMware Tools are up-to-date.
After a restart:
$ sudo service vmware-tools restart
vmware-tools stop/waiting
vmware-tools start/running
I got actually a new instance but the old instance was still there:
$ ps auxf|grep vm
root      1233  0.0  0.0  87576   980 ?        D     2013  79:20 /usr/sbin/vmtoolsd
root       953  0.1  0.0  87576  3976 ?        S    09:59   0:00 /usr/sbin/vmtoolsd
The old process won't get killed with SIGTERM nor SIGKILL so I had to restart the box. If this ever repeats it should be probably a good idea to notify vmware of a possible bug.

No comments:

Followers