Bug 209421 - Processes hangs in D state, suspfs or vofflock wchan under FreeBSD 10.X-11.X
Summary: Processes hangs in D state, suspfs or vofflock wchan under FreeBSD 10.X-11.X
Status: New
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 11.2-RELEASE
Hardware: amd64 Any
: --- Affects Many People
Assignee: freebsd-bugs (Nobody)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-05-10 12:02 UTC by vvv
Modified: 2018-10-16 07:06 UTC (History)
4 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description vvv 2016-05-10 12:02:46 UTC
Sometimes processes go to D state and never change to other states under FreeBSD from 10.0 to 10.3. Than new and new processes go to D state until server hangs at all. There is no disk activity at that time.

uname -a:
FreeBSD hostname 10.3-RELEASE FreeBSD 10.3-RELEASE #0 r297264: Fri Mar 25 02:10:02 UTC 2016     root@releng1.nyi.freebsd.org:/usr/obj/usr/src/sys/GENERIC  amd64

ps auxwwO wchan:
USER            PID  %CPU %MEM     VSZ     RSS TT  STAT STARTED         TIME COMMAND            PID WCHAN    TT  STAT         TIME COMMAND
www            1395   0,0  0,0   97932    7704  -  D    11:12AM      0:00,02 /usr/local/sbin/  1395 suspfs    -  D         0:00,02 /usr/local/sbin/httpd -f /usr/local/etc/apache24/httpd_ssl.conf -c PidFile /var/run/httpd.ssl.pid
www            2273   0,0  0,0   97932    7652  -  D    11:14AM      0:00,02 /usr/local/sbin/  2273 vofflock  -  D         0:00,02 /usr/local/sbin/httpd -f /usr/local/etc/apache24/httpd_ssl.conf -c PidFile /var/run/httpd.ssl.pid
www            2831   0,0  0,0   97932    7660  -  D    11:14AM      0:00,02 /usr/local/sbin/  2831 vofflock  -  D         0:00,02 /usr/local/sbin/httpd -f /usr/local/etc/apache24/httpd_ssl.conf -c PidFile /var/run/httpd.ssl.pid
www            3627   0,0  0,0   97932    7652  -  D    11:16AM      0:00,02 /usr/local/sbin/  3627 vofflock  -  D         0:00,02 /usr/local/sbin/httpd -f /usr/local/etc/apache24/httpd_ssl.conf -c PidFile /var/run/httpd.ssl.pid
www            3634   0,0  0,0   97932    7612  -  D    11:16AM      0:00,01 /usr/local/sbin/  3634 vofflock  -  D         0:00,01 /usr/local/sbin/httpd -f /usr/local/etc/apache24/httpd_ssl.conf -c PidFile /var/run/httpd.ssl.pid
www            3635   0,0  0,0   97932    7588  -  D    11:16AM      0:00,00 /usr/local/sbin/  3635 vofflock  -  D         0:00,00 /usr/local/sbin/httpd -f /usr/local/etc/apache24/httpd_ssl.conf -c PidFile /var/run/httpd.ssl.pid
www            4912   0,0  0,3  158388   71048  -  D    11:18AM      0:00,05 /usr/local/sbin/  4912 suspfs    -  D         0:00,05 /usr/local/sbin/httpd -f /usr/local/etc/apache24/httpd.conf -c PidFile /var/run/httpd.users.pid
www            4913   0,0  0,3  158388   70960  -  D    11:18AM      0:00,05 /usr/local/sbin/  4913 suspfs    -  D         0:00,05 /usr/local/sbin/httpd -f /usr/local/etc/apache24/httpd.conf -c PidFile /var/run/httpd.users.pid
www            5258   0,0  0,3  158388   71228  -  D    11:19AM      0:00,09 /usr/local/sbin/  5258 suspfs    -  D         0:00,09 /usr/local/sbin/httpd -f /usr/local/etc/apache24/httpd.conf -c PidFile /var/run/httpd.users.pid
www            5361   0,0  0,0   97932    7624  -  D    11:19AM      0:00,01 /usr/local/sbin/  5361 vofflock  -  D         0:00,01 /usr/local/sbin/httpd -f /usr/local/etc/apache24/httpd_ssl.conf -c PidFile /var/run/httpd.ssl.pid
www            5362   0,0  0,0   97932    7624  -  D    11:19AM      0:00,01 /usr/local/sbin/  5362 vofflock  -  D         0:00,01 /usr/local/sbin/httpd -f /usr/local/etc/apache24/httpd_ssl.conf -c PidFile /var/run/httpd.ssl.pid
www            5381   0,0  0,3  158388   70952  -  D    11:19AM      0:00,03 /usr/local/sbin/  5381 suspfs    -  D         0:00,03 /usr/local/sbin/httpd -f /usr/local/etc/apache24/httpd.conf -c PidFile /var/run/httpd.users.pid
www            5382   0,0  0,3  158388   70944  -  D    11:19AM      0:00,04 /usr/local/sbin/  5382 suspfs    -  D         0:00,04 /usr/local/sbin/httpd -f /usr/local/etc/apache24/httpd.conf -c PidFile /var/run/httpd.users.pid
latokar        5424   0,0  0,0   41736    3180  -  Ds   11:19AM      0:01,55 ftpd: 31.129.249  5424 suspfs    -  Ds        0:01,55 ftpd: XXX.XXX.XXX.XXX: user/latokar: STOR icons.php\r\n (ftpd)
www            5431   0,0  0,3  158388   70972  -  D    11:19AM      0:00,02 /usr/local/sbin/  5431 suspfs    -  D         0:00,02 /usr/local/sbin/httpd -f /usr/local/etc/apache24/httpd.conf -c PidFile /var/run/httpd.users.pid
.....

procstat -kk 1395:
  PID    TID COMM             TDNAME           KSTACK                       
 1395 102111 httpd            -                mi_switch+0xe1 sleepq_wait+0x3a _sleep+0x287 vn_start_write_locked+0xa7 vn_start_write+0xa3 vn_write+0xb0 vn_io_fault_doio+0x22 vn_io_fault1+0x1ac vn_io_fault+0x18b dofilewrite+0x87 kern_writev+0x68 sys_write+0x63 amd64_syscall+0x40f Xfast_syscall+0xfb

procstat -kk 2273:
  PID    TID COMM             TDNAME           KSTACK                       
 2273 100096 httpd            -                mi_switch+0xe1 sleepq_wait+0x3a _sleep+0x287 foffset_lock+0xaa vn_io_fault+0x5c dofilewrite+0x87 kern_writev+0x68 sys_write+0x63 amd64_syscall+0x40f Xfast_syscall+0xfb 

I don't know how to repeat. It occasionally happens at the different servers.
Comment 1 Christos Chatzaras 2016-10-01 14:31:00 UTC
Same issue here. It happens randomly, for example during tar, or rsync:

procstat -kk 66610
  PID    TID COMM             TDNAME           KSTACK
66610 101290 bsdtar           -                mi_switch+0xe1 sleepq_wait+0x3a _sleep+0x287 vnode_create_vobject+0x100 ufs_open+0x6d VOP_OPEN_APV+0xa1 vn_open_vnode+0x234 vn_open_cred+0x36a kern_openat+0x26f amd64_syscall+0x40f Xfast_syscall+0xfb

fstat -p 66610
USER     CMD          PID   FD MOUNT      INUM MODE         SZ|DV R/W
root     bsdtar     66610 text /usr     903274 -r-xr-xr-x   58392  r
root     bsdtar     66610   wd -        290045184 d---------     512  r
root     bsdtar     66610 root /             2 drwxr-xr-x     512  r
root     bsdtar     66610    0* pipe fffff804efe05000 <-> fffff804efe05160      0 rw
root     bsdtar     66610    1* pipe fffff8000e2d1730 <-> fffff8000e2d15d0      0 rw
root     bsdtar     66610    2* pipe fffff804e9480448 <-> fffff804e94802e8      0 rw
root     bsdtar     66610    3 -        290045184 d---------     512  r
root     bsdtar     66610    4 -        293018454 drwxr-xr-x  15111168  r
root     bsdtar     66610    5 -        293018454 drwxr-xr-x  15111168  r
Comment 2 Christos Chatzaras 2016-10-01 15:17:51 UTC
It's possible this issue is the same as https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=204764
Comment 3 vvv 2016-10-11 13:36:41 UTC
Te same problem is under 11.0-RELEASE.
Comment 4 Christos Chatzaras 2016-10-11 13:45:58 UTC
For me it's resolved when I upgrade to 10.3-STABLE
Comment 5 Christos Chatzaras 2016-10-11 13:46:24 UTC
I mean 10-STABLE
Comment 6 vvv 2017-01-30 16:54:10 UTC
uname -a:
FreeBSD hostname 11.0-RELEASE-p1 FreeBSD 11.0-RELEASE-p1 #0 r306420: Thu Sep 29 01:43:23 UTC 2016     root@releng2.nyi.freebsd.org:/usr/obj/usr/src/sys/GENERIC  amd64

ps auxwwO wchan:
user123          21246   0,0  0,2  192920   45800  -  D    18:20        0:00,19 /usr/local/bin/p 21246 suspfs    -  D         0:00,19 /usr/local/bin/php-cgi
root             21442   0,0  0,0    8396    2676  -  Ds   18:21        0:00,01 find -H /tmp -na 21442 suspfs    -  Ds        0:00,01 find -H /tmp -name sess_* -mtime +1h -delete
user234          21669   0,0  0,0   11544    3420  -  D    18:21        0:00,02 unzip -ao /tmp/f 21669 suspfs    -  D         0:00,02 unzip -ao /tmp/fm/55BA42987CD9F1D024546F689963E25C/dist.zip -d /tmp/fm/55BA42987CD9F1D024546F689963E25C
www              22188   0,0  0,2  146188   62200  -  D    18:23        0:00,24 /usr/local/sbin/ 22188 suspfs    -  D         0:00,24 /usr/local/sbin/httpd -f /usr/local/etc/apache24/httpd.conf -c PidFile /var/run/httpd.users.pid
www              22439   0,0  0,2  146188   61772  -  D    18:24        0:00,03 /usr/local/sbin/ 22439 suspfs    -  D         0:00,03 /usr/local/sbin/httpd -f /usr/local/etc/apache24/httpd.conf -c PidFile /var/run/httpd.users.pid
www              22442   0,0  0,2  146188   61768  -  D    18:24        0:00,02 /usr/local/sbin/ 22442 suspfs    -  D         0:00,02 /usr/local/sbin/httpd -f /usr/local/etc/apache24/httpd.conf -c PidFile /var/run/httpd.users.pid
www              22789   0,0  0,2  146188   61780  -  D    18:26        0:00,03 /usr/local/sbin/ 22789 suspfs    -  D         0:00,03 /usr/local/sbin/httpd -f /usr/local/etc/apache24/httpd.conf -c PidFile /var/run/httpd.users.pid
www              22790   0,0  0,2  146188   61772  -  D    18:26        0:00,02 /usr/local/sbin/ 22790 suspfs    -  D         0:00,02 /usr/local/sbin/httpd -f /usr/local/etc/apache24/httpd.conf -c PidFile /var/run/httpd.users.pid
www              22799   0,0  0,2  146188   61800  -  D    18:26        0:00,03 /usr/local/sbin/ 22799 suspfs    -  D         0:00,03 /usr/local/sbin/httpd -f /usr/local/etc/apache24/httpd.conf -c PidFile /var/run/httpd.users.pid
www              22802   0,0  0,2  146188   61764  -  D    18:26        0:00,02 /usr/local/sbin/ 22802 suspfs    -  D         0:00,02 /usr/local/sbin/httpd -f /usr/local/etc/apache24/httpd.conf -c PidFile /var/run/httpd.users.pid
user345          22821   0,0  0,0   44148    3504  -  Ds   18:26        0:00,12 ftpd: bzq-79-183 22821 range     -  Ds        0:00,12 ftpd: ???.red.bezeqint.net: user/user345: STOR pack-473a0372073c2c7baee9ef960158fa0d3fa750e4.idx\r\n (ftpd)
www              22834   0,0  0,2  146188   61764  -  D    18:26        0:00,02 /usr/local/sbin/ 22834 suspfs    -  D         0:00,02 /usr/local/sbin/httpd -f /usr/local/etc/apache24/httpd.conf -c PidFile /var/run/httpd.users.pid
www              22935   0,0  0,2  146188   61828  -  D    18:26        0:00,04 /usr/local/sbin/ 22935 vofflock  -  D         0:00,04 /usr/local/sbin/httpd -f /usr/local/etc/apache24/httpd.conf -c PidFile /var/run/httpd.users.pid
www              23009   0,0  0,2  146188   61772  -  D    18:27        0:00,01 /usr/local/sbin/ 23009 vofflock  -  D         0:00,01 /usr/local/sbin/httpd -f /usr/local/etc/apache24/httpd.conf -c PidFile /var/run/httpd.users.pid
.....

procstat -kk 22188:
  PID    TID COMM             TDNAME           KSTACK                       
22188 100207 httpd            -                mi_switch+0xd2 sleepq_wait+0x3a _sleep+0x2a1 vn_start_write_locked+0xa6 vn_start_write+0xdf vn_close+0x5b vn_closefile+0x4a _fdrop+0x1a closef+0x2d4 closefp+0xb6 amd64_syscall+0x4ce Xfast_syscall+0xfb 

procstat -kk 22935:
  PID    TID COMM             TDNAME           KSTACK                       
22935 100997 httpd            -                mi_switch+0xd2 sleepq_wait+0x3a _sleep+0x2a1 foffset_lock+0xda vn_io_fault+0x5a dofilewrite+0x87 kern_writev+0x68 sys_write+0x84 amd64_syscall+0x4ce Xfast_syscall+0xfb
Comment 7 vvv 2018-03-09 09:50:47 UTC
The same problem is under 11.1-RELEASE.
Comment 8 Konstantin Belousov freebsd_committer freebsd_triage 2018-03-09 11:15:09 UTC
(In reply to vvv from comment #7)
You either use journaled soft updates, or your disk controller stopped processing the io requests.  If you do use journaling, try to switch to plain soft updates.
Comment 9 vvv 2018-03-09 11:50:18 UTC
It isn't a problem of controller, because the behavior is randomly observed at different servers with different hardware.

Yes, SU+J is enabled. Is it a known problem?

Disabling SU+J is undesirable because fsck will take a very long time on unclean file systems. But I'll try.
Comment 10 lampa 2018-03-09 17:11:56 UTC
Try sysctl -w vfs.lookup_shared=0

In our case it helped, lockup was in lockmgr due to heavy nfs load.
Comment 11 vvv 2018-03-09 17:29:23 UTC
Thanks. I'll try.
Comment 12 vvv 2018-04-12 07:44:15 UTC
vfs.lookup_shared=0 didn't help. Trying to disable journaling and leave plain SU.
Comment 13 Christos Chatzaras 2018-10-05 12:57:34 UTC
Did disabling SU+J help?
Comment 14 vvv 2018-10-05 16:02:53 UTC
I've disabled soft update journaling (-j) and left soft updates (-n) enabled at two servers. Now they works fine.
Comment 15 vvv 2018-10-16 07:06:54 UTC
I've got the problem with disabled soft update journaling and enabled soft updates:
tunefs: soft updates: (-n)                                 enabled
tunefs: soft update journaling: (-j)                       disabled

So, disabling SU+J didn't help.