Bug 240047

Summary: more and more processes get stuck waiting for ufs and zfs until system is rendered inaccessible
Product: Base System Reporter: Robert Clausecker <fuz>
Component: kernAssignee: freebsd-bugs mailing list <bugs>
Status: Open ---    
Severity: Affects Some People CC: chris, koobs, ota
Priority: --- Keywords: needs-qa
Version: 12.0-RELEASE   
Hardware: amd64   
OS: Any   
Attachments:
Description Flags
/var/run/dmesg.boot
none
pkg version -v
none
ps -auxc
none
vmstat -z none

Description Robert Clausecker 2019-08-22 21:41:27 UTC
I'm on a conference running an open FTP server.  Files are served by FTP via ftpd(8), NFS via nfsd(8), and HTTP via Apache 2.4.  The server has its root on UFS and remaining files spread over three ZFS pools, one currently replacing a (working) disk:

$ zpool list -v
NAME                                     SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
disk12                                  18.1T  14.0T  4.16T        -         -     3%    77%  1.00x  ONLINE  -
  da3                                   9.06T  6.98T  2.08T        -         -     3%    77%
  diskid/DISK-7JG9E40C%20%20%20%20%20%20%20%20%20%20%20%20  9.06T  6.98T  2.08T        -         -     3%    77%
cache                                       -      -      -         -      -      -
  ada0p2                                 170G  3.98G   166G        -         -     0%     2%
disk34                                  18.1T  14.8T  3.33T        -         -     4%    81%  1.00x  ONLINE  -
  da2                                   9.06T  7.39T  1.67T        -         -     4%    81%
  da1                                   9.06T  7.41T  1.66T        -         -     4%    81%
cache                                       -      -      -         -      -      -
  ada0p5                                 170G  5.14G   165G        -         -     0%     3%
disk56                                  18.1T  14.0T  4.15T        -         -     1%    77%  1.00x  ONLINE  -
  replacing                             9.06T  6.97T  2.10T        -         -     1%    76%
    da0                                     -      -      -        -         -      -      -
    da4                                     -      -      -        -         -      -      -
  diskid/DISK-7PGVBGZC%20%20%20%20%20%20%20%20%20%20%20%20  9.06T  7.01T  2.06T        -         -     1%    77%
cache                                       -      -      -         -      -      -
  ada0p6                                 170G  6.03G   164G        -         -     0%     3%

$ df -h
Filesystem         Size    Used   Avail Capacity  Mounted on
/dev/ada0p4        375G     68G    278G    20%    /
devfs              1.0K    1.0K      0B   100%    /dev
tmpfs               33G     76K     33G     0%    /var/run
tmpfs               33G    4.0K     33G     0%    /tmp
tmpfs               33G    156K     33G     0%    /var/log
fdescfs            1.0K    1.0K      0B   100%    /dev/fd
procfs             4.0K    4.0K      0B   100%    /proc
disk12              18T     14T    3.6T    80%    /disk12
disk34              17T     14T    2.8T    83%    /disk34
disk56              18T     14T    3.6T    80%    /disk56
disk34/zeug        3.6T    864G    2.8T    23%    /usr/home/fuz/zeug
<above>:/disk12     18T     14T    3.6T    80%    /export
<above>:/disk34     35T     32T    2.8T    92%    /export
<above>:/disk56     52T     49T    3.6T    93%    /export

Files are served over a 10 GBe connection with an average bandwith of around 200 MB/s, the limit seems to be in the number of IOP/s:

$ zpool iostat
               capacity     operations    bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
disk12      14.0T  4.16T    254      0  34.8M  6.16K
disk34      14.8T  3.33T    261     29  35.0M  1.20M
disk56      14.0T  4.15T    882     29   118M   191K
----------  -----  -----  -----  -----  -----  -----

RAM is about half used and nothing seems to indicate any resource exhaustion.

$ vmstat
procs  memory       page                    disks     faults         cpu
r b w  avm   fre   flt  re  pi  po    fr   sr ad0 da0   in    sy    cs us sy id
0 0 0 1.0T  666M   451 1197 436   0 64834 14532   0   0 28631 18084 93822  0 17 83

The only sysctl set is kern.racct.enable=1



After a while, more and more httpd and ftpd processes get stuck in an ufs or zfs wait state.  They cannot be killed.  I have since rebooted the server a bunch of times and the problem keeps appearing.
Comment 1 Kubilay Kocak freebsd_committer freebsd_triage 2019-08-23 06:04:18 UTC
Thank you for the repoyrt Robert

Can you provide the exact system information (uname -a), and include

- pkg version -v output (as an attachment) 
- /var/run/dmesg.boot output (as an attachment) 

And when the symptoms are observable:

- ps (-aux at least) output (as an attachment )
- vmstat -z output (as an attachment)
Comment 2 Robert Clausecker 2019-08-23 08:06:46 UTC
Created attachment 206815 [details]
/var/run/dmesg.boot
Comment 3 Robert Clausecker 2019-08-23 08:07:09 UTC
Created attachment 206816 [details]
pkg version -v
Comment 4 Robert Clausecker 2019-08-23 09:13:16 UTC
Created attachment 206819 [details]
ps -auxc
Comment 5 Robert Clausecker 2019-08-23 09:13:35 UTC
Created attachment 206820 [details]
vmstat -z
Comment 6 Robert Clausecker 2019-08-23 09:14:15 UTC
The situation has reappeared with a bunch of ftpd instances being stuck in ufs and zfs wait channels.  I'll leave the stuck box up for an hour or so in case you need further information.
Comment 7 Robert Clausecker 2019-08-23 09:36:04 UTC
Once processes start to lock up, the machine kicks me out of my SSH session when I run "pkg update".  I can't log in again (neither from FTP, SSH, or console), but existing connections continue to work.  I have to hard reboot the machine then.  Weird.
Comment 8 Christos Chatzaras 2019-08-28 22:25:38 UTC
Today my backup script hang with chflags process (I chflags recursively a lot of files)  in ufs state.

I use FreeBSD 12-STABLE (kernel/userland from 9 August) and UFS SU+J.

SSH was responsive but server was pingable.

The only way was to hard reset the server.
Comment 9 ota 2019-08-29 05:08:47 UTC
(In reply to Robert Clausecker from comment #7)

By the way, were all problems happened during disk replacement?
Comment 10 Robert Clausecker 2019-08-29 08:52:22 UTC
(In reply to ota from comment #9)

Yes, but I recall that I had another lockup after the disk replacement was done.  What finally mitigated the problem was locking down the number of simultaneous FTP connections to an unreasonably low number (200) and disabling Apache 2.4.  I think this reduced the load sufficiently to avoid the issue.