|Summary:||more and more processes get stuck waiting for ufs and zfs until system is rendered inaccessible|
|Product:||Base System||Reporter:||Robert Clausecker <fuz>|
|Component:||kern||Assignee:||freebsd-bugs mailing list <bugs>|
|Severity:||Affects Some People||CC:||chris, koobs, ota|
Description Robert Clausecker 2019-08-22 21:41:27 UTC
I'm on a conference running an open FTP server. Files are served by FTP via ftpd(8), NFS via nfsd(8), and HTTP via Apache 2.4. The server has its root on UFS and remaining files spread over three ZFS pools, one currently replacing a (working) disk: $ zpool list -v NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT disk12 18.1T 14.0T 4.16T - - 3% 77% 1.00x ONLINE - da3 9.06T 6.98T 2.08T - - 3% 77% diskid/DISK-7JG9E40C%20%20%20%20%20%20%20%20%20%20%20%20 9.06T 6.98T 2.08T - - 3% 77% cache - - - - - - ada0p2 170G 3.98G 166G - - 0% 2% disk34 18.1T 14.8T 3.33T - - 4% 81% 1.00x ONLINE - da2 9.06T 7.39T 1.67T - - 4% 81% da1 9.06T 7.41T 1.66T - - 4% 81% cache - - - - - - ada0p5 170G 5.14G 165G - - 0% 3% disk56 18.1T 14.0T 4.15T - - 1% 77% 1.00x ONLINE - replacing 9.06T 6.97T 2.10T - - 1% 76% da0 - - - - - - - da4 - - - - - - - diskid/DISK-7PGVBGZC%20%20%20%20%20%20%20%20%20%20%20%20 9.06T 7.01T 2.06T - - 1% 77% cache - - - - - - ada0p6 170G 6.03G 164G - - 0% 3% $ df -h Filesystem Size Used Avail Capacity Mounted on /dev/ada0p4 375G 68G 278G 20% / devfs 1.0K 1.0K 0B 100% /dev tmpfs 33G 76K 33G 0% /var/run tmpfs 33G 4.0K 33G 0% /tmp tmpfs 33G 156K 33G 0% /var/log fdescfs 1.0K 1.0K 0B 100% /dev/fd procfs 4.0K 4.0K 0B 100% /proc disk12 18T 14T 3.6T 80% /disk12 disk34 17T 14T 2.8T 83% /disk34 disk56 18T 14T 3.6T 80% /disk56 disk34/zeug 3.6T 864G 2.8T 23% /usr/home/fuz/zeug <above>:/disk12 18T 14T 3.6T 80% /export <above>:/disk34 35T 32T 2.8T 92% /export <above>:/disk56 52T 49T 3.6T 93% /export Files are served over a 10 GBe connection with an average bandwith of around 200 MB/s, the limit seems to be in the number of IOP/s: $ zpool iostat capacity operations bandwidth pool alloc free read write read write ---------- ----- ----- ----- ----- ----- ----- disk12 14.0T 4.16T 254 0 34.8M 6.16K disk34 14.8T 3.33T 261 29 35.0M 1.20M disk56 14.0T 4.15T 882 29 118M 191K ---------- ----- ----- ----- ----- ----- ----- RAM is about half used and nothing seems to indicate any resource exhaustion. $ vmstat procs memory page disks faults cpu r b w avm fre flt re pi po fr sr ad0 da0 in sy cs us sy id 0 0 0 1.0T 666M 451 1197 436 0 64834 14532 0 0 28631 18084 93822 0 17 83 The only sysctl set is kern.racct.enable=1 After a while, more and more httpd and ftpd processes get stuck in an ufs or zfs wait state. They cannot be killed. I have since rebooted the server a bunch of times and the problem keeps appearing.
Comment 1 Kubilay Kocak 2019-08-23 06:04:18 UTC
Thank you for the repoyrt Robert Can you provide the exact system information (uname -a), and include - pkg version -v output (as an attachment) - /var/run/dmesg.boot output (as an attachment) And when the symptoms are observable: - ps (-aux at least) output (as an attachment ) - vmstat -z output (as an attachment)
Comment 2 Robert Clausecker 2019-08-23 08:06:46 UTC
Created attachment 206815 [details] /var/run/dmesg.boot
Comment 3 Robert Clausecker 2019-08-23 08:07:09 UTC
Created attachment 206816 [details] pkg version -v
Comment 6 Robert Clausecker 2019-08-23 09:14:15 UTC
The situation has reappeared with a bunch of ftpd instances being stuck in ufs and zfs wait channels. I'll leave the stuck box up for an hour or so in case you need further information.
Comment 7 Robert Clausecker 2019-08-23 09:36:04 UTC
Once processes start to lock up, the machine kicks me out of my SSH session when I run "pkg update". I can't log in again (neither from FTP, SSH, or console), but existing connections continue to work. I have to hard reboot the machine then. Weird.
Comment 8 Christos Chatzaras 2019-08-28 22:25:38 UTC
Today my backup script hang with chflags process (I chflags recursively a lot of files) in ufs state. I use FreeBSD 12-STABLE (kernel/userland from 9 August) and UFS SU+J. SSH was responsive but server was pingable. The only way was to hard reset the server.
Comment 9 ota 2019-08-29 05:08:47 UTC
(In reply to Robert Clausecker from comment #7) By the way, were all problems happened during disk replacement?
Comment 10 Robert Clausecker 2019-08-29 08:52:22 UTC
(In reply to ota from comment #9) Yes, but I recall that I had another lockup after the disk replacement was done. What finally mitigated the problem was locking down the number of simultaneous FTP connections to an unreasonably low number (200) and disabling Apache 2.4. I think this reduced the load sufficiently to avoid the issue.