NFS client (13.0p5) to NFS server (12.2p6), hangs and unkillable processes When trying to pinpoint the processes involved, using fstat: fstat proc hangs here, unkillable: mi_switch+0xc1 sleeplk+0xec lockmgr_slock_hard+0x382 nfs_lock+0x2c vop_sigdefer+0x2b vn_fill_kinfo_vnode+0xd5 export_vnode_to_sb+0x84 kern_proc_filedesc_out+0x1ee sysctl_kern_proc_filedesc+0x7d sysctl_root_handler_locked+0x91 sysctl_root+0x24c userland_sysctl+0x173 sys___sysctl+0x5f amd64_syscall+0x10c fast_syscall_common+0xf8 nfsdumpstate on both server and client does not show anything. - tcpdump sample between server (12.2p6) and client (13.0p5), taken on server: 20:38:29.912755 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 176) <client>.1015 > <server>.2049: Flags [P.], cksum 0x95b3 (correct), seq 1984:2108, ack 1857, win 4353, options [nop,nop,TS val 1900843486 ecr 3042085228], length 124: NFS request xid 1920691674 120 getattr fh Unknown/ED995FFBDE3515680A00080000000000B3310E000000000000000000 0x0000: 4500 00b0 0000 4000 4006 0b5b d447 c32d E.....@.@..[.G.- 0x0010: d447 c330 03f7 0801 61be 799f 625c c9d9 .G.0....a.y.b\.. 0x0020: 8018 1101 95b3 0000 0101 080a 714c 91de ............qL.. 0x0030: b552 896c 8000 0078 727b 6dda 0000 0000 .R.l...xr{m..... 0x0040: 0000 0002 0001 86a3 0000 0003 0000 0001 ................ 0x0050: 0000 0001 0000 0030 61a6 30b7 0000 0018 .......0a.0..... 0x0060: xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxxxxxxxxxxxxxx 0x0070: xxxx xxxx xxxx xxxx 0000 03e8 0000 03e8 xxxxxxxx........ 0x0080: 0000 0001 0000 0000 0000 0000 0000 0000 ................ 0x0090: 0000 001c ed99 5ffb de35 1568 0a00 0800 ......_..5.h.... 0x00a0: 0000 0000 b331 0e00 0000 0000 0000 0000 .....1.......... 20:38:29.912762 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 168) <server>.2049 > <client>.1015: Flags [P.], cksum 0xce36 (correct), seq 1857:1973, ack 2108, win 29128, options [nop,nop,TS val 3042085228 ecr 1900843486], length 116: NFS reply xid 1920691674 reply ok 112 getattr DIR 755 ids 0/0 sz 15 0x0000: 4500 00a8 0000 4000 4006 0b63 d447 c330 E.....@.@..c.G.0 0x0010: d447 c32d 0801 03f7 625c c9d9 61be 7a1b .G.-....b\..a.z. 0x0020: 8018 71c8 ce36 0000 0101 080a b552 896c ..q..6.......R.l 0x0030: 714c 91de 8000 0070 727b 6dda 0000 0001 qL.....pr{m..... 0x0040: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 0x0050: 0000 0000 0000 0002 0000 01ed 0000 000f ................ 0x0060: 0000 0000 0000 0000 0000 0000 0000 000f ................ 0x0070: 0000 0000 0000 3000 0000 0000 0000 0008 ......0......... 0x0080: 0000 0000 fb5f 99ed 0000 0000 0000 0008 ....._.......... 0x0090: 5878 e8b4 39f5 8258 5824 2e46 0000 0000 Xx..9..XX$.F.... 0x00a0: 5878 e8b4 39f5 8a28 Xx..9..( This condition has already survived a nfs client reboot, we'll reboot the nfs server in approx. 12 hours. So if someone has ideas what we can try to catch before we reboot the nfs server... ?
ktrace of hangig processes did not show something, but maybe I had some errors using ktrace here...
reboot of the nfs server did not change the problem. But: using this: /usr/local/bin/konsole was installed by package konsole-21.08.3_1 triggers the problem, using xterm works.
Hmm, looks like you are using NFSv3. If you are running rpc.lockd, it is a fundamentally broken protocol (NLM). - If this is the case, try either the "nolockd" mount option or "nfsv4,minorversion=1" for your mount. If this doesn't resolve the problem, please provide the output for all of the following when the hang has occurred. nfsstat -m ps axHl procstat -kk netstat -a nfsstat -E -c <-- done at least twice about a minute apart Also, since you know how to reproduce the problem, you can capture packets while it happens. Before the hang run: tcpdump -s 0 -w out.pcap host <nfs-server> --> Then reproduce the problem and kill the tcpdump 1 minute after the hang has occurred. Put out.pcap here as an attachment. (Btw, if you want to look at out.pcap, use wireshark. tcpdump knows almost nothing about NFS packets, whereas wireshark can decode them nicely.) I have not heard of a FreeBSD client hang problem. There is a known TCP layer issue, but it causes hangs when the server is running 13.0 and I do not think the problem occurs when the server is 12.3. I have heard of issues when using vnet jails, but I doubt you are doing that. Also, hangs are often network fabric related. Disabling TSO on both client and server often helps. Of using a different network driver/chip if that is feasible (for example, for re devices the driver in ports sometimes works better than what is in the system, depending on which re chip you have.)
Oh, and if "netstat -a" when it is hung shows the TCP connection to the server as ESTABLISHED and Recv-Q is non-zero, you probably are hitting the TCP issue. See PR#254590 for what to do about it. (It is specific to 13.0 and is fixed in stable/13.)
(In reply to Rick Macklem from comment #3) Thanks for the pointers. We're now testing with rw,soft,bg,tcp via v6 for now and wait 24hours, if the problem re-occurs.
(In reply to Kurt Jaeger from comment #5) The error did not re-appear. It's unclear if the mount option change is a good fix (and we close this PR) or does some part of KDE, NFS, ... need changes ?
(In reply to Kurt Jaeger from comment #6) Ah, I have an idea. We have a test desktop and try to reproduce with that and the old fstab settings 8-}
(In reply to Kurt Jaeger from comment #7) If I remember correctly we have had a long standing issue with FAM on NFS that lead to hangs. (that's why it is off by default in devel/kf5-kcoreaddons). mfg Tobias