Bug 260146 - Using x11/konsole causes NFS hangs
Summary: Using x11/konsole causes NFS hangs
Status: New
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 13.0-RELEASE
Hardware: amd64 Any
: --- Affects Only Me
Assignee: freebsd-kde (Team)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-11-30 19:53 UTC by Kurt Jaeger
Modified: 2021-12-03 09:44 UTC (History)
4 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Kurt Jaeger freebsd_committer 2021-11-30 19:53:35 UTC
NFS client (13.0p5) to NFS server (12.2p6), hangs and unkillable processes

When trying to pinpoint the processes involved, using fstat:

fstat proc hangs here, unkillable:
mi_switch+0xc1 sleeplk+0xec lockmgr_slock_hard+0x382 nfs_lock+0x2c vop_sigdefer+0x2b vn_fill_kinfo_vnode+0xd5 export_vnode_to_sb+0x84 kern_proc_filedesc_out+0x1ee sysctl_kern_proc_filedesc+0x7d sysctl_root_handler_locked+0x91 sysctl_root+0x24c userland_sysctl+0x173 sys___sysctl+0x5f amd64_syscall+0x10c fast_syscall_common+0xf8 

nfsdumpstate on both server and client does not show anything.

- tcpdump sample between server (12.2p6) and client (13.0p5), taken on server:
20:38:29.912755 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 176)
    <client>.1015 > <server>.2049: Flags [P.], cksum 0x95b3 (correct), seq 1984:2108, ack 1857, win 4353, options [nop,nop,TS val 1900843486 ecr 3042085228], length 124: NFS request xid 1920691674 120 getattr fh Unknown/ED995FFBDE3515680A00080000000000B3310E000000000000000000
        0x0000:  4500 00b0 0000 4000 4006 0b5b d447 c32d  E.....@.@..[.G.-
        0x0010:  d447 c330 03f7 0801 61be 799f 625c c9d9  .G.0....a.y.b\..
        0x0020:  8018 1101 95b3 0000 0101 080a 714c 91de  ............qL..
        0x0030:  b552 896c 8000 0078 727b 6dda 0000 0000  .R.l...xr{m.....
        0x0040:  0000 0002 0001 86a3 0000 0003 0000 0001  ................
        0x0050:  0000 0001 0000 0030 61a6 30b7 0000 0018  .......0a.0.....
        0x0060:  xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx  xxxxxxxxxxxxxxxx
        0x0070:  xxxx xxxx xxxx xxxx 0000 03e8 0000 03e8  xxxxxxxx........
        0x0080:  0000 0001 0000 0000 0000 0000 0000 0000  ................
        0x0090:  0000 001c ed99 5ffb de35 1568 0a00 0800  ......_..5.h....
        0x00a0:  0000 0000 b331 0e00 0000 0000 0000 0000  .....1..........
20:38:29.912762 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 168)
    <server>.2049 > <client>.1015: Flags [P.], cksum 0xce36 (correct), seq 1857:1973, ack 2108, win 29128, options [nop,nop,TS val 3042085228 ecr 1900843486], length 116: NFS reply xid 1920691674 reply ok 112 getattr DIR 755 ids 0/0 sz 15
        0x0000:  4500 00a8 0000 4000 4006 0b63 d447 c330  E.....@.@..c.G.0
        0x0010:  d447 c32d 0801 03f7 625c c9d9 61be 7a1b  .G.-....b\..a.z.
        0x0020:  8018 71c8 ce36 0000 0101 080a b552 896c  ..q..6.......R.l
        0x0030:  714c 91de 8000 0070 727b 6dda 0000 0001  qL.....pr{m.....
        0x0040:  0000 0000 0000 0000 0000 0000 0000 0000  ................
        0x0050:  0000 0000 0000 0002 0000 01ed 0000 000f  ................
        0x0060:  0000 0000 0000 0000 0000 0000 0000 000f  ................
        0x0070:  0000 0000 0000 3000 0000 0000 0000 0008  ......0.........
        0x0080:  0000 0000 fb5f 99ed 0000 0000 0000 0008  ....._..........
        0x0090:  5878 e8b4 39f5 8258 5824 2e46 0000 0000  Xx..9..XX$.F....
        0x00a0:  5878 e8b4 39f5 8a28                      Xx..9..(

This condition has already survived a nfs client reboot, we'll reboot the nfs server in approx. 12 hours. So if someone has ideas what we can try to catch before we reboot the nfs server... ?
Comment 1 Kurt Jaeger freebsd_committer 2021-11-30 19:54:40 UTC
ktrace of hangig processes did not show something, but maybe I had some errors using ktrace here...
Comment 2 Kurt Jaeger freebsd_committer 2021-12-01 08:28:26 UTC
reboot of the nfs server did not change the problem.

But: using this:

/usr/local/bin/konsole was installed by package konsole-21.08.3_1               

triggers the problem, using xterm works.
Comment 3 Rick Macklem freebsd_committer 2021-12-01 12:53:14 UTC
Hmm, looks like you are using NFSv3. If you are running
rpc.lockd, it is a fundamentally broken protocol (NLM).
- If this is the case, try either the "nolockd" mount
  option or "nfsv4,minorversion=1" for your mount.

If this doesn't resolve the problem, please provide the
output for all of the following when the hang has occurred.

nfsstat -m
ps axHl
procstat -kk
netstat -a
nfsstat -E -c  <-- done at least twice about a minute apart

Also, since you know how to reproduce the problem, you can
capture packets while it happens. Before the hang run:
tcpdump -s 0 -w out.pcap host <nfs-server>
--> Then reproduce the problem and kill the tcpdump 1 minute
    after the hang has occurred.
Put out.pcap here as an attachment.
(Btw, if you want to look at out.pcap, use wireshark. tcpdump
 knows almost nothing about NFS packets, whereas wireshark
 can decode them nicely.)

I have not heard of a FreeBSD client hang problem. There is
a known TCP layer issue, but it causes hangs when the server
is running 13.0 and I do not think the problem occurs when
the server is 12.3.

I have heard of issues when using vnet jails, but I doubt you
are doing that.

Also, hangs are often network fabric related. Disabling TSO on
both client and server often helps. Of using a different network
driver/chip if that is feasible (for example, for re devices the
driver in ports sometimes works better than what is in the system,
depending on which re chip you have.)
Comment 4 Rick Macklem freebsd_committer 2021-12-01 13:01:03 UTC
Oh, and if "netstat -a" when it is hung shows the
TCP connection to the server as ESTABLISHED and Recv-Q
is non-zero, you probably are hitting the TCP issue.

See PR#254590 for what to do about it.
(It is specific to 13.0 and is fixed in stable/13.)
Comment 5 Kurt Jaeger freebsd_committer 2021-12-01 13:55:24 UTC
(In reply to Rick Macklem from comment #3)
Thanks for the pointers. We're now testing with 

  rw,soft,bg,tcp

via v6 for now and wait 24hours, if the problem re-occurs.
Comment 6 Kurt Jaeger freebsd_committer 2021-12-02 19:44:22 UTC
(In reply to Kurt Jaeger from comment #5)
The error did not re-appear.

It's unclear if the mount option change is a good fix (and we close this PR)
or does some part of KDE, NFS, ... need changes ?
Comment 7 Kurt Jaeger freebsd_committer 2021-12-02 19:56:50 UTC
(In reply to Kurt Jaeger from comment #6)
Ah, I have an idea. We have a test desktop and try to reproduce with that and the old fstab settings 8-}
Comment 8 Tobias C. Berner freebsd_committer 2021-12-03 09:44:45 UTC
(In reply to Kurt Jaeger from comment #7)
If I remember correctly we have had a long standing issue with FAM on NFS that lead to hangs.

(that's why it is off by default in devel/kf5-kcoreaddons).


mfg Tobias