Bug 259163 - 12.2-p9 & -p10 NFSv3 server times out for CentOS 8 client after ~13-14 minute of file copying
Summary: 12.2-p9 & -p10 NFSv3 server times out for CentOS 8 client after ~13-14 minute...
Status: Closed Overcome By Events
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 12.2-RELEASE
Hardware: amd64 Any
: --- Affects Only Me
Assignee: freebsd-bugs (Nobody)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-10-14 01:28 UTC by parv
Modified: 2022-03-10 06:55 UTC (History)
3 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description parv 2021-10-14 01:28:37 UTC
Our ZFS-based storage system tracks FreeBSD 12.2 release. A directory from ZFS RAID-Z2 dataset is exported, with use of NFS v3 server. (NFS v4 server is not being used due to the need to implement idmap(d).)

FreeBSD server has this in /etc/exports ...

  /real/dir  -alldirs,-maproot=root  -network IP-address/netmask


... CentOS client mounts the exported directory via autofs as ...

  SERVER-addr  -soft,retrans=1,retry=1,vers=3  SERVER-addr:/real/dir 


After updating 12.2 to -p9 from -p8, which brought changes to libcasper, CentOS 8 NFS v3 client started seeing NFS connection timing out after ~13-14 minutes of a file copy job. Going to -p10 did not help the situation any.

Moving back to -p8 does not cause any time out issues on CentOS side.

Around time of time-out, tcpdump of NFS connection on both CentOS client & FreeBSD server only showed that FreeBSD server was indeed slower to respond in the expected time.

Output on FreeBSD server due to "vfs.nfsd.debuglevel=2" MIB did not enlighten me;  produced a flood of seemingly irrelevant output so it's entirely possible I might have missed something.

There had/have not been any messages on FreeBSD related to or around the time of NFS connection time out.


When CentOS client experiences NFS server time out, it puts out "CPU stuck" messages about multiple CPUs on the tty; dmesg gets ...

...
2021-08-23T21:20:08,041420 watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [kworker/u24:3:856093]
2021-08-23T21:20:08,043420 watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [kworker/u24:7:856767]
2021-08-23T21:20:08,044420 watchdog: BUG: soft lockup - CPU#2 stuck for 23s! [kworker/u24:5:856593]
2021-08-23T21:20:08,044420 Modules linked in: ib_core nfsv3 nfs_acl rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache xt_CHECKSUM ipt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 nft_compat nft_counter nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nf_tables libcrc32c nfnetlink tun bridge stp llc sunrpc vfat fat intel_rapl_msr intel_rapl_common snd_hda_codec_hdmi snd_sof_pci snd_sof_intel_byt snd_sof_intel_ipc snd_sof_intel_hda_common snd_soc_hdac_hda snd_sof_xtensa_dsp snd_sof_intel_hda snd_hda_codec_realtek snd_sof snd_hda_codec_generic ledtrig_audio snd_soc_skl snd_soc_sst_ipc snd_soc_sst_dsp x86_pkg_temp_thermal intel_powerclamp snd_hda_ext_core coretemp snd_soc_acpi_intel_match snd_soc_acpi kvm_intel snd_hda_intel snd_intel_dspcfg soundwire_intel soundwire_generic_allocation iTCO_wdt mei_wdt iTCO_vendor_support intel_wmi_thunderbolt wmi_bmof snd_soc_core kvm snd_compress soundwire_cadence soundwire_bus irqbypass snd_hda_codec snd_hda_core crct10dif_pclmul
2021-08-23T21:20:08,044438  crc32_pclmul snd_hwdep snd_seq ghash_clmulni_intel rapl snd_seq_device intel_cstate snd_pcm intel_uncore snd_timer snd pcspkr soundcore i2c_i801 mei_me mei intel_pch_thermal ie31200_edac wmi acpi_pad intel_pmc_core ip_tables ext4 mbcache jbd2 sd_mod t10_pi sg i915 cec intel_gtt drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm igb ahci e1000e libahci libata dca i2c_algo_bit crc32c_intel video pinctrl_cannonlake fuse
2021-08-23T21:20:08,044451 CPU: 2 PID: 856593 Comm: kworker/u24:5 Kdump: loaded Tainted: G        W    L   --------- -  - 4.18.0-305.10.2.el8_4.x86_64 #1
2021-08-23T21:20:08,044451 Hardware name: Puget Systems C246M-WU4/C246M-WU4-CF, BIOS F4e 09/16/2020
2021-08-23T21:20:08,044466 Workqueue: rpciod rpc_async_schedule [sunrpc]
2021-08-23T21:20:08,044469 RIP: 0010:native_queued_spin_lock_slowpath+0x18f/0x1d0
2021-08-23T21:20:08,044469 Code: c1 e9 12 83 e0 03 83 e9 01 48 c1 e0 05 48 63 c9 48 05 00 ac 02 00 48 03 04 cd a0 b8 b7 a7 48 89 10 8b 42 08 85 c0 75 09 f3 90 <8b> 42 08 85 c0 74 f7 48 8b 02 48 85 c0 0f 84 6e ff ff ff 48 89 c1
2021-08-23T21:20:08,044470 RSP: 0018:ffff9d9bc9ea7de8 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
2021-08-23T21:20:08,044471 RAX: 0000000000000000 RBX: ffff8a8ac1679000 RCX: 0000000000000001
2021-08-23T21:20:08,044471 RDX: ffff8a8b7c2aac00 RSI: 00000000000c0000 RDI: ffff8a8ac1679670
2021-08-23T21:20:08,044472 RBP: ffff8a8ac1679670 R08: 0000000000200000 R09: 0000000000000000
2021-08-23T21:20:08,044472 R10: ffff8a7c88217200 R11: 0000000000000001 R12: ffff8a80d33fc7d0
2021-08-23T21:20:08,044472 R13: 0000000000000000 R14: ffffffffc0c52b20 R15: ffffffffc0c44cf0
2021-08-23T21:20:08,044473 FS:  0000000000000000(0000) GS:ffff8a8b7c280000(0000) knlGS:0000000000000000
2021-08-23T21:20:08,044473 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
2021-08-23T21:20:08,044474 CR2: 000055a105562088 CR3: 0000000eb0610004 CR4: 00000000003706e0
2021-08-23T21:20:08,044474 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
2021-08-23T21:20:08,044475 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
2021-08-23T21:20:08,044475 Call Trace:
2021-08-23T21:20:08,044478  _raw_spin_lock+0x1c/0x20
2021-08-23T21:20:08,044487  xprt_prepare_transmit+0x4a/0xb0 [sunrpc]
2021-08-23T21:20:08,044495  call_transmit+0x2e/0x80 [sunrpc]
2021-08-23T21:20:08,044503  __rpc_execute+0x85/0x3c0 [sunrpc]
2021-08-23T21:20:08,044506  ? try_to_wake_up+0x1cd/0x540
2021-08-23T21:20:08,044514  rpc_async_schedule+0x29/0x40 [sunrpc]
2021-08-23T21:20:08,044515  process_one_work+0x1a7/0x360
2021-08-23T21:20:08,044516  worker_thread+0x30/0x390
2021-08-23T21:20:08,044517  ? create_worker+0x1a0/0x1a0
2021-08-23T21:20:08,044518  kthread+0x116/0x130
2021-08-23T21:20:08,044519  ? kthread_flush_work_fn+0x10/0x10
2021-08-23T21:20:08,044520  ret_from_fork+0x35/0x40
...
021-08-23T21:20:41,880955 nfs: server SERVER not responding, timed out
2021-08-23T21:20:41,881036 nfs: server SERVER not responding, timed out
2021-08-23T21:20:41,881039 nfs: server SERVER not responding, timed out
...


Earlier, mostly one-sided, discussion elsewhere ...

- NFS v3 - FreeBSD 12 server connection getting timed out on CentOS 8 client, https://lists.freebsd.org/pipermail/freebsd-questions/2021-August/294634.html

- centos-freebsd-nfs-issue, https://github.com/parv/centos8-nfs3-issue
  (dmesg from CentOS; tcpdumps from CentOS & FreeBSD)


- https://twitter.com/_parv/status/1429889151155916800
Comment 1 parv 2021-10-14 02:27:36 UTC
Added Rick M to email recipients for input, suggestion.
Comment 2 Rick Macklem freebsd_committer freebsd_triage 2021-10-14 02:36:52 UTC
Well, since there are no errata done w.r.t. the NFS code,
I'm not sure why such an upgrade would cause problems?

The known issue for FreeBSD12.2 for the NFS server is
described in PR#254590. If this is the problem..
# netstat -a
--> will show the stuck client's TCP connection in CLOSE_WAIT

If that is the case, the patch in the first attachment
in PR#254590 fixes the problem. (This patch will be in 12.3,
but is not done as an errata.)
Comment 3 parv 2022-03-10 06:38:22 UTC
I could not find the cause of the issues with 12.x

After going to 13.0, NFS v3 server timing out had not reoccurred, for whatever reasons.
Comment 4 parv 2022-03-10 06:55:23 UTC
I could not try the patch suggested as the system was installed via "freebsd-update" and there are no source installed per work policy.