Bug 244713 - Processes hanging in "nfs" state
Summary: Processes hanging in "nfs" state
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 12.1-RELEASE
Hardware: Any Any
: --- Affects Only Me
Assignee: freebsd-bugs (Nobody)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-03-10 09:54 UTC by Julien Cigar
Modified: 2020-04-09 14:41 UTC (History)
4 users (show)

See Also:
julien: mfc-stable12?


Attachments
ps (804 bytes, text/plain)
2020-03-10 09:54 UTC, Julien Cigar
no flags Details
procstat (1.28 KB, text/plain)
2020-03-10 09:55 UTC, Julien Cigar
no flags Details
procstat -kk -a (279.46 KB, text/plain)
2020-03-11 09:02 UTC, Julien Cigar
no flags Details
Port of D24038 to stable/12 (1.89 KB, patch)
2020-03-11 22:33 UTC, Konstantin Belousov
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Julien Cigar 2020-03-10 09:54:28 UTC
Hello,

I have a Python webapp which is running in a 10.3-RELEASE JAIL on a 12.1-RELEASE HOST which is randomly hanging in nfs state. Processes are unkillable and require a hard reboot.

The HOST is running UFS (with SU, but without +J) and the NFS server is running 10.3-RELEASE. NFS4 shares are mounted on the HOST with "nfsv4,ro,bg,late" options (NFS4 only)

It looks like a deadlock to me and I've attached a procstat -kk of the offending processes
Comment 1 Julien Cigar 2020-03-10 09:54:52 UTC
Created attachment 212298 [details]
ps
Comment 2 Julien Cigar 2020-03-10 09:55:19 UTC
Created attachment 212299 [details]
procstat
Comment 3 Konstantin Belousov freebsd_committer 2020-03-10 22:08:12 UTC
The 'grbmaw' thread waits for the xbusy state of the page to pass, so that the caller can sbusy it.  There must be other thread which owns the xbusy.  Perhaps provide the procstat -kk output for all processes.

Also it might be worth trying 12-stable kernel, where some number of bugs in nearby area were fixed.
Comment 4 Julien Cigar 2020-03-11 09:02:04 UTC
Thanks for you reply kib, I've attached the full procstat -kk -a
Comment 5 Julien Cigar 2020-03-11 09:02:49 UTC
Created attachment 212324 [details]
procstat -kk -a
Comment 6 Konstantin Belousov freebsd_committer 2020-03-11 22:33:14 UTC
I think I see the problem.

The proposed (untested) patch for CURRENT is at
https://reviews.freebsd.org/D24038

Since it is not applicable to stable/12 without conflicts, there were a lot of rewriting, I attached the port of the patch to 12.  It is not tested as well.  Please try it and report results.
Comment 7 Konstantin Belousov freebsd_committer 2020-03-11 22:33:50 UTC
Created attachment 212343 [details]
Port of D24038 to stable/12
Comment 8 Julien Cigar 2020-03-12 09:02:03 UTC
(In reply to Konstantin Belousov from comment #6)

Thanks for your time and analysis..! I can hardly test it on the affected host unfortunately as it is a production machine, but I'll try to reproduce the problem on another machine (with a 12-STABLE kernel).

Do you have any idea what could I make to reproduce the problem faster? (It takes some days on the heavily loaded production machine before it happens)

Are UFS and NFS required to reproduce the issue?
Comment 9 Konstantin Belousov freebsd_committer 2020-03-12 16:11:54 UTC
(In reply to Julien Cigar from comment #8)
It is sendfile over NFS file that triggers the issue.   You need to get into some very specific layout of cached pages vs. non-cached for this file to get into the issue.
Comment 10 Julien Cigar 2020-03-12 16:27:57 UTC
(In reply to Konstantin Belousov from comment #9)

OK, I'll try to isolate the webapp, raise the number of worker processes to something like 1000 and hammer the webapp heavily to reproduce the issue
Comment 11 Konstantin Belousov freebsd_committer 2020-03-12 21:22:41 UTC
(In reply to Julien Cigar from comment #10)
If you just increase parallelism but do the sendfile over the same file, I doubt that the issue is easier to reproduce, quite contrary.  Problem appears when some pages in the file cache are reclaimed, and then sendfile(2) is called over the region that contains holes in the cache.  So I would suggest to not overreact with the parallelism from your app, instead some modest memory pressure might be more useful.
Comment 12 commit-hook freebsd_committer 2020-03-30 22:03:21 UTC
A commit references this bug:

Author: kib
Date: Mon Mar 30 21:42:47 UTC 2020
New revision: 359464
URL: https://svnweb.freebsd.org/changeset/base/359464

Log:
  buffer pager: skip bogus pages.

  We cannot validate bogus page by reading a buffer.

  PR:	244713
  Reviewed by:	glebius, markj
  Tested by:	pho
  Sponsored by:	The FreeBSD Foundation
  MFC after:	1 week
  Differential revision:	https://reviews.freebsd.org/D24038

Changes:
  head/sys/kern/vfs_bio.c
Comment 13 commit-hook freebsd_committer 2020-03-30 22:24:25 UTC
A commit references this bug:

Author: kib
Date: Mon Mar 30 22:13:32 UTC 2020
New revision: 359473
URL: https://svnweb.freebsd.org/changeset/base/359473

Log:
  kern_sendfile.c: fix bugs with handling of busy page states.

  - Do not call into a vnode pager while leaving some pages from the
    same block as the current run, xbusy. This immediately deadlocks if
    pager needs to instantiate the buffer.
  - Only relookup bogus pages after io finished, otherwise we might
    obliterate the valid pages by out of date disk content.  While there,
    expand the comment explaining this pecularity.
  - Do not double-unbusy on error.  Split unbusy for error case, which
    is left in the sendfile_swapin(), from the more properly coded
    normal case in sendfile_iodone().
  - Add an XXXKIB comment explaining the serious bug in the validation
    algorithm, not fixed by this patch series.

  PR:	244713
  Reviewed by:	glebius, markj
  Tested by:	pho
  Sponsored by:	The FreeBSD Foundation
  MFC after:	1 week
  Differential revision:	https://reviews.freebsd.org/D24038

Changes:
  head/sys/kern/kern_sendfile.c
Comment 14 Julien Cigar 2020-04-02 08:43:55 UTC
Thank you very much for solving the issue and the time spent.
Comment 15 commit-hook freebsd_committer 2020-04-06 18:47:48 UTC
A commit references this bug:

Author: kib
Date: Mon Apr  6 18:47:16 UTC 2020
New revision: 359664
URL: https://svnweb.freebsd.org/changeset/base/359664

Log:
  MFC r359464:
  buffer pager: skip bogus pages.

  PR:	244713

Changes:
_U  stable/12/
  stable/12/sys/kern/vfs_bio.c