Bug 242272 - LinuxKPI combines all RCU and SRCU domains together, leading to deadlock
Summary: LinuxKPI combines all RCU and SRCU domains together, leading to deadlock
Status: In Progress
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: CURRENT
Hardware: Any Any
: --- Affects Some People
Assignee: Hans Petter Selasky
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-11-27 18:57 UTC by Andrew Boyer
Modified: 2020-03-03 18:43 UTC (History)
3 users (show)

See Also:


Attachments
Possible patch (11.92 KB, text/plain)
2020-03-03 18:43 UTC, Andrew Boyer
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Andrew Boyer 2019-11-27 18:57:21 UTC
In that other OS, all RCU code is in the one domain, and each SRCU user gets its own domain. (More or less.) They are all combined together in FreeBSD since it has only a bare-bones implementation in the KPI.

This is a problem because ib_uverbs holds an SRCU read lock when calling into the provider's destroy_cq function. Providers may expect to be able to use RCU primitives when tearing down, but calling synchronize_rcu() or synchronize_srcu() will lead to a deadlock, even on a completely separate SRCU domain.

To fix this will require adding real multiple-domain support to the KPI.
Comment 1 commit-hook freebsd_committer 2019-11-29 13:34:15 UTC
A commit references this bug:

Author: pkubaj
Date: Fri Nov 29 13:33:14 UTC 2019
New revision: 518649
URL: https://svnweb.freebsd.org/changeset/ports/518649

Log:
  games/stuntrally: fix build on PPC with clang and ARM

  Upstream PR:
  https://github.com/stuntrally/stuntrally/pull/18

  We had until now CXXFLAGS_gcc=-Wno-narrowing, but it looks like this was incorrect because it did not fix the original issue.

  PR:		242272
  Approved by:	linimon (mentor), amdmi3 (maintainer)

Changes:
  head/games/stuntrally/Makefile
  head/games/stuntrally/files/patch-source_editor_CApp.h
Comment 2 Hans Petter Selasky freebsd_committer 2020-03-03 15:54:25 UTC
Sorry I didn't see this earlier on. Has this deadlock been observed in practice?
Comment 3 Andrew Boyer 2020-03-03 15:56:56 UTC
It was easy to hit in code that we have not yet upstreamed.

The design has changed in the meantime and we're not using RCU to protect our QP/CQ objects any more.

This was a right pain to debug though - maybe it could be documented as a XXX somewhere?
Comment 4 Hans Petter Selasky freebsd_committer 2020-03-03 16:01:27 UTC
Would it help if SRCU and RCU were separated?

That all SRCU had one domain and all RCU had another domain?

The RCU implementation in the LinuxKPI does not have as many asserts and debug knobs as the EPOCH() implementation in the kernel.

--HPS
Comment 5 Andrew Boyer 2020-03-03 18:43:14 UTC
That would work for what we tried to do. SRCU should still be marked XXX IMO.
Comment 6 Andrew Boyer 2020-03-03 18:43:55 UTC
Created attachment 212127 [details]
Possible patch

Attached is a quick & dirty split of RCU/SRCU.