In that other OS, all RCU code is in the one domain, and each SRCU user gets its own domain. (More or less.) They are all combined together in FreeBSD since it has only a bare-bones implementation in the KPI.
This is a problem because ib_uverbs holds an SRCU read lock when calling into the provider's destroy_cq function. Providers may expect to be able to use RCU primitives when tearing down, but calling synchronize_rcu() or synchronize_srcu() will lead to a deadlock, even on a completely separate SRCU domain.
To fix this will require adding real multiple-domain support to the KPI.
A commit references this bug:
Date: Fri Nov 29 13:33:14 UTC 2019
New revision: 518649
games/stuntrally: fix build on PPC with clang and ARM
We had until now CXXFLAGS_gcc=-Wno-narrowing, but it looks like this was incorrect because it did not fix the original issue.
Approved by: linimon (mentor), amdmi3 (maintainer)
Sorry I didn't see this earlier on. Has this deadlock been observed in practice?
It was easy to hit in code that we have not yet upstreamed.
The design has changed in the meantime and we're not using RCU to protect our QP/CQ objects any more.
This was a right pain to debug though - maybe it could be documented as a XXX somewhere?
Would it help if SRCU and RCU were separated?
That all SRCU had one domain and all RCU had another domain?
The RCU implementation in the LinuxKPI does not have as many asserts and debug knobs as the EPOCH() implementation in the kernel.
That would work for what we tried to do. SRCU should still be marked XXX IMO.
Created attachment 212127 [details]
Attached is a quick & dirty split of RCU/SRCU.