Summary: | [lor] Possibly two LORs: entropy harvest mutex and scrlock, and entropy harvest mutex and sleepq chain | ||||||
---|---|---|---|---|---|---|---|
Product: | Base System | Reporter: | Ellis H. Wilson III <ellisw> | ||||
Component: | kern | Assignee: | freebsd-bugs (Nobody) <bugs> | ||||
Status: | Closed Overcome By Events | ||||||
Severity: | Affects Many People | CC: | delphij, ellisw, jhb, markm, rpokala | ||||
Priority: | --- | ||||||
Version: | 10.1-RELEASE | ||||||
Hardware: | amd64 | ||||||
OS: | Any | ||||||
Attachments: |
|
Description
Ellis H. Wilson III
2014-11-21 20:58:53 UTC
Also, as a brief explanation as to why I marked it "Affects Many People," I have been able to reproduce this bug on real hardware and on 10.0 as have my colleagues in our testing of FreeBSD 10.0 and 10.1. So I am strongly led to believe the only reason Many People aren't seeing it is because of the decision to default to testing with WITNESS + WITNESS_SKIPSPIN. That doesn't mean this bug isn't affecting each and every one of them, it sure is; it simply may not be causing any major problems yet. Created attachment 149695 [details]
A possible fix
I think these are false positives. harvest_mtx is a spinning mutex that is only acquired in random_harvestq.c, and it's already done quite carefully. The problem is raised when calling msleep_spin_sbt(), which in turn tries to acquire sleepq chain lock (sys/kern/subr_sleepqueue.c). In witness, the "blessed" order is sleepq chain and then entropy harvest mutex, this is not right (the system does not poke with entropy harvesting when manipulating sleepq chain), so you see the second LOR warning. When it tries to print the LOR warning, the code eventually calls printf() which in turn would go to syscons(4) where scrlock is acquired. The defined lock order wants scrlock be acquired before harvest_mtx and therefore you would see the first LOR warning. Could you please try the attached patch and see if it solves the problem? What it does is to move entropy mutex slightly higher, allowing it to be held before acquiring sleepq chain lock. This would eliminate the (false) LOR warning and make the first LOR go away at the same time. Adding jhb@ and markm@ for review. Many thanks to Xin Li for the rapid patch and excellent explanation! I can confirm with the attached patch applied to my 10.1 release bits, and the kernel recompiled and reinstalled, we boot without panic. Tested via ten reboots of my VM instance of 10.1 with the patch, with both clean and unclean reboots. I'm not sure what protocol is for declaring this fixed or not as this is my first bug report for FreeBSD. So I apologize if I should be marking this closed or what. Please advise. Last, it would be ideal in my (perhaps naive) mind if we could return to a default kernel configuration without WITNESS_SKIPSPIN so these bugs are given attention naturally, rather than people having to run a non-standard configuration in order to expose them. Thoughts? I believe we have a patch here that works, at least per my testing. What are the correct next steps in terms of moving this along? If this fixes the LOR report, then this looks good to me. Xin, can you commit this? (In reply to John Baldwin from comment #6) The code in question have received a major refactor and the issue no longer exists (in r262496: mutex is no longer held in random_kthread before sleeping). In the new world order, the spin lock is only acquired as a leaf mutex in random_harvestq_internal() and that would avoided all LOR cases (I would probably move get_cyclecount() out from the locked scope but that's a different issue). So I think the _right_ fix is probably just merge the refactor back to stable/10 instead. Ok, that sound good to me then. Since this no longer impacts -HEAD, but *does* impact stable/10, does anyone object to me applying Xin's patch directly to stable/10? If that's okay, I'll try to do it before the 10.3 slush |