Bug 260591 - kern/subr_bus.c: device_probe_and_{attach|detach}: Insufficient multi-thread protection causes crash
Summary: kern/subr_bus.c: device_probe_and_{attach|detach}: Insufficient multi-thread ...
Status: Open
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 13.0-RELEASE
Hardware: Any Any
: --- Affects Some People
Assignee: Warner Losh
URL:
Keywords: crash, needs-qa
Depends on:
Blocks:
 
Reported: 2021-12-21 16:17 UTC by ghuckriede
Modified: 2024-10-04 11:08 UTC (History)
3 users (show)

See Also:
koobs: maintainer-feedback? (imp)
koobs: maintainer-feedback? (mw)


Attachments
main-n251848-3e01ee76f20 Panic (13.12 KB, text/plain)
2021-12-21 16:17 UTC, ghuckriede
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description ghuckriede 2021-12-21 16:17:14 UTC
Created attachment 230290 [details]
main-n251848-3e01ee76f20 Panic

The protection for probe/attach/detach appears to be the 'Giant' mutex lock.  This is insufficient to protect against multiple thread attaching/detaching at the same time.  Some attach/detach actions require "sleeps", this releases the 'Giant' lock, allowing other threads to also attach/detach.  There are no checks for `DS_ATTACHING` state in device_attach() and there is currently no `DS_DETACHING` state to check in device_detach().

Steps to Reproduce:
This is easily reproduced with devices that "sleep" during attach/detach, as Giant lock is dropped in these cases.

https://www.freebsd.org/cgi/man.cgi?locking(9). "Giant is dropped during unbounded sleeps and reacquired after wakeup." 

igb devices enter e1000_get_cfg_done_i210()->safe_pause_sbt()->pause_sbt() during detach, thus releasing the Giant lock.  As show in the attached backtrace.

root@FBSDCURRENT:/ # devinfo -v | grep igb0
            igb0 pnpinfo vendor=0x8086 device=0x1533 subvendor=0x8086 subdevice=0x0002 class=0x020000 at slot=0 function=0 dbsf=pci0:6:0:0 handle=\_SB_.PCI0.RP05.PXSX
root@FBSDCURRENT:/ #

Terminal #1:
# sh
root@FBSDCURRENT:/ # while [ true ] ; do devctl attach pci0:6:0:0;devctl detach pci0:6:0:0;done

Terminal #2:
# sh
root@FBSDCURRENT:/ # while [ true ] ; do devctl attach pci0:6:0:0;devctl detach pci0:6:0:0;done

Actual Results:
This causes an immediate panic when Terminal #2 starts loop.

A backtrace of the 2 threads both running the devctl's devctl2_ioctl() is attached (locally build kernel). `vmcore` file from "13.0-RELEASE" can also be provided.

Expected Results:
The kernel should not panic/crash.  First thread should complete.  The second thread should return an error (or wait until the other thread is complete).

Build Date & Hardware: 
Locally compiled with git updated Dec 21 2021:

FreeBSD FBSDCURRENT 14.0-CURRENT FreeBSD 14.0-CURRENT #3 main-n251848-3e01ee76f20: Tue Dec 21 00:33:45 EST 2021     root@FBSDCURRENT:/usr/obj/usr/src/amd64.amd64/sys/GENERIC  amd64

Also reproducible on 13.0-RELEASE (Downloaded/Updated):

FreeBSD TrafficHammerTwoHanded 13.0-RELEASE FreeBSD 13.0-RELEASE #0 releng/13.0-n244733-ea31abc261f: Fri Apr  9 04:24:09 UTC 2021     root@releng1.nyi.freebsd.org:/usr/obj/usr/src/amd64.amd64/sys/GENERIC  amd64
Comment 1 Kubilay Kocak freebsd_committer freebsd_triage 2021-12-21 22:11:28 UTC
Some recent work [1] with Giant replacement in subr_bus and portentially relatged work with respect to [2], which mentions "patch works around the real issue, which is a locking deficiency of the busses.

Request feedback from those committers accordingly. 

[1] https://reviews.freebsd.org/D31833
[2] https://reviews.freebsd.org/D19375

^Triage: Set version to earliest supported and confirmed reproduction.
Comment 2 Warner Losh freebsd_committer freebsd_triage 2021-12-21 22:22:40 UTC
This is a know bug since FreeBSD 4.0. devctl just makes it a lot easier to tickle. As you note, the only real way to fix it is to move away from Giant, but even that exposes a few holes w/o additional work.
Comment 3 ghuckriede 2021-12-22 14:53:05 UTC
Is there a time frame for a fix?
Comment 4 Warner Losh freebsd_committer freebsd_triage 2021-12-22 15:02:00 UTC
The general fix is not simple. Simple hacks might limit the blast radius.. loops with devctl won't work for some time because it does unnatural things that normal drivers never do. I'll be concentrating on the typical scenarios first...
Comment 5 Mark Linimon freebsd_committer freebsd_triage 2024-10-04 11:08:21 UTC
^Triage: clear unneeded flags.  Nothing has yet been committed to be merged.