Bug 250309

Summary: devmatch: panic: general protection fault: sysctl_devices() -> sysctl_root_handler_locked() in hw.bus.devices sysctl handler when plugging in USB mouse
Product: Base System Reporter: sigsys
Component: kernAssignee: freebsd-bugs (Nobody) <bugs>
Status: Open ---    
Severity: Affects Some People CC: emaste, hselasky, imp, usb
Priority: --- Keywords: crash, needs-qa
Version: CURRENTFlags: koobs: maintainer-feedback? (imp)
koobs: maintainer-feedback? (hselasky)
koobs: mfc-stable13?
koobs: mfc-stable12?
Hardware: Any   
OS: Any   
URL: https://twitter.com/ngortheone/status/1566802139858763776
Attachments:
Description Flags
updated patch none

Description sigsys 2020-10-12 23:01:52 UTC
I had panics on boot related to the hw.bus.devices sysctl.

[15] Fatal trap 9: general protection fault while in kernel mode
[15] cpuid = 7; apic id = 09
[15] instruction pointer   = 0x20:0xffffffff80c167fc
[15] stack pointer         = 0x28:0xfffffe00e78b8860
[15] frame pointer         = 0x28:0xfffffe00e78b88a0
[15] code segment          = base rx0, limit 0xfffff, type 0x1b
[15]                       = DPL 0, pres 1, long 1, def32 0, gran 1
[15] processor eflags      = interrupt enabled, resume, IOPL = 0
[15] current process       = 9241 (devmatch)
[15] trap number           = 9
[15] panic: general protection fault
[15] cpuid = 7
[15] time = 1598332353
[15] KDB: stack backtrace:
[15] db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe00e78b8570
[15] vpanic() at vpanic+0x182/frame 0xfffffe00e78b85c0
[15] panic() at panic+0x43/frame 0xfffffe00e78b8620
[15] trap_fatal() at trap_fatal+0x387/frame 0xfffffe00e78b8680
[15] trap() at trap+0xa4/frame 0xfffffe00e78b8790
[15] calltrap() at calltrap+0x8/frame 0xfffffe00e78b8790
[15] --- trap 0x9, rip = 0xffffffff80c167fc, rsp = 0xfffffe00e78b8860, rbp = 0xfffffe00e78b88a0 ---
[15] sysctl_devices() at sysctl_devices+0x21c/frame 0xfffffe00e78b88a0
[15] sysctl_root_handler_locked() at sysctl_root_handler_locked+0x9c/frame 0xfffffe00e78b88f0
[15] sysctl_root() at sysctl_root+0x20a/frame 0xfffffe00e78b8970
[15] userland_sysctl() at userland_sysctl+0x17d/frame 0xfffffe00e78b8a20
[15] sys___sysctl() at sys___sysctl+0x5f/frame 0xfffffe00e78b8ad0
[15] amd64_syscall() at amd64_syscall+0x140/frame 0xfffffe00e78b8bf0
[15] fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe00e78b8bf0
[15] --- syscall (202, FreeBSD ELF64, sys___sysctl), rip = 0x1df89305e12a, rsp = 0x7fffffe4b628, rbp = 0x7fffffe4b660 

It would happen from time to time before but it started happening more and more until I pretty much wasn't able to boot to multi-user mode anymore.  Turns out I had a USB mouse that did that.  The mouse would constantly disconnect and reconnect from the USB and it must have been triggering race conditions in that sysctl's handler.

The following patch fixed it for me but not sure if correct.  It's been working fine for more than a month though (so I don't have to unplug my mouse to be able to boot).


diff --git a/sys/kern/subr_bus.c b/sys/kern/subr_bus.c
index 78d07796659..c42160162ed 100644
--- a/sys/kern/subr_bus.c
+++ b/sys/kern/subr_bus.c
@@ -854,6 +854,7 @@ devctl_safe_quote_sb(struct sbuf *sb, const char *src)
 /* End of /dev/devctl code */
 
 static TAILQ_HEAD(,device)	bus_data_devices;
+static struct sx		bus_data_sx;
 static int bus_data_generation = 1;
 
 static kobj_method_t null_methods[] = {
@@ -1817,7 +1818,9 @@ make_device(device_t parent, const char *name, int unit)
 
 	dev->state = DS_NOTPRESENT;
 
+	sx_xlock(&bus_data_sx);
 	TAILQ_INSERT_TAIL(&bus_data_devices, dev, devlink);
+	sx_xunlock(&bus_data_sx);
 	bus_data_generation_update();
 
 	return (dev);
@@ -1957,9 +1960,11 @@ device_delete_child(device_t dev, device_t child)
 		devclass_delete_device(child->devclass, child);
 	if (child->parent)
 		BUS_CHILD_DELETED(dev, child);
+	sx_xlock(&bus_data_sx);
 	TAILQ_REMOVE(&dev->children, child, link);
 	TAILQ_REMOVE(&bus_data_devices, child, devlink);
 	kobj_delete((kobj_t) child, M_BUS);
+	sx_xunlock(&bus_data_sx);
 
 	bus_data_generation_update();
 	return (0);
@@ -5165,6 +5170,7 @@ root_bus_module_handler(module_t mod, int what, void* arg)
 	switch (what) {
 	case MOD_LOAD:
 		TAILQ_INIT(&bus_data_devices);
+		sx_init(&bus_data_sx, "bus_data_sx");
 		kobj_class_compile((kobj_class_t) &root_driver);
 		root_bus = make_device(NULL, "root", 0);
 		root_bus->desc = "System root bus";
@@ -5507,19 +5513,24 @@ sysctl_devices(SYSCTL_HANDLER_ARGS)
 	/*
 	 * Scan the list of devices, looking for the requested index.
 	 */
+	sx_slock(&bus_data_sx);
 	TAILQ_FOREACH(dev, &bus_data_devices, devlink) {
 		if (index-- == 0)
 			break;
 	}
-	if (dev == NULL)
-		return (ENOENT);
+	if (dev == NULL) {
+		error = ENOENT;
+		goto out;
+	}
 
 	/*
 	 * Populate the return item, careful not to overflow the buffer.
 	 */
 	udev = malloc(sizeof(*udev), M_BUS, M_WAITOK | M_ZERO);
-	if (udev == NULL)
-		return (ENOMEM);
+	if (udev == NULL) {
+		error = ENOMEM;
+		goto out;
+	}
 	udev->dv_handle = (uintptr_t)dev;
 	udev->dv_parent = (uintptr_t)dev->parent;
 	udev->dv_devflags = dev->devflags;
@@ -5550,6 +5561,7 @@ sysctl_devices(SYSCTL_HANDLER_ARGS)
 		error = SYSCTL_OUT(req, udev, sizeof(*udev));
 	sbuf_delete(&sb);
 	free(udev, M_BUS);
+out:	sx_sunlock(&bus_data_sx);
 	return (error);
 }
 
@@ -5586,11 +5598,13 @@ device_lookup_by_name(const char *name)
 {
 	device_t dev;
 
+	sx_slock(&bus_data_sx);
 	TAILQ_FOREACH(dev, &bus_data_devices, devlink) {
 		if (dev->nameunit != NULL && strcmp(dev->nameunit, name) == 0)
-			return (dev);
+			break;
 	}
-	return (NULL);
+	sx_sunlock(&bus_data_sx);
+	return (dev);
 }
 
 /*
Comment 1 Kubilay Kocak freebsd_committer freebsd_triage 2022-09-07 00:01:19 UTC
Received report of what appears to be the same issue from a user on Twitter. Have requested they attach system information and crash traceback, pending.

We have a patch in comment 0, which has one confirmation of resolution, but may require a rebase against latest CURRENT.

@Reporter If you're able to produce an updated patch, please do so, thank you

^Triage: Request feedback from folks recently in and around kern/subr_bus, devmatch and USB
Comment 2 Warner Losh freebsd_committer freebsd_triage 2022-09-07 00:19:01 UTC
I'll take a look at this. I think this is due to the nature of Giant unlocking itself when we think it is protecting the bus stuff.

I have a different patch than sigsys' patch that converts Giant into a real lock, which sigsys' patch doesn't quite do. I worry that it will cause a lot of other problems if we don't do that and/or expose deadlock situations in other drivers / busses. All the places that use the lock *should* have giant locked already.
Comment 3 sigsys 2022-09-07 00:30:49 UTC
Created attachment 236405 [details]
updated patch

Here's the refreshed patch in case it helps until a better fix is developed.  Looking at it, I suspect it would be better to unlock before calling kobj_delete() in device_delete_child()...  But I no longer have the panic-inducing mouse to test it.  It gave out.
Comment 4 Hans Petter Selasky freebsd_committer freebsd_triage 2022-09-07 06:44:44 UTC
Warner: It's just a list, why can't you use a regular mutex to protect it?

--HPS
Comment 5 Hans Petter Selasky freebsd_committer freebsd_triage 2022-09-07 06:49:40 UTC
(In reply to Hans Petter Selasky from comment #4)

Warner: Looked at the wrong patch.
Comment 6 Warner Losh freebsd_committer freebsd_triage 2022-09-07 22:07:01 UTC
I think we can just switch the bus lock routines over to using an sx lock w/o adding or removing other locking which may be tricky for some of the consumers of newbus (they already cope with Giant though, so a giant-like sx lock I think will be fine).
Comment 7 Warner Losh freebsd_committer freebsd_triage 2022-09-07 22:08:16 UTC
I've been looking for a way to reproduce this locally, since otherwise it's very hard to test that the fix works for sure... and my new-bus pessimism leads me to believe that any fix in this area will lead to further problem reports, maybe delayed by weeks or months.