The deadlock is between the geom code and the cam code. It occurred when a fibre channel cable was removed when a user process was still accessing a disk through it. The system is set up to do a 'camcontrol rescan' upon indication from the HBA driver that the storage devices in the system may have changed. 'camcontrol rescan' triggers a succession of SCSI commands that are driven by the cambio/camisr() software interrupt. When the cable was unplugged, this led to cambio calling disk_destroy() on the disks that were now lost. disk_destroy() led to an attempt to acquire topology_lock() in the g_event thread. Meanwhile, the user app (dd) received an I/O error and closed the device. This led to a call to g_dev_close(), which acquired topology_lock() and then went down to daclose(), which sent a SCSI SYNC_CACHE command and waited for the command to complete. The SYNC_CACHE command completes, but the syscall is never told by cambio, which is frozen waiting for the lock that the syscall is holding. Fix: One perspective on this is that cambio inverted the layers; normally, geom code calls cam code, but in the 'camcontrol rescan' case, cam code calls geom code, resulting in locks being taken in opposite order. Perhaps disk_destroy could just queue to g_event and not wait for completion. How-To-Repeat: Do 'camcontrol rescan' either continuously or upon driver notification of changes. Set up a bunch of processes (I was using 'dd') to read a removable disk, then remove it while the processes are running. There may also be a scenario with disk_create.
In message <200409231827.i8NIR3TK071354@www.freebsd.org>, Brian Eng writes: >One perspective on this is that cambio inverted the layers; normally, >geom code calls cam code, but in the 'camcontrol rescan' case, cam >code calls geom code, resulting in locks being taken in opposite >order. Perhaps disk_destroy could just queue to g_event and not >wait for completion. A lot of the trouble in this area comes from the fact that the disk_*() api has to isolate giant free code from giant infected code. There are no easy fixes to currently remaining problems, they will take serious work to fix. I know there is some work on CAM going on behind the scenes and hopefully they will take this sort of thing into account. -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk@FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence.
State Changed From-To: open->suspended Mark suspended as there do not seem to be any quick fixes forthcoming.
For bugs matching the following conditions: - Status == In Progress - Assignee == "bugs@FreeBSD.org" - Last Modified Year <= 2017 Do - Set Status to "Open"
Haven't seen any issues like this in a long time.