Bug 72041 - [cam] [hang] Deadlock when disk is destroyed while user process closes
Summary: [cam] [hang] Deadlock when disk is destroyed while user process closes
Status: Closed Overcome By Events
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 5.2.1-RELEASE
Hardware: Any Any
: Normal Affects Only Me
Assignee: freebsd-bugs (Nobody)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2004-09-23 19:30 UTC by Brian Eng
Modified: 2018-05-21 07:34 UTC (History)
0 users

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Brian Eng 2004-09-23 19:30:27 UTC
The deadlock is between the geom code and the cam code.  It occurred when a fibre channel cable was removed when a user process was still accessing a disk through it.  

The system is set up to do a 'camcontrol rescan' upon indication from the HBA driver that the storage devices in the system may have changed.  'camcontrol rescan' triggers a succession of SCSI commands that are driven by the cambio/camisr() software interrupt.  When the cable was unplugged, this led to cambio calling disk_destroy() on the disks that were now lost.  disk_destroy() led to an attempt to acquire topology_lock() in the g_event thread.

Meanwhile, the user app (dd) received an I/O error and closed the device.  This led to a call to g_dev_close(), which acquired topology_lock() and then went down to daclose(), which sent a SCSI SYNC_CACHE command and waited for the command to complete.

The SYNC_CACHE command completes, but the syscall is never told by cambio, which is frozen waiting for the lock that the syscall is holding.

Fix: 

One perspective on this is that cambio inverted the layers; normally, geom code calls cam code, but in the 'camcontrol rescan' case, cam code calls geom code, resulting in locks being taken in opposite order.  Perhaps disk_destroy could just queue to g_event and not wait for completion.
How-To-Repeat: Do 'camcontrol rescan' either continuously or upon driver notification of changes.  Set up a bunch of processes (I was using 'dd') to read a removable disk, then remove it while the processes are running.

There may also be a scenario with disk_create.
Comment 1 Poul-Henning Kamp 2004-09-23 19:48:58 UTC
In message <200409231827.i8NIR3TK071354@www.freebsd.org>, Brian Eng writes:

>One perspective on this is that cambio inverted the layers; normally,
>geom code calls cam code, but in the 'camcontrol rescan' case, cam
>code calls geom code, resulting in locks being taken in opposite
>order.  Perhaps disk_destroy could just queue to g_event and not
>wait for completion.

A lot of the trouble in this area comes from the fact that the disk_*()
api has to isolate giant free code from giant infected code.

There are no easy fixes to currently remaining problems, they will take
serious work to fix.

I know there is some work on CAM going on behind the scenes and hopefully
they will take this sort of thing into account.

-- 
Poul-Henning Kamp       | UNIX since Zilog Zeus 3.20
phk@FreeBSD.ORG         | TCP/IP since RFC 956
FreeBSD committer       | BSD since 4.3-tahoe    
Never attribute to malice what can adequately be explained by incompetence.
Comment 2 Mark Linimon freebsd_committer freebsd_triage 2005-10-24 10:38:26 UTC
State Changed
From-To: open->suspended

Mark suspended as there do not seem to be any quick fixes forthcoming.
Comment 3 Eitan Adler freebsd_committer freebsd_triage 2018-05-20 23:57:14 UTC
For bugs matching the following conditions:
- Status == In Progress
- Assignee == "bugs@FreeBSD.org"
- Last Modified Year <= 2017

Do
- Set Status to "Open"
Comment 4 Andriy Gapon freebsd_committer freebsd_triage 2018-05-21 07:34:29 UTC
Haven't seen any issues like this in a long time.