72041 – [cam] [hang] Deadlock when disk is destroyed while user process closes

Bug 72041 - [cam] [hang] Deadlock when disk is destroyed while user process closes

Summary: [cam] [hang] Deadlock when disk is destroyed while user process closes

Status:	Closed Overcome By Events

Alias:	None

Product:	Base System
Classification:	Unclassified
Component:	kern (show other bugs)
Version:	5.2.1-RELEASE
Hardware:	Any Any

Importance:	Normal Affects Only Me
Assignee:	freebsd-bugs (Nobody)

URL:
Keywords:

Depends on:
Blocks:

Reported:	2004-09-23 19:30 UTC by Brian Eng
Modified:	2018-05-21 07:34 UTC (History)
CC List:	0 users

See Also:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Brian Eng 2004-09-23 19:30:27 UTC

The deadlock is between the geom code and the cam code. It occurred when a fibre channel cable was removed when a user process was still accessing a disk through it.

The system is set up to do a 'camcontrol rescan' upon indication from the HBA driver that the storage devices in the system may have changed. 'camcontrol rescan' triggers a succession of SCSI commands that are driven by the cambio/camisr() software interrupt. When the cable was unplugged, this led to cambio calling disk_destroy() on the disks that were now lost. disk_destroy() led to an attempt to acquire topology_lock() in the g_event thread.

Meanwhile, the user app (dd) received an I/O error and closed the device. This led to a call to g_dev_close(), which acquired topology_lock() and then went down to daclose(), which sent a SCSI SYNC_CACHE command and waited for the command to complete.

The SYNC_CACHE command completes, but the syscall is never told by cambio, which is frozen waiting for the lock that the syscall is holding.

Fix:

One perspective on this is that cambio inverted the layers; normally, geom code calls cam code, but in the 'camcontrol rescan' case, cam code calls geom code, resulting in locks being taken in opposite order. Perhaps disk_destroy could just queue to g_event and not wait for completion.
How-To-Repeat: Do 'camcontrol rescan' either continuously or upon driver notification of changes. Set up a bunch of processes (I was using 'dd') to read a removable disk, then remove it while the processes are running.

There may also be a scenario with disk_create.

Comment 1 Poul-Henning Kamp 2004-09-23 19:48:58 UTC

In message <200409231827.i8NIR3TK071354@www.freebsd.org>, Brian Eng writes:

>One perspective on this is that cambio inverted the layers; normally,
>geom code calls cam code, but in the 'camcontrol rescan' case, cam
>code calls geom code, resulting in locks being taken in opposite
>order.  Perhaps disk_destroy could just queue to g_event and not
>wait for completion.

A lot of the trouble in this area comes from the fact that the disk_*()
api has to isolate giant free code from giant infected code.

There are no easy fixes to currently remaining problems, they will take
serious work to fix.

I know there is some work on CAM going on behind the scenes and hopefully
they will take this sort of thing into account.

-- 
Poul-Henning Kamp       | UNIX since Zilog Zeus 3.20
phk@FreeBSD.ORG         | TCP/IP since RFC 956
FreeBSD committer       | BSD since 4.3-tahoe    
Never attribute to malice what can adequately be explained by incompetence.

Comment 2 Mark Linimon freebsd_committer

2005-10-24 10:38:26 UTC

State Changed
From-To: open->suspended

Mark suspended as there do not seem to be any quick fixes forthcoming.

Comment 3 Eitan Adler freebsd_committer

2018-05-20 23:57:14 UTC

For bugs matching the following conditions:
- Status == In Progress
- Assignee == "bugs@FreeBSD.org"
- Last Modified Year <= 2017

Do
- Set Status to "Open"

Comment 4 Andriy Gapon freebsd_committer

2018-05-21 07:34:29 UTC

Haven't seen any issues like this in a long time.