Bug 219857 - panic in scsi_cd code
Summary: panic in scsi_cd code
Status: Open
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: CURRENT
Hardware: Any Any
: --- Affects Only Me
Assignee: Kenneth D. Merry
: 226870 232044 (view as bug list)
Depends on:
Reported: 2017-06-08 08:53 UTC by Andriy Gapon
Modified: 2019-06-15 08:11 UTC (History)
7 users (show)

See Also:

Proposed patch to make the media check async (21.87 KB, patch)
2018-07-11 15:37 UTC, Kenneth D. Merry
no flags Details | Diff
The good log (41.57 KB, text/plain)
2018-07-21 00:22 UTC, Dexuan Cui
no flags Details
The badlog (38.69 KB, text/plain)
2018-07-21 00:23 UTC, Dexuan Cui
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Andriy Gapon freebsd_committer 2017-06-08 08:53:58 UTC
The following panic happened right after cdrecord quit with an error while writing to a CD-RW disk.

(cd0:ahcich10:0:0:0): READ(10). CDB: 28 00 00 00 00 00 00 00 01 00 
(cd0:ahcich10:0:0:0): CAM status: SCSI Status Error
(cd0:ahcich10:0:0:0): SCSI status: Check Condition
(cd0:ahcich10:0:0:0): SCSI sense: ILLEGAL REQUEST asc:21,0 (Logical block address out of range)
(cd0:ahcich10:0:0:0): Error 22, Unretryable error
(cd0:ahcich10:0:0:0): cddone: got error 0x16 back

panic: sleepq_add: td 0xfffff800060bf000 to sleep on wchan 0xfffff801b3f1a048 with sleeping prohibited
cpuid = 4
time = 1496908339
KDB: stack backtrace:
db_trace_self_wrapper() at 0xffffffff80439f4b = db_trace_self_wrapper+0x2b/frame 0xfffffe01ef5da770
kdb_backtrace() at 0xffffffff8068b369 = kdb_backtrace+0x39/frame 0xfffffe01ef5da820
vpanic() at 0xffffffff806521f2 = vpanic+0x162/frame 0xfffffe01ef5da860
panic() at 0xffffffff80651f23 = panic+0x43/frame 0xfffffe01ef5da8c0
sleepq_add() at 0xffffffff806979ff = sleepq_add+0x30f/frame 0xfffffe01ef5da910
_sleep() at 0xffffffff8065c0de = _sleep+0x2ae/frame 0xfffffe01ef5da9b0
cam_periph_ccbwait() at 0xffffffff802aac9d = cam_periph_ccbwait+0x5d/frame 0xfffffe01ef5da9e0
cam_periph_runccb() at 0xffffffff802aaac8 = cam_periph_runccb+0xb8/frame 0xfffffe01ef5daa40
cdrunccb() at 0xffffffff802cab8c = cdrunccb+0x3c/frame 0xfffffe01ef5daa60
cdprevent() at 0xffffffff802ca48b = cdprevent+0x7b/frame 0xfffffe01ef5daaa0
cdcheckmedia() at 0xffffffff802ca132 = cdcheckmedia+0x22/frame 0xfffffe01ef5daaf0
cdstrategy() at 0xffffffff802c8356 = cdstrategy+0x56/frame 0xfffffe01ef5dab20
g_disk_start() at 0xffffffff805c38f0 = g_disk_start+0x180/frame 0xfffffe01ef5dab70
g_io_schedule_down() at 0xffffffff805c760b = g_io_schedule_down+0x1eb/frame 0xfffffe01ef5dab90
g_down_procbody() at 0xffffffff805c7fed = g_down_procbody+0x6d/frame 0xfffffe01ef5daba0
fork_exit() at 0xffffffff80618680 = fork_exit+0xd0/frame 0xfffffe01ef5dabf0
fork_trampoline() at 0xffffffff8084304e = fork_trampoline+0xe/frame 0xfffffe01ef5dabf0
Comment 1 Kenneth D. Merry freebsd_committer 2017-06-13 21:32:50 UTC
I can see why this happened.  It is a little strange that I don't recall seeing this before, since the media check code was added in 2003, and it appears that sleeping has been prohibited for a similar amount of time.

The problem is:

1.  In g_io_schedule_down(), THREAD_NO_SLEEPING() is called before the start routine is called.

2.  In cdstrategy(), if the CD_FLAG_VALID_MEDIA flag isn't set, the cd(4) driver probes to see if there is valid media in the drive.

3.  In cdcheckmedia(), the cd(4) driver first prevents CD removal by calling cdprevent().

4.  In cdprevent(), the cd(4) driver calls cdrunccb(), which calls cam_periph_runccb(), which sleeps and runs into the panic.

In most cases, this won't happen because the cd(4) driver checks for media on open(), and prevents removal when it finds media.  The driver does allow the open() when no media is present, so that the user can call the CDIOCCLOSE and CDIOCEJECT ioctls.

One possible solution is just failing the I/O in cdstrategy() if the CD_FLAG_VALID_MEDIA flag is not set.  We would need to add code that checks for media if the user does a CDIOCCLOSE ioctl.

The challenge there, however, is that it would not catch the case where the user physically closes the drive while they have the device open and then sends a read.

In order to handle that case, and not sleep underneath cdstrategy(), we would need to do an asynchronous media check.

The way I would probably do it is to add a series of states into cdstart() and cddone() that mirror the way that cdcheckmedia() operates.  So, prevent, read capacity, read TOC, and so on.  We would still call xpt_schedule() from cdstrategy(), but cdstart() would go into a media probe state machine to detect the media and once it has validated the media or not, satisfy the I/O or send it back with an error.

As a bonus, we would re-implement the existing code so that it can work either asynchronously or synchronously.  (To avoid duplicate code paths.)
Comment 2 Andriy Gapon freebsd_committer 2018-03-28 10:09:07 UTC
*** Bug 226870 has been marked as a duplicate of this bug. ***
Comment 3 Dexuan Cui 2018-06-04 04:14:42 UTC
Our test team sees a similar call-trace:

panic: sleepq_add: td 0xfffff8010581a000 to sleep on wchan 0xfffff8010f924048 with sleeping prohibited
cpuid = 1
time = 1527752111
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe00005815e0
vpanic() at vpanic+0x1a3/frame 0xfffffe0000581640
doadump() at doadump/frame 0xfffffe00005816c0
sleepq_add() at sleepq_add+0x378/frame 0xfffffe0000581710
_sleep() at _sleep+0x2c3/frame 0xfffffe00005817b0
cam_periph_runccb() at cam_periph_runccb+0x1ad/frame 0xfffffe0000581900
cdcheckmedia() at cdcheckmedia+0x8a/frame 0xfffffe0000581980
cdstrategy() at cdstrategy+0x55/frame 0xfffffe00005819b0
g_disk_start() at g_disk_start+0x48e/frame 0xfffffe0000581a20
g_io_schedule_down() at g_io_schedule_down+0x105/frame 0xfffffe0000581a60
g_down_procbody() at g_down_procbody+0x6d/frame 0xfffffe0000581a70
fork_exit() at fork_exit+0x84/frame 0xfffffe0000581ab0
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe0000581ab0
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---
KDB: enter: panic
[ thread pid 13 tid 100113 ]
Stopped at      kdb_enter+0x3b: movq    $0,kdb_why

But our test team is just running some fio test with RAID (we run "kldload geom_stripe). I'm not sure why our test is triggering cdstrategy() and cdcheckmedia(), which seems to be related to CD? Our test team says the panic happens to FreeBSD 10.3 and the latest HEAD code, meaning this should be a longstanding issue...

Anyway, do we have a plan to fix the panic originally reported?
Comment 4 Andriy Gapon freebsd_committer 2018-06-04 06:23:56 UTC
(In reply to Dexuan Cui from comment #3)
>  I'm not sure why our test is triggering cdstrategy() and cdcheckmedia(), which seems to be related to CD?

It's probably because the new geom class wants to read media to check for on-disk metadata and it ends up trying to read from the empty CD drive too.
Comment 5 Dexuan Cui 2018-06-14 21:20:09 UTC
@ken: Is anyone working on this bug recently?
Comment 6 Kenneth D. Merry freebsd_committer 2018-06-15 01:57:30 UTC
I have not been working on it.  Feel free to work on the solution.  My comments from last year give a roadmap for how to do it.
Comment 7 Dexuan Cui 2018-06-15 05:25:46 UTC
(In reply to Kenneth D. Merry from comment #6)
I wish I could help, but unluckily the code here is quite unfamiliar to me. :-(
Comment 8 Kenneth D. Merry freebsd_committer 2018-06-15 20:47:40 UTC
(In reply to Dexuan Cui from comment #7)

Ok, I'll see if I can come up with a fix.  No promises on when/if it'll get done, of course. :)
Comment 9 Dexuan Cui 2018-06-15 21:04:24 UTC
(In reply to Kenneth D. Merry from comment #8)
Understood. :-)  Thanks!
Comment 10 Kenneth D. Merry freebsd_committer 2018-07-11 15:37:18 UTC
Created attachment 195055 [details]
Proposed patch to make the media check async

Here is a proposed patch against FreeBSD/head as of revision 335477 on June 21st, 2018.

This makes the media check process asynchronous, so we no longer block in cdstrategy() to check for media.

This still needs some additional testing in various situations.  I think there may still be one panic (GEOM assertion I believe) I saw with an audio CD in a drive, and I have not tested this with a SCSI CD/DVD drive.

I have only tested so far with a SATA drive running ATAPI.

In any case, give this a try and let me know if it improves things for you.
Comment 11 Dexuan Cui 2018-07-11 18:34:58 UTC
(In reply to Kenneth D. Merry from comment #10)
Thanks very much! We'll test it and report back.
Comment 12 Dexuan Cui 2018-07-17 21:48:36 UTC
(In reply to Dexuan Cui from comment #11)
Hi Kenneth, we tested your patch with 12-CURRENT, and we believe the panic we saw is fixed by the patch! Thank you very much!
Comment 13 Kenneth D. Merry freebsd_committer 2018-07-18 15:04:36 UTC
(In reply to Dexuan Cui from comment #12)

Great, I'm glad it is working!  I'll test it in a few more scenarios before I commit it.
Comment 14 Dexuan Cui 2018-07-20 23:19:02 UTC
(In reply to Kenneth D. Merry from comment #13)
Hi Ken, we found a regression caused by the patch: when we tested the patch with recent 12-CURRENT (July 11) in a *UP* FreeBSD Virtual Machine running on Windows Server 2012 R2, the VM hangs after printing 

Trying to mount root from ufs:/dev/da0s1a [rw]...

If I use a SMP VM (e.g. 2 CPUs or more), the issue will go away.
If I use a UP VM, and insert a .iso file into the virtual DVD drive, the issue will go away as well.

However, the same UP VM image can run fine on a Windows Server 2016 host with 1 CPU, and no .iso file in the virtual DVD drive.

So I can't say this patch has a bug. My gut feeling is that: maybe the patch somehow triggers a host issue, e.g. maybe the host's emulation of DVD drive returns some invalid value to the VM and the VM hangs? When the issue happens, the VM's CPU utilization is 0%, so I'm sure the VM is not in a busy loop. Probably the VM is waiting for some condition?

I'm attaching the good log (I insert a .iso file into the virtual DVD) and the bad log (the virtual DVD is empty). I got the log with the same VM running on Windows Server 2012 R2. 

The good one is:
cd0 at ata1 bus 0 scbus1 target 0 lun 0
cd0: <Msft Virtual CD/ROM 1.0> Removable CD-ROM SPC-3 SCSI device
cd0: 16.700MB/s transfers (WDMA2, ATAPI 12bytes, PIO 65534bytes)
cd0: 669MB (342562 2048 byte sectors)
Trying to mount root from ufs:/dev/da0s1a [rw]...
GEOM: new disk da0
GEOM: new disk cd0
start_init: trying /sbin/init

The bad one is:
cd0 at ata1 bus 0 scbus1 target 0 lun 0
cd0: <Msft Virtual CD/ROM 1.0> Removable CD-ROM SPC-3 SCSI device
cd0: 16.700MB/s transfers (WDMA2, ATAPI 12bytes, PIO 65534bytes)
cd0: Attempt to query device size failed: NOT READY, Medium not present
Trying to mount root from ufs:/dev/da0s1a [rw]...
GEOM: new disk cd0
GEOM: new disk da0
(Dexuan: it hangs here for ever)

How can I know where it hangs? Any insight or suggestion?  Thanks!
Comment 15 Dexuan Cui 2018-07-21 00:22:49 UTC
Created attachment 195314 [details]
The good log
Comment 16 Dexuan Cui 2018-07-21 00:23:08 UTC
Created attachment 195315 [details]
The badlog
Comment 17 Dexuan Cui 2018-07-21 00:29:04 UTC
(In reply to Dexuan Cui from comment #14)
BTW, the issue can't reproduce if we don't apply the patch.
Comment 18 Kenneth D. Merry freebsd_committer 2018-07-21 18:39:07 UTC
Ok, there is likely a bug in the patch, we'll just have to track it down.

Here are a few things to try that may help get to the bottom of it.

First, make sure you're running with WITNESS and INVARIANTS enabled, since those may catch locking or other issues that are causing problems.

1.  In the situation that hangs (Windows Server 2012, UP, no ISO in the drive), when the OS hangs, break into the debugger and run "ps" and "alltrace" and send the output. You can break into the debugger normally by sending a "break", but you can also configure ALT_BREAK_TO_DEBUGGER in the kernel config file.  Here is what sys/conf/NOTES says about triggering the debugger:

# Solaris implements a new BREAK which is initiated by a character
# sequence CR ~ ^b which is similar to a familiar pattern used on
# Sun servers by the Remote Console.  There are FreeBSD extensions:
# CR ~ ^p requests force panic and CR ~ ^r requests a clean reboot.

2.  Run the VM under Windows Server 2012 in SMP mode with no ISO in the drive, run these commands and send the output:

camcontrol cmd cd0 -v -c "1e 0 0 0 1 0"

camcontrol readcap cd0 -v

camcontrol cmd cd0 -v -t 60 -c "28 0 0 0 0 0 0 0 1 0" -i 2048 - | hd

The first is Prevent Media Removal, the second is Read Capacity, the third is a standard 10 byte read of one sector only.

My assumption is that there should be no difference in the behavior of the emulated CD device between SMP and UP mode, and we're just avoiding the problem in SMP mode due to timing.

I have tested in UP and SMP, and haven't seen this particular problem, but it could be that the emulated CD driver on Windows Server 2012 is responding in a way I didn't expect or the timing is just right in UP mode to trigger the bug.

Let me know what you figure out.
Comment 19 Dexuan Cui 2018-07-21 18:50:01 UTC
(In reply to Kenneth D. Merry from comment #18)
I'll try to debug the issue following the detailed instructions. Thanks so much!
Comment 20 Kenneth D. Merry freebsd_committer 2018-10-08 00:54:01 UTC
*** Bug 232044 has been marked as a duplicate of this bug. ***
Comment 21 Andriy Gapon freebsd_committer 2018-12-07 12:21:36 UTC
I am starting (at last!) to test the patch and here is a warning that I got during a build:
> /usr/devel/svn/head/sys/cam/scsi/scsi_cd.c:2862:1: warning: unused function 'cdsize' [-Wunused-function]
Comment 22 Harald Schmalzbauer 2018-12-07 12:31:22 UTC
Just to merge sideband communication:
I wasn't able to reproduce the UP problems described in comment#14.
The patch prevents my 12-stable production machines from panicing for a while; but I'm not extensively using ODDs over all...


Comment 23 Andriy Voskoboinyk freebsd_committer 2019-01-21 09:38:14 UTC
*** Bug 235081 has been marked as a duplicate of this bug. ***
Comment 24 Harald Schmalzbauer 2019-06-15 08:11:00 UTC
Kenneth's patch successfully prevented my machines from panicing for almost one year. I don't use ODD regularly, so the fact that I don't see any other side effect nor regression isn't counting much, but I remember staring at console while hypervisor reboots after ejecting (removing USB-) OD(D)...
Unfortunately it seems that 11.3 again ships with this trip-wire/trip-mine.

I'd vote for commiting it as soon as possible.