Bug 210316 - panic after trying to r/w mount msdosfs on write protected media
Summary: panic after trying to r/w mount msdosfs on write protected media
Status: New
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: CURRENT
Hardware: Any Any
: --- Affects Only Me
Assignee: freebsd-fs mailing list
URL:
Keywords:
: 67326 212043 224037 (view as bug list)
Depends on:
Blocks:
 
Reported: 2016-06-16 08:41 UTC by Andriy Gapon
Modified: 2018-12-06 13:52 UTC (History)
9 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Andriy Gapon freebsd_committer 2016-06-16 08:41:56 UTC
At first there were some error messages:
(da1:umass-sim0:0:0:1): WRITE(10). CDB: 2a 20 00 00 00 04 00 00 08 00
(da1:umass-sim0:0:0:1): CAM status: SCSI Status Error
(da1:umass-sim0:0:0:1): SCSI status: Check Condition
(da1:umass-sim0:0:0:1): SCSI sense: DATA PROTECT asc:27,0 (Write protected)
(da1:umass-sim0:0:0:1): Error 19, Unretryable error
g_vfs_done():da1s1[WRITE(offset=512, length=4096)]error = 19

(da1:umass-sim0:0:0:1): WRITE(10). CDB: 2a 20 00 00 00 04 00 00 08 00
(da1:umass-sim0:0:0:1): CAM status: SCSI Status Error
(da1:umass-sim0:0:0:1): SCSI status: Check Condition
(da1:umass-sim0:0:0:1): SCSI sense: DATA PROTECT asc:27,0 (Write protected)
(da1:umass-sim0:0:0:1): Error 19, Unretryable error
g_vfs_done():da1s1[WRITE(offset=512, length=4096)]error = 19

(da1:umass-sim0:0:0:1): WRITE(10). CDB: 2a 20 00 00 00 04 00 00 08 00
(da1:umass-sim0:0:0:1): CAM status: SCSI Status Error
(da1:umass-sim0:0:0:1): SCSI status: Check Condition
(da1:umass-sim0:0:0:1): SCSI sense: DATA PROTECT asc:27,0 (Write protected)
(da1:umass-sim0:0:0:1): Error 19, Unretryable error
g_vfs_done():da1s1[WRITE(offset=512, length=4096)]error = 19

fsync: giving up on dirty
0xfffff802c126c938: tag devfs, type VCHR
    usecount 1, writecount 0, refcount 9 mountedhere 0xfffff80168b65000
    flags (VI_ACTIVE)
    v_object 0xfffff8029ed6e318 ref 0 pages 6 cleanbuf 6 dirtybuf 1
    lock type devfs: UNLOCKED
        dev da1s1
(da1:umass-sim0:0:0:1): SYNCHRONIZE CACHE(10). CDB: 35 20 00 00 00 00 00 00 00 00
(da1:umass-sim0:0:0:1): CAM status: SCSI Status Error
(da1:umass-sim0:0:0:1): SCSI status: Check Condition
(da1:umass-sim0:0:0:1): SCSI sense: DATA PROTECT asc:27,0 (Write protected)
(da1:umass-sim0:0:0:1): Error 19, Unretryable error
g_vfs_done():da1s1[WRITE(offset=512, length=4096)]error = 1

fsync: giving up on dirty
0xfffff802c126c938: tag devfs, type VCHR
    usecount 1, writecount 0, refcount 12 mountedhere 0xfffff80168b65000
    flags (VI_ACTIVE)
    v_object 0xfffff8029ed6e318 ref 0 pages 10 cleanbuf 9 dirtybuf 1
    lock type devfs: UNLOCKED
        dev da1s1


And then BOOM:

Fatal trap 9: general protection fault while in kernel mode
cpuid = 2; apic id = 02
instruction pointer     = 0x20:0xffffffff8061eb51
stack pointer           = 0x28:0xfffffe0517dff8a0
frame pointer           = 0x28:0xfffffe0517dff910
code segment            = base rx0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 25 (syncer)
trap number             = 9
panic: general protection fault
cpuid = 2

KDB: stack backtrace:
db_trace_self_wrapper() at 0xffffffff80420f0b = db_trace_self_wrapper+0x2b/frame
0xfffffe0517dff400
kdb_backtrace() at 0xffffffff80673439 = kdb_backtrace+0x39/frame 0xfffffe0517dff4b0
vpanic() at 0xffffffff8063c94c = vpanic+0x14c/frame 0xfffffe0517dff4f0
panic() at 0xffffffff8063c693 = panic+0x43/frame 0xfffffe0517dff550
trap_fatal() at 0xffffffff8083e99f = trap_fatal+0x33f/frame 0xfffffe0517dff5b0
trap() at 0xffffffff8083e603 = trap+0x823/frame 0xfffffe0517dff7c0
trap_check() at 0xffffffff8083ecba = trap_check+0x2a/frame 0xfffffe0517dff7e0
calltrap() at 0xffffffff808273ea = calltrap+0x8/frame 0xfffffe0517dff7e0
--- trap 0x9, rip = 0xffffffff8061eb51, rsp = 0xfffffe0517dff8b0, rbp =
0xfffffe0517dff910 ---
__mtx_lock_flags() at 0xffffffff8061eb51 = __mtx_lock_flags+0x51/frame
0xfffffe0517dff910
g_vfs_strategy() at 0xffffffff805b8776 = g_vfs_strategy+0x36/frame
0xfffffe0517dff940
bstrategy() at 0xffffffff806cc946 = bstrategy+0x26/frame 0xfffffe0517dff950
bufwrite() at 0xffffffff806cab2e = bufwrite+0x20e/frame 0xfffffe0517dff9b0
bwrite() at 0xffffffff806cd8f6 = bwrite+0x26/frame 0xfffffe0517dff9c0
bawrite() at 0xffffffff806cd3ad = bawrite+0xd/frame 0xfffffe0517dff9d0
vop_stdfsync() at 0xffffffff806d7660 = vop_stdfsync+0x1a0/frame 0xfffffe0517dffa20
devfs_fsync() at 0xffffffff80531d66 = devfs_fsync+0x26/frame 0xfffffe0517dffa50
VOP_FSYNC_APV() at 0xffffffff80884be0 = VOP_FSYNC_APV+0x120/frame 0xfffffe0517dffaa0
VOP_FSYNC() at 0xffffffff806e7d58 = VOP_FSYNC+0x28/frame 0xfffffe0517dffad0
sync_vnode() at 0xffffffff806eb99d = sync_vnode+0x11d/frame 0xfffffe0517dffb20
sched_sync() at 0xffffffff806eb63e = sched_sync+0x23e/frame 0xfffffe0517dffb80
fork_exit() at 0xffffffff8060280f = fork_exit+0xdf/frame 0xfffffe0517dffbf0
fork_trampoline() at 0xffffffff8082791e = fork_trampoline+0xe/frame
0xfffffe0517dffbf0
Comment 1 Andriy Gapon freebsd_committer 2016-06-16 08:47:29 UTC
My preliminary analysis of the problem.
mountmsdosfs() calls markvoldirty() near the end.  markvoldirty() failed because of the read-only media.  bwrite() called brelse() which marked the buffer as dirty because of the write error.  Because of the markvoldirty() failure mountmsdosfs() failed over all and, thus, it called g_vfs_destroy() which destroyed the filesystem's geom. When the syncer tried to sync the dirty buffer later on g_vfs_strategy9) accessed the destroyed consumer / geom and that resulted in a crash.
Comment 2 Andriy Gapon freebsd_committer 2017-02-15 19:06:06 UTC
A reference to a similar report: https://lists.freebsd.org/pipermail/freebsd-current/2016-June/062061.html
Comment 3 Andriy Gapon freebsd_committer 2017-06-26 09:29:02 UTC
Another panic of a similar nature:
Fatal trap 9: general protection fault while in kernel mode
cpuid = 1; apic id = 01
instruction pointer     = 0x20:0xffffffff80633e06
stack pointer           = 0x28:0xfffffe07169c6240
frame pointer           = 0x28:0xfffffe07169c6290
code segment            = base rx0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 14163 (gvfs-hal-volume-mon)
trap number             = 9
panic: general protection fault
cpuid = 1
time = 1498465614
KDB: stack backtrace:
db_trace_self_wrapper() at 0xffffffff80439f4b = db_trace_self_wrapper+0x2b/frame 0xfffffe07169c5df0
kdb_backtrace() at 0xffffffff8068b369 = kdb_backtrace+0x39/frame 0xfffffe07169c5ea0
vpanic() at 0xffffffff806521f2 = vpanic+0x162/frame 0xfffffe07169c5ee0
panic() at 0xffffffff80651f23 = panic+0x43/frame 0xfffffe07169c5f40
trap_fatal() at 0xffffffff8085a170 = trap_fatal+0x310/frame 0xfffffe07169c5f90
trap() at 0xffffffff80859717 = trap+0x97/frame 0xfffffe07169c6150
trap_check() at 0xffffffff8085a49a = trap_check+0x2a/frame 0xfffffe07169c6170
calltrap() at 0xffffffff80842b1a = calltrap+0x8/frame 0xfffffe07169c6170
--- trap 0x9, rip = 0xffffffff80633e06, rsp = 0xfffffe07169c6240, rbp = 0xfffffe07169c6290 ---
__mtx_lock_flags() at 0xffffffff80633e06 = __mtx_lock_flags+0x46/frame 0xfffffe07169c6290
g_vfs_strategy() at 0xffffffff805cc586 = g_vfs_strategy+0x36/frame 0xfffffe07169c62c0
bstrategy() at 0xffffffff806e4326 = bstrategy+0x26/frame 0xfffffe07169c62d0
bufwrite() at 0xffffffff806e258f = bufwrite+0x1df/frame 0xfffffe07169c6310
bwrite() at 0xffffffff806e51d6 = bwrite+0x26/frame 0xfffffe07169c6320
bawrite() at 0xffffffff806e4cbd = bawrite+0xd/frame 0xfffffe07169c6330
vop_stdfsync() at 0xffffffff806f11b0 = vop_stdfsync+0x1b0/frame 0xfffffe07169c6380
VOP_FSYNC_APV() at 0xffffffff808a4fea = VOP_FSYNC_APV+0xfa/frame 0xfffffe07169c63b0
VOP_FSYNC() at 0xffffffff806e49f8 = VOP_FSYNC+0x28/frame 0xfffffe07169c63e0
bufsync() at 0xffffffff806e2687 = bufsync+0x27/frame 0xfffffe07169c6400
bufobj_invalbuf() at 0xffffffff806fe937 = bufobj_invalbuf+0x287/frame 0xfffffe07169c6460
vinvalbuf() at 0xffffffff806fec1a = vinvalbuf+0x8a/frame 0xfffffe07169c6490
vgonel() at 0xffffffff80701fbe = vgonel+0x16e/frame 0xfffffe07169c64d0
vgone() at 0xffffffff8070236f = vgone+0x2f/frame 0xfffffe07169c64f0
devfs_delete() at 0xffffffff805425ae = devfs_delete+0x15e/frame 0xfffffe07169c6550
devfs_populate_loop() at 0xffffffff80542d4f = devfs_populate_loop+0x3cf/frame 0xfffffe07169c65a0
devfs_populate() at 0xffffffff8054296a = devfs_populate+0x4a/frame 0xfffffe07169c65c0
devfs_populate_vp() at 0xffffffff80546a6e = devfs_populate_vp+0x5e/frame 0xfffffe07169c65f0
devfs_lookup() at 0xffffffff80545636 = devfs_lookup+0x16/frame 0xfffffe07169c6620
VOP_LOOKUP_APV() at 0xffffffff808a2d93 = VOP_LOOKUP_APV+0xf3/frame 0xfffffe07169c6650
VOP_LOOKUP() at 0xffffffff806f5e29 = VOP_LOOKUP+0x29/frame 0xfffffe07169c6680
lookup() at 0xffffffff806f52b8 = lookup+0x3e8/frame 0xfffffe07169c6700
namei() at 0xffffffff806f4b5d = namei+0x3ad/frame 0xfffffe07169c67a0
kern_statat() at 0xffffffff8070b3fb = kern_statat+0x9b/frame 0xfffffe07169c69b0
sys_fstatat() at 0xffffffff8070b7af = sys_fstatat+0x2f/frame 0xfffffe07169c6ab0
syscallenter() at 0xffffffff8085aaf6 = syscallenter+0x316/frame 0xfffffe07169c6b00
amd64_syscall() at 0xffffffff8085a69f = amd64_syscall+0x1f/frame 0xfffffe07169c6bf0
Xfast_syscall() at 0xffffffff80842dfb = Xfast_syscall+0xfb/frame 0xfffffe07169c6bf0
--- syscall (552, FreeBSD ELF64, sys_fstatat), rip = 0x80234f65a, rsp = 0x7fffffffd798, rbp = 0x7fffffffd8e0 ---
Uptime: 3d14h59m16s
Comment 4 Conrad Meyer freebsd_committer 2017-12-03 02:17:02 UTC
*** Bug 224037 has been marked as a duplicate of this bug. ***
Comment 5 Damjan Jovanovic 2018-04-30 05:18:12 UTC
Still a problem on CURRENT. Very easy to reproduce:

dd if=/dev/zero of=/path/to/image bs=1M count=100
newfs_msdos /path/to/image
ggatel create -o ro /path/to/image
mount /dev/ggate0 /mnt

A few seconds later: crash.

I wonder what the best way to fix this is. Should the referenced geom explicitly be cleared from the buffer after the failed write? Should the msdosfs driver call fsync() before g_vfs_destroy()-ing the geom on mount failure, so the dirty buffer is deleted before the geom device it references is deleted? Should geom devices be reference counted by their buffers and be forced to live until all buffers that reference them are deleted?
Comment 6 Damjan Jovanovic 2018-04-30 06:49:41 UTC
Repeating that test with UFS (newfs -U -j) also freezes the system, so sadly msdosfs is not the only filesystem affected.

Using the ext2fs driver on a read-only EXT3 filesystem however works, printing some errors in dmesg:

g_vfs_done():ggate0[WRITE(offset=1024, length=1024)]error = 1
g_vfs_done():ggate0[WRITE(offset=1024, length=1024)]error = 1
g_vfs_done():ggate0[WRITE(offset=2048, length=1024)]error = 1
g_vfs_done():ggate0[WRITE(offset=268288, length=1024)]error = 1
g_vfs_done():ggate0[WRITE(offset=268288, length=1024)]error = 1

but it mounts successfully, denies writes with EPERM, and doesn't crash. Unmounting however fails, with this ominous error:

fsync: giving up on dirty (error = 1) 0xfffff801df6a45a0: tag devfs, type VCHR
    usecount 1, writecount 0, refcount 7 rdev 0xfffff800d1092800
    flags (VI_ACTIVE)
    v_object 0xfffff80146eb3900 ref 0 pages 2 cleanbuf 1 dirtybuf 4
    lock type devfs: EXCL by thread 0xfffff800d1017000 (pid 1494, umount, tid 100515)
	dev ggate0

but still no crash.
Comment 7 Damjan Jovanovic 2018-05-01 06:12:45 UTC
If I add this code snippet (copied from somewhere in the ext2fs driver) so that vinvalbuf() is called before g_vfs_close(), the r/w mount of a r/o geom fails (EPERM), and the panic doesn't happen:

diff --git a/sys/fs/msdosfs/msdosfs_vfsops.c b/sys/fs/msdosfs/msdosfs_vfsops.c
index 76a238c5ff77..7477d5b03a1f 100644
--- a/sys/fs/msdosfs/msdosfs_vfsops.c
+++ b/sys/fs/msdosfs/msdosfs_vfsops.c
@@ -717,6 +717,12 @@ mountmsdosfs(struct vnode *devvp, struct mount *mp)
 error_exit:
        if (bp)
                brelse(bp);
+
+       vn_lock(devvp, LK_EXCLUSIVE | LK_RETRY);
+       if (vinvalbuf(devvp, 0, 0, 0) != 0)
+               panic("mountmsdosfs: dirty");
+       VOP_UNLOCK(devvp, 0);
+
        if (cp != NULL) {
                g_topology_lock();
                g_vfs_close(cp);

Apart from not really understanding it, I don't like it, because if every filesystem should be calling vinvalbuf() before g_vfs_close(), then shouldn't g_vfs_close() rather be calling vinvalbuf() itself? Also msdosfs_unmount() doesn't call vinvalbuf() before its own g_vfs_close(), but it does call msdosfs_sync() and vflush(). Would those also work, or be even better?

Fixing UFS will be a different story, as msdosfs writes to the filesystem during mount [in markvoldirty()], so the panic is almost immediate. I could only crash UFS by trying to write to a file after mounting.
Comment 8 Andriy Gapon freebsd_committer 2018-05-02 10:01:31 UTC
*** Bug 67326 has been marked as a duplicate of this bug. ***
Comment 9 Konstantin Belousov freebsd_committer 2018-05-02 11:48:31 UTC
(In reply to Andriy Gapon from comment #1)
Assuming that this analysis covers all cases, what could be your suggestion there, Andrey ?

May be, in g_vfs_close() we should set up bo_ops to some vector which contains NOP strategy and return success ?  Perhaps this should be an opt-in from the caller of g_vfs_close().  In that case, we might also stop passing V_SAVE to invalbuf().
Comment 10 Andriy Gapon freebsd_committer 2018-05-16 12:38:37 UTC
(In reply to Konstantin Belousov from comment #9)

Kostik,
I need to refresh my memory of this problem.

Meanwhile, I have a counter-question.
Is there a bwrite variant where a failure is a final failure with no side effects?  I mean, if that bwriteX fails, then the buffer is just discarded.
Something with semantics similar to write(2).

My understanding of what happens now.
bufwrite calls this code:
        if ((oldflags & B_ASYNC) == 0) {
                int rtval = bufwait(bp);
                brelse(bp);
                return (rtval);

Then in brelse() the following block is executed:
	if (bp->b_iocmd == BIO_WRITE && (bp->b_ioflags & BIO_ERROR) &&
	    (bp->b_error != ENXIO || !LIST_EMPTY(&bp->b_dep)) &&
	    !(bp->b_flags & B_INVAL)) {
		/*
		 * Failed write, redirty.  All errors except ENXIO (which
		 * means the device is gone) are expected to be potentially
		 * transient - underlying media might work if tried again
		 * after EIO, and memory might be available after an ENOMEM.
		 *
		 * Do this also for buffers that failed with ENXIO, but have
		 * non-empty dependencies - the soft updates code might need
		 * to access the buffer to untangle them.
		 *
		 * Must clear BIO_ERROR to prevent pages from being scrapped.
		 */
		bp->b_ioflags &= ~BIO_ERROR;
		bdirty(bp);

I think that it would be nice to have a way to instruct the buffer code to never retry a specific synchronous write request.
If we had such a mechanism we could use it in filesystems's mount code to check whether we can really write to the backing media.

Or should a caller of bwrite() explicitly manipulate b_flags (e.g. B_INVAL, B_RELBUF) to get the desired effect?
Comment 11 Konstantin Belousov freebsd_committer 2018-05-16 12:56:44 UTC
(In reply to Andriy Gapon from comment #10)
There is no flag 'write once and destroy on error'.  It is impossible to achieve this by flags.  Buffer cache tries hard to not loose user data, it is the caller duty to destroy no longer needed dirty buffers.
Comment 12 Andriy Gapon freebsd_committer 2018-05-16 13:36:57 UTC
(In reply to Konstantin Belousov from comment #11)
Well, it's just very counter-intuitive (for someone who is not deeply familiar with buffer cache, like me) that the default (and, in fact, the only) behaviour for a failed _synchronous_ request is retry.  I mean, I've got the error, I might have acted on it, I even might get surprised if the data gets written later.

But this is buffer cache, so I will just shut up.
Comment 13 Conrad Meyer freebsd_committer 2018-05-16 18:34:57 UTC
Write once and if it errors destroy user data is a very surprising semantic for users.  Not sure why we should do that.
Comment 14 Andriy Gapon freebsd_committer 2018-05-17 05:48:21 UTC
(In reply to Conrad Meyer from comment #13)

Indeed, if we talk about the general behaviour.

I see that I utterly failed to explain that I was thinking purely in a context of what msdos does in markvoldirty.
Right now that code can leave behind a perpetually dirty buffer and I was thinking how that can be avoided.

Maybe markvoldirty should do
    bp = getblk(...)
    bp->b_flags |= B_INVAL | B_RELBUF | B_NOCACHE;
    bp->b_flags &= ~(B_ASYNC | B_CACHE);
    brelse(bp);
after a failed write?
Looks clumsy, but should work.

Or maybe markvoldirty should not use buffer cache for its write?
It could use g_write_data, for example.  But that sounds like layering violation.
Comment 15 Andriy Gapon freebsd_committer 2018-05-17 06:00:30 UTC
(In reply to Konstantin Belousov from comment #9)

This proposal looks like something that can help with the problem at hand in a generic way.
I see that at present g_vfs_close() does not even check the return code of bufobj_invalbuf().  That looks dangerous?  But not sure if the code can do better.  E.g., if looping on an error would be okay.

Then, there remains a question of when to use the proposed g_vfs_close(NOSAVE).
There are obvious cases like a failed mount.
And that's what we want to address now, so that's good.

But there can be a more complex situation.
For example, we can mount a filesystem r/w and do some writes to it.
But then the hardware suddenly decides to go to read-only mode (e.g., because some internal counter crosses its threshold).
So, we will get some failed writes and some dirty buffers that cannot possibly be made clean by writing out.
Do we need to somehow handle them?
And then we decide to unmount the filesystem.
How do we handle those buffers?
How do we know if we should call g_vfs_close(SAVE) or g_vfs_close(NOSAVE) ?
Comment 16 Warner Losh freebsd_committer 2018-05-17 16:29:10 UTC
(In reply to Andriy Gapon from comment #14)
    bp = getblk(...)
    bp->b_flags |= B_INVAL | B_RELBUF | B_NOCACHE;
    bp->b_flags &= ~(B_ASYNC | B_CACHE);
    brelse(bp);

maybe we need a bfinval(bp) that does that. "Forced" is the 'f' but I'm open to any other name. Basically "Throw away this buffer I know is dirty." We have a 'flush' operation but no 'invalidate' or 'throw away' operation for these cache entries. If we centralize it, we'll have fewer gross hacks like this in the system that "should work" but gradually drift away from working because of some subtle change in what B_NOCACHE means 5 years from now...
Comment 17 Kirk McKusick freebsd_committer 2018-05-17 16:36:33 UTC
(In reply to Warner Losh from comment #16)
I very much agree with the idea of adding bfinval.
Comment 18 Andriy Gapon freebsd_committer 2018-12-06 13:52:57 UTC
*** Bug 212043 has been marked as a duplicate of this bug. ***