At first there were some error messages: (da1:umass-sim0:0:0:1): WRITE(10). CDB: 2a 20 00 00 00 04 00 00 08 00 (da1:umass-sim0:0:0:1): CAM status: SCSI Status Error (da1:umass-sim0:0:0:1): SCSI status: Check Condition (da1:umass-sim0:0:0:1): SCSI sense: DATA PROTECT asc:27,0 (Write protected) (da1:umass-sim0:0:0:1): Error 19, Unretryable error g_vfs_done():da1s1[WRITE(offset=512, length=4096)]error = 19 (da1:umass-sim0:0:0:1): WRITE(10). CDB: 2a 20 00 00 00 04 00 00 08 00 (da1:umass-sim0:0:0:1): CAM status: SCSI Status Error (da1:umass-sim0:0:0:1): SCSI status: Check Condition (da1:umass-sim0:0:0:1): SCSI sense: DATA PROTECT asc:27,0 (Write protected) (da1:umass-sim0:0:0:1): Error 19, Unretryable error g_vfs_done():da1s1[WRITE(offset=512, length=4096)]error = 19 (da1:umass-sim0:0:0:1): WRITE(10). CDB: 2a 20 00 00 00 04 00 00 08 00 (da1:umass-sim0:0:0:1): CAM status: SCSI Status Error (da1:umass-sim0:0:0:1): SCSI status: Check Condition (da1:umass-sim0:0:0:1): SCSI sense: DATA PROTECT asc:27,0 (Write protected) (da1:umass-sim0:0:0:1): Error 19, Unretryable error g_vfs_done():da1s1[WRITE(offset=512, length=4096)]error = 19 fsync: giving up on dirty 0xfffff802c126c938: tag devfs, type VCHR usecount 1, writecount 0, refcount 9 mountedhere 0xfffff80168b65000 flags (VI_ACTIVE) v_object 0xfffff8029ed6e318 ref 0 pages 6 cleanbuf 6 dirtybuf 1 lock type devfs: UNLOCKED dev da1s1 (da1:umass-sim0:0:0:1): SYNCHRONIZE CACHE(10). CDB: 35 20 00 00 00 00 00 00 00 00 (da1:umass-sim0:0:0:1): CAM status: SCSI Status Error (da1:umass-sim0:0:0:1): SCSI status: Check Condition (da1:umass-sim0:0:0:1): SCSI sense: DATA PROTECT asc:27,0 (Write protected) (da1:umass-sim0:0:0:1): Error 19, Unretryable error g_vfs_done():da1s1[WRITE(offset=512, length=4096)]error = 1 fsync: giving up on dirty 0xfffff802c126c938: tag devfs, type VCHR usecount 1, writecount 0, refcount 12 mountedhere 0xfffff80168b65000 flags (VI_ACTIVE) v_object 0xfffff8029ed6e318 ref 0 pages 10 cleanbuf 9 dirtybuf 1 lock type devfs: UNLOCKED dev da1s1 And then BOOM: Fatal trap 9: general protection fault while in kernel mode cpuid = 2; apic id = 02 instruction pointer = 0x20:0xffffffff8061eb51 stack pointer = 0x28:0xfffffe0517dff8a0 frame pointer = 0x28:0xfffffe0517dff910 code segment = base rx0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 25 (syncer) trap number = 9 panic: general protection fault cpuid = 2 KDB: stack backtrace: db_trace_self_wrapper() at 0xffffffff80420f0b = db_trace_self_wrapper+0x2b/frame 0xfffffe0517dff400 kdb_backtrace() at 0xffffffff80673439 = kdb_backtrace+0x39/frame 0xfffffe0517dff4b0 vpanic() at 0xffffffff8063c94c = vpanic+0x14c/frame 0xfffffe0517dff4f0 panic() at 0xffffffff8063c693 = panic+0x43/frame 0xfffffe0517dff550 trap_fatal() at 0xffffffff8083e99f = trap_fatal+0x33f/frame 0xfffffe0517dff5b0 trap() at 0xffffffff8083e603 = trap+0x823/frame 0xfffffe0517dff7c0 trap_check() at 0xffffffff8083ecba = trap_check+0x2a/frame 0xfffffe0517dff7e0 calltrap() at 0xffffffff808273ea = calltrap+0x8/frame 0xfffffe0517dff7e0 --- trap 0x9, rip = 0xffffffff8061eb51, rsp = 0xfffffe0517dff8b0, rbp = 0xfffffe0517dff910 --- __mtx_lock_flags() at 0xffffffff8061eb51 = __mtx_lock_flags+0x51/frame 0xfffffe0517dff910 g_vfs_strategy() at 0xffffffff805b8776 = g_vfs_strategy+0x36/frame 0xfffffe0517dff940 bstrategy() at 0xffffffff806cc946 = bstrategy+0x26/frame 0xfffffe0517dff950 bufwrite() at 0xffffffff806cab2e = bufwrite+0x20e/frame 0xfffffe0517dff9b0 bwrite() at 0xffffffff806cd8f6 = bwrite+0x26/frame 0xfffffe0517dff9c0 bawrite() at 0xffffffff806cd3ad = bawrite+0xd/frame 0xfffffe0517dff9d0 vop_stdfsync() at 0xffffffff806d7660 = vop_stdfsync+0x1a0/frame 0xfffffe0517dffa20 devfs_fsync() at 0xffffffff80531d66 = devfs_fsync+0x26/frame 0xfffffe0517dffa50 VOP_FSYNC_APV() at 0xffffffff80884be0 = VOP_FSYNC_APV+0x120/frame 0xfffffe0517dffaa0 VOP_FSYNC() at 0xffffffff806e7d58 = VOP_FSYNC+0x28/frame 0xfffffe0517dffad0 sync_vnode() at 0xffffffff806eb99d = sync_vnode+0x11d/frame 0xfffffe0517dffb20 sched_sync() at 0xffffffff806eb63e = sched_sync+0x23e/frame 0xfffffe0517dffb80 fork_exit() at 0xffffffff8060280f = fork_exit+0xdf/frame 0xfffffe0517dffbf0 fork_trampoline() at 0xffffffff8082791e = fork_trampoline+0xe/frame 0xfffffe0517dffbf0
My preliminary analysis of the problem. mountmsdosfs() calls markvoldirty() near the end. markvoldirty() failed because of the read-only media. bwrite() called brelse() which marked the buffer as dirty because of the write error. Because of the markvoldirty() failure mountmsdosfs() failed over all and, thus, it called g_vfs_destroy() which destroyed the filesystem's geom. When the syncer tried to sync the dirty buffer later on g_vfs_strategy9) accessed the destroyed consumer / geom and that resulted in a crash.
A reference to a similar report: https://lists.freebsd.org/pipermail/freebsd-current/2016-June/062061.html
Another panic of a similar nature: Fatal trap 9: general protection fault while in kernel mode cpuid = 1; apic id = 01 instruction pointer = 0x20:0xffffffff80633e06 stack pointer = 0x28:0xfffffe07169c6240 frame pointer = 0x28:0xfffffe07169c6290 code segment = base rx0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 14163 (gvfs-hal-volume-mon) trap number = 9 panic: general protection fault cpuid = 1 time = 1498465614 KDB: stack backtrace: db_trace_self_wrapper() at 0xffffffff80439f4b = db_trace_self_wrapper+0x2b/frame 0xfffffe07169c5df0 kdb_backtrace() at 0xffffffff8068b369 = kdb_backtrace+0x39/frame 0xfffffe07169c5ea0 vpanic() at 0xffffffff806521f2 = vpanic+0x162/frame 0xfffffe07169c5ee0 panic() at 0xffffffff80651f23 = panic+0x43/frame 0xfffffe07169c5f40 trap_fatal() at 0xffffffff8085a170 = trap_fatal+0x310/frame 0xfffffe07169c5f90 trap() at 0xffffffff80859717 = trap+0x97/frame 0xfffffe07169c6150 trap_check() at 0xffffffff8085a49a = trap_check+0x2a/frame 0xfffffe07169c6170 calltrap() at 0xffffffff80842b1a = calltrap+0x8/frame 0xfffffe07169c6170 --- trap 0x9, rip = 0xffffffff80633e06, rsp = 0xfffffe07169c6240, rbp = 0xfffffe07169c6290 --- __mtx_lock_flags() at 0xffffffff80633e06 = __mtx_lock_flags+0x46/frame 0xfffffe07169c6290 g_vfs_strategy() at 0xffffffff805cc586 = g_vfs_strategy+0x36/frame 0xfffffe07169c62c0 bstrategy() at 0xffffffff806e4326 = bstrategy+0x26/frame 0xfffffe07169c62d0 bufwrite() at 0xffffffff806e258f = bufwrite+0x1df/frame 0xfffffe07169c6310 bwrite() at 0xffffffff806e51d6 = bwrite+0x26/frame 0xfffffe07169c6320 bawrite() at 0xffffffff806e4cbd = bawrite+0xd/frame 0xfffffe07169c6330 vop_stdfsync() at 0xffffffff806f11b0 = vop_stdfsync+0x1b0/frame 0xfffffe07169c6380 VOP_FSYNC_APV() at 0xffffffff808a4fea = VOP_FSYNC_APV+0xfa/frame 0xfffffe07169c63b0 VOP_FSYNC() at 0xffffffff806e49f8 = VOP_FSYNC+0x28/frame 0xfffffe07169c63e0 bufsync() at 0xffffffff806e2687 = bufsync+0x27/frame 0xfffffe07169c6400 bufobj_invalbuf() at 0xffffffff806fe937 = bufobj_invalbuf+0x287/frame 0xfffffe07169c6460 vinvalbuf() at 0xffffffff806fec1a = vinvalbuf+0x8a/frame 0xfffffe07169c6490 vgonel() at 0xffffffff80701fbe = vgonel+0x16e/frame 0xfffffe07169c64d0 vgone() at 0xffffffff8070236f = vgone+0x2f/frame 0xfffffe07169c64f0 devfs_delete() at 0xffffffff805425ae = devfs_delete+0x15e/frame 0xfffffe07169c6550 devfs_populate_loop() at 0xffffffff80542d4f = devfs_populate_loop+0x3cf/frame 0xfffffe07169c65a0 devfs_populate() at 0xffffffff8054296a = devfs_populate+0x4a/frame 0xfffffe07169c65c0 devfs_populate_vp() at 0xffffffff80546a6e = devfs_populate_vp+0x5e/frame 0xfffffe07169c65f0 devfs_lookup() at 0xffffffff80545636 = devfs_lookup+0x16/frame 0xfffffe07169c6620 VOP_LOOKUP_APV() at 0xffffffff808a2d93 = VOP_LOOKUP_APV+0xf3/frame 0xfffffe07169c6650 VOP_LOOKUP() at 0xffffffff806f5e29 = VOP_LOOKUP+0x29/frame 0xfffffe07169c6680 lookup() at 0xffffffff806f52b8 = lookup+0x3e8/frame 0xfffffe07169c6700 namei() at 0xffffffff806f4b5d = namei+0x3ad/frame 0xfffffe07169c67a0 kern_statat() at 0xffffffff8070b3fb = kern_statat+0x9b/frame 0xfffffe07169c69b0 sys_fstatat() at 0xffffffff8070b7af = sys_fstatat+0x2f/frame 0xfffffe07169c6ab0 syscallenter() at 0xffffffff8085aaf6 = syscallenter+0x316/frame 0xfffffe07169c6b00 amd64_syscall() at 0xffffffff8085a69f = amd64_syscall+0x1f/frame 0xfffffe07169c6bf0 Xfast_syscall() at 0xffffffff80842dfb = Xfast_syscall+0xfb/frame 0xfffffe07169c6bf0 --- syscall (552, FreeBSD ELF64, sys_fstatat), rip = 0x80234f65a, rsp = 0x7fffffffd798, rbp = 0x7fffffffd8e0 --- Uptime: 3d14h59m16s
*** Bug 224037 has been marked as a duplicate of this bug. ***
Still a problem on CURRENT. Very easy to reproduce: dd if=/dev/zero of=/path/to/image bs=1M count=100 newfs_msdos /path/to/image ggatel create -o ro /path/to/image mount /dev/ggate0 /mnt A few seconds later: crash. I wonder what the best way to fix this is. Should the referenced geom explicitly be cleared from the buffer after the failed write? Should the msdosfs driver call fsync() before g_vfs_destroy()-ing the geom on mount failure, so the dirty buffer is deleted before the geom device it references is deleted? Should geom devices be reference counted by their buffers and be forced to live until all buffers that reference them are deleted?
Repeating that test with UFS (newfs -U -j) also freezes the system, so sadly msdosfs is not the only filesystem affected. Using the ext2fs driver on a read-only EXT3 filesystem however works, printing some errors in dmesg: g_vfs_done():ggate0[WRITE(offset=1024, length=1024)]error = 1 g_vfs_done():ggate0[WRITE(offset=1024, length=1024)]error = 1 g_vfs_done():ggate0[WRITE(offset=2048, length=1024)]error = 1 g_vfs_done():ggate0[WRITE(offset=268288, length=1024)]error = 1 g_vfs_done():ggate0[WRITE(offset=268288, length=1024)]error = 1 but it mounts successfully, denies writes with EPERM, and doesn't crash. Unmounting however fails, with this ominous error: fsync: giving up on dirty (error = 1) 0xfffff801df6a45a0: tag devfs, type VCHR usecount 1, writecount 0, refcount 7 rdev 0xfffff800d1092800 flags (VI_ACTIVE) v_object 0xfffff80146eb3900 ref 0 pages 2 cleanbuf 1 dirtybuf 4 lock type devfs: EXCL by thread 0xfffff800d1017000 (pid 1494, umount, tid 100515) dev ggate0 but still no crash.
If I add this code snippet (copied from somewhere in the ext2fs driver) so that vinvalbuf() is called before g_vfs_close(), the r/w mount of a r/o geom fails (EPERM), and the panic doesn't happen: diff --git a/sys/fs/msdosfs/msdosfs_vfsops.c b/sys/fs/msdosfs/msdosfs_vfsops.c index 76a238c5ff77..7477d5b03a1f 100644 --- a/sys/fs/msdosfs/msdosfs_vfsops.c +++ b/sys/fs/msdosfs/msdosfs_vfsops.c @@ -717,6 +717,12 @@ mountmsdosfs(struct vnode *devvp, struct mount *mp) error_exit: if (bp) brelse(bp); + + vn_lock(devvp, LK_EXCLUSIVE | LK_RETRY); + if (vinvalbuf(devvp, 0, 0, 0) != 0) + panic("mountmsdosfs: dirty"); + VOP_UNLOCK(devvp, 0); + if (cp != NULL) { g_topology_lock(); g_vfs_close(cp); Apart from not really understanding it, I don't like it, because if every filesystem should be calling vinvalbuf() before g_vfs_close(), then shouldn't g_vfs_close() rather be calling vinvalbuf() itself? Also msdosfs_unmount() doesn't call vinvalbuf() before its own g_vfs_close(), but it does call msdosfs_sync() and vflush(). Would those also work, or be even better? Fixing UFS will be a different story, as msdosfs writes to the filesystem during mount [in markvoldirty()], so the panic is almost immediate. I could only crash UFS by trying to write to a file after mounting.
*** Bug 67326 has been marked as a duplicate of this bug. ***
(In reply to Andriy Gapon from comment #1) Assuming that this analysis covers all cases, what could be your suggestion there, Andrey ? May be, in g_vfs_close() we should set up bo_ops to some vector which contains NOP strategy and return success ? Perhaps this should be an opt-in from the caller of g_vfs_close(). In that case, we might also stop passing V_SAVE to invalbuf().
(In reply to Konstantin Belousov from comment #9) Kostik, I need to refresh my memory of this problem. Meanwhile, I have a counter-question. Is there a bwrite variant where a failure is a final failure with no side effects? I mean, if that bwriteX fails, then the buffer is just discarded. Something with semantics similar to write(2). My understanding of what happens now. bufwrite calls this code: if ((oldflags & B_ASYNC) == 0) { int rtval = bufwait(bp); brelse(bp); return (rtval); Then in brelse() the following block is executed: if (bp->b_iocmd == BIO_WRITE && (bp->b_ioflags & BIO_ERROR) && (bp->b_error != ENXIO || !LIST_EMPTY(&bp->b_dep)) && !(bp->b_flags & B_INVAL)) { /* * Failed write, redirty. All errors except ENXIO (which * means the device is gone) are expected to be potentially * transient - underlying media might work if tried again * after EIO, and memory might be available after an ENOMEM. * * Do this also for buffers that failed with ENXIO, but have * non-empty dependencies - the soft updates code might need * to access the buffer to untangle them. * * Must clear BIO_ERROR to prevent pages from being scrapped. */ bp->b_ioflags &= ~BIO_ERROR; bdirty(bp); I think that it would be nice to have a way to instruct the buffer code to never retry a specific synchronous write request. If we had such a mechanism we could use it in filesystems's mount code to check whether we can really write to the backing media. Or should a caller of bwrite() explicitly manipulate b_flags (e.g. B_INVAL, B_RELBUF) to get the desired effect?
(In reply to Andriy Gapon from comment #10) There is no flag 'write once and destroy on error'. It is impossible to achieve this by flags. Buffer cache tries hard to not loose user data, it is the caller duty to destroy no longer needed dirty buffers.
(In reply to Konstantin Belousov from comment #11) Well, it's just very counter-intuitive (for someone who is not deeply familiar with buffer cache, like me) that the default (and, in fact, the only) behaviour for a failed _synchronous_ request is retry. I mean, I've got the error, I might have acted on it, I even might get surprised if the data gets written later. But this is buffer cache, so I will just shut up.
Write once and if it errors destroy user data is a very surprising semantic for users. Not sure why we should do that.
(In reply to Conrad Meyer from comment #13) Indeed, if we talk about the general behaviour. I see that I utterly failed to explain that I was thinking purely in a context of what msdos does in markvoldirty. Right now that code can leave behind a perpetually dirty buffer and I was thinking how that can be avoided. Maybe markvoldirty should do bp = getblk(...) bp->b_flags |= B_INVAL | B_RELBUF | B_NOCACHE; bp->b_flags &= ~(B_ASYNC | B_CACHE); brelse(bp); after a failed write? Looks clumsy, but should work. Or maybe markvoldirty should not use buffer cache for its write? It could use g_write_data, for example. But that sounds like layering violation.
(In reply to Konstantin Belousov from comment #9) This proposal looks like something that can help with the problem at hand in a generic way. I see that at present g_vfs_close() does not even check the return code of bufobj_invalbuf(). That looks dangerous? But not sure if the code can do better. E.g., if looping on an error would be okay. Then, there remains a question of when to use the proposed g_vfs_close(NOSAVE). There are obvious cases like a failed mount. And that's what we want to address now, so that's good. But there can be a more complex situation. For example, we can mount a filesystem r/w and do some writes to it. But then the hardware suddenly decides to go to read-only mode (e.g., because some internal counter crosses its threshold). So, we will get some failed writes and some dirty buffers that cannot possibly be made clean by writing out. Do we need to somehow handle them? And then we decide to unmount the filesystem. How do we handle those buffers? How do we know if we should call g_vfs_close(SAVE) or g_vfs_close(NOSAVE) ?
(In reply to Andriy Gapon from comment #14) bp = getblk(...) bp->b_flags |= B_INVAL | B_RELBUF | B_NOCACHE; bp->b_flags &= ~(B_ASYNC | B_CACHE); brelse(bp); maybe we need a bfinval(bp) that does that. "Forced" is the 'f' but I'm open to any other name. Basically "Throw away this buffer I know is dirty." We have a 'flush' operation but no 'invalidate' or 'throw away' operation for these cache entries. If we centralize it, we'll have fewer gross hacks like this in the system that "should work" but gradually drift away from working because of some subtle change in what B_NOCACHE means 5 years from now...
(In reply to Warner Losh from comment #16) I very much agree with the idea of adding bfinval.
*** Bug 212043 has been marked as a duplicate of this bug. ***
*** Bug 211959 has been marked as a duplicate of this bug. ***
A commit references this bug: Author: cem Date: Wed Sep 11 21:24:17 UTC 2019 New revision: 352233 URL: https://svnweb.freebsd.org/changeset/base/352233 Log: buf: Add B_INVALONERR flag to discard data Setting the B_INVALONERR flag before a synchronous write causes the buf cache to forcibly invalidate contents if the write fails (BIO_ERROR). This is intended to be used to allow layers above the buffer cache to make more informed decisions about when discarding dirty buffers without successful write is acceptable. As a proof of concept, use in msdosfs to handle failures to mark the on-disk 'dirty' bit during rw mount or ro->rw update. Extending this to other filesystems is left as future work. PR: 210316 Reviewed by: kib (with objections) Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D21539 Changes: head/sys/fs/msdosfs/fat.h head/sys/fs/msdosfs/msdosfs_fat.c head/sys/fs/msdosfs/msdosfs_vfsops.c head/sys/kern/vfs_bio.c head/sys/sys/buf.h head/usr.sbin/makefs/msdos/msdosfs_denode.c head/usr.sbin/makefs/msdos/msdosfs_fat.c head/usr.sbin/makefs/msdos/msdosfs_lookup.c head/usr.sbin/makefs/msdos/msdosfs_vfsops.c head/usr.sbin/makefs/msdos/msdosfs_vnops.c