I reproducible get the following (handtranscribed) panic when sending an zfs snapshot to a geli provider based on a labeled USB stick that disappears (due to a bug, or because it's unplugged): Fatal trap 12: page fault while in kernel mode cpuid = 0: apic id = 00 fault virtual address = 0x288 fault code = supervisor write data, page not present instruction pointer = 0x20:0xffffffff808e2984 stack pointer = 0x28:0xffffff800023fba0 frame pointer = 0x28:0xffffff800023fbb0 code segment = base rx0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 13 (g_up) [ thread pid 13 tid 100010 ] Stopped at atomic_subtract_int+0x4: lock subl %esi,(%rdi) db> where Tracing pid 13 tid 100010 td 0xfffffe00027998c0 atomic_subtract_int() at atomic_subtract_int+0x4 g_io_schdule_up() at g_io_schedule_up+0xa6 g_up_procbody() at g_up_procbody+0x5c fork_exit() at fork_exit+0x11f fork_trampoline() at fork_trampoline+0xe --- trap 0, rip = 0, rsp = 0xffffff800023fd00, rbp 0 --- It seems to be important that ZFS is actually writing to the stick. If the stick is unplugged while the operation is stalled for other reasons, this particular panic doesn't seem to occur. In case the zpool layout isn't clear, it's the same as used in: http://www.freebsd.org/cgi/query-pr.cgi?pr=162010 While I end up in the debugger, dumping core doesn't work and produces a double fault. The panic above is from a custom kernel without WITNESS and INVARIANTS (a bit older than the uname -a output above), but enabling them doesn't seem to affect the trace before the double fault. I wasn't able to reproduce the panic by unplugging the stick while writing to the pool using dd (but only tried once). How-To-Repeat: Either use a flaky USB stick as geli-encrypted vdev and wait for it to disappear by itself while a zfs receive is in progress and actually writing to the device, or use a reliable stick and unplug it manually while zpool iostat is showing writes.
With the patch from: http://lists.freebsd.org/pipermail/freebsd-current/2011-November/029226.html and sources after r227637 dumping core works. I didn't test with r227637 and without the scheduler-stopping patch. fk@r500 /usr/crash $kgdb kernel.6/kernel vmcore.6 GNU gdb 6.1.1 [FreeBSD] Copyright 2004 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "amd64-marcel-freebsd"... Unread portion of the kernel message buffer: (da0:ugen7.2: <vendor 0x14cd> at usbus7 (disconnected) umass-sim0:0:0:0): Request completed with CAM_REQ_CMP_ERR umass0: at uhub7, port 2, addr 2 (disconnected) (da0:umass-sim0:0:0:0): Retrying command (da0:umass-sim0:0:0:0): Selection timeout (da0:umass-sim0:0:0:0): Retrying command (da0:umass-sim0:0:0:0): Selection timeout (da0:umass-sim0:0:0:0): Retrying command (da0:umass-sim0:0:0:0): Selection timeout (da0:umass-sim0:0:0:0): Retrying command (da0:umass-sim0:0:0:0): lost device - 1 outstanding GEOM_ELI(pass2:: umass-sim0:0:Crypto WRITE request failed (error=6).0: 0): lost device label/extreme.eli[WRITE(offset=2791427072, length=131072)] (pass2:umass-sim0:0:0:0): removing device entry GEOM_ELI(da0:umass-sim0:0:0:0): Error 6, Retries exhausted (da0:umass-sim0:0:0:0): oustanding 0 : Crypto WRITE request failed (error=6). label/extreme.eli[WRITE(offset=2791558144, length=131072)] GEOM_ELI: Crypto WRITE request failed (error=6). label/extreme.eli[WRITE(offset=2791689216, length=131072)] GEOM_ELI: Crypto WRITE request failed (error=6). label/extreme.eli[WRITE(offset=2791886848, length=131072)] GEOM_ELI: Crypto WRITE request failed (error=6). label/extreme.eli[WRITE(offset=2792017920, length=131072)] GEOM_ELI: Crypto WRITE request failed (error=6). label/extreme.eli[WRITE(offset=2792148992, length=131072)] GEOM_ELI: Crypto WRITE request failed (error=6). label/extreme.eli[WRITE(offset=2792280064, length=131072)] GEOM_ELI: Crypto WRITE request failed (error=6). label/extreme.eli[WRITE(offset=2792411136, length=131072)] GEOM_ELI: Crypto WRITE request failed (error=6). label/extreme.eli[WRITE(offset=2792542208, length=131072)] GEOM_ELI: Crypto WRITE request failed (error=6). label/extreme.eli[WRITE(offset=2791296000, length=131072)] GEOM_ELI: g_eli_read_done() failed label/extreme.eli[READ(offset=270336, length=8192)] GEOM_ELI: g_eli_read_done() failed label/extreme.eli[READ(offset=4022607872, length=8192)] GEOM_ELI: g_eli_read_done() failed label/extreme.eli[READ(offset=4022870016, length=8192)] zio_vdev_io_start: Setting zio->io_error to ENXIO for vdev extreme /dev/label/extreme.eli zio_vdev_io_start: Setting zio->io_error to ENXIO for vdev extreme /dev/label/extreme.eli zio_vdev_io_start: Setting zio->io_error to ENXIO for vdev extreme /dev/label/extreme.eli zio_vdev_io_start: Setting zio->io_error to ENXIO for vdev extreme /dev/label/extreme.eli zio_vdev_io_start: Setting zio->io_error to ENXIO for vdev extreme /dev/label/extreme.eli zio_vdev_io_start: Setting zio->io_error to ENXIO for vdev extreme /dev/label/extreme.eli zio_vdev_io_start: Setting zio->io_error to ENXIO for vdev extreme /dev/label/extreme.eli zio_vdev_io_start: Setting zio->io_error to ENXIO for vdev extreme /dev/label/extreme.eli zio_vdev_io_start: Setting zio->io_error to ENXIO for vdev extreme /dev/label/extreme.eli zio_vdev_io_start: Setting zio->io_error to ENXIO for vdev extreme /dev/label/extreme.eli GEOM_ELI: Crypto WRITE request failed (error=6). label/extreme.eli[WRITE(offset=2792673280, length=131072)] zio_vdev_io_start: Setting zio->io_error to ENXIO for vdev extreme /dev/label/extreme.eli GEOM_ELI: Crypto WRITE request failed (error=6). label/extreme.eli[WRITE(offset=2792804352, length=131072)] zio_vdev_io_start: Setting zio->io_error to ENXIO for vdev extreme /dev/label/extreme.eli GEOM_ELI: Crypto WRITE request failed (error=6). label/extreme.eli[WRITE(offset=2792935424, length=131072)] Fatal trap 12: page fault while in kernel mode (da0:cpuid = 1; apic id = 01 umass-sim0:0:fault virtual address = 0x288 0:fault code = supervisor write data, page not present 0): removing device entry instruction pointer = 0x20:0xffffffff808dbef4 stack pointer = 0x28:0xffffff8000244ba0 frame pointer = 0x28:0xffffff8000244bb0 code segment = base rx0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 13 (g_up) Reading symbols from /boot/kernel/zfs.ko...Reading symbols from /boot/kernel/zfs.ko.symbols...done. done. [...] Reading symbols from /boot/kernel/fdescfs.ko...Reading symbols from /boot/kernel/fdescfs.ko.symbols...done. done. Loaded symbols for /boot/kernel/fdescfs.ko #0 doadump (textdump=0) at /usr/src/sys/kern/kern_shutdown.c:267 267 if (textdump && textdump_pending) { (kgdb) where #0 doadump (textdump=0) at /usr/src/sys/kern/kern_shutdown.c:267 #1 0xffffffff80315990 in db_dump (dummy=Variable "dummy" is not available. ) at /usr/src/sys/ddb/db_command.c:537 #2 0xffffffff803152c1 in db_command (last_cmdp=0xffffffff80df1800, cmd_table=Variable "cmd_table" is not available. ) at /usr/src/sys/ddb/db_command.c:448 #3 0xffffffff80315510 in db_command_loop () at /usr/src/sys/ddb/db_command.c:501 #4 0xffffffff80317649 in db_trap (type=Variable "type" is not available. ) at /usr/src/sys/ddb/db_main.c:229 #5 0xffffffff80690581 in kdb_trap (type=12, code=0, tf=0xffffff8000244af0) at /usr/src/sys/kern/subr_kdb.c:626 #6 0xffffffff808f48cd in trap_fatal (frame=0xffffff8000244af0, eva=Variable "eva" is not available. ) at /usr/src/sys/amd64/amd64/trap.c:814 #7 0xffffffff808f4c36 in trap_pfault (frame=0xffffff8000244af0, usermode=0) at /usr/src/sys/amd64/amd64/trap.c:735 #8 0xffffffff808f50ff in trap (frame=0xffffff8000244af0) at /usr/src/sys/amd64/amd64/trap.c:474 #9 0xffffffff808def33 in calltrap () at /usr/src/sys/amd64/amd64/exception.S:228 #10 0xffffffff808dbef4 in atomic_subtract_int (p=0x288, v=1) at atomic.h:289 #11 0xffffffff805eeb16 in g_io_schedule_up (tp=Variable "tp" is not available. ) at /usr/src/sys/geom/geom_io.c:679 #12 0xffffffff805ef09c in g_up_procbody (arg=Variable "arg" is not available. ) at /usr/src/sys/geom/geom_kern.c:97 #13 0xffffffff80628b6f in fork_exit (callout=0xffffffff805ef040 <g_up_procbody>, arg=0x0, frame=0xffffff8000244c50) at /usr/src/sys/kern/kern_fork.c:995 #14 0xffffffff808df45e in fork_trampoline () at /usr/src/sys/amd64/amd64/exception.S:602 #15 0x0000000000000000 in ?? () [...] #38 0x0000000000000000 in ?? () #39 0xffffffff80e2cf80 in tdq_cpu () #40 0x0000000000000001 in ?? () #41 0x0000000000000000 in ?? () #42 0xfffffe0002723460 in ?? () #43 0xffffff8000244360 in ?? () #44 0xffffff8000244308 in ?? () #45 0xfffffe0004ca9000 in ?? () ---Type <return> to continue, or q <return> to quit--- #46 0xffffffff80683250 in sched_switch (td=0xffffffff805ef040, newtd=0x0, flags=Variable "flags" is not available. ) at /usr/src/sys/kern/sched_ule.c:1853 (kgdb) f 11 #11 0xffffffff805eeb16 in g_io_schedule_up (tp=Variable "tp" is not available. ) at /usr/src/sys/geom/geom_io.c:679 679 biodone(bp); Current language: auto; currently c (kgdb) l 674 g_bioq_unlock(&g_bio_run_up); 675 THREAD_NO_SLEEPING(); 676 CTR4(KTR_GEOM, "g_up biodone bp %p provider %s off " 677 "%jd len %ld", bp, bp->bio_to->name, 678 bp->bio_offset, bp->bio_length); 679 biodone(bp); 680 THREAD_SLEEPING_OK(); 681 continue; 682 } 683 CTR0(KTR_GEOM, "g_up going to sleep"); (kgdb) p *bp $1 = {bio_cmd = 2 '\002', bio_flags = 2 '\002', bio_cflags = 0 '\0', bio_pflags = 0 '\0', bio_dev = 0x0, bio_disk = 0x0, bio_offset = 2792935424, bio_bcount = 131072, bio_data = 0xffffff800a581000 <Address 0xffffff800a581000 out of bounds>, bio_error = 6, bio_resid = 131072, bio_done = 0xffffffff813597d0 <g_eli_write_done>, bio_driver1 = 0x0, bio_driver2 = 0x0, bio_caller1 = 0x0, bio_caller2 = 0x0, bio_queue = {tqe_next = 0x0, tqe_prev = 0xffffffff80e10f60}, bio_attribute = 0x0, bio_from = 0xfffffe0012877d00, bio_to = 0xfffffe0004c65100, bio_length = 131072, bio_completed = 0, bio_children = 0, bio_inbed = 0, bio_parent = 0xfffffe0017121bc8, bio_t0 = {sec = 201, frac = 2813861954672982723}, bio_task = 0, bio_task_arg = 0x0, bio_classifier1 = 0x0, bio_classifier2 = 0x0, bio_pblkno = 0} The "zio_vdev_io_start: ..." messages are the result of: commit 541719565e4daf8cb6f93896013ae57b867546c4 Author: Fabian Keil <fk@fabiankeil.de> Date: Mon Mar 14 18:27:13 2011 +0100 Print a message when ENXIOing a device diff --git a/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c b/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c index b38fb0b..7269258 100644 --- a/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c +++ b/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c @@ -2370,6 +2370,8 @@ zio_vdev_io_start(zio_t *zio) return (ZIO_PIPELINE_STOP); if (!vdev_accessible(vd, zio)) { + printf("%s: Setting zio->io_error to ENXIO for vdev %s %s\n", + __func__, vd->vdev_spa->spa_name, vd->vdev_path); zio->io_error = ENXIO; zio_interrupt(zio); return (ZIO_PIPELINE_STOP); Fabian
Created attachment 185874 [details] Let g_eli deal with lost devices without crashing I've been using the attached patchset since 2013 to prevent the crashes and just noticed that I hadn't attached it to the PR yet. Obtained from: ElectroBSD
batch change of PRs untouched in 2018 marked "in progress" back to open.
This code looks like it is something that should be validated/used. What do you think?
I just looked in to the patch and seems clean and straightforward. However I was trying to reproduce your problem by: - creating label on external device - encrypting the label - creating ZFS on top of it - during receiving snapshot disconnect the device And unfortunately I could reproduce bug. I know the bug is quite old but I very determined to reproduce that. Could you tell my a little more how can I reproduce that? (or if you still have this issue without the patch)
A commit references this bug: Author: oshogbo Date: Sun Jul 15 18:03:20 UTC 2018 New revision: 336310 URL: https://svnweb.freebsd.org/changeset/base/336310 Log: Let geli deal with lost devices without crashing. PR: 162036 Submitted by: Fabian Keil <fk@fabiankeil.de> Obtained from: ElectroBSD Discussed with: pjd@ Changes: head/sys/geom/eli/g_eli.c head/sys/geom/eli/g_eli_privacy.c
(In reply to commit-hook from comment #6) Mariusz: are you planning to MFC this, or can it be closed?
A commit references this bug: Author: oshogbo Date: Sun Sep 30 12:25:39 UTC 2018 New revision: 339023 URL: https://svnweb.freebsd.org/changeset/base/339023 Log: MFC r336310: Let geli deal with lost devices without crashing. PR: 162036 Submitted by: Fabian Keil <fk@fabiankeil.de> Obtained from: ElectroBSD Discussed with: pjd@ Changes: _U stable/11/ stable/11/sys/geom/eli/g_eli.c stable/11/sys/geom/eli/g_eli_privacy.c
Allan: Done!