Bug 162036 - [geom] Fatal trap 12: page fault while in kernel mode -- Stopped at atomic_subtract_int+0x4
Summary: [geom] Fatal trap 12: page fault while in kernel mode -- Stopped at atomic_su...
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: Unspecified
Hardware: Any Any
: Normal Affects Only Me
Assignee: freebsd-geom (Nobody)
URL:
Keywords: patch
Depends on:
Blocks:
 
Reported: 2011-10-26 18:40 UTC by Fabian Keil
Modified: 2018-09-30 12:26 UTC (History)
3 users (show)

See Also:


Attachments
Let g_eli deal with lost devices without crashing (2.44 KB, patch)
2017-08-29 12:52 UTC, Fabian Keil
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Fabian Keil 2011-10-26 18:40:08 UTC
I reproducible get the following (handtranscribed) panic
when sending an zfs snapshot to a geli provider based on a
labeled USB stick that disappears (due to a bug, or because
it's unplugged): 

Fatal trap 12: page fault while in kernel mode
cpuid = 0: apic id = 00
fault virtual address = 0x288
fault code	      = supervisor write data, page not present
instruction pointer   = 0x20:0xffffffff808e2984
stack pointer         = 0x28:0xffffff800023fba0
frame pointer         = 0x28:0xffffff800023fbb0
code segment          = base rx0, limit 0xfffff, type 0x1b
                      = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags      = interrupt enabled, resume, IOPL = 0
current process       = 13 (g_up)
[ thread pid 13 tid 100010 ]
Stopped at    atomic_subtract_int+0x4: lock subl %esi,(%rdi)
db> where  
Tracing pid 13 tid 100010 td 0xfffffe00027998c0
atomic_subtract_int() at atomic_subtract_int+0x4
g_io_schdule_up() at g_io_schedule_up+0xa6
g_up_procbody() at g_up_procbody+0x5c
fork_exit() at fork_exit+0x11f
fork_trampoline() at fork_trampoline+0xe
--- trap 0, rip = 0, rsp = 0xffffff800023fd00, rbp 0 ---

It seems to be important that ZFS is actually writing to the stick.
If the stick is unplugged while the operation is stalled for other
reasons, this particular panic doesn't seem to occur.

In case the zpool layout isn't clear, it's the same as used in:
http://www.freebsd.org/cgi/query-pr.cgi?pr=162010

While I end up in the debugger, dumping core doesn't work
and produces a double fault.

The panic above is from a custom kernel without WITNESS and
INVARIANTS (a bit older than the uname -a output above),
but enabling them doesn't seem to affect the trace
before the double fault.

I wasn't able to reproduce the panic by unplugging the stick
while writing to the pool using dd (but only tried once).

How-To-Repeat: Either use a flaky USB stick as geli-encrypted vdev and wait
for it to disappear by itself while a zfs receive is in progress
and actually writing to the device, or use a reliable stick and
unplug it manually while zpool iostat is showing writes.
Comment 1 Fabian Keil 2011-11-18 12:08:43 UTC
With the patch from:
http://lists.freebsd.org/pipermail/freebsd-current/2011-November/029226.html
and sources after r227637 dumping core works. I didn't test with r227637
and without the scheduler-stopping patch.

fk@r500 /usr/crash $kgdb kernel.6/kernel vmcore.6
GNU gdb 6.1.1 [FreeBSD]
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "amd64-marcel-freebsd"...

Unread portion of the kernel message buffer:
(da0:ugen7.2: <vendor 0x14cd> at usbus7 (disconnected)
umass-sim0:0:0:0): Request completed with CAM_REQ_CMP_ERR
umass0: at uhub7, port 2, addr 2 (disconnected)
(da0:umass-sim0:0:0:0): Retrying command
(da0:umass-sim0:0:0:0): Selection timeout
(da0:umass-sim0:0:0:0): Retrying command
(da0:umass-sim0:0:0:0): Selection timeout
(da0:umass-sim0:0:0:0): Retrying command
(da0:umass-sim0:0:0:0): Selection timeout
(da0:umass-sim0:0:0:0): Retrying command
(da0:umass-sim0:0:0:0): lost device - 1 outstanding
GEOM_ELI(pass2:: umass-sim0:0:Crypto WRITE request failed (error=6).0: 0): lost device
label/extreme.eli[WRITE(offset=2791427072, length=131072)]
(pass2:umass-sim0:0:0:0): removing device entry
GEOM_ELI(da0:umass-sim0:0:0:0): Error 6, Retries exhausted
(da0:umass-sim0:0:0:0): oustanding 0
: Crypto WRITE request failed (error=6). label/extreme.eli[WRITE(offset=2791558144, length=131072)]
GEOM_ELI: Crypto WRITE request failed (error=6). label/extreme.eli[WRITE(offset=2791689216, length=131072)]
GEOM_ELI: Crypto WRITE request failed (error=6). label/extreme.eli[WRITE(offset=2791886848, length=131072)]
GEOM_ELI: Crypto WRITE request failed (error=6). label/extreme.eli[WRITE(offset=2792017920, length=131072)]
GEOM_ELI: Crypto WRITE request failed (error=6). label/extreme.eli[WRITE(offset=2792148992, length=131072)]
GEOM_ELI: Crypto WRITE request failed (error=6). label/extreme.eli[WRITE(offset=2792280064, length=131072)]
GEOM_ELI: Crypto WRITE request failed (error=6). label/extreme.eli[WRITE(offset=2792411136, length=131072)]
GEOM_ELI: Crypto WRITE request failed (error=6). label/extreme.eli[WRITE(offset=2792542208, length=131072)]
GEOM_ELI: Crypto WRITE request failed (error=6). label/extreme.eli[WRITE(offset=2791296000, length=131072)]
GEOM_ELI: g_eli_read_done() failed label/extreme.eli[READ(offset=270336, length=8192)]
GEOM_ELI: g_eli_read_done() failed label/extreme.eli[READ(offset=4022607872, length=8192)]
GEOM_ELI: g_eli_read_done() failed label/extreme.eli[READ(offset=4022870016, length=8192)]
zio_vdev_io_start: Setting zio->io_error to ENXIO for vdev extreme /dev/label/extreme.eli
zio_vdev_io_start: Setting zio->io_error to ENXIO for vdev extreme /dev/label/extreme.eli
zio_vdev_io_start: Setting zio->io_error to ENXIO for vdev extreme /dev/label/extreme.eli
zio_vdev_io_start: Setting zio->io_error to ENXIO for vdev extreme /dev/label/extreme.eli
zio_vdev_io_start: Setting zio->io_error to ENXIO for vdev extreme /dev/label/extreme.eli
zio_vdev_io_start: Setting zio->io_error to ENXIO for vdev extreme /dev/label/extreme.eli
zio_vdev_io_start: Setting zio->io_error to ENXIO for vdev extreme /dev/label/extreme.eli
zio_vdev_io_start: Setting zio->io_error to ENXIO for vdev extreme /dev/label/extreme.eli
zio_vdev_io_start: Setting zio->io_error to ENXIO for vdev extreme /dev/label/extreme.eli
zio_vdev_io_start: Setting zio->io_error to ENXIO for vdev extreme /dev/label/extreme.eli
GEOM_ELI: Crypto WRITE request failed (error=6). label/extreme.eli[WRITE(offset=2792673280, length=131072)]
zio_vdev_io_start: Setting zio->io_error to ENXIO for vdev extreme /dev/label/extreme.eli
GEOM_ELI: Crypto WRITE request failed (error=6). label/extreme.eli[WRITE(offset=2792804352, length=131072)]
zio_vdev_io_start: Setting zio->io_error to ENXIO for vdev extreme /dev/label/extreme.eli
GEOM_ELI: Crypto WRITE request failed (error=6). label/extreme.eli[WRITE(offset=2792935424, length=131072)]

Fatal trap 12: page fault while in kernel mode
(da0:cpuid = 1; apic id = 01
umass-sim0:0:fault virtual address      = 0x288
0:fault code            = supervisor write data, page not present
0): removing device entry
instruction pointer     = 0x20:0xffffffff808dbef4
stack pointer           = 0x28:0xffffff8000244ba0
frame pointer           = 0x28:0xffffff8000244bb0
code segment            = base rx0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 13 (g_up)

Reading symbols from /boot/kernel/zfs.ko...Reading symbols from /boot/kernel/zfs.ko.symbols...done.
done.
[...]
Reading symbols from /boot/kernel/fdescfs.ko...Reading symbols from /boot/kernel/fdescfs.ko.symbols...done.
done.
Loaded symbols for /boot/kernel/fdescfs.ko
#0  doadump (textdump=0) at /usr/src/sys/kern/kern_shutdown.c:267
267             if (textdump && textdump_pending) {
(kgdb) where
#0  doadump (textdump=0) at /usr/src/sys/kern/kern_shutdown.c:267
#1  0xffffffff80315990 in db_dump (dummy=Variable "dummy" is not available.
) at /usr/src/sys/ddb/db_command.c:537
#2  0xffffffff803152c1 in db_command (last_cmdp=0xffffffff80df1800, cmd_table=Variable "cmd_table" is not available.
) at /usr/src/sys/ddb/db_command.c:448
#3  0xffffffff80315510 in db_command_loop () at /usr/src/sys/ddb/db_command.c:501
#4  0xffffffff80317649 in db_trap (type=Variable "type" is not available.
) at /usr/src/sys/ddb/db_main.c:229
#5  0xffffffff80690581 in kdb_trap (type=12, code=0, tf=0xffffff8000244af0) at /usr/src/sys/kern/subr_kdb.c:626
#6  0xffffffff808f48cd in trap_fatal (frame=0xffffff8000244af0, eva=Variable "eva" is not available.
) at /usr/src/sys/amd64/amd64/trap.c:814
#7  0xffffffff808f4c36 in trap_pfault (frame=0xffffff8000244af0, usermode=0) at /usr/src/sys/amd64/amd64/trap.c:735
#8  0xffffffff808f50ff in trap (frame=0xffffff8000244af0) at /usr/src/sys/amd64/amd64/trap.c:474
#9  0xffffffff808def33 in calltrap () at /usr/src/sys/amd64/amd64/exception.S:228
#10 0xffffffff808dbef4 in atomic_subtract_int (p=0x288, v=1) at atomic.h:289
#11 0xffffffff805eeb16 in g_io_schedule_up (tp=Variable "tp" is not available.
) at /usr/src/sys/geom/geom_io.c:679
#12 0xffffffff805ef09c in g_up_procbody (arg=Variable "arg" is not available.
) at /usr/src/sys/geom/geom_kern.c:97
#13 0xffffffff80628b6f in fork_exit (callout=0xffffffff805ef040 <g_up_procbody>, arg=0x0, frame=0xffffff8000244c50) at /usr/src/sys/kern/kern_fork.c:995
#14 0xffffffff808df45e in fork_trampoline () at /usr/src/sys/amd64/amd64/exception.S:602
#15 0x0000000000000000 in ?? ()
[...]
#38 0x0000000000000000 in ?? ()
#39 0xffffffff80e2cf80 in tdq_cpu ()
#40 0x0000000000000001 in ?? ()
#41 0x0000000000000000 in ?? ()
#42 0xfffffe0002723460 in ?? ()
#43 0xffffff8000244360 in ?? ()
#44 0xffffff8000244308 in ?? ()
#45 0xfffffe0004ca9000 in ?? ()
---Type <return> to continue, or q <return> to quit---
#46 0xffffffff80683250 in sched_switch (td=0xffffffff805ef040, newtd=0x0, flags=Variable "flags" is not available.
) at /usr/src/sys/kern/sched_ule.c:1853
(kgdb) f 11
#11 0xffffffff805eeb16 in g_io_schedule_up (tp=Variable "tp" is not available.
) at /usr/src/sys/geom/geom_io.c:679
679                             biodone(bp);
Current language:  auto; currently c
(kgdb) l
674                             g_bioq_unlock(&g_bio_run_up);
675                             THREAD_NO_SLEEPING();
676                             CTR4(KTR_GEOM, "g_up biodone bp %p provider %s off "
677                                 "%jd len %ld", bp, bp->bio_to->name,
678                                 bp->bio_offset, bp->bio_length);
679                             biodone(bp);
680                             THREAD_SLEEPING_OK();
681                             continue;
682                     }
683                     CTR0(KTR_GEOM, "g_up going to sleep");
(kgdb) p *bp
$1 = {bio_cmd = 2 '\002', bio_flags = 2 '\002', bio_cflags = 0 '\0', bio_pflags = 0 '\0', bio_dev = 0x0, bio_disk = 0x0, bio_offset = 2792935424, bio_bcount = 131072,
  bio_data = 0xffffff800a581000 <Address 0xffffff800a581000 out of bounds>, bio_error = 6, bio_resid = 131072, bio_done = 0xffffffff813597d0 <g_eli_write_done>,
  bio_driver1 = 0x0, bio_driver2 = 0x0, bio_caller1 = 0x0, bio_caller2 = 0x0, bio_queue = {tqe_next = 0x0, tqe_prev = 0xffffffff80e10f60}, bio_attribute = 0x0,
  bio_from = 0xfffffe0012877d00, bio_to = 0xfffffe0004c65100, bio_length = 131072, bio_completed = 0, bio_children = 0, bio_inbed = 0, bio_parent = 0xfffffe0017121bc8,
  bio_t0 = {sec = 201, frac = 2813861954672982723}, bio_task = 0, bio_task_arg = 0x0, bio_classifier1 = 0x0, bio_classifier2 = 0x0, bio_pblkno = 0}

The "zio_vdev_io_start: ..." messages are the result of:

commit 541719565e4daf8cb6f93896013ae57b867546c4
Author: Fabian Keil <fk@fabiankeil.de>
Date:   Mon Mar 14 18:27:13 2011 +0100

    Print a message when ENXIOing a device

diff --git a/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c b/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c
index b38fb0b..7269258 100644
--- a/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c
+++ b/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c
@@ -2370,6 +2370,8 @@ zio_vdev_io_start(zio_t *zio)
                        return (ZIO_PIPELINE_STOP);

                if (!vdev_accessible(vd, zio)) {
+                       printf("%s: Setting zio->io_error to ENXIO for vdev %s %s\n",
+                              __func__, vd->vdev_spa->spa_name, vd->vdev_path);
                        zio->io_error = ENXIO;
                        zio_interrupt(zio);
                        return (ZIO_PIPELINE_STOP);

Fabian
Comment 2 Fabian Keil 2017-08-29 12:52:22 UTC
Created attachment 185874 [details]
Let g_eli deal with lost devices without crashing

I've been using the attached patchset since 2013 to prevent the crashes
and just noticed that I hadn't attached it to the PR yet.

Obtained from: ElectroBSD
Comment 3 Eitan Adler freebsd_committer freebsd_triage 2018-05-23 10:27:47 UTC
batch change of PRs untouched in 2018 marked "in progress" back to open.
Comment 4 Sean Bruno freebsd_committer freebsd_triage 2018-05-23 14:18:38 UTC
This code looks like it is something that should be validated/used.  What do you think?
Comment 5 Mariusz Zaborski freebsd_committer freebsd_triage 2018-06-05 22:59:09 UTC
I just looked in to the patch and seems clean and straightforward.
However I was trying to reproduce your problem by:
- creating label on external device
- encrypting the label
- creating ZFS on top of it
- during receiving snapshot disconnect the device

And unfortunately I could reproduce bug.
I know the bug is quite old but I very determined to reproduce that.
Could you tell my a little more how can I reproduce that? (or if you still have this issue without the patch)
Comment 6 commit-hook freebsd_committer freebsd_triage 2018-07-15 18:03:48 UTC
A commit references this bug:

Author: oshogbo
Date: Sun Jul 15 18:03:20 UTC 2018
New revision: 336310
URL: https://svnweb.freebsd.org/changeset/base/336310

Log:
  Let geli deal with lost devices without crashing.

  PR:		162036
  Submitted by:	Fabian Keil <fk@fabiankeil.de>
  Obtained from:	ElectroBSD
  Discussed with: pjd@

Changes:
  head/sys/geom/eli/g_eli.c
  head/sys/geom/eli/g_eli_privacy.c
Comment 7 Allan Jude freebsd_committer freebsd_triage 2018-09-29 14:10:38 UTC
(In reply to commit-hook from comment #6)
Mariusz: are you planning to MFC this, or can it be closed?
Comment 8 commit-hook freebsd_committer freebsd_triage 2018-09-30 12:25:45 UTC
A commit references this bug:

Author: oshogbo
Date: Sun Sep 30 12:25:39 UTC 2018
New revision: 339023
URL: https://svnweb.freebsd.org/changeset/base/339023

Log:
  MFC r336310:
    Let geli deal with lost devices without crashing.

    PR:           162036
    Submitted by: Fabian Keil <fk@fabiankeil.de>
    Obtained from:        ElectroBSD
    Discussed with: pjd@

Changes:
_U  stable/11/
  stable/11/sys/geom/eli/g_eli.c
  stable/11/sys/geom/eli/g_eli_privacy.c
Comment 9 Mariusz Zaborski freebsd_committer freebsd_triage 2018-09-30 12:26:16 UTC
Allan: Done!