Bug 221804

Summary: [patch] page fault during shutdown on a server with gjournaled partitions
Product: Base System Reporter: longwitz
Component: kernAssignee: freebsd-geom (Nobody) <geom>
Status: Closed FIXED    
Severity: Affects Some People CC: eugen, longwitz, mckusick
Priority: --- Keywords: patch-ready
Version: 10.3-STABLE   
Hardware: Any   
OS: Any   
Attachments:
Description Flags
Patch is against HEAD and should work for FreeBSD 10/11 too. none

Description longwitz 2017-08-25 14:55:37 UTC
I am running FreeBSD 10.3-STABLE #1 r317936 using gjournal/gmirror and got
a page fault of the process g_journal switcher during shutdown:

...
GEOM_JOURNAL: Shutting down geom gjournal 1247376231.
GEOM_JOURNAL: Shutting down geom gjournal 1068288894.

Fatal trap 12: page fault while in kernel mode
cpuid = 0; apic id = 00
fault virtual address   = 0x378
fault code              = supervisor read data, page not present
instruction pointer     = 0x20:0xffffffff806aa1ea
stack pointer           = 0x28:0xfffffe104445fa10
frame pointer           = 0x28:0xfffffe104445fab0
code segment            = base rx0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 6 (g_journal switcher)
[ thread pid 6 tid 100132 ]
Stopped at      __mtx_lock_sleep+0x1da: movl    0x378(%rbx),%eax

(kgdb) where
...
#10 0xffffffff80954abc in calltrap () at /usr/src/sys/amd64/amd64/exception.S:238
#11 0xffffffff806aa1ea in __mtx_lock_sleep (c=0xfffff8003823d030, tid=18446735278556155904, opts=0, file=0x0, line=1145437376)
    at /usr/src/sys/kern/kern_mutex.c:506
#12 0xffffffff806a9ff7 in __mtx_lock_flags (c=<value optimized out>, opts=<value optimized out>, file=0x0, line=0)
    at /usr/src/sys/kern/kern_mutex.c:256
#13 0xffffffff811686bf in g_journal_switcher (arg=0xffffffff81170e48)
    at /usr/src/sys/modules/geom/geom_journal/../../../geom/journal/g_journal.c:2945
#14 0xffffffff8068daba in fork_exit (callout=0xffffffff81167a50 <g_journal_switcher>, arg=0xffffffff81170e48,
    frame=0xfffffe104445fc00) at /usr/src/sys/kern/kern_fork.c:1030
#15 0xffffffff80954ffe in fork_trampoline () at /usr/src/sys/amd64/amd64/exception.S:613

(kgdb) f 13
#13 0xffffffff811686bf in g_journal_switcher (arg=0xffffffff81170e48)
    at /usr/src/sys/modules/geom/geom_journal/../../../geom/journal/g_journal.c:2945
2945                            mtx_lock(&sc->sc_mtx);

(kgdb) list
2940                    g_topology_lock();
2941                    LIST_FOREACH(gp, &g_journal_class.geom, geom) {
2942                            sc = gp->softc;
2943                            if (sc == NULL)
2944                                    continue;
2945                            mtx_lock(&sc->sc_mtx);
2946                            if ((sc->sc_type & GJ_TYPE_COMPLETE) == GJ_TYPE_COMPLETE &&
2947                                !(sc->sc_flags & GJF_DEVICE_DESTROY) &&
2948                                (sc->sc_flags & GJF_DEVICE_BEFORE_SWITCH)) {
2949                                    break;

(kgdb) p *g_journal_class.geom->lh_first
$1 = {name = 0xfffff80038239560 "gjournal 1068288894", class = 0xffffffff81170e48, geom = {le_next = 0x0,
    le_prev = 0xffffffff81170ef0}, consumer = {lh_first = 0xfffff800382c5900}, provider = {lh_first = 0x0}, geoms = {
    tqe_next = 0xfffff8003822a300, tqe_prev = 0xfffff8003826c330}, rank = 4, start = 0xffffffff8116bcd0 <g_journal_start>,
  spoiled = 0, attrchanged = 0, dumpconf = 0xffffffff811674e0 <g_journal_dumpconf>, access = 0xffffffff8116bfe0 <g_journal_access>,
  orphan = 0xffffffff8116be30 <g_journal_orphan>, ioctl = 0, providergone = 0, resize = 0, spare0 = 0x0, spare1 = 0x0,
  softc = 0xfffff8003823d000, flags = 2}
(kgdb) p *(struct g_journal_softc *)((g_journal_class.geom->lh_first))->softc
$2 = {sc_id = 1068288894, sc_type = 3 '\003', sc_orig_type = 1 '\001', sc_geom = 0xfffff800380bce00, sc_flags = 144, sc_mtx = {
    lock_object = {lo_name = 0xffffffff8116f916 "gjournal", lo_flags = 16908288, lo_data = 0, lo_witness = 0x0}, mtx_lock = 6},
  sc_mediasize = 1073741312, sc_sectorsize = 512, sc_bio_flush = 3, sc_journal_id = 1627568564, sc_journal_next_id = 163691452,
  sc_journal_copying = 0, sc_journal_offset = 59226624, sc_journal_previous_id = 3706521218, sc_back_queue = {queue = {
      tqh_first = 0x0, tqh_last = 0xfffff8003823d068}, last_offset = 805502976, insert_point = 0x0}, sc_regular_queue = {queue = {
      tqh_first = 0x0, tqh_last = 0xfffff8003823d088}, last_offset = 65536, insert_point = 0x0}, sc_delayed_queue = {queue = {
      tqh_first = 0x0, tqh_last = 0xfffff8003823d0a8}, last_offset = 0, insert_point = 0x0}, sc_delayed_count = 0,
  sc_current_queue = 0x0, sc_current_count = 0, sc_flush_queue = 0x0, sc_flush_count = 0, sc_flush_in_progress = 0,
  sc_copy_queue = 0x0, sc_copy_in_progress = 0, sc_dconsumer = 0xfffff800382c5900, sc_jconsumer = 0xfffff8003822f900,
  sc_inactive = {jj_queue = 0x0, jj_offset = 59142656}, sc_active = {jj_queue = 0x0, jj_offset = 59226112}, sc_jstart = 0,
  sc_jend = 8589934080, sc_callout = {c_links = {le = {le_next = 0x0, le_prev = 0xfffffe0000803f00}, sle = {sle_next = 0x0}, tqe = {
        tqe_next = 0x0, tqe_prev = 0xfffffe0000803f00}}, c_time = 33825373681, c_precision = 1342177187,
    c_arg = 0xfffff8003823d000, c_func = 0xffffffff8116c190 <g_journal_timeout>, c_lock = 0x0, c_flags = 0, c_iflags = 16,
    c_cpu = 0}, sc_worker = 0x0, sc_rootmount = 0x0}

(kgdb) f 11
#11 0xffffffff806aa1ea in __mtx_lock_sleep (c=0xfffff8003823d030, tid=18446735278556155904, opts=0, file=0x0, line=1145437376)
    at /usr/src/sys/kern/kern_mutex.c:506
506                             owner = (struct thread *)(v & ~MTX_FLAGMASK);
(kgdb) list
501                      * If the owner is running on another CPU, spin until the
502                      * owner stops running or the state of the lock changes.
503                      */
504                     v = m->mtx_lock;
505                     if (v != MTX_UNOWNED) {
506                             owner = (struct thread *)(v & ~MTX_FLAGMASK);
507                             if (TD_IS_RUNNING(owner)) {
508                                     if (LOCK_LOG_TEST(&m->lock_object, 0))
509                                             CTR3(KTR_LOCK,
510                                                 "%s: spinning on %p held by %p",

(kgdb) p *c
$2 = 6
(kgdb) p lda
$3 = {config = 0xffffffff80d70e48, delay = 0, spin_cnt = 1}

The first parameter on call of __mtx_lock_sleep is the address of sc->sc_mtx->mtx_lock with value MTX_DESTROYED (=6), and so we have v = m->mtx_lock = 6, v & ~MTX_FLAGMASK) = 0 with page fault in line 506. The leak here ist that the case v = MTX_DESTROYED is not handled in this function __mtx_lock_sleep().

I think we have a race condition between g_journal_switcher and the process "PID 1" who calls g_journal_shutdown() for every gjournaled partition on shutdown.

(kgdb) proc 1
[Switching to thread 301 (Thread 100002)]#0  sched_switch (td=0xfffff8000d2ff500, newtd=<value optimized out>,
    flags=<value optimized out>) at /usr/src/sys/kern/sched_ule.c:1956
1956                    cpuid = PCPU_GET(cpuid);
(kgdb) where
#0  sched_switch (td=0xfffff8000d2ff500, newtd=<value optimized out>, flags=<value optimized out>)
    at /usr/src/sys/kern/sched_ule.c:1956
#1  0xffffffff806cb316 in mi_switch (flags=0, newtd=0x0) at /usr/src/sys/kern/kern_synch.c:473
#2  0xffffffff80710c1a in sleepq_wait (wchan=0x0, pri=0) at /usr/src/sys/kern/subr_sleepqueue.c:632
#3  0xffffffff806cacea in _sleep (ident=<value optimized out>, lock=<value optimized out>, priority=<value optimized out>,
    wmesg=<value optimized out>, sbt=<value optimized out>, pr=<value optimized out>, flags=<value optimized out>)
    at /usr/src/sys/kern/kern_synch.c:242
#4  0xffffffff8075e950 in biowait (bp=<value optimized out>, wchan=<value optimized out>) at /usr/src/sys/kern/vfs_bio.c:3674
#5  0xffffffff8064a0b5 in g_write_data (cp=<value optimized out>, offset=146815737344, ptr=0xfffff80609429c00, length=512)
    at /usr/src/sys/geom/geom_io.c:937
#6  0xffffffff8118c8df in g_mirror_write_metadata (disk=0xfffff800380b4100, md=<value optimized out>)
    at /usr/src/sys/modules/geom/geom_mirror/../../../geom/mirror/g_mirror.c:675
#7  0xffffffff8118c495 in g_mirror_update_metadata (disk=0xfffff800380b4100)
    at /usr/src/sys/modules/geom/geom_mirror/../../../geom/mirror/g_mirror.c:774
#8  0xffffffff8118ea6e in g_mirror_idle (sc=0xfffff80038059c00, acw=<value optimized out>)
    at /usr/src/sys/modules/geom/geom_mirror/../../../geom/mirror/g_mirror.c:860
#9  0xffffffff8118f058 in g_mirror_access (pp=<value optimized out>, acr=<value optimized out>, acw=<value optimized out>,
    ace=<value optimized out>) at /usr/src/sys/modules/geom/geom_mirror/../../../geom/mirror/g_mirror.c:2914
#10 0xffffffff8064c16e in g_access (cp=0xfffff80038261700, dcr=-1, dcw=-1, dce=-2) at /usr/src/sys/geom/geom_subr.c:919
#11 0xffffffff8064c16e in g_access (cp=0xfffff800382c5900, dcr=-1, dcw=-1, dce=-1) at /usr/src/sys/geom/geom_subr.c:919
#12 0xffffffff81169295 in g_journal_destroy (sc=0xfffff8003823d000)
    at /usr/src/sys/modules/geom/geom_journal/../../../geom/journal/g_journal.c:2431
#13 0xffffffff81167741 in g_journal_shutdown (arg=<value optimized out>, howto=0)
    at /usr/src/sys/modules/geom/geom_journal/../../../geom/journal/g_journal.c:2658
#14 0xffffffff806c11bc in kern_reboot (howto=16384) at /usr/src/sys/kern/kern_shutdown.c:483
#15 0xffffffff806c0acf in sys_reboot (td=<value optimized out>, uap=<value optimized out>) at /usr/src/sys/kern/kern_shutdown.c:229
#16 0xffffffff80970cd2 in amd64_syscall (td=0xfffff8000d2ff500, traced=0) at subr_syscall.c:141
#17 0xffffffff80954dab in Xfast_syscall () at /usr/src/sys/amd64/amd64/exception.S:398

(kgdb) f 12
#12 0xffffffff81169295 in g_journal_destroy (sc=0xfffff8003823d000)
    at /usr/src/sys/modules/geom/geom_journal/../../../geom/journal/g_journal.c:2431
2431                            g_access(cp, -1, -1, -1);

(kgdb) list
2426                        sc->sc_current_count);
2427            }
2428
2429            LIST_FOREACH(cp, &gp->consumer, consumer) {
2430                    if (cp->acr + cp->acw + cp->ace > 0)
2431                            g_access(cp, -1, -1, -1);
2432                    /*
2433                     * We keep all consumers open for writting, so if I'll detach
2434                     * and destroy consumer here, I'll get providers for taste, so
2435                     * journal will be started again.

(kgdb) p *sc
$4 = {sc_id = 1068288894, sc_type = 3 '\003', sc_orig_type = 1 '\001', sc_geom = 0xfffff800380bce00, sc_flags = 144, sc_mtx = {
    lock_object = {lo_name = 0xffffffff8116f916 "gjournal", lo_flags = 16908288, lo_data = 0, lo_witness = 0x0}, mtx_lock = 6},
  sc_mediasize = 1073741312, sc_sectorsize = 512, sc_bio_flush = 3, sc_journal_id = 1627568564, sc_journal_next_id = 163691452,
  sc_journal_copying = 0, sc_journal_offset = 59226624, sc_journal_previous_id = 3706521218, sc_back_queue = {queue = {
      tqh_first = 0x0, tqh_last = 0xfffff8003823d068}, last_offset = 805502976, insert_point = 0x0}, sc_regular_queue = {queue = {
      tqh_first = 0x0, tqh_last = 0xfffff8003823d088}, last_offset = 65536, insert_point = 0x0}, sc_delayed_queue = {queue = {
      tqh_first = 0x0, tqh_last = 0xfffff8003823d0a8}, last_offset = 0, insert_point = 0x0}, sc_delayed_count = 0,
  sc_current_queue = 0x0, sc_current_count = 0, sc_flush_queue = 0x0, sc_flush_count = 0, sc_flush_in_progress = 0,
  sc_copy_queue = 0x0, sc_copy_in_progress = 0, sc_dconsumer = 0xfffff800382c5900, sc_jconsumer = 0xfffff8003822f900,
  sc_inactive = {jj_queue = 0x0, jj_offset = 59142656}, sc_active = {jj_queue = 0x0, jj_offset = 59226112}, sc_jstart = 0,
  sc_jend = 8589934080, sc_callout = {c_links = {le = {le_next = 0x0, le_prev = 0xfffffe0000803f00}, sle = {sle_next = 0x0}, tqe = {
        tqe_next = 0x0, tqe_prev = 0xfffffe0000803f00}}, c_time = 33825373681, c_precision = 1342177187,
    c_arg = 0xfffff8003823d000, c_func = 0xffffffff8116c190 <g_journal_timeout>, c_lock = 0x0, c_flags = 0, c_iflags = 16,
    c_cpu = 0}, sc_worker = 0x0, sc_rootmount = 0x0}

Thats the same "struct g_journal_softc" as seen above for the g_journal_switcher.

Now the story is complete: "PID 1" calls g_journal_shutdown() and loops through the gjournaled partitions to destroy them. Last partition has sc_id = 1068288894 which is printed to console. Then in g_journal_destroy() we see
       mtx_unlock(&sc->sc_mtx);
       g_topology_lock();
       mtx_destroy(&sc->sc_mtx);      (That is mtx_lock = MTX_DESTROYED)
       ....
       g_topology_unlock();           (in g_mirror_access)
At this moment g_journal_switcher gets the topology_lock and starts running to the page fault described above.
Comment 1 longwitz 2017-08-26 21:34:42 UTC
I suggest the following patch to avoid the described problem:

--- g_journal.c.r322793 2017-08-24 15:58:33.702658000 +0200
+++ g_journal.c 2017-08-26 23:14:04.616049000 +0200
@@ -2433,6 +2433,7 @@
                    sc->sc_current_count);
        }

+       gp->softc = NULL;
        LIST_FOREACH(cp, &gp->consumer, consumer) {
                if (cp->acr + cp->acw + cp->ace > 0)
                        g_access(cp, -1, -1, -1);
@@ -2444,7 +2445,6 @@
                 */
                g_post_event(g_journal_destroy_consumer, cp, M_WAITOK, NULL);
        }
-       gp->softc = NULL;
        g_wither_geom(gp, ENXIO);
        free(sc, M_JOURNAL);
        return (0);
Comment 2 Eugene Grosbein freebsd_committer freebsd_triage 2017-09-03 15:54:27 UTC
Please use patch attachment instead of posting it inline.
Comment 3 longwitz 2017-09-18 14:29:13 UTC
Created attachment 186511 [details]
Patch is against HEAD and should work for FreeBSD 10/11 too.
Comment 4 Kirk McKusick freebsd_committer freebsd_triage 2018-01-31 23:22:17 UTC
I will get the suggested change committed.
Comment 5 commit-hook freebsd_committer freebsd_triage 2018-01-31 23:31:51 UTC
A commit references this bug:

Author: mckusick
Date: Wed Jan 31 23:30:49 UTC 2018
New revision: 328647
URL: https://svnweb.freebsd.org/changeset/base/328647

Log:
  Null out journal softc pointer earlier to avoid a segment fault
  that can otherwise occur.

  PR:           221804
  Submitted by: Andreas Longwitz <longwitz at incore.de>
  MFC after:    1 week

Changes:
  head/sys/geom/journal/g_journal.c
Comment 6 Kirk McKusick freebsd_committer freebsd_triage 2018-02-06 19:32:26 UTC
Fixed with -r328647.
MFC to 11 with -r328947.
MFC to 10 with -r328948.