I am running several servers FreeBSD 10.3-STABLE #3 r307955 using gjournal/gmirror: --> gjournal status Name Status Components mirror/gmbkp4p7.journal N/A mirror/gmbkp4p3 mirror/gmbkp4p7 mirror/gmbkp4p8.journal N/A mirror/gmbkp4p4 mirror/gmbkp4p8 mirror/gmbkp4p9.journal N/A mirror/gmbkp4p5 mirror/gmbkp4p9== root@dssbkp4 (pts/2) -> --> gmirror status Name Status Components mirror/gmbkp4 COMPLETE da10 (ACTIVE) da11 (ACTIVE) mirror/gmbkp4p10.journal N/A mirror/gmbkp4p6 mirror/gmbkp4p10 I do not use gjournal on other disks because I am aware of the trouble described in PR kern/164252. This configuration runs correct for many years, all journal partitions have a size of 2 GB. Now a have got a panic on a test machine with 48 GB RAM: panic: Journal overflow (id = 1510160743 joffset=1950095360 active=863204864 inactive=1948974592) cpuid = 4 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe0c4ea36410 kdb_backtrace() at kdb_backtrace+0x39/frame 0xfffffe0c4ea364c0 vpanic() at vpanic+0x126/frame 0xfffffe0c4ea36500 panic() at panic+0x43/frame 0xfffffe0c4ea36560 g_journal_flush() at g_journal_flush+0x8f9/frame 0xfffffe0c4ea36840 g_journal_flush_send() at g_journal_flush_send+0xd1/frame 0xfffffe0c4ea36880 g_journal_worker() at g_journal_worker+0x170a/frame 0xfffffe0c4ea36bb0 fork_exit() at fork_exit+0x9a/frame 0xfffffe0c4ea36bf0 fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe0c4ea36bf0 --- trap 0, rip = 0, rsp = 0, rbp = 0 --- KDB: enter: panic [ thread pid 27 tid 100182 ] Stopped at kdb_enter+0x3e: movq $0,kdb_why db:0:kdb.enter.panic> watchdog No argument provided, disabling watchdog db:0:kdb.enter.panic> run ddbinfo db:1:ddbinfo> capture on db:1:on> run lockinfo db:2:lockinfo> show lock Giant class: sleep mutex name: Giant flags: {DEF, RECURSE} state: {UNOWNED} db:2:Giant> show lockedvnods Locked vnodes 0xfffff80b7e1acb10: tag ufs, type VREG usecount 2, writecount 1, refcount 3781 mountedhere 0 flags (VI_ACTIVE) v_object 0xfffff8085348d600 ref 0 pages 30224 cleanbuf 548 dirtybuf 3230 lock type ufs: EXCL by thread 0xfffff800117c1960 (pid 7, g_journal switcher, tid 100169) ino 18, on dev mirror/gmbkp4p8.journal db:2:lockedvnods> show lockchain thread 100182 (pid 27, g_journal mirror/gm) running on CPU 4 db:2:lockchain> show sleepchain thread 100182 (pid 27, g_journal mirror/gm) running on CPU 4 db:1:sleepchain> show pcpu cpuid = 4 dynamic pcpu = 0xfffffe0c9c5dfe00 curthread = 0xfffff80011e014b0: pid 27 "g_journal mirror/gm" curpcb = 0xfffffe0c4ea36cc0 fpcurthread = none idlethread = 0xfffff8000b343960: tid 100007 "idle: cpu4" curpmap = 0xffffffff80efa9d8 tssp = 0xffffffff80f11ab0 commontssp = 0xffffffff80f11ab0 From kerneldump: kgdb) bt #0 doadump (textdump=-1319329552) at pcpu.h:219 #1 0xffffffff80340f05 in db_fncall (dummy1=<value optimized out>, dummy2=<value optimized out>, dummy3=<value optimized out>, dummy4=<value optimized out>) at /usr/src/sys/ddb/db_command.c:568 #2 0xffffffff80340bed in db_command (cmd_table=0x0) at /usr/src/sys/ddb/db_command.c:440 #3 0xffffffff80344e7f in db_script_exec (scriptname=0xfffffe0c4ea360c0 "kdb.enter.panic", warnifnotfound=<value optimized out>) at /usr/src/sys/ddb/db_script.c:302 #4 0xffffffff80344cc3 in db_script_kdbenter (eventname=0x0) at /usr/src/sys/ddb/db_script.c:324 #5 0xffffffff803432eb in db_trap (type=<value optimized out>, code=0) at /usr/src/sys/ddb/db_main.c:230 #6 0xffffffff806df959 in kdb_trap (type=3, code=0, tf=<value optimized out>) at /usr/src/sys/kern/subr_kdb.c:656 #7 0xffffffff809391d3 in trap (frame=0xfffffe0c4ea363f0) at /usr/src/sys/amd64/amd64/trap.c:561 #8 0xffffffff8091eb6c in calltrap () at /usr/src/sys/amd64/amd64/exception.S:238 #9 0xffffffff806df0be in kdb_enter (why=0xffffffff80a79f84 "panic", msg=<value optimized out>) at cpufunc.h:63 #10 0xffffffff806a12e6 in vpanic (fmt=<value optimized out>, ap=<value optimized out>) at /usr/src/sys/kern/kern_shutdown.c:882 #11 0xffffffff806a1193 in panic (fmt=0xffffffff80d24350 "PèŠ\200ÿÿÿÿ\001") at /usr/src/sys/kern/kern_shutdown.c:818 #12 0xffffffff8112bbc9 in g_journal_flush (sc=<value optimized out>) at /usr/src/sys/modules/geom/geom_journal/../../../geom/journal/g_journal.c:344 #13 0xffffffff8112bca1 in g_journal_flush_send (sc=0xfffff80011dee400) at /usr/src/sys/modules/geom/geom_journal/../../../geom/journal/g_journal.c:1443 #14 0xffffffff8112f8ea in g_journal_worker (arg=0xfffff80011dee400) at /usr/src/sys/modules/geom/geom_journal/../../../geom/journal/g_journal.c:2211 #15 0xffffffff8066f99a in fork_exit (callout=0xffffffff8112e1e0 <g_journal_worker>, arg=0xfffff80011dee400, frame=0xfffffe0c4ea36c00) at /usr/src/sys/kern/kern_fork.c:1030 #16 0xffffffff8091f0ae in fork_trampoline () at /usr/src/sys/amd64/amd64/exception.S:613 (kgdb) f 13 #13 0xffffffff8112bca1 in g_journal_flush_send (sc=0xfffff80011dee400) at /usr/src/sys/modules/geom/geom_journal/../../../geom/journal/g_journal.c:1443 1443 g_journal_add_current(sc, bp); (kgdb) p *sc $21 = {sc_id = 1510160743, sc_type = 3 '\003', sc_orig_type = 1 '\001', sc_geom = 0xfffff80011de6b00, sc_flags = 64, sc_mtx = {lock_object = {lo_name = 0xffffffff811319f4 "gjournal", lo_flags = 16973824, lo_data = 0, lo_witness = 0x0}, mtx_lock = 4}, sc_mediasize = 4294966784, sc_sectorsize = 512, sc_bio_flush = 3, sc_journal_id = 114030489, sc_journal_next_id = 3023870498, sc_journal_copying = 0, sc_journal_offset = 1950095360, sc_journal_previous_id = 2305361505, sc_back_queue = {queue = { tqh_first = 0xfffff8094ce6e1f0, tqh_last = 0xfffff8065935cd10}, last_offset = 1945507328, insert_point = 0xfffff8065935cc98}, sc_regular_queue = {queue = {tqh_first = 0xfffff808536333e0, tqh_last = 0xfffff80836392740}, last_offset = 478609408, insert_point = 0xfffff808363926c8}, sc_delayed_queue = {queue = {tqh_first = 0xfffff805bf282d90, tqh_last = 0xfffff805f2023360}, last_offset = 475987968, insert_point = 0xfffff805f20232e8}, sc_delayed_count = 108, sc_current_queue = 0x0, sc_current_count = 0, sc_flush_queue = 0xfffff805f14fc3e0, sc_flush_count = 84, sc_flush_in_progress = 16, sc_copy_queue = 0x0, sc_copy_in_progress = 0, sc_dconsumer = 0xfffff80011dea880, sc_jconsumer = 0xfffff80011deb600, sc_inactive = {jj_queue = 0x0, jj_offset = 1948974592}, sc_active = {jj_queue = 0xfffff80a4748baa8, jj_offset = 863204864}, sc_jstart = 0, sc_jend = 2147483136, sc_callout = {c_links = {le = {le_next = 0x0, le_prev = 0xfffffe0000806768}, sle = {sle_next = 0x0}, tqe = {tqe_next = 0x0, tqe_prev = 0xfffffe0000806768}}, c_time = 55532158374, c_precision = 1342177187, c_arg = 0xfffff80011dee400, c_func = 0xffffffff8112e150 <g_journal_timeout>, c_lock = 0x0, c_flags = 0, c_iflags = 16, c_cpu = 0}, sc_worker = 0xfffff80011df7500, sc_rootmount = 0x0} (kgdb) p vm_kmem_size $1 = 50189914112 (kgdb) p g_journal_cache_divisor $2 = 2 (kgdb) p g_journal_cache_low $3 = 3258108450 (kgdb) p g_journal_cache_limit $4 = 3620120576 (kgdb) p g_journal_cache_switch $5 = 90 (kgdb) p g_journal_force_switch $6 = 70 (kgdb) p g_journal_cache_used $7 = 1098408960 (kgdb) p g_journal_switch_time $8 = 10 (kgdb) p g_journal_switcher_wokenup $9 = 1 (kgdb) p g_journal_cache_alloc_failures $10 = 0 (kgdb) p g_journal_cache_misses $11 = 4 (kgdb) p g_journal_switcher_state $12 = 0 (kgdb) p g_journal_sync_requested $13 = 0 (kgdb) p g_journal_stats_journal_full $14 = 2 (kgdb) p g_journal_switch_time $15 = 10 (kgdb) p sc->sc_id $22 = 1510160743 (kgdb) p sc->sc_journal_offset $23 = 1950095360 (kgdb) p sc->sc_active.jj_offset $24 = 863204864 (kgdb) p sc->sc_inactive.jj_offset $25 = 1948974592 (kgdb) p g_journal_stats_wait_for_copy $28 = 0 (kgdb) thread 465 [Switching to thread 465 (Thread 100169)]#0 sched_switch (td=0xfffff800117c1960, newtd=<value optimized out>, flags=<value optimized out>) at /usr/src/sys/kern/sched_ule.c:1945 1945 cpuid = PCPU_GET(cpuid); (kgdb) where #0 sched_switch (td=0xfffff800117c1960, newtd=<value optimized out>, flags=<value optimized out>) at /usr/src/sys/kern/sched_ule.c:1945 #1 0xffffffff806aabc1 in mi_switch (flags=260, newtd=0x0) at /usr/src/sys/kern/kern_synch.c:491 #2 0xffffffff806ee73a in sleepq_wait (wchan=0x0, pri=0) at /usr/src/sys/kern/subr_sleepqueue.c:624 #3 0xffffffff806aa5d7 in _sleep (ident=<value optimized out>, lock=<value optimized out>, priority=<value optimized out>, wmesg=<value optimized out>, sbt=<value optimized out>, pr=<value optimized out>, flags=<value optimized out>) at /usr/src/sys/kern/kern_synch.c:255 #4 0xffffffff80910c58 in getpbuf (pfreecnt=<value optimized out>) at /usr/src/sys/vm/vm_pager.c:336 #5 0xffffffff8073f453 in cluster_wbuild (vp=0xfffff80b7e1acb10, size=32768, start_lbn=<value optimized out>, len=<value optimized out>, gbflags=<value optimized out>) at /usr/src/sys/kern/vfs_cluster.c:840 #6 0xffffffff808d8748 in ffs_syncvnode (vp=0xfffff80b7e1acb10, waitfor=1, flags=<value optimized out>) at /usr/src/sys/ufs/ffs/ffs_vnops.c:298 #7 0xffffffff808d5f6d in ffs_sync (mp=0xfffff8002ba4b000, waitfor=Cannot access memory at address 0x1 ) at /usr/src/sys/ufs/ffs/ffs_vfsops.c:1591 #8 0xffffffff807618fc in vfs_write_suspend (mp=0xfffff8002ba4b000, flags=<value optimized out>) at /usr/src/sys/kern/vfs_vnops.c:1877 #9 0xffffffff8112a3e4 in g_journal_switcher (arg=0xffffffff81132f28) at /usr/src/sys/modules/geom/geom_journal/../../../geom/journal/g_journal.c:2964 #10 0xffffffff8066f99a in fork_exit (callout=0xffffffff81129a40 <g_journal_switcher>, arg=0xffffffff81132f28, frame=0xfffffe0c4e49cc00) at /usr/src/sys/kern/kern_fork.c:1030 #11 0xffffffff8091f0ae in fork_trampoline () at /usr/src/sys/amd64/amd64/exception.S:613 #12 0x0000000000000000 in ?? () (kgdb) f 4 #4 0xffffffff80910c58 in getpbuf (pfreecnt=<value optimized out>) at /usr/src/sys/vm/vm_pager.c:336 336 msleep(pfreecnt, &pbuf_mtx, PVM, "wswbuf0", 0); (kgdb) list 331 mtx_lock(&pbuf_mtx); 332 333 for (;;) { 334 if (pfreecnt) { 335 while (*pfreecnt == 0) { 336 msleep(pfreecnt, &pbuf_mtx, PVM, "wswbuf0", 0); 337 } 338 } The panic was caused by a vi-Problem: I tried to insert some bytes per cut/paste into an opened file and vi started (I do not know why) to loop and wrote my copystring over and over again in his address space until vi dumped. The kernel wrote the core dump of vi to /var/core and doing this issued the gjournal panic. (kgdb) thread 541 (vi) [Switching to thread 541 (Thread 101086)]#0 sched_switch (td=0xfffff805564a7960, newtd=<value optimized out>, flags=<value optimized out>) at /usr/src/sys/kern/sched_ule.c:1945 1945 cpuid = PCPU_GET(cpuid); (kgdb) bt #0 sched_switch (td=0xfffff805564a7960, newtd=<value optimized out>, flags=<value optimized out>) at /usr/src/sys/kern/sched_ule.c:1945 #1 0xffffffff806aabc1 in mi_switch (flags=260, newtd=0x0) at /usr/src/sys/kern/kern_synch.c:491 #2 0xffffffff806ee73a in sleepq_wait (wchan=0x0, pri=0) at /usr/src/sys/kern/subr_sleepqueue.c:624 #3 0xffffffff806aa5d7 in _sleep (ident=<value optimized out>, lock=<value optimized out>, priority=<value optimized out>, wmesg=<value optimized out>, sbt=<value optimized out>, pr=<value optimized out>, flags=<value optimized out>) at /usr/src/sys/kern/kern_synch.c:255 #4 0xffffffff80761487 in vn_start_write_locked (mp=0xfffff8002ba4b000, flags=257) at /usr/src/sys/kern/vfs_vnops.c:1673 #5 0xffffffff8075fc23 in vn_start_write (vp=<value optimized out>, mpp=<value optimized out>, flags=<value optimized out>) at /usr/src/sys/kern/vfs_vnops.c:1725 #6 0xffffffff807605db in vn_rdwr (rw=UIO_WRITE, vp=0xfffff80b7e1acb10, base=<value optimized out>, len=<value optimized out>, offset=<value optimized out>, segflg=<value optimized out>, ioflg=<value optimized out>, active_cred=0xfffff8055623b280, file_cred=0x1, aresid=<value optimized out>, td=<value optimized out>) at /usr/src/sys/kern/vfs_vnops.c:530 #7 0xffffffff80760d47 in vn_rdwr_inchunks (rw=UIO_WRITE, vp=0xfffff80b7e1acb10, base=<value optimized out>, len=<value optimized out>, offset=<value optimized out>, segflg=UIO_USERSPACE, aresid=<value optimized out>) at /usr/src/sys/kern/vfs_vnops.c:630 #8 0xffffffff80645e14 in elf64_coredump (td=<value optimized out>, vp=0xfffff80b7e1acb10, limit=<value optimized out>, flags=<value optimized out>) at /usr/src/sys/kern/imgact_elf.c:1179 #9 0xffffffff806a4a6c in sigexit (td=0xfffff805564a7960, sig=11) at /usr/src/sys/kern/kern_sig.c:3366 #10 0xffffffff806a50c6 in postsig (sig=<value optimized out>) at /usr/src/sys/kern/kern_sig.c:2911 #11 0xffffffff806f40f7 in ast (framep=<value optimized out>) at /usr/src/sys/kern/subr_trap.c:316 #12 0xffffffff8091fe89 in doreti_ast () at /usr/src/sys/amd64/amd64/exception.S:685 #13 0x0000000801853c00 in ?? () #14 0x0000000801853c80 in ?? () #15 0x0000062e00000008 in ?? () #16 0x0000000000000604 in ?? () #17 0x0000000000000002 in ?? () #18 0x0000000000000001 in ?? () #19 0x0000000801815400 in ?? () #20 0x0000000000000008 in ?? () #21 0x00007fffffffe190 in ?? () #22 0x0000000000000000 in ?? () (kgdb) f 4 #4 0xffffffff80761487 in vn_start_write_locked (mp=0xfffff8002ba4b000, flags=257) at /usr/src/sys/kern/vfs_vnops.c:1673 1673 error = msleep(&mp->mnt_flag, MNT_MTX(mp), mflags, (kgdb) list 1668 while ((mp->mnt_kern_flag & MNTK_SUSPEND) != 0) { 1669 if (flags & V_NOWAIT) { 1670 error = EWOULDBLOCK; 1671 goto unlock; 1672 } 1673 error = msleep(&mp->mnt_flag, MNT_MTX(mp), mflags, 1674 "suspfs", 0); 1675 if (error) 1676 goto unlock; 1677 } After reboot I saw (inode 18 is the inode the gjournal switcher had locked): == root@dssbkp4 (pts/1) -> ls -li /var/core total 1704864 18 -rw------- 1 root wheel 1745289216 29 Mär 22:34 vi-70628.core === root@dssbkp4 (pts/1) -> ls -li /var/tmp/vi.recover/ total 395232 80384 -rw------- 1 root wheel 404533248 29 Mär 22:34 vi.CqtHml This server has faster disks than my others, maybe the journal space is too low. But the journal_switcher sits on wswbuf0 (but why ?) and every journal space will be filled bringing the overflow. Looking at the gjournal variables, we have (/sys/geom/journal/g_journal.c) static u_int g_journal_cache_limit = 64 * 1024 * 1024; static u_int g_journal_cache_divisor = 2; static u_int g_journal_cache_switch = 90; static u_int g_journal_cache_low = 0; and in g_journal_init() if (g_journal_cache_limit > 0) { g_journal_cache_limit = vm_kmem_size / g_journal_cache_divisor; g_journal_cache_low = (g_journal_cache_limit / 100) * g_journal_cache_switch; } But vm_kmem_size is defined u_long. Therefore we have g_journal_cache_limit = 3.620.120.576, but we should have g_journal_cache_limit = 50.189.914.112 / 2. This looks to me like a programming error in g_journal.c, relevant for servers with more RAM than 4 GB. Can this be respinsible for the panic ?
All of the stress2 tests I have panic with "Journal overflow". Here's an example with an 8g disk with an excessive 50% journal: https://people.freebsd.org/~pho/stress/log/gjournal5.txt
Thanks for the extensive example, the information of your kerneldump looks nearly identical to mine. I think the problem with u_int/u_long variables in g_journal.c is not the reason for the panic, there is already the PR kern/198500. The panic has to do with the extra physical memory described in man pbuf(9). I see the following user of pbufs (I don't have nfs and fuse): nbuf = 105931 nswbuf = min(nbuf / 4, 256) = 256; User of pbufs: boottime kerneldump md_vnode_pbuf_freecnt 25 25 smbfs_pbuf_freecnt 129 129 ncl_pbuf_freecnt 129 129 cluster_pbuf_freecnt 128 0 vnode_pbuf_freecnt 129 129 nsw_rcount 128 128 nsw_wcount_sync 64 64 The g_journal_switcher (pid 7) is waiting on channel "wswbuf0" because all the cluster_pbufs are in use. Looking at the 256 swbufs I found, that 128 are free and 128 are in use by the g_journal_switcher itself. All the buffers have the same b_bufobj and the same b_iodone address "cluster_callback". One example: (kgdb) p swbuf[126] $352 = {b_bufobj = 0xfffff80b7e1acbe0, b_bcount = 131072, b_caller1 = 0x0, b_data = 0xfffffe0c1c2ec000 "", b_error = 0, b_iocmd = 2 '\002', b_ioflags = 0 '\0', b_iooffset = 472973312, b_resid = 0, b_iodone = 0xffffffff8073fcb0 <cluster_callback>, b_blkno = 923776, b_offset = 2841903104, b_bobufs = {tqe_next = 0x0, tqe_prev = 0x0}, b_vflags = 0, b_freelist = {tqe_next = 0xfffffe0bafe182b8, tqe_prev = 0xffffffff80eaa460}, b_qindex = 0, b_flags = 1677721636, b_xflags = 0 '\0', b_lock = {lock_object = {lo_name = 0xffffffff80a829a7 "bufwait", lo_flags = 108199936, lo_data = 0, lo_witness = 0x0}, lk_lock = 18446744073709551600, lk_exslpfail = 0, lk_timo = 0, lk_pri = 96}, b_bufsize = 131072, b_runningbufspace = 131072, b_kvabase = 0xfffffe0c1c2ec000 "", b_kvaalloc = 0x0, b_kvasize = 131072, b_lblkno = 86728, b_vp = 0xfffff80b7e1acb10, b_dirtyoff = 0, b_dirtyend = 131072, b_rcred = 0x0, b_wcred = 0x0, b_saveaddr = 0xfffffe0c1c2ec000, b_pager = {pg_reqpage = 0}, b_cluster = {cluster_head = {tqh_first = 0xfffffe0bb1bec410, tqh_last = 0xfffffe0bb1bebe20}, cluster_entry = {tqe_next = 0xfffffe0bb1bec410, tqe_prev = 0xfffffe0bb1bebe20}}, b_pages = {0xfffff80c097f97c0, 0xfffff80c097f9828, 0xfffff80c097f9890, 0xfffff80c097f98f8, 0xfffff80c097f9960, 0xfffff80c097f99c8, 0xfffff80c097f9a30, 0xfffff80c097f9a98, 0xfffff80c097f9b00, 0xfffff80c097f9b68, 0xfffff80c097f9bd0, 0xfffff80c097f9c38, 0xfffff80c097f9ca0, 0xfffff80c097f9d08, 0xfffff80c097f9d70, 0xfffff80c097f9dd8, 0xfffff80c097f9e40, 0xfffff80c097f9ea8, 0xfffff80c097f9f10, 0xfffff80c097f9f78, 0xfffff80c097f9fe0, 0xfffff80c097fa048, 0xfffff80c097fa0b0, 0xfffff80c097fa118, 0xfffff80c11547980, 0xfffff80c115479e8, 0xfffff80c11547a50, 0xfffff80c11547ab8, 0xfffff80c11547b20, 0xfffff80c11547b88, 0xfffff80c11547bf0, 0xfffff80c11547c58}, b_npages = 32, b_dep = {lh_first = 0x0}, b_fsprivate1 = 0x0, b_fsprivate2 = 0x0, b_fsprivate3 = 0x0, b_pin_count = 0} Therefore the g_journal_switcher has all his cluster pbufs in use and waits forever for another one. So the worker thread must panic with overflow. In cluster_wbuild() I can't see a check for "cluster_pbuf_freecnt > 0" to avoid the hang on "wswbuf0". I wonder why this seems only a problem with gjournal, other components in the kernel also use VFS_SYNC. I would like to know if this problem can be fixed.
From my kerneldump I see the backtrace for the g_journal_switcher: #4 0xffffffff80910c58 in getpbuf (pfreecnt=<value optimized out>) at /usr/src/sys/vm/vm_pager.c:336 #5 0xffffffff8073f453 in cluster_wbuild (vp=0xfffff80b7e1acb10, size=32768, start_lbn=<value optimized out>, len=<value optimized out>, gbflags=<value optimized out>) at /usr/src/sys/kern/vfs_cluster.c:840 #6 0xffffffff808d8748 in ffs_syncvnode (vp=0xfffff80b7e1acb10, waitfor=1, flags=<value optimized out>) at /usr/src/sys/ufs/ffs/ffs_vnops.c:298 #7 0xffffffff808d5f6d in ffs_sync (mp=0xfffff8002ba4b000, waitfor=Cannot access memory at address 0x1 ) at /usr/src/sys/ufs/ffs/ffs_vfsops.c:1591 #8 0xffffffff807618fc in vfs_write_suspend (mp=0xfffff8002ba4b000, flags=<value optimized out>) at /usr/src/sys/kern/vfs_vnops.c:1877 #9 0xffffffff8112a3e4 in g_journal_switcher (arg=0xffffffff81132f28) In the source /usr/src/sys/ufs/ffs/ffs_vfsops.c on line 1591 I see the call of ffs_syncvnode() as stated above (frame #7 --> frame #6). But in the source /usr/src/sys/ufs/ffs/ffs_vnops.c on line 298 I see a call of vfs_bio_awrite() and not a call of cluster_wbuild() as stated above (frame #6 --> frame #5). Looking in the source /usr/src/sys/kern/vfs_bio.c line 1971 I see the function vfs_bio_awrite() which calls cluster_wbuild(). So I feel there is a missing frame in the backtrace give by kgdb for g_journal_switcher. In my ddb output - amd also in Peters ddb output - I see the same: getpbuf() at getpbuf+0xb8/frame 0xfffffe0c4e49c8b0 cluster_wbuild() at cluster_wbuild+0x83/frame 0xfffffe0c4e49c950 ffs_syncvnode() at ffs_syncvnode+0x308/frame 0xfffffe0c4e49c9e0 ffs_sync() at ffs_sync+0x1fd/frame 0xfffffe0c4e49ca90 vfs_write_suspend() at vfs_write_suspend+0x19c/frame 0xfffffe0c4e49cad0 g_journal_switcher() at g_journal_switcher+0x9a4/frame 0xfffffe0c4e49cbb0 Please can somebody explain to me what is going on here ?
With the help of Peters stress2 test I could find out that the reason for the journal overflow panic was a journal of 2 GB which is too small for my new hardware. After going to 8 GB Journal the server does not crash anymore. Also I have understand that the g_journal_switcher sitting on the lock wswbuf0 will be wakened up when the interrupt routine for outstanding I/O becames active. This PR can be closed.