Created attachment 250626 [details] core.txt Fatal trap 12: page fault while in kernel mode cpuid = 29; apic id = 1d fault virtual address = 0x0 fault code = supervisor write data, page not present instruction pointer = 0x20:0xffffffff820975a1 stack pointer = 0x28:0xfffffe022b901de0 frame pointer = 0x28:0xfffffe022b901de0 code segment = base rx0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 6 (dp_sync_taskq_17) rdi: fffff8000234cf60 rsi: fffff800022f8328 rdx: 0000000000000000 rcx: 0000000000000000 r8: 0000000000000000 r9: fffffe027bbf1a00 rax: 00000000000000e8 rbx: 0000000000000270 rbp: fffffe022b901de0 r10: 0000000000000000 r11: 0000000098ff24fe r12: fffff800022f8328 r13: 0000000000000000 r14: fffff8000234cf40 r15: fffff80aa5726c00 trap number = 12 panic: page fault cpuid = 29 time = 1715524597 KDB: stack backtrace: #0 0xffffffff80b9009d at kdb_backtrace+0x5d #1 0xffffffff80b431a2 at vpanic+0x132 #2 0xffffffff80b43063 at panic+0x43 #3 0xffffffff8100c85c at trap_fatal+0x40c #4 0xffffffff8100c8af at trap_pfault+0x4f #5 0xffffffff80fe3ac8 at calltrap+0x8 #6 0xffffffff82105083 at sync_dnodes_task+0x63 #7 0xffffffff8209addf at taskq_run+0x1f #8 0xffffffff80ba5992 at taskqueue_run_locked+0x182 #9 0xffffffff80ba6c22 at taskqueue_thread_loop+0xc2 #10 0xffffffff80afdb7f at fork_exit+0x7f #11 0xffffffff80fe4b2e at fork_trampoline+0xe Uptime: 4d20h39m45s kgdb backtrace: #0 __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:57 #1 doadump (textdump=<optimized out>) at /usr/src/sys/kern/kern_shutdown.c:405 #2 0xffffffff80b42d37 in kern_reboot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:526 #3 0xffffffff80b4320f in vpanic (fmt=0xffffffff81136b3b "%s", ap=ap@entry=0xfffffe022b901c30) at /usr/src/sys/kern/kern_shutdown.c:970 #4 0xffffffff80b43063 in panic (fmt=<unavailable>) at /usr/src/sys/kern/kern_shutdown.c:894 #5 0xffffffff8100c85c in trap_fatal (frame=0xfffffe022b901d20, eva=0) at /usr/src/sys/amd64/amd64/trap.c:952 #6 0xffffffff8100c8af in trap_pfault (frame=0xfffffe022b901d20, usermode=false, signo=<optimized out>, ucode=<optimized out>) at /usr/src/sys/amd64/amd64/trap.c:760 #7 <signal handler called> #8 0xffffffff820975a1 in list_remove (list=0xfffff8000234cf60, object=object@entry=0xfffff800022f8328) at /usr/src/sys/contrib/openzfs/module/os/freebsd/spl/list.c:127 #9 0xffffffff8216158e in multilist_sublist_remove ( mls=mls@entry=0xfffff8000234cf40, obj=obj@entry=0xfffff800022f8328) at /usr/src/sys/contrib/openzfs/module/zfs/multilist.c:363 #10 0xffffffff82105083 in dmu_objset_sync_dnodes (list=0xfffff8000234cf40, tx=0xfffff80aa5726c00) at /usr/src/sys/contrib/openzfs/module/zfs/dmu_objset.c:1557 #11 sync_dnodes_task (arg=0xfffff8083ac22e60) at /usr/src/sys/contrib/openzfs/module/zfs/dmu_objset.c:1638 #12 0xffffffff8209addf in taskq_run (arg=0xfffff8005728f900, pending=<optimized out>) at /usr/src/sys/contrib/openzfs/module/os/freebsd/spl/spl_taskq.c:320 #13 0xffffffff80ba5992 in taskqueue_run_locked ( queue=queue@entry=0xfffff80002356300) at /usr/src/sys/kern/subr_taskqueue.c:512 #14 0xffffffff80ba6c22 in taskqueue_thread_loop ( arg=arg@entry=0xfffff80003a91620) at /usr/src/sys/kern/subr_taskqueue.c:824 #15 0xffffffff80afdb7f in fork_exit ( callout=0xffffffff80ba6b60 <taskqueue_thread_loop>, arg=0xfffff80003a91620, frame=0xfffffe022b901f40) at /usr/src/sys/kern/kern_fork.c:1160 #16 <signal handler called> See attached for core.txt This seems to happen intermittently while running a backup, which is performed by a remote computer running `zfs send` through SSH. If there's anything else you'd like to see please let me know. I have a full vmcore as well if needed (11GB). I am also able to run kgdb to inspect said vmcore.
Another panic this morning at 3:33AM with an identical backtrace. Looking at the core.txt there was a backup running this time as well. There is a running `zfs` process with a `sudo` as the parent, and `sshd` as the next process in the tree. A `zpool status` shoes that our jail dataset experience an error: pool: zsmtp_jail state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P scan: scrub repaired 92K in 00:23:52 with 0 errors on Tue May 14 03:53:36 2024 config: NAME STATE READ WRITE CKSUM zsmtp_jail ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 nda1p3 ONLINE 0 0 1 nda2p3 ONLINE 0 0 0 It seems to me that this was likely caused by the panic corrupting something on that drive. smartctl -a /dev/nvme1 doesn't seem out of the ordinary: === START OF INFORMATION SECTION === Model Number: Force MP600 Serial Number: 2006823000012856205C Firmware Version: EGFM11.3 PCI Vendor/Subsystem ID: 0x1987 IEEE OUI Identifier: 0x6479a7 Total NVM Capacity: 2,000,398,934,016 [2.00 TB] Unallocated NVM Capacity: 0 Controller ID: 1 NVMe Version: 1.3 Number of Namespaces: 1 Namespace 1 Size/Capacity: 2,000,398,934,016 [2.00 TB] Namespace 1 Formatted LBA Size: 512 Namespace 1 IEEE EUI-64: 6479a7 2fc124183d Local Time is: Tue May 14 10:42:58 2024 CDT Firmware Updates (0x12): 1 Slot, no Reset required Optional Admin Commands (0x0017): Security Format Frmw_DL Self_Test Optional NVM Commands (0x005d): Comp DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp Log Page Attributes (0x08): Telmtry_Lg Maximum Data Transfer Size: 512 Pages Warning Comp. Temp. Threshold: 90 Celsius Critical Comp. Temp. Threshold: 95 Celsius Supported Power States St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat 0 + 9.78W - - 0 0 0 0 0 0 1 + 6.75W - - 1 1 1 1 0 0 2 + 5.23W - - 2 2 2 2 0 0 3 - 0.0490W - - 3 3 3 3 2000 2000 4 - 0.0018W - - 4 4 4 4 25000 25000 Supported LBA Sizes (NSID 0x1) Id Fmt Data Metadt Rel_Perf 0 + 512 0 2 1 - 4096 0 1 === START OF SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED SMART/Health Information (NVMe Log 0x02) Critical Warning: 0x00 Temperature: 39 Celsius Available Spare: 100% Available Spare Threshold: 5% Percentage Used: 3% Data Units Read: 295,807,203 [151 TB] Data Units Written: 98,425,650 [50.3 TB] Host Read Commands: 1,499,250,766 Host Write Commands: 2,297,088,561 Controller Busy Time: 5,756 Power Cycles: 108 Power On Hours: 33,640 Unsafe Shutdowns: 67 Media and Data Integrity Errors: 0 Error Information Log Entries: 543 Warning Comp. Temperature Time: 0 Critical Comp. Temperature Time: 0 Error Information (NVMe Log 0x01, 16 of 63 entries) No Errors Logged Self-test Log (NVMe Log 0x06) Self-test status: No self-test in progress Num Test_Description Status Power_on_Hours Failing_LBA NSID Seg SCT Code 0 Short Completed without error 30728 - - - - - I will run a SMART self test and report the results.
=== START OF SMART DATA SECTION === Self-test Log (NVMe Log 0x06) Self-test status: No self-test in progress Num Test_Description Status Power_on_Hours Failing_LBA NSID Seg SCT Code 0 Extended Completed without error 33641 - - - - - 1 Short Completed without error 33640 - - - - - 2 Short Completed without error 30728 - - - - -
There was another crash yesterday, this time with a different backtrace: #0 __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:57 #1 doadump (textdump=<optimized out>) at /usr/src/sys/kern/kern_shutdown.c:405 #2 0xffffffff80b42d37 in kern_reboot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:526 #3 0xffffffff80b4320f in vpanic (fmt=0xffffffff81136b3b "%s", ap=ap@entry=0xfffffe0231e84ca0) at /usr/src/sys/kern/kern_shutdown.c:970 #4 0xffffffff80b43063 in panic (fmt=<unavailable>) at /usr/src/sys/kern/kern_shutdown.c:894 #5 0xffffffff8100c85c in trap_fatal (frame=0xfffffe0231e84d90, eva=164) at /usr/src/sys/amd64/amd64/trap.c:952 #6 0xffffffff8100c8af in trap_pfault (frame=0xfffffe0231e84d90, usermode=false, signo=<optimized out>, ucode=<optimized out>) at /usr/src/sys/amd64/amd64/trap.c:760 #7 <signal handler called> #8 taskqueue_run_locked (queue=0x0, queue@entry=0xfffff800021dd100) at /usr/src/sys/kern/subr_taskqueue.c:514 #9 0xffffffff80ba6c22 in taskqueue_thread_loop (arg=arg@entry=0xfffff800020b30e0) at /usr/src/sys/kern/subr_taskqueue.c:824 #10 0xffffffff80afdb7f in fork_exit (callout=0xffffffff80ba6b60 <taskqueue_thread_loop>, arg=0xfffff800020b30e0, frame=0xfffffe0231e84f40) at /usr/src/sys/kern/kern_fork.c:1160 #11 <signal handler called> I found a thread on TrueNAS: https://www.truenas.com/community/threads/importing-pool-crashes-os.45049/ This seems to indicate an SSD failure. We are going to remove the suspect drive from the mirror and see if the crashes stop. If we don't get any after a week then I will close this bug as a hardware issue.
We have not gotten arround to replacing the drive yet. Meanwhile we have a new panic: Fatal trap 12: page fault while in kernel mode cpuid = 28; apic id = 1c fault virtual address = 0x0 fault code = supervisor read data, page not present instruction pointer = 0x20:0xffffffff80eb9c53 stack pointer = 0x28:0xfffffe02b60c7b90 frame pointer = 0x28:0xfffffe02b60c7bb0 code segment = base rx0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 47378 (sshd) rdi: ffffffff81a135a8 rsi: ffffffff81a135a8 rdx: 0000000000000000 rcx: fffff80a9be5dc60 r8: 0000000000000000 r9: fffff80a9be5dc60 rax: 0000000000000000 rbx: fffffe011d24f000 rbp: fffffe02b60c7bb0 r10: ffffffff81a135a8 r11: 0000000000000000 r12: 0000000000000000 r13: fffff8075191c2a0 r14: fffff807e1ed8240 r15: 0000000000000000 trap number = 12 panic: page fault cpuid = 28 time = 1717134122 KDB: stack backtrace: #0 0xffffffff80b9009d at kdb_backtrace+0x5d #1 0xffffffff80b431a2 at vpanic+0x132 #2 0xffffffff80b43063 at panic+0x43 #3 0xffffffff8100c85c at trap_fatal+0x40c #4 0xffffffff8100c8af at trap_pfault+0x4f #5 0xffffffff80fe3ac8 at calltrap+0x8 #6 0xffffffff80ebd5ab at vm_map_entry_delete+0x1b #7 0xffffffff80eb8d0b at vm_map_delete+0x7b #8 0xffffffff80ebd84c at vm_map_remove+0x9c #9 0xffffffff80bb5f79 at pipeclose+0x2f9 #10 0xffffffff80bb5970 at pipe_close+0x60 #11 0xffffffff80ae2221 at _fdrop+0x11 #12 0xffffffff80ae543a at closef+0x24a #13 0xffffffff80ae9194 at closefp_impl+0x64 #14 0xffffffff8100d119 at amd64_syscall+0x109 #15 0xffffffff80fe43db at fast_syscall_common+0xf8 Interestingly enough, the only process listed in the `ps` output of the core.txt in a `pipewr` state was the `zfs send` of a backup. Also, myself and my father have been working on our backups scripts and we've had other issues with zfs described here: https://forums.freebsd.org/threads/zfs-dataset-size-is-out-of-control-zfs-send-recv-hangs-on-random-datasets.93549/