While trying to get more information related to bug #240201, which happens with a kernel build as GENERIC-NODEBUG, it turnes out that in the same environment (RPi3, writing from the network into a ZFS mirrored pool via doveadm) a kernel build as GENERIC panics in less than 30 seconds. As I do follow CURRENT the panic message changed with the latest changes related to vm_page from mtx_lock to spinlock. Because of the size of the stack backtrace, without a serial console and without a unsable keyboard after the panic I used screenshots to get - most of - the panic information. I collected three of them. Please expect some typos. * From bug #240201, GENERIC base r352023 panic: mtx_lock() by idle thread 0x... on sleep mutex pmap @ /usr/src/sys/arm6... cpuid = 0 time = ... KDB: stack backtrace: db_trace_self() ar db_trace_self_wrapper+0x28 db_trace_self_wrapper() at vpanic+0x18c vpanic() at panix+0x44 panic() at __mtx_lock_flages+0x1c4 __mtx_lock_flags() at pmap_fault+0x1bc pmap_fault() at data_abort+0xc0 data_abort() at do_el1h_sync+0x120 do_el1h_sync() at handle_el1h_sync+0x74 handle_el1h_sync() at dwc_otg_write_fifo+0x98 dwc_otg_write_fifo() at dwc_otg_host_data_tx+0x784 dwc_otg_host_data_tx() at dwc_otg_interrupt_poll_locked+0x894 dwc_otg_interrupt_poll_locked() at dwc_otg_filter_interrupt+0x134 dwc_otg_filter_interrupt() at intr_event_handle+0xac intr_event_handle() at intr_isrc_dispatch+0x34 intr_isrc_dispatch() at bmc2835_intc_intr+0x178 bmc2835_intc_intr() at intr_event_handle+0xac intr_event_handle() at intr_isrc_dispatch+0x34 intr_isrc_dispatch() at bcm_lintc_intr+0x1bc bcm_lintc_intr() at intr_irq_handler+0x74 intr_irq_handler() at handle_el1h_irq+0x78 handle_el1h_irq() at cpu_idle+0x54 cpu_idle() at sched_idletd+0x388 sched_idletd() at fork_exit+0x7c fork_exit() at fork_trampoline+0x10 KDB: enter: panic [ thread pid 11 tid 100003 ] Stopped at generic_bs_wr_4+0x0: ldr w1, [x3], #4 db> * From GENERIC base r352119 panic: acquiring blockable sleep lock with spinlock or cirtical section held (sleep mutex) pmap @ /usr/src/sys/arm64/arm64/pmap.c:5013 cpuid = 0 time = ... KDB: stack backtrace: db_trace_self() ar db_trace_self_wrapper+0x28 db_trace_self_wrapper() at vpanic+0x18c vpanic() at panix+0x44 panic() at witness_checkorder+0x... witness_checkorder() at __mtx_lock_flages __mtx_lock_flags() at pmap_fault pmap_fault() at data_abort data_abort() at do_el1h_sync do_el1h_sync() at handle_el1h_sync handle_el1h_sync() at dwc_otg_write_fifo dwc_otg_write_fifo() at dwc_otg_host_data_tx dwc_otg_host_data_tx() at dwc_otg_interrupt_poll_locked dwc_otg_interrupt_poll_locked() at dwc_otg_filter_interrupt dwc_otg_filter_interrupt() at intr_event_handle intr_event_handle() at intr_isrc_dispatch intr_isrc_dispatch() at bmc2835_intc_intr bmc2835_intc_intr() at intr_event_handle intr_event_handle() at intr_isrc_dispatch intr_isrc_dispatch() at bcm_lintc_intr bcm_lintc_intr() at intr_irq_handler intr_irq_handler() at handle_el1h_irq handle_el1h_irq() at lz4compress lz4compress() at zio_compress_data zio_compress_data() at zio_write_compress zio_write_compress() at zio_execute zio_execute() at taskqueue_run_locked taskqueue_run_locked() at taskqueue_thread_loop taskqueue_thread_loop() at fork_exit fork_exit() at fork_trampoline+0x10 KDB: enter: panic [ thread pid 0 tid 100224 ] Stopped at generic_bs_wr_4+0x0: ldr w1, [x3], #4 db> * From GENERIC base r352207 lock order reversal: 1st 0x... abd_chunk (UMA zone) 0 /usr/src/sys&vm/uma_core.c:4232 2nd 0x... pmap (pmap) 0 /usr/src/sys/arm64/arm64/pmap.c:5813 stack backtrace: #0 0x... at witness_debugger+0x64 #1 0x... at __mtx_lock_flags+0xb8 #2 0x... at pmap_fault+0x1bc #3 0x... at data_abort+0xc0 #4 0x... at do_el1h_sync+0x120 #5 0x... at handle_el1h_sync+0x74 #6 0x... at uma_dbg_free+0x58 #7 0x... at uma_zfree_arg+0x13c #8 0x... at abd_free+0xc0 #9 0x... at arc_hdr_free_pabd+0x94 #10 0x... at arc_write+0x1f8 #11 0x... at dbuf_write+0x630 #12 0x... at dbus_sync_leaf+0x5ac #13 0x... at dbus_sync_list+0xb4 #14 0x... at dbuf_sync_indirect+0x2e4 #15 0x... at dbus_sync_list+0xa4 #16 0x... at dnode_sync+0xdd4 #17 0x... at sync_dnodes_task+0x12c And a few seconds later: panic: acquiring blockable sleep lock with spinlock or cirtical section held (sleep mutex) pmap @ /usr/src/sys/arm64/arm64/pmap.c:5013 cpuid = 0 time = ... KDB: stack backtrace: db_trace_self() ar db_trace_self_wrapper+0x28 db_trace_self_wrapper() at vpanic+0x18c vpanic() at panix+0x44 panic() at witness_checkorder+0x... witness_checkorder() at __mtx_lock_flages __mtx_lock_flags() at pmap_fault pmap_fault() at data_abort data_abort() at do_el1h_sync do_el1h_sync() at handle_el1h_sync handle_el1h_sync() at dwc_otg_write_fifo dwc_otg_write_fifo() at dwc_otg_host_data_tx dwc_otg_host_data_tx() at dwc_otg_interrupt_poll_locked dwc_otg_interrupt_poll_locked() at dwc_otg_filter_interrupt+0x134 dwc_otg_filter_interrupt() at intr_event_handle+0xac intr_event_handle() at intr_isrc_dispatch+0x34 intr_isrc_dispatch() at bmc2835_intc_intr+0x178 bmc2835_intc_intr() at intr_event_handle+0xac intr_event_handle() at intr_isrc_dispatch+0x34 intr_isrc_dispatch() at bcm_lintc_intr+0x1bc bcm_lintc_intr() at intr_irq_handler+0x74 intr_irq_handler() at handle_el1h_irq+0x78 handle_el1h_irq() at cpu_idle+0x54 cpu_idle() at sched_idletd+0x388 sched_idletd() at fork_exit+0x7c fork_exit() at fork_trampoline+0x10 KDB: enter: panic [ thread pid 11 tid 100003 ] Stopped at generic_bs_wr_4+0x0: ldr w1, [x3], #4 db> Best regrads, Ralf
Might be an issue in "sys/arm64/arm64/pmap.c". Can you do this: pkg install gdb gdb83 /boot/kernel/kernel l *(pmap_fault+0x1bc) You might need to build a kernel with debugging symbols. --HPS
(In reply to Hans Petter Selasky from comment #1) > You might need to build a kernel with debugging symbols. Fortunately default make installkernel copies such a kernel in addition, so here it is: # gdb /usr/lib/debug/boot/kernel/kernel.debug ... Reading symbols from /usr/lib/debug/boot/kernel/kernel.debug... l *(pmap_fault+0x1bc) (gdb) l *(pmap_fault+0x1bc) 0xffff000000733e54 is in pmap_fault (/usr/src/sys/arm64/arm64/pmap.c:5813). 5808 break; 5809 case ISS_DATA_DFSC_TF_L0: 5810 case ISS_DATA_DFSC_TF_L1: 5811 case ISS_DATA_DFSC_TF_L2: 5812 case ISS_DATA_DFSC_TF_L3: 5813 PMAP_LOCK(pmap); 5814 /* Ask the MMU to check the address */ 5815 intr = intr_disable(); 5816 if (pmap == kernel_pmap) 5817 par = arm64_address_translate_s1e1r(far); (gdb) Ralf
Hi, Seems to be an issue after: https://reviews.freebsd.org/D6987 What happens if you compile a kernel without: PMAP_LOCK(pmap); and PMAP_UNLOCK(pmap); Index: sys/arm64/arm64/pmap.c =================================================================== --- sys/arm64/arm64/pmap.c (revision 352210) +++ sys/arm64/arm64/pmap.c (working copy) @@ -5810,7 +5810,7 @@ case ISS_DATA_DFSC_TF_L1: case ISS_DATA_DFSC_TF_L2: case ISS_DATA_DFSC_TF_L3: - PMAP_LOCK(pmap); + // PMAP_LOCK(pmap); /* Ask the MMU to check the address */ intr = intr_disable(); if (pmap == kernel_pmap) @@ -5818,7 +5818,7 @@ else par = arm64_address_translate_s1e0r(far); intr_restore(intr); - PMAP_UNLOCK(pmap); + // PMAP_UNLOCK(pmap); /* * If the translation was successful the address was invalid
Andrew, can you comment on this? --HPS
(In reply to Hans Petter Selasky from comment #3) > What happens if you compile a kernel without: > PMAP_LOCK(pmap); > and > PMAP_UNLOCK(pmap); It gets worser. Three times in a row the kernel panics at the end of booting. It is always the same: kbd1 at ukbd0 Warning: no-time-of-day clock registered, system time will not be set accurately panic: vm_fault_hold: fault on nofault entry, addr: 0xffff00005acc400 cpuid = 1 time = 11 KDB: stack backtrace: db_trace_self() ar db_trace_self_wrapper+0x28 db_trace_self_wrapper() at vpanic+0x18c vpanic() at panic+0x44 panic() at vm_fault_hold+0x1bc vm_fault() at data_abort+0x114 data_abort() at do_el1h_sync+0x128 do_el1h_sync() at handle_el1h_sync+0x74 handle_el1h_sync() at arc_cksum_verify+0xac arc_cksum_verify(9 at arc_buf_destroy_imlp+0x0c arc_buf_destroy_impl() at arc_buf_destroy+0x194 arc_buf_destroy() at dbuf_destroy+0x04 dbuf_destroy() at dbuf_evict_one+0x1a8 dbuf_evict_one() at dbuf_evict_thread+0x188 dbuf_evict_thraed() at fork_exit+0x7c fork_exit() at fork_trampoline+0x18 KDB: enter: panic [ thread pid 8 tid 100064 ] stopped at fletcher_2_native+0x38: undefined a8c1380d db> Ralf
This looks like a bug where dwc_otg_write_fifo tries to either access unmapped or write to read-only memory. On arm64 this needs to lock the pmap to ensure any page table updates by other threads have completed.
dwc_otg only access the pointers it is given. So that means dwc_otg is given an invalid pointer. --HPS
(In reply to iz-rpi03 from comment #5) Just to mention it: Because of the fletcher_*, dbuf_* and arc_* functions called in the last panic I imported the ZFS pool on a CURRENT amd64 system to check if the pool got damaged somehow. zpool(8) scrub did not have to repair anything. The devices are back in the arm64 hardware. Ralf
Can you try running "set vm.pmap.superpages_enabled=0" in loader before booting. It might be because a superpage is being promoted/demoted at the same time as dwc_otg_write_fifo is trying to access it.
(In reply to Andrew Turner from comment #9) > Can you try running "set vm.pmap.superpages_enabled=0" in loader before booting. I added vm.pmap.superpages_enabled=0 to /etc/loader.conf. The command is now running. At the moment ~23 times longer than it used to take to trigger the panic. I'll report again when it finishes.
(In reply to iz-rpi03 from comment #10) Meanwhile the command has finished successfully, without triggering a panic. Ralf
Could you (or anyone able to reproduce the "fault on nofault entry" panic) try the following? Add the following lines to the kernel configuration: options KTR options KTR_ENTRIES=(128*1024) options KTR_COMPILE=(KTR_PMAP) options KTR_MASK(KTR_PMAP) Then build a new kernel and trigger a panic. At the db> prompt, run "show ktr". Please save the output and post it here.
(In reply to Mark Johnston from comment #12) > At the db> prompt, run "show ktr". Please save the output and post it here. I will try it, but usually after the panic occured I can not input anything via the USB keyboard. So I will not be able to run "show ktr" execpt there is some kind of automatically running "show ktr" at the db> prompt in case of a panic. Ralf
If you can drop into single-user mode during boot, run the following command: # ddb script kdb.enter.default="show ktr" Then when the system panics it should run that command automatically.
(In reply to Mark Johnston from comment #14) It worked. Somehow. I suppose in comment #12 you ment "options KTR_MASK=KTR_PMAP"? Because "options KTR_MASK(KTR_PMAP)" gave me an syntax error which "options KTR_MASK=KTR_PMAP" did not. When the panic happend the "show ktr" output does not stop. After several minutes it is still running. That amount of output is too much to transfer it by hand. A screenshot from a not very sharp movie showed the beginning as: Stopped at generic_bs_wr_4+0x0: ldr w1, [x3], #4 db:0 kdb.enter.default> show ktr 1974 (0xfffff...:cpu1): pmap_enter: ffff00005e57f000 -> 000...35d7f000 1974 (0xfffff...:cpu1): pmat_enter: ffff00005e57e000 -> 000...35d7e000 ... Oh, meanwhile it has stopped. Keyboad input is not possible, as expected. I will attach a picture of the final screen. Hope this helps. Ralf
Created attachment 207428 [details] Final screen output of "sh ktr".
Created attachment 207536 [details] stack backtrace and show ktr output (In reply to Mark Johnston from comment #14) With my new USB to RPi serial cable I was able to store the output of a panic and the following "show ktr". Its size is ~11 Mbyte uncompressed. Now, via the serial console I am also able to enter commands to the debugger after a panic.
Created attachment 207574 [details] Early panic while booting: stack backtrace and show ktr output This time the panic occured very early, so I do not have to trigger it via doveadm.
Can you try the patch in https://reviews.freebsd.org/D21685?
(In reply to Andrew Turner from comment #19) > Can you try the patch in https://reviews.freebsd.org/D21685? Sure. Unfortunately the patch does not fix the panic. If you like to get the "show ktr" output just tell me. It's still running. If there are some kdb command outputs you like to see, tell me as well. Ralf # ./dovecrash panic: acquiring blockable sleep lock with spinlock or critical section held (sleep mutex) pmap @ /usr/src/sys/arm64/arm64/pmap.c:5845 cpuid = 0 time = 1568730961 KDB: stack backtrace: db_trace_self() at db_trace_self_wrapper+0x28 pc = 0xffff000000719d68 lr = 0xffff0000001061f4 sp = 0xffff000050185e30 fp = 0xffff000050186040 db_trace_self_wrapper() at vpanic+0x18c pc = 0xffff0000001061f4 lr = 0xffff0000003ff300 sp = 0xffff000050186050 fp = 0xffff000050186100 vpanic() at panic+0x44 pc = 0xffff0000003ff300 lr = 0xffff0000003ff0b0 sp = 0xffff000050186110 fp = 0xffff000050186190 panic() at witness_checkorder+0xa68 pc = 0xffff0000003ff0b0 lr = 0xffff000000465de0 sp = 0xffff0000501861a0 fp = 0xffff000050186210 witness_checkorder() at __mtx_lock_flags+0xb0 pc = 0xffff000000465de0 lr = 0xffff0000003df214 sp = 0xffff000050186220 fp = 0xffff000050186260 __mtx_lock_flags() at pmap_fault+0x338 pc = 0xffff0000003df214 lr = 0xffff000000734afc sp = 0xffff000050186270 fp = 0xffff000050186290 pmap_fault() at data_abort+0xc0 pc = 0xffff000000734afc lr = 0xffff000000736800 sp = 0xffff0000501862a0 fp = 0xffff000050186350 data_abort() at do_el1h_sync+0x128 pc = 0xffff000000736800 lr = 0xffff00000073663c sp = 0xffff000050186360 fp = 0xffff000050186390 do_el1h_sync() at handle_el1h_sync+0x74 pc = 0xffff00000073663c lr = 0xffff00000071c074 sp = 0xffff0000501863a0 fp = 0xffff0000501864b0 handle_el1h_sync() at dwc_otg_write_fifo+0x98 pc = 0xffff00000071c074 lr = 0xffff000000272c9c sp = 0xffff0000501864c0 fp = 0xffff0000501865a0 dwc_otg_write_fifo() at dwc_otg_host_data_tx+0x784 pc = 0xffff000000272c9c lr = 0xffff000000272264 sp = 0xffff0000501865b0 fp = 0xffff000050186600 dwc_otg_host_data_tx() at dwc_otg_interrupt_poll_locked+0x894 pc = 0xffff000000272264 lr = 0xffff00000026c15c sp = 0xffff000050186610 fp = 0xffff000050186680 dwc_otg_interrupt_poll_locked() at dwc_otg_filter_interrupt+0x134 pc = 0xffff00000026c15c lr = 0xffff00000026b874 sp = 0xffff000050186690 fp = 0xffff0000501866c0 dwc_otg_filter_interrupt() at intr_event_handle+0xac pc = 0xffff00000026b874 lr = 0xffff0000003c2408 sp = 0xffff0000501866d0 fp = 0xffff000050186710 intr_event_handle() at intr_isrc_dispatch+0x34 pc = 0xffff0000003c2408 lr = 0xffff0000007689a8 sp = 0xffff000050186720 fp = 0xffff000050186730 intr_isrc_dispatch() at bcm2835_intc_intr+0x170 pc = 0xffff0000007689a8 lr = 0xffff00000070a7d4 sp = 0xffff000050186740 fp = 0xffff000050186780 bcm2835_intc_intr() at intr_event_handle+0xac pc = 0xffff00000070a7d4 lr = 0xffff0000003c2408 sp = 0xffff000050186790 fp = 0xffff0000501867d0 intr_event_handle() at intr_isrc_dispatch+0x34 pc = 0xffff0000003c2408 lr = 0xffff0000007689a8 sp = 0xffff0000501867e0 fp = 0xffff0000501867f0 intr_isrc_dispatch() at bcm_lintc_intr+0x1bc pc = 0xffff0000007689a8 lr = 0xffff000000710504 sp = 0xffff000050186800 fp = 0xffff000050186880 bcm_lintc_intr() at intr_irq_handler+0x74 pc = 0xffff000000710504 lr = 0xffff00000076880c sp = 0xffff000050186890 fp = 0xffff0000501868b0 intr_irq_handler() at handle_el1h_irq+0x70 pc = 0xffff00000076880c lr = 0xffff00000071c130 sp = 0xffff0000501868c0 fp = 0xffff0000501869d0 handle_el1h_irq() at __mtx_unlock_flags+0x54 pc = 0xffff00000071c130 lr = 0xffff0000003df76c sp = 0xffff0000501869e0 fp = 0xffff000050186a90 __mtx_unlock_flags() at __mtx_unlock_flags+0x54 pc = 0xffff0000003df76c lr = 0xffff0000003df76c sp = 0xffff000050186aa0 fp = 0xffff000050186ac0 __mtx_unlock_flags() at ithread_loop+0x1f4 pc = 0xffff0000003df76c lr = 0xffff0000003c2b08 sp = 0xffff000050186ad0 fp = 0xffff000050186b50 ithread_loop() at fork_exit+0x7c pc = 0xffff0000003c2b08 lr = 0xffff0000003bf5f0 sp = 0xffff000050186b60 fp = 0xffff000050186b90 fork_exit() at fork_trampoline+0x10 pc = 0xffff0000003bf5f0 lr = 0xffff000000736394 sp = 0xffff000050186ba0 fp = 0x0000000000000000 KDB: enter: panic [ thread pid 12 tid 100035 ] Stopped at generic_bs_wr_4+0x8: ldr w1, [x3], #4 db:0:kdb.enter.default> show ktr
(In reply to iz-rpi03 from comment #20) Could you please retry with the updated patch?
(In reply to Mark Johnston from comment #21) > Could you please retry with the updated patch? Sure. I used the review D21685?id=62373 version from 6:29 AM today and tried to trigger the panic three times. The kernel does not panic any more. - Superpages are enabled and the kernel was still build with the "sh ktr" infrastructure. - During the first try those "smsc0: warning: Failed to read register 0x114 smsc0: warning: smsc_miibus_readreg MII is busy" messages from bug #240201 appeard. But even those get better as they do not flood the console any more and the kernel/process/shell responds to keyboard interaction now. - After a cold boot, the second an third try happend without any "smsc0" messages. Thank you all.
Fixed by base r352584.
(In reply to iz-rpi03 from comment #22) Thanks for testing.