Bug 257298

Summary: kernel panic with kern.cam.da.enable_uma_ccbs=1
Product: Base System Reporter: pr
Component: kernAssignee: freebsd-bugs (Nobody) <bugs>
Status: Closed Overcome By Events    
Severity: Affects Some People CC: grahamperrin, sigsys
Priority: ---    
Version: CURRENT   
Hardware: Any   
OS: Any   

Description pr 2021-07-20 15:43:17 UTC
Hi,
despite https://reviews.freebsd.org/D31054 I can still reproduce the same exact bt as reported in the mailing list https://lists.freebsd.org/archives/freebsd-current/2021-June/000267.html

In particular, with a GENERIC kernel I get:
panic: Duplicate free of 0xffffa02039d7a000 from zone 0xffff000166aec000(ada_ccb) slab 0xffffa02039d7afd8(0)          
cpuid = 10                                                                                                            
time = 1626781044                                                                                                     
KDB: stack backtrace:                                                                                                 
db_trace_self() at db_trace_self                                                                                      
db_trace_self_wrapper() at db_trace_self_wrapper+0x30                                                                 
vpanic() at vpanic+0x188                                                                                              
panic() at panic+0x44                                                                                                 
uma_dbg_free() at uma_dbg_free+0x1e4                                                                                  
uma_zfree_arg() at uma_zfree_arg+0x358                                                                                
ahci_end_transaction() at ahci_end_transaction+0x7a4
ahci_ch_intr_main() at ahci_ch_intr_main+0x660
ahci_ch_intr() at ahci_ch_intr+0x5c
ahci_intr() at ahci_intr+0xe4
ithread_loop() at ithread_loop+0x2a8
fork_exit() at fork_exit+0x74
fork_trampoline() at fork_trampoline+0x14
KDB: enter: panic
[ thread pid 12 tid 100137 ]
Stopped at      kdb_enter+0x48: undefined       f904411f

And with a GENERIC-NODEBUG kernel I get:
panic: vm_fault failed: ffff0000007a9b40 error 1
cpuid = 2
time = 1626773972
KDB: stack backtrace:
db_trace_self() at db_trace_self
db_trace_self_wrapper() at db_trace_self_wrapper+0x30
vpanic() at vpanic+0x188
panic() at panic+0x44
data_abort() at data_abort+0x1e0
handle_el1h_sync() at handle_el1h_sync+0x74
--- exception, esr 0x96000044
zone_release() at zone_release+0x224
bucket_drain() at bucket_drain+0xe8
bucket_cache_reclaim_domain() at bucket_cache_reclaim_domain+0x3b0
zone_reclaim() at zone_reclaim+0x194
uma_reclaim_domain() at uma_reclaim_domain+0xbc
vm_pageout_worker() at vm_pageout_worker+0x594
vm_pageout() at vm_pageout+0x1e0
fork_exit() at fork_exit+0x94
fork_trampoline() at fork_trampoline+0x14
KDB: enter: panic
[ thread pid 33 tid 100222 ]
Stopped at      kdb_enter+0x48: undefined       f903c11f

This is with CURRENT as of 439097486ba0453e057c05d548fa306d91c784e5
Author: Jessica Clarke <jrtc27@FreeBSD.org>
Date:   Mon Jul 19 17:19:23 2021 +0100

(This is just where I am now, nothing to do with Jessica commit).

Environment:
# uname -a
FreeBSD asn 14.0-CURRENT FreeBSD 14.0-CURRENT #0 main-n248066-439097486ba0: Mon Jul 19 21:33:35 CEST 2021     root@asn:/usr/obj/usr/src/arm64.aarch64/sys/GENERIC-NODEBUG  arm64

(or GENERIC instead than GENERIC-NODEBUG)

I have a board that is known to have low signal levels in the SATA subsystem and hits frequent minor troubles with ada disks, such as:
 (ada0:ahcich1:0:0:0): READ_FPDMA_QUEUED. ACB: 60 08 c0 ce 36 40 06 00 00 00 00 00
(ada0:ahcich1:0:0:0): CAM status: Uncorrectable parity/CRC error
(ada0:ahcich1:0:0:0): Retrying command, 3 more tries remain

or

ahcich1: Timeout on slot 14 port 0
ahcich1: is 00000000 cs 0003c080 ss 0003c080 rs 0003c080 tfd 50 serr 00180000 cmd 0000c017
(ada0:ahcich1:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 20 e8 8f ed 40 08 00 00 00 00 00
(ada0:ahcich1:0:0:0): CAM status: Command timeout
(ada0:ahcich1:0:0:0): Retrying command, 3 more tries remain

This is ok, I mean, FreeBSD is solid enough to cope with that and get no data loss at fs level.

While I understand that this is sub optimal, the circumstance reveals the bug which is the object of this report: i.e. the kernel panics with faulty hardware.

Interestingly I can avoid the bug by setting kern.cam.da.enable_uma_ccbs=0

Note that I set .da., not .ada sysctl. I have no da disks in the system, only ada (two).

# sysctl -a | fgrep cbs
kern.cam.da.enable_uma_ccbs: 0
kern.cam.ada.enable_uma_ccbs: 0

I am unable to get a kernel dump with line numbers (RAM >> swap). Is there a workaround for this?

Regarding the bug, I can test further, please suggest the direction.
Comment 1 pr 2021-07-21 09:53:46 UTC
I hit the bug again with kern.cam.ada.enable_uma_ccbs=0

I also find out that my previous post was partially incorrect: I have to set the .ada. version of the sysctl, no the .da. Please forgive me.

So, to summarize, under heavy work (poudriere with 16 builders):
kern.cam.(a)da.enable_uma_ccbs=1 -> kernel panics quickly
kern.cam.(a)da.enable_uma_ccbs=0 -> kernel panics in a few hours

If the kernel is GENERIC (with INVARIANTS) sometimes it hangs at boot/rc, random positions (I assume then the first reset is issued). GENERIC-NODEBUG usually boots ok.
Comment 2 sigsys 2022-02-23 10:15:09 UTC
Getting those panics too (I think) with recent -CURRENT (9ad210c15933e5a49c51fba134e77d84cfdba94f).

All relevant sysctls at their default value.

#0  __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55
#1  doadump (textdump=textdump@entry=0) at /usr/src/sys/kern/kern_shutdown.c:406
#2  0xffffffff804cadda in db_dump (dummy=<optimized out>, dummy2=<unavailable>, dummy3=<unavailable>, dummy4=<unavailable>) at /usr/src/sys/ddb/db_command.c:575
#3  0xffffffff804cac92 in db_command (last_cmdp=<optimized out>, cmd_table=<optimized out>, dopager=dopager@entry=1) at /usr/src/sys/ddb/db_command.c:482
#4  0xffffffff804ca8ed in db_command_loop () at /usr/src/sys/ddb/db_command.c:535
#5  0xffffffff804cdf26 in db_trap (type=<optimized out>, code=<optimized out>) at /usr/src/sys/ddb/db_main.c:270
#6  0xffffffff80c5b4ab in kdb_trap (type=type@entry=3, code=code@entry=0, tf=tf@entry=0xfffffe00c4e8d810) at /usr/src/sys/kern/subr_kdb.c:733
#7  0xffffffff810ef5ca in trap (frame=0xfffffe00c4e8d810) at /usr/src/sys/amd64/amd64/trap.c:609
#8  <signal handler called>
#9  kdb_enter (why=0xffffffff812eeecb "panic", msg=<optimized out>) at /usr/src/sys/kern/subr_kdb.c:506
#10 0xffffffff80c0cc90 in vpanic (fmt=0xffffffff8132fdc0 "malloc(M_WAITOK) with sleeping prohibited", ap=ap@entry=0xfffffe00c4e8d970) at /usr/src/sys/kern/kern_shutdown.c:953
#11 0xffffffff80c0ca23 in panic (fmt=0xffffffff81e962d0 <cnputs_mtx> "Ҳ*\201\377\377\377\377") at /usr/src/sys/kern/kern_shutdown.c:889
#12 0xffffffff80be0be4 in malloc_dbg (vap=<optimized out>, sizep=<optimized out>, mtp=<optimized out>, flags=<optimized out>) at /usr/src/sys/kern/kern_malloc.c:542
#13 0xffffffff80be0a0d in malloc (size=16, mtp=0xffffffff81b51520 <M_KMALLOC>, flags=258) at /usr/src/sys/kern/kern_malloc.c:645
#14 0xffffffff834ab0ab in drm_crtc_helper_set_config () from /boot/modules/drm.ko
#15 0xffffffff833db472 in radeon_crtc_set_config () from /boot/modules/radeonkms.ko
#16 0xffffffff834a9ccd in __drm_mode_set_config_internal () from /boot/modules/drm.ko
#17 0xffffffff834a59d0 in drm_client_modeset_commit_force () from /boot/modules/drm.ko
#18 0xffffffff834e6d62 in drm_fb_helper_restore_fbdev_mode_unlocked () from /boot/modules/drm.ko
#19 0xffffffff834e0cd3 in vt_kms_postswitch () from /boot/modules/drm.ko
#20 0xffffffff80a5a3ef in vt_window_switch (vw=0xfffff80067e1b600, vw@entry=0xffffffff81aeb930 <vt_conswindow>) at /usr/src/sys/dev/vt/vt_core.c:624
#21 0xffffffff80a57aef in vtterm_cngrab (tm=<optimized out>) at /usr/src/sys/dev/vt/vt_core.c:1875
#22 0xffffffff80b9ec06 in cngrab () at /usr/src/sys/kern/kern_cons.c:383
#23 0xffffffff80c0cbce in vpanic (fmt=0xffffffff81296a0c "Duplicate free of %p from zone %p(%s) slab %p(%d)", ap=ap@entry=0xfffffe00c4e8dca0) at /usr/src/sys/kern/kern_shutdown.c:937
#24 0xffffffff80c0ca23 in panic (fmt=0xffffffff81e962d0 <cnputs_mtx> "Ҳ*\201\377\377\377\377") at /usr/src/sys/kern/kern_shutdown.c:889
#25 0xffffffff80f800b1 in uma_dbg_free (zone=0xfffff800071ea500, slab=0xfffff8011dd4bfd8, item=0xfffff8011dd4b000) at /usr/src/sys/vm/uma_core.c:5749
#26 0xffffffff80f78663 in item_dtor (zone=0xfffff800071ea500, item=0xfffff8011dd4b000, size=272, udata=0x0, skip=SKIP_NONE) at /usr/src/sys/vm/uma_core.c:3465
#27 uma_zfree_arg (zone=0xfffff800071ea500, item=0xfffff8011dd4b000, udata=udata@entry=0x0) at /usr/src/sys/vm/uma_core.c:4448
#28 0xffffffff80391893 in uma_zfree (zone=0xffffffff81e962d0 <cnputs_mtx>, zone@entry=0xfffff8011dd4b000, item=0x80) at /usr/src/sys/vm/uma.h:408
#29 0xffffffff8050c2e9 in ahci_process_read_log (ch=0xfffffe00d84fa000, ccb=0xfffff8011dd4b000) at /usr/src/sys/dev/ahci/ahci.c:2282
#30 ahci_end_transaction (slot=<optimized out>, et=<optimized out>) at /usr/src/sys/dev/ahci/ahci.c:2104
#31 0xffffffff8050b790 in ahci_ch_intr_main (ch=<optimized out>, ch@entry=0xfffffe00d84fa000, istatus=istatus@entry=2) at /usr/src/sys/dev/ahci/ahci.c:1473
#32 0xffffffff8050a541 in ahci_ch_intr (arg=0xfffffe00d84fa000) at /usr/src/sys/dev/ahci/ahci.c:1328
#33 0xffffffff80508d41 in ahci_intr (data=<optimized out>) at /usr/src/sys/dev/ahci/ahci.c:529
#34 0xffffffff80bc74a9 in intr_event_execute_handlers (ie=0xfffff80003aa7a00, p=<optimized out>) at /usr/src/sys/kern/kern_intr.c:1205
#35 ithread_execute_handlers (ie=0xfffff80003aa7a00, p=<optimized out>) at /usr/src/sys/kern/kern_intr.c:1218
#36 ithread_loop (arg=<optimized out>, arg@entry=0xfffff80003ef1e00) at /usr/src/sys/kern/kern_intr.c:1306
#37 0xffffffff80bc3b40 in fork_exit (callout=0xffffffff80bc7230 <ithread_loop>, arg=0xfffff80003ef1e00, frame=0xfffffe00c4e8df40) at /usr/src/sys/kern/kern_fork.c:1106
#38 <signal handler called>

That's with an SSD often giving errors like this:

(ada1:ahcich2:0:0:0): READ_FPDMA_QUEUED. ACB: 60 40 d0 8a 63 40 12 00 00 00 00 00
(ada1:ahcich2:0:0:0): CAM status: Uncorrectable parity/CRC error
(ada1:ahcich2:0:0:0): Retrying command, 3 more tries remain

I guess changing the SSD would make the problem go away but there might be a bug there.