276587 – ccp(4) causes 'sysctl -a' to hang when reading OID 'kern.geom.conftxt'

Bug 276587 - ccp(4) causes 'sysctl -a' to hang when reading OID 'kern.geom.conftxt'

Summary: ccp(4) causes 'sysctl -a' to hang when reading OID 'kern.geom.conftxt'

Status:	New

Alias:	None

Product:	Base System
Classification:	Unclassified
Component:	kern (show other bugs)
Version:	14.0-RELEASE
Hardware:	amd64 Any

Importance:	--- Affects Only Me
Assignee:	freebsd-bugs (Nobody)

URL:
Keywords:

Depends on:
Blocks:

Reported:	2024-01-24 15:10 UTC by Joshua Kinard
Modified:	2024-01-29 17:52 UTC (History)
CC List:	1 user (show)

See Also:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Joshua Kinard 2024-01-24 15:10:32 UTC

The ccp(4) driver appears to work on my NAS machine's Ryzen 5 2200G a bit better under 14.0-RELEASE, in that GELI can create encrypted swap w/o hanging and as far as I can tell, that swap seems to work.  However, attempting to dump all sysctl values w/ 'sysctl -a' will cause sysctl to do a hard hang when it gets to OID 'kern.geom.conftxt'.  The process becomes unresponsive and cannot be exited w/ ctrl+c, or killed by any signals, including SIGKILL.  It goes into the D+ state and only a reboot can clear it.

Sample output from truss showing the hang-up on 'kern.geom.conftxt':
> 42815: 0.000007549 write(1,"\n",1)               = 1 (0x1)
> 42815: 0.000007590 __sysctl("sysctl.next",5,0x45a51ae7830,0x45a51ae7828,0x0,0) = 0 (0x0)
> 42815: 0.000010670 __sysctl("sysctl.name { 1.2147483316.2147483313 }",5,0x45a51ae6f70,0x45a51ae6af0,0x0,0) = 0 (0x0)
> 42815: 0.000008760 __sysctl("sysctl.oidfmt kern.geom.conftxt",5,0x45a51ae73e0,0x45a51ae6af8,0x0,0) = 0 (0x0)
> ^C^C

Some dmesg/pciconf info, in case it helps:
dmesg:
> # dmesg | grep ccp
> ccp0: <AMD CCP-5a> mem 0xfc700000-0xfc7fffff,0xfc884000-0xfc885fff irq 54 at device 0.2 on pci10

> [26] GEOM_ELI: Device da0p2.eli created.
> [26] GEOM_ELI: Encryption: AES-XTS 256
> [26] GEOM_ELI:     Crypto: hardware

pciconf -lvcV
ccp0@pci0:10:0:2:       class=0x108000 rev=0x00 hdr=0x00 vendor=0x1022 device=0x15df subvendor=0x1043 subdevice=0x876b
    vendor     = 'Advanced Micro Devices, Inc. [AMD]'
    device     = 'Family 17h (Models 10h-1fh) Platform Security Processor'
    class      = encrypt/decrypt
    cap 09[48] = vendor (length 8)
    cap 01[50] = powerspec 3  supports D0 D3  current D0
    cap 10[64] = PCI-Express 2 endpoint max data 256(256) RO NS
                 max read 512
                 link x16(x16) speed 8.0(8.0) ASPM disabled(L0s/L1)
    cap 05[a0] = MSI supports 2 messages, 64 bit
    cap 11[c0] = MSI-X supports 2 messages, enabled
                 Table in map 0x24[0x0], PBA in map 0x24[0x1000]
    ecap 000b[100] = Vendor [1] ID 0001 Rev 1 Length 16

Comment 1 Joshua Kinard 2024-01-25 21:13:23 UTC

Per comment #10 on Bug 253860, this is the 'procstat -kk' command on the PID of a hung 'sysctl -a' process while ccp(4) is loaded and in use w/ GELI-encrypted swap partitions:

> # procstat -kk 95687
>   PID    TID COMM                TDNAME              KSTACK
> 95687 100677 sysctl              -                   mi_switch+0xbb _sleep+0x1ed g_waitfor_event+0x118 sysctl_kern_geom_confany+0xc1 sysctl_root_handler_locked+0x90 sysctl_root+0x23e userland_sysctl+0x176 sys___sysctl+0x5c amd64_syscall+0x109 fast_syscall_common+0xf8

My guess is it's hung up on 'g_waitfor_event', of which there are two in the 'sysctl_kern_geom_confany' function, and I am not sure which one is affected here.

Comment 2 John Baldwin freebsd_committer

2024-01-25 22:09:05 UTC

Can you find the PID of the "geom" kernel process and then provide `procstat -kk` output for that PID too?  In the case of GEOM events, the event is handed off to one of the threads in that process for handling.

Comment 3 Joshua Kinard 2024-01-26 00:39:51 UTC

(In reply to John Baldwin from comment #2)

So it looks like I was wrong about ccp(4) seemingly working.  It is still hanging when GEOM is initializing GELI on my swap drives, which I have two in this system.  What fooled me has two factors:
1. SSH came up and allowed me to login, and I never thought to look at the console
2. I recently changed the console resolution to 1280x1024, which generally works, but there is a large black rectangle being drawn over the bottom 1/8th of the monitor, so I *thought* it was sitting at a login prompt, because I could see the swap/GELI bits printed out just above the rectangle.  If that rectangle wasn't there, I'd have noticed it wasn't at a login prompt.

So the reason 'sysctl -a' is hanging is because g_waitfor_event is still waiting for GELI to finish doing whatever it's stuck doing in trying to init my encrypted swap volumes.

That means Bug #253860 is still unfixed, so it needs to be re-opened.  This one is probably a duplicate of that bug, being a side-effect of GELI lost somewhere in space.

If it helps, I can at least get you 'procstat -kk' outputs of the four GELI threads on the first encrypted swap volume, where it is stuck:

> root    36041   0.0  0.0      0     16  -  DL   18:50     0:00.00 [g_eli[0] da0p2]
> root    36255   0.0  0.0      0     16  -  DL   18:50     0:00.00 [g_eli[1] da0p2]
> root    36973   0.0  0.0      0     16  -  DL   18:50     0:00.00 [g_eli[2] da0p2]
> root    37530   0.0  0.0      0     16  -  DL   18:50     0:00.00 [g_eli[3] da0p2]

> # procstat -kk 36041
>   PID    TID COMM                TDNAME              KSTACK
> 36041 100531 g_eli[0] da0p2      -                   mi_switch+0xbb _sleep+0x1ed g_eli_worker+0x37e fork_exit+0x7f fork_trampoline+0xe
> 
> # procstat -kk 36255
>   PID    TID COMM                TDNAME              KSTACK
> 36255 100532 g_eli[1] da0p2      -                   mi_switch+0xbb _sleep+0x1ed g_eli_worker+0x37e fork_exit+0x7f fork_trampoline+0xe
> 
> # procstat -kk 36973
>   PID    TID COMM                TDNAME              KSTACK
> 36973 100686 g_eli[2] da0p2      -                   mi_switch+0xbb _sleep+0x1ed g_eli_worker+0x37e fork_exit+0x7f fork_trampoline+0xe
> 
> # procstat -kk 37530
>   PID    TID COMM                TDNAME              KSTACK
> 37530 100687 g_eli[3] da0p2      -                   mi_switch+0xbb _sleep+0x1ed g_eli_worker+0x37e fork_exit+0x7f fork_trampoline+0xe

This feels like there's a missed interrupt not being handled somewhere, so the process(es) are forever waiting.  Had my fair share of those on Linux when I played around with driver debugging long ago.

Comment 4 Joshua Kinard 2024-01-26 00:42:38 UTC

I've not tried learning it yet, but if you have any methods of using dtrace or the kernel debugger to poke the stuck processes some more and see what's "stuck".  eventually, though, I'll have to abandon ccp(4) to get the system back into a working state...it is my NAS server.

Comment 5 John Baldwin freebsd_committer

2024-01-29 17:52:18 UTC

A missed interrupt was the theory in the previous PR (and in that case the theory was that it was possibly a hardware issue).  You can use kgdb on a live system to investigate stack traces of sleeping threads and examine variables, etc.  Just 'pkg install devel/gdb' and run 'kgdb' as root.  You can then use normal GDB commands.  Each kernel thread is a thread in GDB in this mode.  That said, there is probably not a stack to examine if ccp(4) is missing an interrupt.  Instead, you would want to either see if ccp(4) maintains a linked-list of pending operations or dump the descriptor rings to infer if there are commands in flight.  Not sure if ccp has a counter for "in-flight" requests (ccr(4) which ccp(4) is roughly based on does have such a counter).