The ccp(4) driver appears to work on my NAS machine's Ryzen 5 2200G a bit better under 14.0-RELEASE, in that GELI can create encrypted swap w/o hanging and as far as I can tell, that swap seems to work. However, attempting to dump all sysctl values w/ 'sysctl -a' will cause sysctl to do a hard hang when it gets to OID 'kern.geom.conftxt'. The process becomes unresponsive and cannot be exited w/ ctrl+c, or killed by any signals, including SIGKILL. It goes into the D+ state and only a reboot can clear it. Sample output from truss showing the hang-up on 'kern.geom.conftxt': > 42815: 0.000007549 write(1,"\n",1) = 1 (0x1) > 42815: 0.000007590 __sysctl("sysctl.next",5,0x45a51ae7830,0x45a51ae7828,0x0,0) = 0 (0x0) > 42815: 0.000010670 __sysctl("sysctl.name { 1.2147483316.2147483313 }",5,0x45a51ae6f70,0x45a51ae6af0,0x0,0) = 0 (0x0) > 42815: 0.000008760 __sysctl("sysctl.oidfmt kern.geom.conftxt",5,0x45a51ae73e0,0x45a51ae6af8,0x0,0) = 0 (0x0) > ^C^C Some dmesg/pciconf info, in case it helps: dmesg: > # dmesg | grep ccp > ccp0: <AMD CCP-5a> mem 0xfc700000-0xfc7fffff,0xfc884000-0xfc885fff irq 54 at device 0.2 on pci10 > [26] GEOM_ELI: Device da0p2.eli created. > [26] GEOM_ELI: Encryption: AES-XTS 256 > [26] GEOM_ELI: Crypto: hardware pciconf -lvcV ccp0@pci0:10:0:2: class=0x108000 rev=0x00 hdr=0x00 vendor=0x1022 device=0x15df subvendor=0x1043 subdevice=0x876b vendor = 'Advanced Micro Devices, Inc. [AMD]' device = 'Family 17h (Models 10h-1fh) Platform Security Processor' class = encrypt/decrypt cap 09[48] = vendor (length 8) cap 01[50] = powerspec 3 supports D0 D3 current D0 cap 10[64] = PCI-Express 2 endpoint max data 256(256) RO NS max read 512 link x16(x16) speed 8.0(8.0) ASPM disabled(L0s/L1) cap 05[a0] = MSI supports 2 messages, 64 bit cap 11[c0] = MSI-X supports 2 messages, enabled Table in map 0x24[0x0], PBA in map 0x24[0x1000] ecap 000b[100] = Vendor [1] ID 0001 Rev 1 Length 16
Per comment #10 on Bug 253860, this is the 'procstat -kk' command on the PID of a hung 'sysctl -a' process while ccp(4) is loaded and in use w/ GELI-encrypted swap partitions: > # procstat -kk 95687 > PID TID COMM TDNAME KSTACK > 95687 100677 sysctl - mi_switch+0xbb _sleep+0x1ed g_waitfor_event+0x118 sysctl_kern_geom_confany+0xc1 sysctl_root_handler_locked+0x90 sysctl_root+0x23e userland_sysctl+0x176 sys___sysctl+0x5c amd64_syscall+0x109 fast_syscall_common+0xf8 My guess is it's hung up on 'g_waitfor_event', of which there are two in the 'sysctl_kern_geom_confany' function, and I am not sure which one is affected here.
Can you find the PID of the "geom" kernel process and then provide `procstat -kk` output for that PID too? In the case of GEOM events, the event is handed off to one of the threads in that process for handling.
(In reply to John Baldwin from comment #2) So it looks like I was wrong about ccp(4) seemingly working. It is still hanging when GEOM is initializing GELI on my swap drives, which I have two in this system. What fooled me has two factors: 1. SSH came up and allowed me to login, and I never thought to look at the console 2. I recently changed the console resolution to 1280x1024, which generally works, but there is a large black rectangle being drawn over the bottom 1/8th of the monitor, so I *thought* it was sitting at a login prompt, because I could see the swap/GELI bits printed out just above the rectangle. If that rectangle wasn't there, I'd have noticed it wasn't at a login prompt. So the reason 'sysctl -a' is hanging is because g_waitfor_event is still waiting for GELI to finish doing whatever it's stuck doing in trying to init my encrypted swap volumes. That means Bug #253860 is still unfixed, so it needs to be re-opened. This one is probably a duplicate of that bug, being a side-effect of GELI lost somewhere in space. If it helps, I can at least get you 'procstat -kk' outputs of the four GELI threads on the first encrypted swap volume, where it is stuck: > root 36041 0.0 0.0 0 16 - DL 18:50 0:00.00 [g_eli[0] da0p2] > root 36255 0.0 0.0 0 16 - DL 18:50 0:00.00 [g_eli[1] da0p2] > root 36973 0.0 0.0 0 16 - DL 18:50 0:00.00 [g_eli[2] da0p2] > root 37530 0.0 0.0 0 16 - DL 18:50 0:00.00 [g_eli[3] da0p2] > # procstat -kk 36041 > PID TID COMM TDNAME KSTACK > 36041 100531 g_eli[0] da0p2 - mi_switch+0xbb _sleep+0x1ed g_eli_worker+0x37e fork_exit+0x7f fork_trampoline+0xe > > # procstat -kk 36255 > PID TID COMM TDNAME KSTACK > 36255 100532 g_eli[1] da0p2 - mi_switch+0xbb _sleep+0x1ed g_eli_worker+0x37e fork_exit+0x7f fork_trampoline+0xe > > # procstat -kk 36973 > PID TID COMM TDNAME KSTACK > 36973 100686 g_eli[2] da0p2 - mi_switch+0xbb _sleep+0x1ed g_eli_worker+0x37e fork_exit+0x7f fork_trampoline+0xe > > # procstat -kk 37530 > PID TID COMM TDNAME KSTACK > 37530 100687 g_eli[3] da0p2 - mi_switch+0xbb _sleep+0x1ed g_eli_worker+0x37e fork_exit+0x7f fork_trampoline+0xe This feels like there's a missed interrupt not being handled somewhere, so the process(es) are forever waiting. Had my fair share of those on Linux when I played around with driver debugging long ago.
I've not tried learning it yet, but if you have any methods of using dtrace or the kernel debugger to poke the stuck processes some more and see what's "stuck". eventually, though, I'll have to abandon ccp(4) to get the system back into a working state...it is my NAS server.
A missed interrupt was the theory in the previous PR (and in that case the theory was that it was possibly a hardware issue). You can use kgdb on a live system to investigate stack traces of sleeping threads and examine variables, etc. Just 'pkg install devel/gdb' and run 'kgdb' as root. You can then use normal GDB commands. Each kernel thread is a thread in GDB in this mode. That said, there is probably not a stack to examine if ccp(4) is missing an interrupt. Instead, you would want to either see if ccp(4) maintains a linked-list of pending operations or dump the descriptor rings to infer if there are commands in flight. Not sure if ccp has a counter for "in-flight" requests (ccr(4) which ccp(4) is roughly based on does have such a counter).