Platform: AMD EPYC 9xx5 CPU (aka "Turin", aka "Zen5"). The serial console is uart.1 (0x2f8), which is connected to the BMC's Serial-Over-LAN service. To install our system, we boot from LAN. The bootloader comes up, finds and loads the kernel and pre-load modules, and boots into an NFS root filesystem. It then runs `bsdinstall' with a pre-generated config file: > bsdinstall script /bsdinstall.cfg The install begins normally: it finds the NICs and storage devices, nukes installation target and creates a new ZFS root there, expands the base system tarball (releng/14.2 + driver backports from stable/14) into the newly-created root, and bootstraps a newer version of `pkg'. It then starts installing our master PKG, which installs various other PKGs (some from the PKG mirrors, some built from ports, and some proprietary) as dependencies. It is during this stage -- installing dependencies -- that the wheels sometimes fall off the wagon. It will be humming along, and then just stops. The serial console becomes unresponsive, even to the serial debugger sequence ([return] [tilde] [ctrl-b]). The kernel still responds to pings, but because we don't have SSH up and running yet, we can't get much more than that. We determined that if we use IPMI to trigger an NMI -- from another machine, run `ipmitool -U ${BMC_USER} -P ${BMC_PASSWORD} -H ${BMC_ADDR} chassis power diag' -- then the hung node is able to enter `ddb'. At that point, the serial console becomes responsive again. Running `ps' from the debugger shows that `pkg' is waiting on the console: ``` 2267 2266 19 0 S+ ttyout 0xfffff8012e8c68c0 pkg 2266 2224 19 0 S+ wait 0xfffffe14dd714ac0 pkg 2224 1707 19 0 S+ wait 0xfffffe1176466ae0 sh ``` Runing `show msgbuf' shows that the last line in the buffer is for a PKG that gets installed later than the PKG which is mentioned in the last line of console output. That is congruent with the idea that output from `pkg' is getting stuck in the output queue and not being emitted to the console. At that point, it is possible to `continue' to exit `ddb', and the install resumes and runs to completion. This suggests a problem with the serial console drivers. The console config is as follows: 'loader.conf': ``` comconsole_port="0x2f8" comconsole_speed="115200" boot_serial="YES" console="efi" ``` 'device.hints': ``` hint.uart.0.at="acpi" hint.uart.0.port="0x3F8" hint.uart.0.flags="0x00" hint.uart.1.at="acpi" hint.uart.1.port="0x2F8" hint.uart.1.flags="0x90" hint.uart.1.baud="115200" ```
My colleagues have gathered some additional information: 1. When in the non-responsive state, (some?) kernel messages are still emitted to the serial console. For example, resetting the BMC causes the virtual USB keyboard and mouse to detach and re-attach, and the corresponding messages show up on the console. 2. When in the non-responsive state, it is always waiting on the 't_outwait' condvar in tty.c , which wakes up callers waiting on the TTY. 3. The main paths for triggering that event are 'ttydisc_getc()' or 'ttydisc_getc_uio()' in tty_ttydisc.c , which are in turn called by 'uart_tty_outwakup()' in uart_tty.c . 'uart_tty_outwakeup()' is responsible for draining the TTY buffer and sending the contents to the UART driver. 4. Instrumenting the install, we found that while 'ixon' -- software flow-control -- is enabled, so is 'ixany' -- release the software flow-control pause when any character is input. Our investigation continues. In the mean time, while this was seen in the context of `bsdinstall', this looks like a bug in the UART and/or TTY drivers; 'kern', not 'bin'.
One of my colleagues commented out the aforementioned 'boot_serial="yes"', causing it to use the video console in preference to the serial console. With that change, they were able to successfully boot from LAN and install over a dozen times. That is a clear indication that the problem is on the serial side of the equation.
When using the video console -- which is implemented in console_tty.c -- the debugger key-combination ([ctr]-[alt]-[esc]) successfully enters the debugger. That is another datapoint suggesting the problem is in the serial side of things, probably uart_tty.c