Using a Thinkpad T400 with amd64. I finally upgraded my 14.1-RELEASE install to 14.2-RELEASE. After which, it hung at boot. I thought that maybe my KMS driver was out of date and causing the issue. I went to reboot in single user mode. A few seconds after launching my shell, it hung. It also hangs at the shell prompt (in single user mode) if I take "too long.") 14.2-RELEASE install media behaves the same way. If I boot in multi-user mode, fsck is able to run without issue on 14.2-RELEASE. After which, it hangs. I downgraded my install to 14.1-RELEASE and I'm able to boot. There's no errors printed of any kind. I think this is a regression. Where should I start? What else would be helpful for me to provide? I have hardware of a similar vintage (other Thinkpads and more) to test on. Thank you!
I tested on some different hardware, including a Thinkpad R400 (very similar) and was not able to reproduce the issue. Maybe someone else has some similar hardware that exhibits the same problem with 14.2, but not 14.1?
It looks like 14.2-RELEASE will boot with safe mode on my T400. The R400 that boots 14.2-RELEASE (without safe mode) has a Intel Core2Duo P8600 (or maybe P8700). My T400 has a Core2Duo T9400. At first glance these seem very similar, but maybe there's enough of a difference? There's a similar bug report about 14.1, but with a P8600: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=281888
While I'm not using the system, twice now it's locked up. Fans spinning, putting out lots of heat. Nothing in /var/log/messages. Assuming that is a different issue than requiring safe mode to boot.
Read your post at freebsd-stable ML. Quite wild prediction, but are you loading any kmods, especially GPU-related ones from non-base (ports/pkgs or building directly from upstream codes), in your /boot/loader.conf? If yes, and if the kmod is NOT clearly essential for FreeBSD to boot (i.e., disk controller driver, but unlikely for ThinkPads with Intel or AMD CPUs), move them to kld_list variable in /etc/rc.conf[.local]. Note that GPU drivers are not at all essential to boot FreeBSD base itself. A bit of details: kmods loaded via kld_list, autoloading via devd and/or manual loading with kldload is done under control of kernel. And kernel checks the kmods to be compatible with itseld or not on loading. But loading via /boot/loader.conf is different. loader loads kernel and kmods specified in /boot/loader.conf into staging area it allocated before hand-over to kernel. And loader is not aware of the version of the kernel and kmods. So incompatible kmods can be loaded, thus, causing problems. And as the staging area is limited, anything not fitting in it are forcibly truncated, causing undefined behaviors. And single user mode is the mode that anything in /etc and/or in /usr/local/etc, including network, fstab,... is not yet initialized. Loaded modules via /boot/loader,conf affects here, too. IIUC, safe mode disables some options/funtionalities that are known to cause problems in certain cases. So this could (not always assured, though) avoid problematic kmods to be loaded, thus, allowed you to boot.
(In reply to Tomoaki AOKI from comment #4) Thank you for replying! That's interesting about loader.conf vs rc.conf for loading modules. Very good to know. I load i915_kms in rc.conf. However, it's not the issue -- I commented that out when upgrading since I knew it wasn't going to work before compiling it for 14.2. The installer also hangs in the same manner. No ctrl+t, no panic, etc. I have to use safe mode on it as well, which rules out any other non-base kernel modules. I feel that the most interesting detail is that fsck can run and complete, then it will hang. And that even in single user mode, it will hang a very seconds after getting to the shell prompt (booting from installer or already installed.) Single user mode (without safe mode) behaves identically between the installed system and the installer. Now I did figure out a bit more about the crash with sway. I have a configuration to turn off displays after ten minutes. The hang starts as soon as: swaymsg "output * power off" is run. I can reproduce it easily. So for now I just have it lock the screen and not turn the screen off. I feel like that deserves its own issue if there isn't another one for it already.
(In reply to Henrich Hartzer from comment #5) OK. So modules loaded via /boot/loader.conf shouldn't be the cause of the issue. kmods (actually) loaded via devd and/or kld_list cannot matter for single user mode. Safe mode sets loader variables below: kern.smp.disabled=1 hw.ata.ata_dma=0 hw.ata.atapi_dma=0 kern.eventtimer.periodic=1 kern.geom.part.check_integrity=0 The relevant code is function core.setSafeMode(safe_mode) below. https://cgit.freebsd.org/src/tree/stand/lua/core.lua?h=releng/14.2#n173 The first one forces cores other than the primary to be disabled, and would be unlikely affecting. The last one forces to skip integrity checks for partition infos. And as your fsck runs sane, it would be unlikely affecting. So next to be tried would be flipping remaining 3 one by one and determine which one (or all, or any combinations of 2) is causing the hang. You can set them via /boot/loader.conf.
(In reply to Tomoaki AOKI from comment #6) Thank you! I happened to try disabling SMP first, and that's definitely the one. I was able to boot without safe mode, just setting kern.smp.disabled=1.
I can report the same issue with 14.2-RELEASE, and also 14-STABLE on a T400. I've spent weeks trying to diagnose this so I'm pleased it's not just me! The "solution" is the same as Henrich reports. Booting with kern.smp.disabled=1 is successful and I'm able to login graphically. If I build 14-STABLE with some debugging options enabled, it will boot with SMP enabled maybe 2 out of 3 times, and otherwise it locks up during boot (usually just after bringing the network interfaces up). Once it boots (with or without SMP), everything seems stable, suspend/resume works and I can do intensive tasks like buildkernel. But of course things are slower due to the debugging options. Happy to try things out, unfortunately the T400 is not getting any younger and make buildkernel without SMP takes well over 1 hour, so things take a bit of time to test! Regards, Anthony Options set as follows: $ diff GENERIC T400 21c21 < ident GENERIC --- > ident T400 95a96,106 > options DDB # Support DDB > options GDB # Support remote GDB > options DEADLKRES # Enable the deadlock resolver > options INVARIANTS # Enable calls of extra sanity checking > options INVARIANT_SUPPORT # Extra sanity checks of internal structures > options WITNESS # Enable checks to detect deadlocks and cycles > options WITNESS_SKIPSPIN # Don't run witness on spinlocks for speed > options ALT_BREAK_TO_DEBUGGER # Enter debugger on keyboard escape sequence > options VERBOSE_SYSINIT=0 # Support debug.verbose_sysinit, off by default
Not sure about T400, but smells ACPI problems in UEFI firmware / BIOS. If I recall correctly, currently FreeBSD depends on ACPI to detect CPU spec including (S)MP topology. And maybe unlikely, but if your firmware / BIOS allows selecting MPS (Multi Processor Spec.) version, changing it to something others (usually 1.1 and 1.4 but possibly something more) could resolve the issue (quite old experience on other OS, if I recall correctly). Lastly, if you've not done, what happenes if you add acpi_ibm_load="YES" in your /boot/loader.conf and reboot? Any differences?
Hi, acpi_ibm_load="YES" is set in loader.conf and has been for several years - this install started off originally as 12.1-RELEASE back in 2019. I believe I did try disabling it during my testing, with no change in behaviour. I've tried a fresh live USB image as well, which has the same problem. I'll have a dig through the BIOS to check for anything relevant to MPS. It turns out there is also a BIOS update to apply. I'm currently running BIOS rev. 7UET92WW (3.22), the changelog for 3.23 is described as "microcode update", so I'll apply that and see what happens.
I'm able to also boot the 14.2 installer on a R500 with a P8600 processor, without needing to disable SMP. I've been having crashes during video playback that occur pretty readily (did not have these on 14.1.) I'm thinking about swapping my SSD over to see if it might be hardware/kernel related or not.
The crash I had on the T400 (with 14.2 and kern.smp.disabled=1, so I can boot) with swaymsg "output * power off" is reproducible on the R500 *if* I boot with kern.smp.disabled=1. Thus, I think disabling SMP seems to cause, one way or another, other issues, that aren't related to the problem of requiring SMP disabled in the first place. With SMP enabled on the R500, I did some video playback and had no crashes. I'm assuming that the display blanking crashes and video crashes are related and happen when SMP is disabled. I also found this bug report that sounds very similar, but about different hardware: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=284275 I opened https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=286018 for the kern.smp.disabled=1 crashes because I feel it is probably a separate issue.
Some more testing... I'm able to disable multi-core in the BIOS which has identical behavior to disabling SMP in loader.conf, allowing me to boot on the T400. I've been using my Thinkpad R500 with 2GB of memory. I put 8GB in and while it's able to boot the installer, hangs around the Intel wifi firmware loading, or elsewhere. So I tested the T400 (which has an up-to-date BIOs, by the way. The R500 does not.) with 4GB of memory. Same issue booting the installer. With 2GB, it boots the installer just fine (with SMP) and my SSD with 14.2 just fine. I'm typing on it now. So it appears to be related to 14.2, maybe these old Core2Duos, and 4GB of memory or more. Anthony, would you mind confirming how much memory your T400 has?
My T400 has 8GB memory (2x4GB). I've just tried it with 1x4GB, and it has the same problem as before. I don't think I have any smaller DDR3 SODIMMS to hand at the moment. However, if I set hw.physmem at the loader prompt, I get some interesting results! All with 8GB installed: hw.physmem="2G" -- boots 12-STABLE with SMP hw.physmem="4G" -- boots 12-STABLE with SMP hw.physmem - not set -- fails during boot Henrich, perhaps you could confirm that you can alter the behaviour by setting hw.physmem as well (hopefully it saves you the trouble of moving physical memory around also!)
(In reply to Anthony Williams from comment #14) Very interesting! Thank you for the tip about setting hw.physmem. I did not know that could be done. I reinstalled the two 4GB dimms and I can confirm that I can boot the T500, with SMP, when I use hw.physmem="4G". I tried hw.physmem="6G" and it did not boot, unless SMP was disabled. I'm perplexed as to why hw.physmem="4G" will work, but one 4GB dimm will not. Either way, this is a much better workaround and hopefully gives useful debugging data.
(In reply to Henrich Hartzer from comment #15) Can you capture and upload verbose boot output for folks to look at, say via kern.smp.disabled=1 use or full safe mode? (dmesg -a output) Another option might be USB boot media that could later be examined from a working system. ( /mnt/var/log/messages ) The boot would have to get far enough along to have updated the file. A serial console capture would be nice but may be unlikely. Part of what I wonder about is what such reports about memory address ranges when you have 8 GiBytes of RAM --if such information can be captured. Folks with appropriate background might be able to note if there are any oddities not handled by FreeBSD. Another idea would be to substitute the UEFI loader from FreeBSD 4.1-RELEASE to see if the problem is FreeBSD-loader-vintage specific in some way rather than just FreeBSD-kernel-specific.
Created attachment 259527 [details] SMP disabled verbose dmesg output
(In reply to Mark Millard from comment #16) Thanks for taking a look at this! Here's SMP disabled output. Not sure if I could do a serial console or not. This system is BIOS-only, so it's not a UEFI issue.
(In reply to Henrich Hartzer from comment #18) Your log indicates that ACPI is in use: . . . ACPI: RSDP 0x00000000000F6440 000024 (v02 LENOVO) ACPI: XSDT 0x00000000BD949B3A 00009C (v01 LENOVO TP-7U 00003240 LTP 00000000) ACPI: FACP 0x00000000BD949C00 0000F4 (v03 LENOVO TP-7U 00003240 LNVO 00000001) Firmware Warning (ACPI): 32/64X length mismatch in FADT/Pm1aControlBlock: 16/32 (20221020/tbfadt-748) Firmware Warning (ACPI): Invalid length for FADT/Pm1aControlBlock: 32, using default 16 (20221020/tbfadt-850) ACPI: DSDT 0x00000000BD94A00E 00FB01 (v01 LENOVO TP-7U 00003240 MSFT 03000000) ACPI: FACS 0x00000000BD98E000 000040 ACPI: SSDT 0x00000000BD949DB4 00025A (v01 LENOVO TP-7U 00003240 MSFT 03000000) ACPI: ECDT 0x00000000BD959B0F 000052 (v01 LENOVO TP-7U 00003240 LNVO 00000001) ACPI: APIC 0x00000000BD959B61 000078 (v01 LENOVO TP-7U 00003240 LNVO 00000001) ACPI: MCFG 0x00000000BD959BD9 00003C (v01 LENOVO TP-7U 00003240 LNVO 00000001) ACPI: HPET 0x00000000BD959C15 000038 (v01 LENOVO TP-7U 00003240 LNVO 00000001) ACPI: SLIC 0x00000000BD959DC2 000176 (v01 LENOVO TP-7U 00003240 LTP 00000000) ACPI: BOOT 0x00000000BD959F38 000028 (v01 LENOVO TP-7U 00003240 LTP 00000001) ACPI: ASF! 0x00000000BD959F60 0000A0 (v16 LENOVO TP-7U 00003240 PTL 00000001) ACPI: SSDT 0x00000000BD98D1FA 000568 (v01 LENOVO TP-7U 00003240 INTL 20050513) ACPI: TCPA 0x00000000BD707000 000032 (v00 00000000 00000000) ACPI: DMAR 0x00000000BD706000 000120 (v01 ? 00000001 00000000) ACPI: SSDT 0x00000000BD6D3000 000655 (v01 PmRef CpuPm 00003000 INTL 20050624) ACPI: SSDT 0x00000000BD6D2000 000274 (v01 PmRef Cpu0Tst 00003000 INTL 20050624) ACPI: SSDT 0x00000000BD6D1000 000242 (v01 PmRef ApTst 00003000 INTL 20050624) MADT: Found IO APIC ID 1, Interrupt 0 at 0xfec00000 ioapic0: MADT APIC ID 1 != hw id 2 ioapic0: ver 0x20 maxredir 0x17 . . .
(In reply to Mark Millard from comment #19) I missed a few even-earlier lines that mention ACPI: MADT: Found CPU APIC ID 0 ACPI ID 0: enabled SMP: Added CPU 0 (AP) MADT: Found CPU APIC ID 1 ACPI ID 1: enabled SMP: Added CPU 1 (AP) MADT: Found CPU APIC ID 2 ACPI ID 2: disabled MADT: Found CPU APIC ID 3 ACPI ID 3: disabled Event timer "LAPIC" quality 100 LAPIC: ipi_wait() us multiplier 61 (r 41211 tsc 2527059940) ACPI APIC Table: <LENOVO TP-7U > APIC: CPU 0 has ACPI ID 0 Note also that indicates 2 CPUs as enabled and 2 as not enabled as far as ACPI is concerned for whateer the details of the style of boot.
(In reply to Mark Millard from comment #20) If you want to avoid ACPI use, you might want to set what kenv can reports as: hint.acpi.0.disabled="0" to "1" instead, such as via the loader prompt.
(In reply to Mark Millard from comment #21) THe reference to the loader prompt was a silly mistake: ACPI vs. not is already establsihed by then (it controled which loader). Instead, likely use the line: hint.acpi.0.disabled="1" in: /boot/device.hints
Created attachment 259561 [details] 14.1-RELEASE verbose boot, SMP enabled
Created attachment 259562 [details] 14.2-RELEASE verbose boot, SMP enabled
Created attachment 259563 [details] 14.2-RELEASE verbose boot, SMP disabled
Created attachment 259564 [details] 14.1-RELEASE verbose boot, SMP disabled
Fortunately the T400 is old enough to have a serial port on the docking station. I've attached serial verbose boot outputs from memstick images of 14.1-RELEASE and 14.2-RELEASE, with SMP enabled and disabled. In both 14.1-R and 14.2-R, booting with ACPI disabled causes an instant panic: > APIC: Could not find any APICs. > panic: running without device atpic requires a local APIC The most significant memory-related difference appears to be (diff the relevant attachments for context): 14.1-RELEASE: 0x0000000002422000 - 0x00000000bd4a0fff, 3137859584 bytes (766079 pages) avail memory = 8167993344 (7789 MB) x86bios: SSEG 0x09c000-0x09cfff at 0xfffffe00103dc000 14.2-RELEASE: 0x0000000002401000 - 0x00000000bd4a0fff, 3137994752 bytes (766112 pages) avail memory = 8168169472 (7789 MB) x86bios: SSEG 0x09c000-0x09cfff at 0xfffffe00103d4000 So 14.2-RELEASE is seeing 0x21000 more bytes / 33 pages in the above memory chunk, and has the SSEG (?) BIOS region mapped(?) 0x8000 bytes lower down. Sorry for any imprecision here, I've not had to look at memory mapping of a PC BIOS much before! Anthony
(In reply to Anthony Williams from comment #27) Looking at the 14.1R pair: (- lines are for smp-disabled only) (+ lines are for smp-enabled only) # diff -u ~/T400-14.1R-verbose-boot-smp-*.txt | grep -i CPU CPU: Intel(R) Core(TM)2 Duo CPU P8400 @ 2.26GHz (2261.12-MHz K8-class CPU) MADT: Found CPU APIC ID 0 ACPI ID 0: enabled SMP: Added CPU 0 (AP) MADT: Found CPU APIC ID 2 ACPI ID 2: disabled MADT: Found CPU APIC ID 3 ACPI ID 3: disabled +FreeBSD/SMP: Multiprocessor System Detected: 2 CPUs + CPU0 (BSP): APIC ID: 0 + CPU1 (AP): APIC ID: 1 APIC: CPU 0 has ACPI ID 0 +APIC: CPU 1 has ACPI ID 1 ULE: setup cpu 0 +ULE: setup cpu 1 +SMP: AP CPU #1 Launched! +cpu1 AP: +CPU1: local APIC error 0x40 CPU0: local APIC error 0x40 cpu0: <ACPI CPU> on acpi0 -ACPI: SSDT 0xFFFFF800038B1400 0002C8 (v01 PmRef Cpu0Ist 00003000 INTL 20050624) +ACPI: SSDT 0xFFFFF80003832C00 0002C8 (v01 PmRef Cpu0Ist 00003000 INTL 20050624) -ACPI: SSDT 0xFFFFF8000373F000 00087A (v01 PmRef Cpu0Cst 00003001 INTL 20050624) +ACPI: SSDT 0xFFFFF8000382F000 00087A (v01 PmRef Cpu0Cst 00003001 INTL 20050624) +cpu1: <ACPI CPU> on acpi0 +ACPI: Processor \_PR_.CPU2 (ACPI ID 2) ignored +ACPI: Processor \_PR_.CPU3 (ACPI ID 3) ignored est0: <Enhanced SpeedStep Frequency Control> on cpu0 +est1: <Enhanced SpeedStep Frequency Control> on cpu1 +ACPI: Processor \_PR_.CPU2 (ACPI ID 2) ignored +ACPI: Processor \_PR_.CPU3 (ACPI ID 3) ignored +ACPI: Processor \_PR_.CPU2 (ACPI ID 2) ignored +ACPI: Processor \_PR_.CPU3 (ACPI ID 3) ignored +cpu_reset: Stopping other CPUs CPU2 and CPU3 are never enabled. smp-enabled does have CPU1 enabled, not just CPU0. How many FreeBSD cpus does top show for 14.1R with smp-enabled? 2? 4? (Cross checks on the above.) cpu1 is likely the hyper-threading on the same core as cpu0?
(In reply to Mark Millard from comment #28) Hi Mark, I haven't had a chance to try booting with ACPI disabled. I just wanted to chime in about cores 3 & 4. I suspect it's a stub in case a 2 core + 2 thread processor were installed, which might be a possible hardware configuration. Neither my processors, nor Anthony's have SMT. They just have two cores. This may be an incorrect ACPI reading, although appears to be how they've been for years. Intel's website says that the P8400 is a dual core without SMT.
(In reply to Henrich Hartzer from comment #29) FYI: I expect that this is the first time that I've run into an actual example of ACPI without UEFI. This explains my potential references to UEFI that do not apply. Part of the issue is that ACPI is under the control of the UEFI folks these days. That does not implicitly suggest ACPI without UEFI involved.
(In reply to Mark Millard from comment #28) top on 14.1 says 2 CPUS. A bit of another piece to the puzzle, however! Booting up the 14.1-R memstick to gather the top data, I found it hung on boot... From a sample of several reboots, I found that with SMP enabled, and selecting the various options from the loader menu: * No serial console, no verbose = hang * Serial console, no verbose = boots * No serial console, verbose = boots * serial console, verbose = boots So there seems to be a variant of this issue in 14.1-RELEASE as well. Could a possibility that either verbose or serial console boot options slows the boot process down enough to escape whatever is causing the lockup? I now recall something like this happening when I upgraded from 14.0-RELEASE to 14.1-RELEASE, but I thought it was i915 related. I set (and kept) verbose boot in loader.conf when diagnosing it back then, and that seems to have been sufficient to boot 14.1-RELEASE reliably, but it's not "good enough" for 14.2.