Hi, I have a Dell R740 server with PERC H330 Mini card in HBA mode. The install was done using 12.1 and today I upgraded to 12.2 with freebsd-update. The kernel crashes now during boot. (I attached the captured serial logs below.) Following the recommendation from the friendly folks on #freebsdhelp I was able to boot up the system by setting hw.mfi.mrsas_enable="1" in the bootloader. This is not ideal at the moment as ZFS complains about difference in configured vs. native blocksizes. ZFS filesystem version: 5 ZFS storage pool version: features support (5000) Timecounters tick every 1.000 msec lo0: bpf attached vlan: initialized, using hash tables with chaining tcp_init: net.inet.tcp.tcbhashsize auto tuned to 1048576 IPsec: Initialized Security Association Processing. hptnr: no controller detected. hptrr: no controller detected. hpt27xx: no controller detected. ahcich0: AHCI reset... ugen0.1: <0x8086 XHCI root HUB> at usbus0 mfisyspd0 numa-domain 0 on mfi0 mfisyspd0: 7630885MB (15628053168 sectors) SYSPD volume (deviceid: 0) mfisyspd0: SYSPD volume attached GEOM: new disk mfisyspd0 ahcich0: SATA connect timeout time=10000us status=00000000 ahcich0: AHCI reset: device not found ahcich1: AHCI reset... uhub0: <0x8086 XHCI root HUB, class 9/0, rev 3.00/1.00, addr 1> on usbus0 ahcich1: SATA connect timeout time=10000us status=00000000 ahcich1: AHCI reset: device not found mfisyspd1 numa-domain 0 on mfi0 ahcich2: AHCI reset... mfi0: DJA NA XXX SYSPDIO Fatal trap 12: page fault while in kernel mode cpuid = 14; apic id = 14 fault virtual address = 0x0 fault code = supervisor read data, page not present instruction pointer = 0x20:0xffffffff806d040b stack pointer = 0x28:0xfffffe00faad61c0 frame pointer = 0x28:0xfffffe00faad61f0 code segment = base rx0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 13 (g_down) trap number = 12 panic: page fault cpuid = 14 time = 1 KDB: stack backtrace: #0 0xffffffff80c0a8f5 at kdb_backtrace+0x65 #1 0xffffffff80bbeb1b at vpanic+0x17b #2 0xffffffff80bbe993 at panic+0x43 #3 0xffffffff8108f911 at trap_fatal+0x391 #4 0xffffffff8108f96f at trap_pfault+0x4f #5 0xffffffff8108efb6 at trap+0x286 #6 0xffffffff81066f38 at calltrap+0x8 #7 0xffffffff806c92b8 at mfi_send_frame+0x28 #8 0xffffffff806c91a9 at mfi_data_cb+0x359 #9 0xffffffff80c00382 at bus_dmamap_load_bio+0xb2 #10 0xffffffff806c8cce at mfi_mapcmd+0xae #11 0xffffffff806c802b at mfi_startio+0xdb #12 0xffffffff806ce3d9 at mfi_syspd_strategy+0x99 #13 0xffffffff80b07d85 at g_disk_start+0x325 #14 0xffffffff80b0bd49 at g_io_schedule_down+0x169 #15 0xffffffff80b0c57c at g_down_procbody+0x6c #16 0xffffffff80b8044e at fork_exit+0x7e #17 0xffffffff81067f6e at fork_trampoline+0xe Uptime: 1s these kernels work fine: Jul 27 12:55:31 backup20vh kernel: FreeBSD 12.1-RELEASE r354233 GENERIC amd64 Sep 2 15:56:28 backup20vh kernel: FreeBSD 12.1-RELEASE-p8 GENERIC amd64 Nov 30 14:49:50 backup20vh kernel: FreeBSD 12.1-RELEASE-p10 GENERIC amd64 Nov 30 15:03:10 backup20vh kernel: FreeBSD 12.1-RELEASE-p10 GENERIC amd64 this is the one with the issue: Nov 30 14:29:59 backup20vh kernel: FreeBSD 12.2-RELEASE r366954 GENERIC amd64 Regards, cstamas
Hi, It looks like commit 358689 introduced regression: https://svnweb.freebsd.org/base/head/sys/dev/mfi/mfi_tbolt.c?revision=358689&view=markup Added jhibbit@ into ticket. Thx!
Hm, it looks really strange. I can revert it, but Clang will complain, as it would be a tautology... 'if (cdb[0] != 0x28 || cdb[0] != 0x2A)' is *always* true, because if it is one, it's not the other.
Adding ambrisko@ for his thoughts, due to PR 187831.
I reverted the change and the kernel boots up properly now: root@backup20vh:/usr/src/sys # svnlite log | head -2 ------------------------------------------------------------------------ r368402 | kib | 2020-12-07 01:05:39 +0000 (Mon, 07 Dec 2020) | 3 lines root@backup20vh:/usr/src/sys # svnlite status M dev/mfi/mfi_tbolt.c root@backup20vh:/usr/src/sys # svnlite diff Index: dev/mfi/mfi_tbolt.c =================================================================== --- dev/mfi/mfi_tbolt.c (revision 368414) +++ dev/mfi/mfi_tbolt.c (working copy) @@ -1109,7 +1109,7 @@ if (hdr->cmd == MFI_CMD_PD_SCSI_IO) { /* check for inquiry commands coming from CLI */ - if (cdb[0] != 0x28 && cdb[0] != 0x2A) { + if (cdb[0] != 0x28 || cdb[0] != 0x2A) { if ((req_desc = mfi_tbolt_build_mpt_cmd(sc, cm)) == NULL) { device_printf(sc->mfi_dev, "Mapping from MFI " root@backup20vh:/usr/src/sys # uname -a FreeBSD backup20vh.removed-domain 12.2-STABLE FreeBSD 12.2-STABLE #1 r368414M: Mon Dec 7 17:35:28 UTC 2020 cstamas@backup20vh.removed-domain:/usr/obj/usr/src/amd64.amd64/sys/GENERIC amd64 I even ran zpool scrub and all seems fine.
(In reply to Justin Hibbits from comment #2) At the same time, it's strange that the comment talks about "inquiry commands" while 0x28 and 0x2A are READ_10 and WRITE_10 commands respectively. There could be a bug-cancels-bug situation there.
I never figured out what this code was dealing with. It came from LSI/Avago/Broadcom with their initial support of Thunderbolt family of RAID cards. This was before they decided to make mrsas. That is why I put in the printf. I assume that the CLI was MegaCLI. Note that the MFI_CMD_PD_SCSI_IO is for commands going to the raw disk versus to the RAID volume. This should mostly be handled as a "SYSPDIO" which they called system physical disk IO, which is a talking straight to the disk. That is handled the "mfisyspd" of the driver. it's kind of strange mode. The real fix is probably to remove the cdb[0] != 0x28 && cdb[0] != 0x2A and always run mfi_tbolt_build_mpt_cmd and not fall through to "DJA NA XXX SYSPDIO". I never found out why we would need to do that check. I put that message there so if it happened we can try to figure out why. Since I've never seen that happen we can probably remove it.
A commit references this bug: Author: jhibbits Date: Wed Dec 9 02:07:02 UTC 2020 New revision: 368473 URL: https://svnweb.freebsd.org/changeset/base/368473 Log: dev/mfi: Make a seemingly bogus conditional unconditional Summary: r358689 attempted to fix a clang warning/error by inferring the intent of the condition "(cdb[0] != 0x28 || cdb[0] != 0x2A)". Unfortunately, it looks like this broke things. Instead, fix this by making this path unconditional, effectively reverting to the previous state. PR: kern/251483 Reviewed By: ambrisko MFC after: 2 days Differential Revision: https://reviews.freebsd.org/D27515 Changes: head/sys/dev/mfi/mfi_tbolt.c
Could you please add this commit into 12.2-p2?
A commit references this bug: Author: jhibbits Date: Sun Dec 13 22:42:48 UTC 2020 New revision: 368620 URL: https://svnweb.freebsd.org/changeset/base/368620 Log: MFC r3684733: dev/mfi: Make a seemingly bogus conditional unconditional Summary: r358689 attempted to fix a clang warning/error by inferring the intent of the condition "(cdb[0] != 0x28 || cdb[0] != 0x2A)". Unfortunately, it looks like this broke things. Instead, fix this by making this path unconditional, effectively reverting to the previous state. PR: kern/251483 Changes: _U stable/12/ stable/12/sys/dev/mfi/mfi_tbolt.c
Is this an EN candidate? I just got hit by it badly on a 16 spindle machine as well after doing a freebsd-update to 12.2.
(In reply to Bjoern A. Zeeb from comment #10) Yeah, it's a good candidate for EN.
Just for the record: I was hit with this panic too, in similar conditions trying to install 12.2-RELEASE on new hardware. I worked this around switching to mrsas(4) that was my intention anyway. Also note that for ZFS it may be preferable configuring HDD/SSD/NVMe as "Non-RAID" media (as opposed to JBOD/RAID0 per drive) in the controller's SETUP and using mrsas(4). Such configuration gives transparent S.M.A.R.T. support for smartmontools, too.
I encountered the same problem on DELL PowerEdge T340. The solution is to use mrsas and D248727
^Triage: apparently fixed back in 2020. The comment about using mrsas may still be correct, however.