Bug 251483

Summary: after upgrading from 12.1 to 12.2 the mfi driver crashes during boot
Product: Base System Reporter: cstamas+fbsdbz
Component: kernAssignee: Justin Hibbits <jhibbits>
Status: Closed FIXED    
Severity: Affects Some People CC: afedorov, ambrisko, bz, eugen, jhibbits, lwhsu, mizhka, number201724, pfduch, zarychtam
Priority: ---    
Version: 12.2-RELEASE   
Hardware: amd64   
OS: Any   

Description cstamas+fbsdbz 2020-11-30 15:01:16 UTC
Hi,

I have a Dell R740 server with PERC H330 Mini card in HBA mode.
The install was done using 12.1 and today I upgraded to 12.2 with freebsd-update.

The kernel crashes now during boot. (I attached the captured serial logs below.)

Following the recommendation from the friendly folks on #freebsdhelp I was able to boot up
the system by setting hw.mfi.mrsas_enable="1" in the bootloader. This is not ideal at the moment
as ZFS complains about difference in configured vs. native blocksizes.

ZFS filesystem version: 5
ZFS storage pool version: features support (5000)
Timecounters tick every 1.000 msec
lo0: bpf attached
vlan: initialized, using hash tables with chaining
tcp_init: net.inet.tcp.tcbhashsize auto tuned to 1048576
IPsec: Initialized Security Association Processing.
hptnr: no controller detected.
hptrr: no controller detected.
hpt27xx: no controller detected.
ahcich0: AHCI reset...
ugen0.1: <0x8086 XHCI root HUB> at usbus0
mfisyspd0 numa-domain 0 on mfi0
mfisyspd0: 7630885MB (15628053168 sectors) SYSPD volume (deviceid: 0)
mfisyspd0:  SYSPD volume attached
GEOM: new disk mfisyspd0
ahcich0: SATA connect timeout time=10000us status=00000000
ahcich0: AHCI reset: device not found
ahcich1: AHCI reset...
uhub0: <0x8086 XHCI root HUB, class 9/0, rev 3.00/1.00, addr 1> on usbus0
ahcich1: SATA connect timeout time=10000us status=00000000
ahcich1: AHCI reset: device not found
mfisyspd1 numa-domain 0 on mfi0
ahcich2: AHCI reset...
mfi0: DJA NA XXX SYSPDIO


Fatal trap 12: page fault while in kernel mode
cpuid = 14; apic id = 14
fault virtual address	= 0x0
fault code		= supervisor read data, page not present
instruction pointer	= 0x20:0xffffffff806d040b
stack pointer	        = 0x28:0xfffffe00faad61c0
frame pointer	        = 0x28:0xfffffe00faad61f0
code segment		= base rx0, limit 0xfffff, type 0x1b
			= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags	= interrupt enabled, resume, IOPL = 0
current process		= 13 (g_down)
trap number		= 12
panic: page fault
cpuid = 14
time = 1
KDB: stack backtrace:
#0 0xffffffff80c0a8f5 at kdb_backtrace+0x65
#1 0xffffffff80bbeb1b at vpanic+0x17b
#2 0xffffffff80bbe993 at panic+0x43
#3 0xffffffff8108f911 at trap_fatal+0x391
#4 0xffffffff8108f96f at trap_pfault+0x4f
#5 0xffffffff8108efb6 at trap+0x286
#6 0xffffffff81066f38 at calltrap+0x8
#7 0xffffffff806c92b8 at mfi_send_frame+0x28
#8 0xffffffff806c91a9 at mfi_data_cb+0x359
#9 0xffffffff80c00382 at bus_dmamap_load_bio+0xb2
#10 0xffffffff806c8cce at mfi_mapcmd+0xae
#11 0xffffffff806c802b at mfi_startio+0xdb
#12 0xffffffff806ce3d9 at mfi_syspd_strategy+0x99
#13 0xffffffff80b07d85 at g_disk_start+0x325
#14 0xffffffff80b0bd49 at g_io_schedule_down+0x169
#15 0xffffffff80b0c57c at g_down_procbody+0x6c
#16 0xffffffff80b8044e at fork_exit+0x7e
#17 0xffffffff81067f6e at fork_trampoline+0xe
Uptime: 1s

these kernels work fine:

Jul 27 12:55:31 backup20vh kernel: FreeBSD 12.1-RELEASE r354233 GENERIC amd64
Sep  2 15:56:28 backup20vh kernel: FreeBSD 12.1-RELEASE-p8 GENERIC amd64
Nov 30 14:49:50 backup20vh kernel: FreeBSD 12.1-RELEASE-p10 GENERIC amd64
Nov 30 15:03:10 backup20vh kernel: FreeBSD 12.1-RELEASE-p10 GENERIC amd64

this is the one with the issue:

Nov 30 14:29:59 backup20vh kernel: FreeBSD 12.2-RELEASE r366954 GENERIC amd64

Regards,
 cstamas
Comment 1 Michael Zhilin freebsd_committer freebsd_triage 2020-12-07 08:00:07 UTC
Hi,

It looks like commit 358689 introduced regression:

https://svnweb.freebsd.org/base/head/sys/dev/mfi/mfi_tbolt.c?revision=358689&view=markup

Added jhibbit@ into ticket. 

Thx!
Comment 2 Justin Hibbits freebsd_committer freebsd_triage 2020-12-07 15:48:48 UTC
Hm, it looks really strange.  I can revert it, but Clang will complain, as it would be a tautology... 'if (cdb[0] != 0x28 || cdb[0] != 0x2A)' is *always* true, because if it is one, it's not the other.
Comment 3 Justin Hibbits freebsd_committer freebsd_triage 2020-12-07 15:51:24 UTC
Adding ambrisko@ for his thoughts, due to PR 187831.
Comment 4 cstamas+fbsdbz 2020-12-07 20:38:02 UTC
I reverted the change and the kernel boots up properly now:

root@backup20vh:/usr/src/sys # svnlite log | head -2
------------------------------------------------------------------------
r368402 | kib | 2020-12-07 01:05:39 +0000 (Mon, 07 Dec 2020) | 3 lines
root@backup20vh:/usr/src/sys # svnlite status
M       dev/mfi/mfi_tbolt.c
root@backup20vh:/usr/src/sys # svnlite diff
Index: dev/mfi/mfi_tbolt.c
===================================================================
--- dev/mfi/mfi_tbolt.c	(revision 368414)
+++ dev/mfi/mfi_tbolt.c	(working copy)
@@ -1109,7 +1109,7 @@
 
 	if (hdr->cmd == MFI_CMD_PD_SCSI_IO) {
 		/* check for inquiry commands coming from CLI */
-		if (cdb[0] != 0x28 && cdb[0] != 0x2A) {
+		if (cdb[0] != 0x28 || cdb[0] != 0x2A) {
 			if ((req_desc = mfi_tbolt_build_mpt_cmd(sc, cm)) ==
 			    NULL) {
 				device_printf(sc->mfi_dev, "Mapping from MFI "
root@backup20vh:/usr/src/sys # uname -a
FreeBSD backup20vh.removed-domain 12.2-STABLE FreeBSD 12.2-STABLE #1 r368414M: Mon Dec  7 17:35:28 UTC 2020     cstamas@backup20vh.removed-domain:/usr/obj/usr/src/amd64.amd64/sys/GENERIC  amd64

I even ran zpool scrub and all seems fine.
Comment 5 Andriy Gapon freebsd_committer freebsd_triage 2020-12-08 14:49:48 UTC
(In reply to Justin Hibbits from comment #2)
At the same time, it's strange that the comment talks about "inquiry commands" while 0x28 and 0x2A are READ_10 and WRITE_10 commands respectively.
There could be a bug-cancels-bug situation there.
Comment 6 Doug Ambrisko freebsd_committer freebsd_triage 2020-12-08 16:28:58 UTC
I never figured out what this code was dealing with.  It came from LSI/Avago/Broadcom with their initial support of Thunderbolt family of RAID cards.  This was before they decided to make mrsas.  That is why I put in the printf.  I assume that the CLI was MegaCLI.  Note that the MFI_CMD_PD_SCSI_IO is for commands going to the raw disk versus to the RAID volume.  This should
mostly be handled as a "SYSPDIO" which they called system physical disk IO, which is a talking straight to the disk.  That is handled the "mfisyspd" of the driver.  it's kind of strange mode.

The real fix is probably to remove the
    cdb[0] != 0x28 && cdb[0] != 0x2A

and always run mfi_tbolt_build_mpt_cmd and not fall through to "DJA NA XXX SYSPDIO".  I never found out why we would need to do that check.  I put that message there so if it happened we can try to figure out why.  Since I've never seen that happen we can probably remove it.
Comment 7 commit-hook freebsd_committer freebsd_triage 2020-12-09 02:07:10 UTC
A commit references this bug:

Author: jhibbits
Date: Wed Dec  9 02:07:02 UTC 2020
New revision: 368473
URL: https://svnweb.freebsd.org/changeset/base/368473

Log:
  dev/mfi: Make a seemingly bogus conditional unconditional

  Summary:
  r358689 attempted to fix a clang warning/error by inferring the intent
  of the condition "(cdb[0] != 0x28 || cdb[0] != 0x2A)".  Unfortunately, it looks
  like this broke things.  Instead, fix this by making this path unconditional,
  effectively reverting to the previous state.

  PR:		kern/251483
  Reviewed By:	ambrisko
  MFC after:	2 days
  Differential Revision: https://reviews.freebsd.org/D27515

Changes:
  head/sys/dev/mfi/mfi_tbolt.c
Comment 8 Alexander Pravkin 2020-12-09 06:52:40 UTC
Could you please add this commit into 12.2-p2?
Comment 9 commit-hook freebsd_committer freebsd_triage 2020-12-13 22:42:57 UTC
A commit references this bug:

Author: jhibbits
Date: Sun Dec 13 22:42:48 UTC 2020
New revision: 368620
URL: https://svnweb.freebsd.org/changeset/base/368620

Log:
  MFC r3684733:

  dev/mfi: Make a seemingly bogus conditional unconditional

  Summary:
  r358689 attempted to fix a clang warning/error by inferring the intent
  of the condition "(cdb[0] != 0x28 || cdb[0] != 0x2A)".  Unfortunately, it looks
  like this broke things.  Instead, fix this by making this path unconditional,
  effectively reverting to the previous state.

  PR:             kern/251483

Changes:
_U  stable/12/
  stable/12/sys/dev/mfi/mfi_tbolt.c
Comment 10 Bjoern A. Zeeb freebsd_committer freebsd_triage 2021-01-15 15:35:41 UTC
Is this an EN candidate?  I just got hit by it badly on a 16 spindle machine as well after doing a freebsd-update to 12.2.
Comment 11 Justin Hibbits freebsd_committer freebsd_triage 2021-01-15 17:47:54 UTC
(In reply to Bjoern A. Zeeb from comment #10)

Yeah, it's a good candidate for EN.
Comment 12 Eugene Grosbein freebsd_committer freebsd_triage 2021-01-29 03:50:05 UTC
Just for the record: I was hit with this panic too, in similar conditions trying to install 12.2-RELEASE on new hardware. I worked this around switching to mrsas(4) that was my intention anyway.

Also note that for ZFS it may be preferable configuring HDD/SSD/NVMe as "Non-RAID" media (as opposed to JBOD/RAID0 per drive) in the controller's SETUP and using mrsas(4). Such configuration gives transparent S.M.A.R.T. support for smartmontools, too.
Comment 13 YUAN RUI 2021-03-10 10:36:04 UTC
I encountered the same problem on DELL PowerEdge T340. The solution is to use mrsas and D248727
Comment 14 Mark Linimon freebsd_committer freebsd_triage 2024-09-30 05:15:07 UTC
^Triage: apparently fixed back in 2020.  The comment about using mrsas may still be correct, however.