Bug 266419 - mrsas: Corrupts memory (crashes) when reading data from NVMe disk attached to PERC H755N controller
Summary: mrsas: Corrupts memory (crashes) when reading data from NVMe disk attached to...
Status: Open
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: CURRENT
Hardware: amd64 Any
: --- Affects Only Me
Assignee: freebsd-bugs (Nobody)
URL:
Keywords: crash, needs-qa
Depends on:
Blocks:
 
Reported: 2022-09-15 00:44 UTC by Rebecca Cran
Modified: 2022-09-19 15:20 UTC (History)
2 users (show)

See Also:
koobs: maintainer-feedback? (kadesai)
koobs: maintainer-feedback? (imp)
koobs: mfc-stable13?
koobs: mfc-stable12?


Attachments
Dell PowerEdge R7525 FreeBSD 14-CURRENT dmesg (19.27 KB, text/plain)
2022-09-15 00:44 UTC, Rebecca Cran
no flags Details
core dump text information from mrsas crash (127.32 KB, text/plain)
2022-09-15 01:45 UTC, Rebecca Cran
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Rebecca Cran freebsd_committer freebsd_triage 2022-09-15 00:44:40 UTC
Created attachment 236559 [details]
Dell PowerEdge R7525 FreeBSD 14-CURRENT dmesg

On my Dell PowerEdge R7525 with a PERC H755N controller with 5 attached NVMe disks, reading from /dev/da0 causes a panic.

I'm running:
  FreeBSD 14.0-CURRENT #0 main-n257957-975407b1d8d-dirty: Wed Sep 14 17:16:44 MDT 2022

FreeBSD sees the controller as:
  AVAGO MegaRAID SAS FreeBSD mrsas driver version: 07.709.04.00-fbsd
  mrsas0: <BROADCOM AERO-10E2 SAS Controller> port 0x2000-0x20ff mem 

The following command triggers the panic:

dd if=/dev/da0 of=/dev/null bs=1M

Sometimes it just causes various processes to segfault (e.g. fsck_msdosfs, logger, csh), other times it causes a panic, such as "Memory modified after free", or "_sleep: curthread not running"

I've attached the full dmesg, and will also attach the kernel dump information when it finishes.
Comment 1 Kubilay Kocak freebsd_committer freebsd_triage 2022-09-15 00:56:28 UTC
Are you able to test 12/13 stable bootonly snapshots to reproduce in those versions and potentially isolate a regression window?
Comment 2 Rebecca Cran freebsd_committer freebsd_triage 2022-09-15 01:45:27 UTC
Created attachment 236560 [details]
core dump text information from mrsas crash
Comment 3 Rebecca Cran freebsd_committer freebsd_triage 2022-09-15 01:47:25 UTC
(In reply to Kubilay Kocak from comment #1)
Yes, I should be able to do that - though it might take a couple of weeks since I'm pretty busy with other stuff just now.
Comment 4 Rebecca Cran freebsd_committer freebsd_triage 2022-09-15 17:45:51 UTC
12.3-STABLE-20220826 r372448 appears to work well: I get 1.2 GB/s and didn't notice any instability (though I didn't try as hard as with 13.0-RELEASE).

13.1-STABLE-20220826 is broken: I get 168 MB/s, then a segfaults when I cancel dd.

13.0-RELEASE appeared to work, I get 168 MB/s and no segfaults but then I tried to logout from the installer shell and it hung. Logging into another VT, I ran dmesg and saw init was repeatedly crashing:

pid 49742 (init), jid 0, uid 0: exited on signal 11
pid 49743 (init), jid 0, uid 0: exited on signal 11
pid 49744 (init), jid 0, uid 0: exited on signal 11
pid 49745 (init), jid 0, uid 0: exited on signal 11
pid 49746 (init), jid 0, uid 0: exited on signal 11
pid 49747 (init), jid 0, uid 0: exited on signal 11
etc.
Comment 5 Rebecca Cran freebsd_committer freebsd_triage 2022-09-19 04:46:58 UTC
Commit f28ecf2b63ff886fa59637c8aa8f0ce7b7f4202f from 13-CURRENT appears to work.
Comment 6 Warner Losh freebsd_committer freebsd_triage 2022-09-19 15:20:35 UTC
Commits of interest since then:
cd8537910406 MAXPHYS tunable
f83d3280f60d uintptr_t (maybe nop, commit message looks maybe interesting)
e34a057ca6eb and 59fffbcf46ba BIG ENDIAN SUPPORT + i386 fix
fa3d57c25610 calculate maximum transfer size correctly (if setting kern.maxphys back to 131072 fixes corruption, then study this and the first one).

The rest look like boring changes, or at the least well constrained bug fixes around the edges that likely wouldn't cause a problem.