Bug 257890 - Storage controller lockup on zfs scrub [smartpqi][zfs]
Summary: Storage controller lockup on zfs scrub [smartpqi][zfs]
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 13.0-RELEASE
Hardware: amd64 Any
: --- Affects Some People
Assignee: freebsd-fs (Nobody)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-08-16 19:54 UTC by Peter
Modified: 2023-10-19 03:21 UTC (History)
8 users (show)

See Also:


Attachments
log/messages (66.81 KB, text/plain)
2021-08-17 17:47 UTC, Peter
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Peter 2021-08-16 19:54:15 UTC
This is a repost of unresolved bug #240145

***HARDWARE IS VERIFIED OK BY ZFS SCRUB ON CENTOS 8.4 WITH 0 ERRORS***

Hardware:
HPE dl180 g10
HPE SmartArray p816i
12x Seagate ST16000NM002G

All combinations of BSD/driver/firmware are affected up to and including:
FreeBSD 13.0 Release/Stable
Microsemi smartpqi driver v4130 updated 8/5/2021
HPE SmartArray Firmware 3.53

The only error displayed/logged is of this form:
[167] [ERROR]::[178:655.0][0,64,0][CPU 15][pqi_map_request][540]:bus_dmamap_load_ccb failed = 36 count = 1044480
[167] [WARN]:[178:655.0][CPU 15][pqisrc_io_start][794]:In Progress on 64

This is a 100% reproduceable issue - sometimes within first 1% of scrub progress, but never more than 8-9%.
Comment 1 Peter 2021-08-17 17:47:28 UTC
Created attachment 227287 [details]
log/messages

See last lines
Comment 2 rainer 2021-08-17 20:32:16 UTC
This is split off from
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=240145

Please include papani.srikanth, emaste and imp in CCs so that Papani can help gather debug information or a never version of the driver.
Comment 3 Warner Losh freebsd_committer freebsd_triage 2021-08-17 22:56:51 UTC
Is there a newer driver ready to integrate?
Comment 4 Warner Losh freebsd_committer freebsd_triage 2021-08-17 22:57:29 UTC
Is there a new version of the smartpqi driver ready to integrate into FreeBSD?
Comment 5 Nils Beyer 2021-12-09 12:15:17 UTC
Hi,

any updates on this? I'm using three Adaptec 1100-4i HBAs each connected to a seperate SuperMicro BPN-SAS3-216EL1 backplane for a total of 72 bays.

My zpool is created with 67 SSDs in a simple "RAID0"-config:

        zpool create atime=off mountpoint=none test da0 [..] da66

and each time I realiably can lockup a random controller by creating enough load using:

        dd if=/dev/zero of=/mnt/test.dat bs=100M

and after a time of five minutes a parallel

        zpool scrub test

with following kernel messages:

        [...heartbeat...] controller is offline
        [...take_ctrl-offline...] Controller FW is not runniung. Lockup code = 1403a

The Adaptec HBA shows after reboot:

        1719-Slot 10 A controller failure event occurred prior to this power-up
          Previous lock up code=0001403A
        POST Messages Ended. Press any key to continue.

I even tried only one Adaptec 1100 HBA and the three backplanes as a cascade; but the controller locks up using this config as well...



TIA and BR,
Nils
Comment 6 Peter 2022-02-14 13:08:41 UTC
All - resolution can be found in thread for bug #240145.
Comment 7 Warner Losh freebsd_committer freebsd_triage 2023-10-19 03:21:59 UTC
New driver fixes this