| Summary: | FreeBSD 13.0: Corruption in SGL observed while running IO with smartpqi driver | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Base System | Reporter: | Hermes T K <Hermes.TK> | ||||
| Component: | kern | Assignee: | freebsd-bugs (Nobody) <bugs> | ||||
| Status: | Closed FIXED | ||||||
| Severity: | Affects Only Me | CC: | Hermes.TK, girgen, imp, nbe, scott.benesh | ||||
| Priority: | --- | ||||||
| Version: | 13.0-RELEASE | ||||||
| Hardware: | amd64 | ||||||
| OS: | Any | ||||||
| Attachments: |
|
||||||
Any update on this ? Hi, This sounds similar to what I was seeing before updating to 13-stable. Read more at https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=259611 Palle Hi,
any updates on this? I'm using three Adaptec 1100-4i HBAs each connected to a seperate SuperMicro BPN-SAS3-216EL1 backplane for a total of 72 bays.
My zpool is created with 67 SSDs in a simple "RAID0"-config:
zpool create atime=off mountpoint=none test da0 [..] da66
and each time I realiably can lockup a random controller by creating enough load using:
dd if=/dev/zero of=/mnt/test.dat bs=100M
and after a time of five minutes a parallel
zpool scrub test
with following kernel messages:
[...heartbeat...] controller is offline
[...take_ctrl-offline...] Controller FW is not runniung. Lockup code = 1403a
The Adaptec HBA shows after reboot:
1719-Slot 10 A controller failure event occurred prior to this power-up
Previous lock up code=0001403A
POST Messages Ended. Press any key to continue.
I even tried only one Adaptec 1100 HBA and the three backplanes as a cascade; but the controller locks up using this config as well...
TIA and BR,
Nils
The newest driver fixes it. |
Created attachment 228651 [details] Attaching kernel logs for reference In FreeBSD 13.0, while running IO with blocksize 1MB, observed corruption in SGL. We were running IO with FIO (version 3.28)IO tool. Command used : fio --filename=/dev/da0: -direct=1 -iodepth=32 -ioengine=posixaio -rw=randrw -bs=1024k -numjobs=8 -runtime=30 -group_reporting -name=stress Transfer size used in smartpqi driver: In update_sim_properties FUNCTION cpi->maxio = (softs->pqi_cap.max_sg_elem - 1) * PAGE_SIZE; ie: Max transfer size: 0x21f000 (2.24 MB) While running IO with 1mb block size , getting EINPROGRESS status for bus_dmamap_load_ccb in pqi_map_request function. function stack => smartpqi_cam_action() -> pqisrc_io_start() -> pqi_map_request() -> bus_dmamap_load_ccb() -> pqi_request_map_helper After receiving the EINPROGRESS status for bus_dmamap_load_ccb(), pqi_request_map_helper callback function is called . Then observed corruption in the SGL received. Eg: While running FIO with 1 MB block size, the total mapped SGL length received is 0x3002c0 (3.14MB), where the total transfer size for the IO was 0x100000 (1.04MB). SGL dump(Added debug prints to dump sgl): 05:30:29 freebsd13 kernel: [NOTE]:[ pqi_map_request ] [ 615 ]Total transfer size : 100000 for tag value 37, Max transfer size: 21f000 Oct 6 05:30:29 freebsd13 kernel: [NOTE]:[ pqi_request_map_helper ] [ 577 ]Submitting IO With tag 30 after EINPROGRESS status[NOTE]:[ pqisrc_build_sgl ] [ 108 ]SGL_Count :256 for tag 30 Oct 6 05:30:29 freebsd13 kernel: [NOTE]:[ pqisrc_build_sgl ] [ 131 ]SGL Chaining Oct 6 05:30:29 freebsd13 kernel: [NOTE]:[ pqisrc_build_sgl ] [ 140 ]TAG: 30 sgt->addr[0]: 5930000 Oct 6 05:30:29 freebsd13 kernel: [NOTE]:[ pqisrc_build_sgl ] [ 142 ]TAG: 30 sgt->len[0]: 1000 Oct 6 05:30:29 freebsd13 kernel: [NOTE]:[ pqisrc_build_sgl ] [ 140 ]TAG: 30 sgt->addr[1]: 592f000 Oct 6 05:30:29 freebsd13 kernel: [NOTE]:[ pqisrc_build_sgl ] [ 142 ]TAG: 30 sgt->len[1]: 1000 Oct 6 05:30:29 freebsd13 kernel: [NOTE]:[ pqisrc_build_sgl ] [ 140 ]TAG: 30 sgt->addr[255]: 5831000 Oct 6 05:30:29 freebsd13 kernel: [NOTE]:[ pqisrc_build_sgl ] [ 142 ]TAG: 30 sgt->len[255]: 1000 Oct 6 05:30:29 freebsd13 kernel: [NOTE]:[ pqisrc_build_sgl ] [ 147 ]Total sg length = 3002c0 for tag value 30 Oct 6 05:30:29 freebsd13 kernel: [NOTE]:[ pqisrc_show_cdb ] [ 1615 ]TAG: 30 When we tried in FreeBSD 12.2, the maximum block size allowed to run in fio is 128k. We are suspecting some issue in SGL handling with FreeBSD 13.0.