Bug 259129

Summary: FreeBSD 13.0: Corruption in SGL observed while running IO with smartpqi driver
Product: Base System Reporter: Hermes T K <Hermes.TK>
Component: kernAssignee: freebsd-bugs (Nobody) <bugs>
Status: Closed FIXED    
Severity: Affects Only Me CC: Hermes.TK, girgen, imp, nbe, scott.benesh
Priority: ---    
Version: 13.0-RELEASE   
Hardware: amd64   
OS: Any   
Attachments:
Description Flags
Attaching kernel logs for reference none

Description Hermes T K 2021-10-13 07:35:51 UTC
Created attachment 228651 [details]
Attaching kernel logs for reference

In FreeBSD 13.0, while running IO with blocksize 1MB, observed corruption in SGL. We were running IO with FIO (version 3.28)IO tool.
Command used :

fio --filename=/dev/da0: -direct=1 -iodepth=32 -ioengine=posixaio -rw=randrw -bs=1024k -numjobs=8 -runtime=30 -group_reporting -name=stress

Transfer size used in smartpqi  driver:

In update_sim_properties FUNCTION
cpi->maxio = (softs->pqi_cap.max_sg_elem - 1) * PAGE_SIZE;
ie: Max transfer size: 0x21f000  (2.24 MB)

While running IO with 1mb block size , getting EINPROGRESS status for bus_dmamap_load_ccb in pqi_map_request function.

function stack => smartpqi_cam_action() -> pqisrc_io_start() -> pqi_map_request() -> bus_dmamap_load_ccb() ->  pqi_request_map_helper

After receiving the EINPROGRESS status for bus_dmamap_load_ccb(),  pqi_request_map_helper callback function is called .
Then observed corruption in the SGL received.
Eg: While running FIO with 1 MB block size, the total mapped SGL length received is 0x3002c0 (3.14MB), where the total transfer size for the IO was  0x100000 (1.04MB).  
SGL dump(Added debug prints to dump sgl):

05:30:29 freebsd13 kernel: [NOTE]:[ pqi_map_request ] [ 615 ]Total transfer size : 100000 for tag value 37, Max transfer size: 21f000
Oct  6 05:30:29 freebsd13 kernel: [NOTE]:[ pqi_request_map_helper ] [ 577 ]Submitting IO With tag 30 after EINPROGRESS status[NOTE]:[ pqisrc_build_sgl ] [ 108 ]SGL_Count :256 for tag 30
Oct  6 05:30:29 freebsd13 kernel: [NOTE]:[ pqisrc_build_sgl ] [ 131 ]SGL Chaining
Oct  6 05:30:29 freebsd13 kernel: [NOTE]:[ pqisrc_build_sgl ] [ 140 ]TAG: 30 sgt->addr[0]: 5930000
Oct  6 05:30:29 freebsd13 kernel: [NOTE]:[ pqisrc_build_sgl ] [ 142 ]TAG: 30 sgt->len[0]: 1000
Oct  6 05:30:29 freebsd13 kernel: [NOTE]:[ pqisrc_build_sgl ] [ 140 ]TAG: 30 sgt->addr[1]: 592f000
Oct  6 05:30:29 freebsd13 kernel: [NOTE]:[ pqisrc_build_sgl ] [ 142 ]TAG: 30 sgt->len[1]: 1000


Oct  6 05:30:29 freebsd13 kernel: [NOTE]:[ pqisrc_build_sgl ] [ 140 ]TAG: 30 sgt->addr[255]: 5831000
Oct  6 05:30:29 freebsd13 kernel: [NOTE]:[ pqisrc_build_sgl ] [ 142 ]TAG: 30 sgt->len[255]: 1000
Oct  6 05:30:29 freebsd13 kernel: [NOTE]:[ pqisrc_build_sgl ] [ 147 ]Total sg length = 3002c0 for tag value 30
Oct  6 05:30:29 freebsd13 kernel: [NOTE]:[ pqisrc_show_cdb ] [ 1615 ]TAG: 30


When we tried in FreeBSD 12.2, the maximum block size allowed to run in fio is 128k.

We are suspecting some issue in SGL handling with FreeBSD 13.0.
Comment 1 Hermes T K 2021-10-25 06:24:23 UTC
Any update on this ?
Comment 2 Palle Girgensohn freebsd_committer freebsd_triage 2021-11-02 14:10:01 UTC
Hi, 

This sounds similar to what I was seeing before updating to 13-stable. Read more at https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=259611

Palle
Comment 3 Nils Beyer 2021-12-09 12:15:11 UTC
Hi,

any updates on this? I'm using three Adaptec 1100-4i HBAs each connected to a seperate SuperMicro BPN-SAS3-216EL1 backplane for a total of 72 bays.

My zpool is created with 67 SSDs in a simple "RAID0"-config:

        zpool create atime=off mountpoint=none test da0 [..] da66

and each time I realiably can lockup a random controller by creating enough load using:

        dd if=/dev/zero of=/mnt/test.dat bs=100M

and after a time of five minutes a parallel

        zpool scrub test

with following kernel messages:

        [...heartbeat...] controller is offline
        [...take_ctrl-offline...] Controller FW is not runniung. Lockup code = 1403a

The Adaptec HBA shows after reboot:

        1719-Slot 10 A controller failure event occurred prior to this power-up
          Previous lock up code=0001403A
        POST Messages Ended. Press any key to continue.

I even tried only one Adaptec 1100 HBA and the three backplanes as a cascade; but the controller locks up using this config as well...



TIA and BR,
Nils
Comment 4 Warner Losh freebsd_committer freebsd_triage 2023-10-19 03:23:55 UTC
The newest driver fixes it.