Bug 237463

Summary: aacraid(4) doesn't work on powerpc64
Product: Base System Reporter: Piotr Kubaj <pkubaj>
Component: kernAssignee: freebsd-bugs (Nobody) <bugs>
Status: Closed FIXED    
Severity: Affects Some People CC: achim, alfredo, jhibbits, leandro.lupori, linimon, luporl, prasad.munirathnam, sobomax
Priority: --- Keywords: patch
Version: CURRENT   
Hardware: powerpc   
OS: Any   
See Also: https://reviews.freebsd.org/D23667
https://reviews.freebsd.org/D23668
https://reviews.freebsd.org/D23859
https://reviews.freebsd.org/D23887

Description Piotr Kubaj freebsd_committer freebsd_triage 2019-04-22 10:10:17 UTC
I use Talos II board (powerpc64 architecture). This board has integrated Adaptec Series 8 12G SAS/PCIe 3.

Unfortunately, trying to use aacraid driver doesn't work. Here's a relevant dmesg excerpt:
aacraid0: <Adaptec RAID Controller> mem 0x80000000-0x800fffff,0x80180000-0x801803ff irq 1038328 at device 0.0 numa-domain 0 on pci5
aacraid0: Async. mode not supported by current driver, sync. mode enforced.
Please update driver to get full performance.
aacraid0: Enable Raw I/O
aacraid0: Enable 64-bit array
aacraid0: RequestSupplementAdapterInfo failed
aacraid0: Adaptec RAID controller, aacraid driver 3.2.5-1
aacraid0: RequestAdapterInfo failed
aacraid0: Error probing container 0
aacraidp0 numa-domain 0 on aacraid0
aacraid0: Error 5 sending VM_ContainerConfig command

This is on CURRENT from today. I have three HDD's plugged in to this adapter, none of them is detected on FreeBSD (they work on Linux).

root@:~ # uname -a
FreeBSD  13.0-CURRENT FreeBSD 13.0-CURRENT r346500 GENERIC  powerpc
Comment 1 Piotr Kubaj freebsd_committer freebsd_triage 2019-05-24 14:18:29 UTC
Update to r348091 didn't help.

If anyone wants to look into it, I can provide remote access to hardware.
Comment 2 Leandro Lupori 2020-01-16 11:50:07 UTC
The main issue is that the controller expects FIBs in little-endian (LE) format and the driver assumes host is LE, performing no endianness conversion.

There are several structures that a FIB may carry, and all of their fields would need to be converted: to LE when going to controller and to big-endian (BE) when coming from the controller.

So far, after converting device info FIBs, the driver is able to show correct device information.
I'll convert the other structures, and test if the driver can then start to perform disk I/O.
Comment 3 Leandro Lupori freebsd_committer freebsd_triage 2020-02-12 12:30:38 UTC
Status update:

After converting all FIB structures needed by aacraid.c and aacraid_cam.c to big-endian, when needed, the driver started going further, but there were still some issues:
- sense data was causing a buffer overflow, because of wrong max sense len logic
- after several read/write operations, the kernel was panic'ing in bus_dmamap_sync(). This was due to a missing aac_unmap_command() in the SYNC path.

(I'm currently holding the patches until I have a minimal working version of aacraid driver).

Now I'm able to perform several reads and writes, sometimes even complete a FreeBSD install from ISO to disk, but there is still (at least) one issue remaining. After some time of disk activity, write commands begin to fail with Selection Timeout errors. They usually fail 3 times in a row and the periph is then destroyed.

After some debugging, I've noticed the timeouts start to occur right after receiving an AIF interrupt. For some reason I don't know yet, the response to an AifRequest in SYNC mode comes with no AIF information, so I don't know what the adapter is trying to tell me.

Any help or debugging tip is appreciated!
Comment 4 Leandro Lupori freebsd_committer freebsd_triage 2020-02-13 18:15:56 UTC
The fixes to the 2 issues mentioned in previous comment were posted for review:

[aacraid] Prevent sense data from causing a buffer overflow: https://reviews.freebsd.org/D23667

[aacraid] Add missing aac_unmap_command() call for SYNC mode: https://reviews.freebsd.org/D23668
Comment 5 Leandro Lupori freebsd_committer freebsd_triage 2020-02-26 18:03:09 UTC
I've noticed that the AIF interrupts always occur about 5 minutes after a reboot.
Luckily, they occur on Petitboot too, which made it possible to collect the following information about the remaining issue:


/ # dmesg | tail -20
[   40.494002] sd 1:2:23:0: [sdi] 4096-byte physical blocks
[   40.494004] scsi 1:3:123:0: Enclosure         ADAPTEC  Smart Adapter    4.02 PQ: 0 ANSI: 5
[   40.495376] sd 1:2:23:0: [sdi] Write Protect is off
[   40.495379] sd 1:2:23:0: [sdi] Mode Sense: 46 00 10 08
[   40.495520] scsi 1:3:123:0: Attached scsi generic sg11 type 13
[   40.498220] sd 1:2:23:0: [sdi] Write cache: enabled, read cache: enabled, supports DPO and FUA
[   40.533826] udevd[2649]: inotify_add_watch(6, /dev/dm-8, 10) failed: No such file or directory
[   40.585006] sd 1:2:23:0: [sdi] Attached SCSI disk
[   41.437318] udevd[2688]: inotify_add_watch(6, /dev/dm-11, 10) failed: No such file or directory
[  321.101655] sd 1:2:16:0: [sdb] Synchronizing SCSI cache
[  321.102364] sd 1:2:16:0: [sdb] Synchronize Cache(10) failed: Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
[  334.245061] scsi 1:2:16:0: Direct-Access     ATA      ST4000NM0115-1YZ SN04 PQ: 0 ANSI: 6
[  334.250710] sd 1:2:16:0: Attached scsi generic sg2 type 0
[  334.260739] sd 1:2:16:0: [sdb] 7814037168 512-byte logical blocks: (4.00 TB/3.64 TiB)
[  334.260742] sd 1:2:16:0: [sdb] 4096-byte physical blocks
[  334.261614] sd 1:2:16:0: [sdb] Write Protect is off
[  334.261616] sd 1:2:16:0: [sdb] Mode Sense: 46 00 10 08
[  334.264430] sd 1:2:16:0: [sdb] Write cache: disabled, read cache: enabled, supports DPO and FUA
[  334.325386]  sdb: sdb1 sdb2 sdb3
[  334.349896] sd 1:2:16:0: [sdb] Attached SCSI disk


/var/petitboot/mnt/dev/sda2/bsd # ./arcconf getlogs 1 event
Controllers found: 1
<ControllerLog controllerID="0" time="Wed Feb 26 16:52:47 2020">
    <eventlog>
        <event message="Previous Firmware Lockup Detected, Lockup Code=227 Detail=0x00000000" eventTag="1" relativeControllerTime="4" eventClassCode="12" eventSubClassCode="0" eventDetailCode
="0"/>
        <event message="Cache battery/Super cap is missing" eventTag="2" relativeControllerTime="4" eventClassCode="2" eventSubClassCode="4" eventDetailCode="2"/>
        <event message="Encryption Self-Test failed" eventTag="3" relativeControllerTime="4" eventClassCode="2" eventSubClassCode="10" eventDetailCode="0"/>
        <event message="Hot-plug drive removed, Port=C0 Box=1 Bay=0 SN=            ZC19RD9E" eventTag="4" relativeControllerTime="335" eventClassCode="1" eventSubClassCode="0" eventDetailCode
="0"/>
        <event message="Physical drive failure, Port=C0 Box=1 Bay=0 reason=0x14" eventTag="5" relativeControllerTime="335" eventClassCode="4" eventSubClassCode="0" eventDetailCode="0"/>
        <event message="Hot-plug drive inserted, Port=C0 Box=1 Bay=0 SN=            ZC19RD9E" eventTag="6" relativeControllerTime="348" eventClassCode="1" eventSubClassCode="0" eventDetailCod
e="1"/>
        <event message="Drive is re-enabled, Port=C0 Box=1 Bay=0" eventTag="7" relativeControllerTime="348" eventClassCode="4" eventSubClassCode="0" eventDetailCode="3"/>
    </eventlog>
</ControllerLog>


So, the AIFs are about the drive being removed and then re-inserted after a few seconds, which explains the "Target Selection Timeout" errors that were being seen right after the AIF interrupts occurred.

However, further investigation is needed to understand why the drive is being removed. It could be due to a bad HDD/SAS expander cable, a write cache issue, or maybe a setup issue with the 2 SAS controllers/cabling on the machine, or maybe something else.
Comment 6 Leandro Lupori freebsd_committer freebsd_triage 2020-02-27 18:51:54 UTC
Ignoring disk 1 of the test machine and using only disks 2 and 3 with FreeBSD, AIF interrupts still occur but cause no issues. So, at least for now, I'll consider this "drive being removed and re-inserted issue" as being some sort of HDD or SAS cable issue.

Now I just need to remove the debug/test stuff from my local changes in order to post an aacraid driver fix/port to powerpc64 for review. This way others can start to test the changes too.
Comment 7 Leandro Lupori freebsd_committer freebsd_triage 2020-02-28 19:45:48 UTC
Added last 2 needed patches. With all patches applied to CURRENT, aacraid driver now *should* work on PowerPC64 machines.
It works on power9-devref.

Piotr Kubaj, can you test this on your Talos II when you get a chance?
Comment 8 Piotr Kubaj freebsd_committer freebsd_triage 2020-03-01 11:45:27 UTC
(In reply to Leandro Lupori from comment #7)
When trying to load aacraid.ko, I'm getting:
link_elf: symbol aac_fib_xporthdr_tole undefined
Comment 9 Leandro Lupori freebsd_committer freebsd_triage 2020-03-02 14:01:46 UTC
(In reply to Piotr Kubaj from comment #8)

Right, I had tested aacraid only integrated into kernel, not as a module.

I've added the missing aacraid module source file (aacraid_be.c) and enabled it for ppc in https://reviews.freebsd.org/D23887.
Comment 10 Piotr Kubaj freebsd_committer freebsd_triage 2020-03-02 14:53:21 UTC
(In reply to Leandro Lupori from comment #9)
It's detected:
aacraid0: <Adaptec RAID Controller> mem 0x80000000-0x800fffff,0x80180000-0x801803ff irq 1038328 at device 0.0 numa-domain 0 on pci5
aacraid0: Async. mode not supported by current driver, sync. mode enforced.
aacraid0: Enable Raw I/O
aacraid0: Enable 64-bit array
aacraid0: using line interrupts
aacraid0: SmartIOC 8i, aacraid driver 3.2.10-1
aacraidp0 numa-domain 0 on aacraid0
aacraidp1 numa-domain 0 on aacraid0
aacraidp2 numa-domain 0 on aacraid0
aacraidp3 numa-domain 0 on aacraid0
ses0 at aacraidp3 bus 0 scbus3 target 123 lun 0
da0 at aacraidp2 bus 0 scbus2 target 0 lun 0
da1 at aacraidp2 bus 0 scbus2 target 1 lun 0

I'll try to install FreeBSD on those disks later and run make buildworld on it to test whether it works.
Comment 11 Piotr Kubaj freebsd_committer freebsd_triage 2020-03-02 17:18:06 UTC
(In reply to Leandro Lupori from comment #9)
Just a thought, if compiling in-kernel is required, then GENERIC64 config should also include aacraid. This would also be in line with amd64.
Comment 12 Leandro Lupori freebsd_committer freebsd_triage 2020-03-02 18:01:41 UTC
(In reply to Piotr Kubaj from comment #11)

Compiling in-kernel is required to use a rootfs in aacraid attached disks, on PowerNV machines, that can't use loader to load modules.
I'll add it to GENERIC64 config, thanks for pointing it out.
Comment 13 Piotr Kubaj freebsd_committer freebsd_triage 2020-03-03 20:57:40 UTC
I made a custom iso that uses a kernel with patched aacraid and installed FreeBSD to an HDD plugged into aacraid card.
Then I successfully ran buildworld on such installation, there weren't any I/O errors, so the driver seems stable.
Comment 14 commit-hook freebsd_committer freebsd_triage 2020-03-04 12:22:24 UTC
A commit references this bug:

Author: luporl
Date: Wed Mar  4 12:21:38 UTC 2020
New revision: 358613
URL: https://svnweb.freebsd.org/changeset/base/358613

Log:
  [aacraid] Add missing unmap call for SYNC mode

  This issue was observed on a PowerPC64 machine with an Adaptec RAID Controller
  with PCI device ID 0x028d. After several read/write operations, the kernel was
  panic'ing in bus_dmamap_sync(). This was due to a missing aac_unmap_command()
  in the SYNC path.

  PR:	237463
  Reviewed by:	jhibbits
  Differential Revision:	https://reviews.freebsd.org/D23668

Changes:
  head/sys/dev/aacraid/aacraid.c
Comment 15 commit-hook freebsd_committer freebsd_triage 2020-03-05 20:04:52 UTC
A commit references this bug:

Author: luporl
Date: Thu Mar  5 20:04:42 UTC 2020
New revision: 358684
URL: https://svnweb.freebsd.org/changeset/base/358684

Log:
  [aacraid] Port driver to big-endian

  Port aacraid driver to big-endian (BE) hosts.

  The immediate goal of this change is to make it possible to use the
  aacraid driver on PowerPC64 machines that have Adaptec Series 8 SAS
  controllers.

  Adapters supported by this driver expect FIB contents in little-endian
  (LE) byte order. All FIBs have a fixed header part as well as a data
  part that depends on the command being issued to the controller.

  In this way, on BE hosts, the FIB header and all FIB data structures
  used in aacraid.c and aacraid_cam.c need to be converted to LE before
  being sent to the adapter and converted to BE when coming from it.

  The functions to convert each struct are on aacraid_endian.c.
  For little-endian (LE) targets, they are macros that expand
  to nothing.
  In some cases, when only a few fields of a large structure are used,
  the fields are converted inline, by the code using them.

  PR:		237463
  Reviewed by:	jhibbits
  Sponsored by:	Eldorado Research Institute (eldorado.org.br)
  Differential Revision:	https://reviews.freebsd.org/D23887

Changes:
  head/sys/amd64/conf/NOTES
  head/sys/conf/NOTES
  head/sys/conf/files.powerpc
  head/sys/dev/aacraid/aacraid.c
  head/sys/dev/aacraid/aacraid_cam.c
  head/sys/dev/aacraid/aacraid_endian.c
  head/sys/dev/aacraid/aacraid_endian.h
  head/sys/modules/Makefile
  head/sys/modules/aacraid/Makefile
  head/sys/powerpc/conf/GENERIC64
Comment 16 commit-hook freebsd_committer freebsd_triage 2020-03-09 19:01:40 UTC
A commit references this bug:

Author: luporl
Date: Mon Mar  9 19:01:17 UTC 2020
New revision: 358814
URL: https://svnweb.freebsd.org/changeset/base/358814

Log:
  [aacraid] Handle both AIF and SYNC interrupts

  Without this change, if an AIF interrupt comes at the same time a SYNC
  command is finished, the SYNC interrupt will be lost. This happens because
  all interrupt bits (bellbits) are cleared, but only one of them is handled.

  Debugging shows that, (at least) when !sc->msi_enabled and (sc->flags &
  AAC_FLAGS_SYNC_MODE) is true (sync mode), both bits may be set at the same
  time.

  PR:		237463
  Reviewed by:	scottl
  Sponsored by:	Eldorado Research Institute (eldorado.org.br)
  Differential Revision:	https://reviews.freebsd.org/D23859

Changes:
  head/sys/dev/aacraid/aacraid.c
Comment 17 Leandro Lupori freebsd_committer freebsd_triage 2020-03-10 15:49:38 UTC
(In reply to Leandro Lupori from comment #5)

About this issue of the drive being removed and then re-inserted after a few seconds, further investigation revealed that the problem was caused by two SAS controllers (Adaptec and LSI) being connected to all HDDs through a SAS expander cable. When LSI controller's cable was disconnected, the issue stopped happening.

So apparently the issue was some kind of conflict or interference between the 2 adapters connected to the same disks and not a driver issue.