Summary: | aacraid(4) doesn't work on powerpc64 | ||
---|---|---|---|
Product: | Base System | Reporter: | Piotr Kubaj <pkubaj> |
Component: | kern | Assignee: | freebsd-bugs (Nobody) <bugs> |
Status: | Closed FIXED | ||
Severity: | Affects Some People | CC: | achim, alfredo, jhibbits, leandro.lupori, linimon, luporl, prasad.munirathnam, sobomax |
Priority: | --- | Keywords: | patch |
Version: | CURRENT | ||
Hardware: | powerpc | ||
OS: | Any | ||
See Also: |
https://reviews.freebsd.org/D23667 https://reviews.freebsd.org/D23668 https://reviews.freebsd.org/D23859 https://reviews.freebsd.org/D23887 |
Description
Piotr Kubaj
2019-04-22 10:10:17 UTC
Update to r348091 didn't help. If anyone wants to look into it, I can provide remote access to hardware. The main issue is that the controller expects FIBs in little-endian (LE) format and the driver assumes host is LE, performing no endianness conversion. There are several structures that a FIB may carry, and all of their fields would need to be converted: to LE when going to controller and to big-endian (BE) when coming from the controller. So far, after converting device info FIBs, the driver is able to show correct device information. I'll convert the other structures, and test if the driver can then start to perform disk I/O. Status update: After converting all FIB structures needed by aacraid.c and aacraid_cam.c to big-endian, when needed, the driver started going further, but there were still some issues: - sense data was causing a buffer overflow, because of wrong max sense len logic - after several read/write operations, the kernel was panic'ing in bus_dmamap_sync(). This was due to a missing aac_unmap_command() in the SYNC path. (I'm currently holding the patches until I have a minimal working version of aacraid driver). Now I'm able to perform several reads and writes, sometimes even complete a FreeBSD install from ISO to disk, but there is still (at least) one issue remaining. After some time of disk activity, write commands begin to fail with Selection Timeout errors. They usually fail 3 times in a row and the periph is then destroyed. After some debugging, I've noticed the timeouts start to occur right after receiving an AIF interrupt. For some reason I don't know yet, the response to an AifRequest in SYNC mode comes with no AIF information, so I don't know what the adapter is trying to tell me. Any help or debugging tip is appreciated! The fixes to the 2 issues mentioned in previous comment were posted for review: [aacraid] Prevent sense data from causing a buffer overflow: https://reviews.freebsd.org/D23667 [aacraid] Add missing aac_unmap_command() call for SYNC mode: https://reviews.freebsd.org/D23668 I've noticed that the AIF interrupts always occur about 5 minutes after a reboot. Luckily, they occur on Petitboot too, which made it possible to collect the following information about the remaining issue: / # dmesg | tail -20 [ 40.494002] sd 1:2:23:0: [sdi] 4096-byte physical blocks [ 40.494004] scsi 1:3:123:0: Enclosure ADAPTEC Smart Adapter 4.02 PQ: 0 ANSI: 5 [ 40.495376] sd 1:2:23:0: [sdi] Write Protect is off [ 40.495379] sd 1:2:23:0: [sdi] Mode Sense: 46 00 10 08 [ 40.495520] scsi 1:3:123:0: Attached scsi generic sg11 type 13 [ 40.498220] sd 1:2:23:0: [sdi] Write cache: enabled, read cache: enabled, supports DPO and FUA [ 40.533826] udevd[2649]: inotify_add_watch(6, /dev/dm-8, 10) failed: No such file or directory [ 40.585006] sd 1:2:23:0: [sdi] Attached SCSI disk [ 41.437318] udevd[2688]: inotify_add_watch(6, /dev/dm-11, 10) failed: No such file or directory [ 321.101655] sd 1:2:16:0: [sdb] Synchronizing SCSI cache [ 321.102364] sd 1:2:16:0: [sdb] Synchronize Cache(10) failed: Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK [ 334.245061] scsi 1:2:16:0: Direct-Access ATA ST4000NM0115-1YZ SN04 PQ: 0 ANSI: 6 [ 334.250710] sd 1:2:16:0: Attached scsi generic sg2 type 0 [ 334.260739] sd 1:2:16:0: [sdb] 7814037168 512-byte logical blocks: (4.00 TB/3.64 TiB) [ 334.260742] sd 1:2:16:0: [sdb] 4096-byte physical blocks [ 334.261614] sd 1:2:16:0: [sdb] Write Protect is off [ 334.261616] sd 1:2:16:0: [sdb] Mode Sense: 46 00 10 08 [ 334.264430] sd 1:2:16:0: [sdb] Write cache: disabled, read cache: enabled, supports DPO and FUA [ 334.325386] sdb: sdb1 sdb2 sdb3 [ 334.349896] sd 1:2:16:0: [sdb] Attached SCSI disk /var/petitboot/mnt/dev/sda2/bsd # ./arcconf getlogs 1 event Controllers found: 1 <ControllerLog controllerID="0" time="Wed Feb 26 16:52:47 2020"> <eventlog> <event message="Previous Firmware Lockup Detected, Lockup Code=227 Detail=0x00000000" eventTag="1" relativeControllerTime="4" eventClassCode="12" eventSubClassCode="0" eventDetailCode ="0"/> <event message="Cache battery/Super cap is missing" eventTag="2" relativeControllerTime="4" eventClassCode="2" eventSubClassCode="4" eventDetailCode="2"/> <event message="Encryption Self-Test failed" eventTag="3" relativeControllerTime="4" eventClassCode="2" eventSubClassCode="10" eventDetailCode="0"/> <event message="Hot-plug drive removed, Port=C0 Box=1 Bay=0 SN= ZC19RD9E" eventTag="4" relativeControllerTime="335" eventClassCode="1" eventSubClassCode="0" eventDetailCode ="0"/> <event message="Physical drive failure, Port=C0 Box=1 Bay=0 reason=0x14" eventTag="5" relativeControllerTime="335" eventClassCode="4" eventSubClassCode="0" eventDetailCode="0"/> <event message="Hot-plug drive inserted, Port=C0 Box=1 Bay=0 SN= ZC19RD9E" eventTag="6" relativeControllerTime="348" eventClassCode="1" eventSubClassCode="0" eventDetailCod e="1"/> <event message="Drive is re-enabled, Port=C0 Box=1 Bay=0" eventTag="7" relativeControllerTime="348" eventClassCode="4" eventSubClassCode="0" eventDetailCode="3"/> </eventlog> </ControllerLog> So, the AIFs are about the drive being removed and then re-inserted after a few seconds, which explains the "Target Selection Timeout" errors that were being seen right after the AIF interrupts occurred. However, further investigation is needed to understand why the drive is being removed. It could be due to a bad HDD/SAS expander cable, a write cache issue, or maybe a setup issue with the 2 SAS controllers/cabling on the machine, or maybe something else. Ignoring disk 1 of the test machine and using only disks 2 and 3 with FreeBSD, AIF interrupts still occur but cause no issues. So, at least for now, I'll consider this "drive being removed and re-inserted issue" as being some sort of HDD or SAS cable issue. Now I just need to remove the debug/test stuff from my local changes in order to post an aacraid driver fix/port to powerpc64 for review. This way others can start to test the changes too. Added last 2 needed patches. With all patches applied to CURRENT, aacraid driver now *should* work on PowerPC64 machines. It works on power9-devref. Piotr Kubaj, can you test this on your Talos II when you get a chance? (In reply to Leandro Lupori from comment #7) When trying to load aacraid.ko, I'm getting: link_elf: symbol aac_fib_xporthdr_tole undefined (In reply to Piotr Kubaj from comment #8) Right, I had tested aacraid only integrated into kernel, not as a module. I've added the missing aacraid module source file (aacraid_be.c) and enabled it for ppc in https://reviews.freebsd.org/D23887. (In reply to Leandro Lupori from comment #9) It's detected: aacraid0: <Adaptec RAID Controller> mem 0x80000000-0x800fffff,0x80180000-0x801803ff irq 1038328 at device 0.0 numa-domain 0 on pci5 aacraid0: Async. mode not supported by current driver, sync. mode enforced. aacraid0: Enable Raw I/O aacraid0: Enable 64-bit array aacraid0: using line interrupts aacraid0: SmartIOC 8i, aacraid driver 3.2.10-1 aacraidp0 numa-domain 0 on aacraid0 aacraidp1 numa-domain 0 on aacraid0 aacraidp2 numa-domain 0 on aacraid0 aacraidp3 numa-domain 0 on aacraid0 ses0 at aacraidp3 bus 0 scbus3 target 123 lun 0 da0 at aacraidp2 bus 0 scbus2 target 0 lun 0 da1 at aacraidp2 bus 0 scbus2 target 1 lun 0 I'll try to install FreeBSD on those disks later and run make buildworld on it to test whether it works. (In reply to Leandro Lupori from comment #9) Just a thought, if compiling in-kernel is required, then GENERIC64 config should also include aacraid. This would also be in line with amd64. (In reply to Piotr Kubaj from comment #11) Compiling in-kernel is required to use a rootfs in aacraid attached disks, on PowerNV machines, that can't use loader to load modules. I'll add it to GENERIC64 config, thanks for pointing it out. I made a custom iso that uses a kernel with patched aacraid and installed FreeBSD to an HDD plugged into aacraid card. Then I successfully ran buildworld on such installation, there weren't any I/O errors, so the driver seems stable. A commit references this bug: Author: luporl Date: Wed Mar 4 12:21:38 UTC 2020 New revision: 358613 URL: https://svnweb.freebsd.org/changeset/base/358613 Log: [aacraid] Add missing unmap call for SYNC mode This issue was observed on a PowerPC64 machine with an Adaptec RAID Controller with PCI device ID 0x028d. After several read/write operations, the kernel was panic'ing in bus_dmamap_sync(). This was due to a missing aac_unmap_command() in the SYNC path. PR: 237463 Reviewed by: jhibbits Differential Revision: https://reviews.freebsd.org/D23668 Changes: head/sys/dev/aacraid/aacraid.c A commit references this bug: Author: luporl Date: Thu Mar 5 20:04:42 UTC 2020 New revision: 358684 URL: https://svnweb.freebsd.org/changeset/base/358684 Log: [aacraid] Port driver to big-endian Port aacraid driver to big-endian (BE) hosts. The immediate goal of this change is to make it possible to use the aacraid driver on PowerPC64 machines that have Adaptec Series 8 SAS controllers. Adapters supported by this driver expect FIB contents in little-endian (LE) byte order. All FIBs have a fixed header part as well as a data part that depends on the command being issued to the controller. In this way, on BE hosts, the FIB header and all FIB data structures used in aacraid.c and aacraid_cam.c need to be converted to LE before being sent to the adapter and converted to BE when coming from it. The functions to convert each struct are on aacraid_endian.c. For little-endian (LE) targets, they are macros that expand to nothing. In some cases, when only a few fields of a large structure are used, the fields are converted inline, by the code using them. PR: 237463 Reviewed by: jhibbits Sponsored by: Eldorado Research Institute (eldorado.org.br) Differential Revision: https://reviews.freebsd.org/D23887 Changes: head/sys/amd64/conf/NOTES head/sys/conf/NOTES head/sys/conf/files.powerpc head/sys/dev/aacraid/aacraid.c head/sys/dev/aacraid/aacraid_cam.c head/sys/dev/aacraid/aacraid_endian.c head/sys/dev/aacraid/aacraid_endian.h head/sys/modules/Makefile head/sys/modules/aacraid/Makefile head/sys/powerpc/conf/GENERIC64 A commit references this bug: Author: luporl Date: Mon Mar 9 19:01:17 UTC 2020 New revision: 358814 URL: https://svnweb.freebsd.org/changeset/base/358814 Log: [aacraid] Handle both AIF and SYNC interrupts Without this change, if an AIF interrupt comes at the same time a SYNC command is finished, the SYNC interrupt will be lost. This happens because all interrupt bits (bellbits) are cleared, but only one of them is handled. Debugging shows that, (at least) when !sc->msi_enabled and (sc->flags & AAC_FLAGS_SYNC_MODE) is true (sync mode), both bits may be set at the same time. PR: 237463 Reviewed by: scottl Sponsored by: Eldorado Research Institute (eldorado.org.br) Differential Revision: https://reviews.freebsd.org/D23859 Changes: head/sys/dev/aacraid/aacraid.c (In reply to Leandro Lupori from comment #5) About this issue of the drive being removed and then re-inserted after a few seconds, further investigation revealed that the problem was caused by two SAS controllers (Adaptec and LSI) being connected to all HDDs through a SAS expander cable. When LSI controller's cable was disconnected, the issue stopped happening. So apparently the issue was some kind of conflict or interference between the 2 adapters connected to the same disks and not a driver issue. |