Bug 261324

Summary:

xhci1 resets on ROCKPro64 when reading from ZFS pool

Product:

Base System

Reporter:

Will <freebsd>

Component:

arm

Assignee:

freebsd-arm (Nobody) <freebsd-arm>

Status:

New ---

Severity:

Affects Only Me

CC:

hselasky

Priority:

---

Version:

13.0-STABLE

Hardware:

arm64

OS:

Any

Attachments:

Description	Flags
Console log of running usbtest unsuccessfully	none
gzip-ed console log of running usbtest against three disks	none

Description Will 2022-01-19 01:08:07 UTC

I have a ROCKPro64 single board computer with 4GB of RAM that I am attempting to use as a small NAS. The storage consists of an encrypted (OpenZFS native, not GELI) RAID-Z pool using three 8TB SATA spinning disks and one 512GB SATA SSD (used as cache) in a simple 4-disk JBOD USB enclosure (the "OWC Mercury Elite Pro Quad"). My understanding is that it contains a USB 3 hub with four SATA-to-USB bridges in it.

The issue I run into is that after reading ~40GB of data from the ZFS filesystem, disk I/O stops and I get the following messages in dmesg:

xhci1: Resetting controller
uhub4: at usbus5, port 1, addr 1 (disconnected)
ugen5.7: <OWC Mercury Elite Pro Quad> at usbus5 (disconnected)
uhub7: at uhub4, port 1, addr 6 (disconnected)
uhub7: detached
ugen5.2: <OWC Mercury Elite Pro Quad> at usbus5 (disconnected)
uhub6: at uhub4, port 2, addr 1 (disconnected)
ugen5.3: <OWC Mercury Elite Pro Quad A> at usbus5 (disconnected)
(da0:umass-sim0:0:0:0): READ(16). CDB: 88 00 00 00 00 01 01 89 5c f8 00 00 00 80 00 00 
umass0: at uhub6, port 1, addr 2 (disconnected)
(da0:umass-sim0:0:0:0): CAM status: CCB request completed with an error
(da0:umass-sim0:0:0:0): Retrying command, 3 more tries remain
(da0:umass-sim0:0:0:0): READ(16). CDB: 88 00 00 00 00 01 01 89 5c f8 00 00 00 80 00 00 
(da0:umass-sim0:0:0:0): CAM status: CCB request completed with an error
(da0:umass-sim0:0:0:0): Retrying command, 2 more tries remain
(da0:umass-sim0:0:0:0): READ(16). CDB: 88 00 00 00 00 01 01 89 5c f8 00 00 00 80 00 00 
(da0:umass-sim0:0:0:0): CAM status: CCB request completed with an error
(da0:umass-sim0:0:0:0): Retrying command, 1 more tries remain
da0 at umass-sim0 bus 0 scbus0 target 0 lun 0
da0: <Mercury Elite Pro Quad A 0>  s/n 123456789004 detached
(da1:umass-sim1:1:0:0): READ(16). CDB: 88 00 00 00 00 01 01 89 5a 00 00 00 00 80 00 00 
(da1:umass-sim1:1:0:0): CAM status: CCB request completed with an error
(da1:umass-sim1:1:0:0): Retrying command, 3 more tries remain
(da2:umass-sim2:2:0:0): READ(16). CDB: 88 00 00 00 00 01 01 89 5d 78 00 00 00 80 00 00 
(da2:umass-sim2:2:0:0): CAM status: CCB request completed with an error
(da2:umass-sim2:2:0:0): Retrying command, 3 more tries remain
(da3:umass-sim3:3:0:0): WRITE(10). CDB: 2a 00 2d 5d 61 31 00 01 00 00 
(da3:umass-sim3:3:0:0): CAM status: CCB request completed with an error
(da3:umass-sim3:3:0:0): Retrying command, 3 more tries remain
(da1:umass-sim1:1:0:0): READ(16). CDB: 88 00 00 00 00 01 01 89 5a 00 00 00 00 80 00 00 
(da1:umass-sim1:1:0:0): CAM status: CCB request completed with an error
(da1:umass-sim1:1:0:0): Retrying command, 2 more tries remain
(da2:umass-sim2:2:0:0): READ(16). CDB: 88 00 00 00 00 01 01 89 5d 78 00 00 00 80 00 00 
(da2:umass-sim2:2:0:0): CAM status: CCB request completed with an error
(da2:umass-sim2:2:0:0): Retrying command, 2 more tries remain
(da3:umass-sim3:3:0:0): WRITE(10). CDB: 2a 00 2d 5d 61 31 00 01 00 00 
(da3:umass-sim3:3:0:0): CAM status: CCB request completed with an error
(da3:umass-sim3:3:0:0): Retrying command, 2 more tries remain
(da1:umass-sim1:1:0:0): READ(16). CDB: 88 00 00 00 00 01 01 89 5a 00 00 00 00 80 00 00 
(da1:umass-sim1:1:0:0): CAM status: CCB request completed with an error
(da1:umass-sim1:1:0:0): Retrying command, 1 more tries remain
(da2:umass-sim2:2:0:0): READ(16). CDB: 88 00 00 00 00 01 01 89 5d 78 00 00 00 80 00 00 
(da2:umass-sim2:2:0:0): CAM status: CCB request completed with an error
(da2:umass-sim2:2:0:0): Retrying command, 1 more tries remain
(da3:umass-sim3:3:0:0): WRITE(10). CDB: 2a 00 2d 5d 61 31 00 01 00 00 
(da3:umass-sim3:3:0:0): CAM status: CCB request completed with an error
(da3:umass-sim3:3:0:0): Retrying command, 1 more tries remain
(da1:umass-sim1:1:0:0): READ(16). CDB: 88 00 00 00 00 01 01 89 5a 00 00 00 00 80 00 00 
(da1:umass-sim1:1:0:0): CAM status: CCB request completed with an error
(da1:umass-sim1:1:0:0): Retrying command, 0 more tries remain
(da2:umass-sim2:2:0:0): READ(16). CDB: 88 00 00 00 00 01 01 89 5d 78 00 00 00 80 00 00 
(da2:umass-sim2:2:0:0): CAM status: CCB request completed with an error
(da2:umass-sim2:2:0:0): Retrying command, 0 more tries remain
(da3:umass-sim3:3:0:0): WRITE(10). CDB: 2a 00 2d 5d 61 31 00 01 00 00 
(da3:umass-sim3:3:0:0): CAM status: CCB request completed with an error
(da3:umass-sim3:3:0:0): Retrying command, 0 more tries remain
(da1:umass-sim1:1:0:0): READ(16). CDB: 88 00 00 00 00 01 01 89 5a 00 00 00 00 80 00 00 
(da1:umass-sim1:1:0:0): CAM status: CCB request completed with an error
(da1:umass-sim1:1:0:0): Error 5, Retries exhausted
(da2:umass-sim2:2:0:0): READ(16). CDB: 88 00 00 00 00 01 01 89 5d 78 00 00 00 80 00 00 
(da2:umass-sim2:2:0:0): CAM status: CCB request completed with an error
(da2:umass-sim2:2:0:0): Error 5, Retries exhausted
(da3:umass-sim3:3:0:0): WRITE(10). CDB: 2a 00 2d 5d 61 31 00 01 00 00 
(da3:umass-sim3:3:0:0): CAM status: CCB request completed with an error
(da3:umass-sim3:3:0:0): Error 5, Retries exhausted
(da1:umass-sim1:1:0:0): READ(16). CDB: 88 00 00 00 00 01 01 89 5b 80 00 00 08 00 00 00 
(da1:umass-sim1:1:0:0): CAM status: CCB request completed with an error
(da1:umass-sim1:1:0:0): Retrying command, 3 more tries remain
(da2:umass-sim2:2:0:0): READ(16). CDB: 88 00 00 00 00 01 01 89 5e f8 00 00 01 80 00 00 
(da2:umass-sim2:2:0:0): CAM status: CCB request completed with an error
(da2:umass-sim2:2:0:0): Retrying command, 3 more tries remain
(da3:umass-sim3:3:0:0): WRITE(10). CDB: 2a 00 2d 5d 63 31 00 01 00 00 
(da3:umass-sim3:3:0:0): CAM status: CCB request completed with an error
(da3:umass-sim3:3:0:0): Retrying command, 3 more tries remain
(da1:umass-sim1:1:0:0): READ(16). CDB: 88 00 00 00 00 01 01 89 5b 80 00 00 08 00 00 00 
(da1:umass-sim1:1:0:0): CAM status: CCB request completed with an error
(da1:umass-sim1:1:0:0): Retrying command, 2 more tries remain
(da2:umass-sim2:2:0:0): READ(16). CDB: 88 00 00 00 00 01 01 89 5e f8 00 00 01 80 00 00 
(da2:umass-sim2:2:0:0): CAM status: CCB request completed with an error
(da2:umass-sim2:2:0:0): Retrying command, 2 more tries remain
(da3:umass-sim3:3:0:0): WRITE(10). CDB: 2a 00 2d 5d 63 31 00 01 00 00 
(da3:umass-sim3:3:0:0): CAM status: CCB request completed with an error
(da3:umass-sim3:3:0:0): Retrying command, 2 more tries remain
(da1:umass-sim1:1:0:0): READ(16). CDB: 88 00 00 00 00 01 01 89 5b 80 00 00 08 00 00 00 
(da1:umass-sim1:1:0:0): CAM status: CCB request completed with an error
(da1:umass-sim1:1:0:0): Retrying command, 1 more tries remain
(da2:umass-sim2:2:0:0): READ(16). CDB: 88 00 00 00 00 01 01 89 5e f8 00 00 01 80 00 00 
(da2:umass-sim2:2:0:0): CAM status: CCB request completed with an error
(da2:umass-sim2:2:0:0): Retrying command, 1 more tries remain
(da3:umass-sim3:3:0:0): WRITE(10). CDB: 2a 00 2d 5d 63 31 00 01 00 00 
(da3:umass-sim3:3:0:0): CAM status: CCB request completed with an error
(da3:umass-sim3:3:0:0): Retrying command, 1 more tries remain
(da1:umass-sim1:1:0:0): READ(16). CDB: 88 00 00 00 00 01 01 89 5b 80 00 00 08 00 00 00 
(da1:umass-sim1:1:0:0): CAM status: CCB request completed with an error
(da1:umass-sim1:1:0:0): Retrying command, 0 more tries remain
(da2:umass-sim2:2:0:0): READ(16). CDB: 88 00 00 00 00 01 01 89 5e f8 00 00 01 80 00 00 
(da2:umass-sim2:2:0:0): CAM status: CCB request completed with an error
(da2:umass-sim2:2:0:0): Retrying command, 0 more tries remain
(da3:umass-sim3:3:0:0): WRITE(10). CDB: 2a 00 2d 5d 63 31 00 01 00 00 
(da3:umass-sim3:3:0:0): CAM status: CCB request completed with an error
(da3:umass-sim3:3:0:0): Retrying command, 0 more tries remain
(da1:umass-sim1:1:0:0): READ(16). CDB: 88 00 00 00 00 01 01 89 5b 80 00 00 08 00 00 00 
(da1:umass-sim1:1:0:0): CAM status: CCB request completed with an error
(da1:umass-sim1:1:0:0): Error 5, Retries exhausted
Jan  1 05:47:13 generic ZFS[5395]: vdev probe failure, zpool=zraid-v5 path=/dev/da1
Jan  1 05:47:13 generic ZFS[5399]: vdev probe failure, zpool=zraid-v5 path=/dev/da1
(da2:umass-sim2:2:0:0): READ(16). CDB: 88 00 00 00 00 01 01 89 5e f8 00 00 01 80 00 00 
(da2:umass-sim2:2:0:0): CAM status: CCB request completed with an error
(da2:umass-sim2:2:0:0): Error 5, Retries exhausted
Solaris: WARNING: Pool 'zraid-v5' has encountered an uncorrectable I/O failure and has been suspended.

Solaris: WARNING: Pool 'zraid-v5' has encountered an uncorrectable I/O failure and has been suspended.

Solaris: WARNING: Pool 'zraid-v5' has encountered an uncorrectable I/O failure and has been suspended.

Solaris: WARNING: Pool 'zraid-v5' has encountered an uncorrectable I/O failure and has been suspended.

Solaris: WARNING: Pool 'zraid-v5' has encountered an uncorrectable I/O failure and has been suspended.

Solaris: WARNING: Pool 'zraid-v5' has encountered an uncorrectable I/O failure and has been suspended.

Solaris: WARNING: Pool 'zraid-v5' has encountered an uncorrectable I/O failure and has been suspended.

Solaris: WARNING: Pool 'zraid-v5' has encountered an uncorrectable I/O failure and has been suspended.

Solaris: WARNING: Pool 'zraid-v5' has encountered an uncorrectable I/O failure and has been suspended.

Solaris: WARNING: Pool 'zraid-v5' has encountered an uncorrectable I/O failure and has been suspended.

Solaris: WARNING: Pool 'zraid-v5' has encountered an uncorrectable I/O failure and has been suspended.

Solaris: WARNING: Pool 'zraid-v5' has encountered an uncorrectable I/O failure and has been suspended.

Solaris: WARNING: Pool 'zraid-v5' has encountered an uncorrectable I/O failure and has been suspended.

Solaris: WARNING: Pool 'zraid-v5' has encountered an uncorrectable I/O failure and has been suspended.

Solaris: WARNING: Pool 'zraid-v5' has encountered an uncorrectable I/O failure and has been suspended.

Solaris: WARNING: Pool 'zraid-v5' has encountered an uncorrectable I/O failure and has been suspended.

Solaris: WARNING: Pool 'zraid-v5' has encountered an uncorrectable I/O failure and has been suspended.

Jan  1 05:47:14 generic ZFS[5423]: vdev probe failure, zpool=zraid-v5 path=/dev/da2
Jan  1 05:47:14 generic ZFS[5447]: catastrophic pool I/O failure, zpool=zraid-v5
Jan  1 05:47:15 generic ZFS[5495]: catastrophic pool I/O failure, zpool=zraid-v5
Jan  1 05:47:15 generic ZFS[5499]: catastrophic pool I/O failure, zpool=zraid-v5
Jan  1 05:47:15 generic ZFS[5503]: catastrophic pool I/O failure, zpool=zraid-v5
Jan  1 05:47:15 generic ZFS[5507]: catastrophic pool I/O failure, zpool=zraid-v5
Jan  1 05:47:15 generic ZFS[5511]: catastrophic pool I/O failure, zpool=zraid-v5
Jan  1 05:47:15 generic ZFS[5515]: catastrophic pool I/O failure, zpool=zraid-v5
Jan  1 05:47:15 generic ZFS[5519]: catastrophic pool I/O failure, zpool=zraid-v5
Jan  1 05:47:15 generic ZFS[5523]: catastrophic pool I/O failure, zpool=zraid-v5
Jan  1 05:47:15 generic ZFS[5527]: catastrophic pool I/O failure, zpool=zraid-v5
Jan  1 05:47:15 generic ZFS[5531]: catastrophic pool I/O failure, zpool=zraid-v5
Jan  1 05:47:15 generic ZFS[5535]: catastrophic pool I/O failure, zpool=zraid-v5
Jan  1 05:47:15 generic ZFS[5539]: catastrophic pool I/O failure, zpool=zraid-v5
Jan  1 05:47:15 generic ZFS[5543]: catastrophic pool I/O failure, zpool=zraid-v5
Jan  1 05:47:15 generic ZFS[5547]: catastrophic pool I/O failure, zpool=zraid-v5
Jan  1 05:47:15 generic ZFS[5551]: catastrophic pool I/O failure, zpool=zraid-v5
Jan  1 05:47:15 generic ZFS[5555]: catastrophic pool I/O failure, zpool=zraid-v5
(da3:umass-sim3:3:0:0): WRITE(10). CDB: 2a 00 2d 5d 63 31 00 01 00 00 
(da3:umass-sim3:3:0:0): CAM status: CCB request completed with an error
(da3:umass-sim3:3:0:0): Error 5, Retries exhausted


...and so on, including further complaints from Solaris, ZFS, and CAM. The system is still operable, and I can SSH in, but the ZFS pool is unresponsive. I can tell FreeBSD to shut down, and although it will start to do so (disconnecting me from SSH), it will not complete powering off: SSH connections are refused and it still responds to pings. If I unplug the SBC's power, disconnect the enclosure's USB cable (since the ROCKPro64 won't boot with it connected, another issue I'll report separately), then boot the SBC and reconnect the USB, things will start working again. (The ZFS pool will sometimes complain about errors that I need to clear, but if I scrub the pool it confirms that no data was corrupted.)

Interestingly, I seem to be able to scrub the pool fine: despite the pool having several TB of data, the issue hasn't appeared then. Additionally, I can use dd to read from the individual block devices for many hours without issue, and can even do so from all the disks simultaneously (averaging ~100MB/s per disk, so ~400MB/s total) without seeing the issue. But if I mount the ZFS filesystem (I do so read-only during my tests to try to reduce risk to the data) and start making a tarball of said filesystem (writing said tar stream to /dev/null), it gets about 40GB in and then the issue appears. The exact amount of data it's able to read (as measured by piping the stream through pv) varies a bit: my last three tests have failed at 43.3, 40.6, and 54.2 GiB, respectively. The issue is 100% repeatable: although the amount of data it's able to read before failing varies, it always fails.

I've tried both the FreeBSD-13.0-RELEASE-arm64-aarch64-ROCKPRO64.img.xz and FreeBSD-13.0-STABLE-arm64-aarch64-ROCKPRO64-20211230-3684bb89d52-248759.img images, but they both have the issue. I've tried Linux on the same SBC with the same ZFS pool, and it had no issue. I tried a Raspberry Pi 4 running FreeBSD 13.0 RELEASE with the same ZFS pool, and it had no issue. I should note that both the Linux and Raspberry Pi tests lacked hardware crypto support (Linux because Linux apparently doesn't support ARM crypto in ZFS yet, and Raspberry Pi because those have no crypto hardware), so they both relied on software crypto which constrained performance to tens of MB/s, rather than hundreds as FreeBSD on the ROCKPro64 does in my problem case. To that end I also used the ZFS pool with an Intel system running FreeBSD 13, which has working hardware crypto and it was able to work with the disks even faster than the ROCKPro64, with no issues. So my suspicion is the enclosure is not the culprit, since it worked fine on RPi and Intel systems. I also suspect the SBC itself is not the culprit since it worked fine with Linux, though it's possible the issue only shows up when reading from the disks at a particular speed and Linux's constrained crypto performance just didn't reach that speed. I'm also able to work with other disks just fine with FreeBSD on the ROCKPro64: it's only this multi-disk enclosure that has trouble. I can create an encrypted ZFS pool on a USB 3 SSD connected directly to the SBC (no intermediary hub) and it works fine, no trouble at all.

Initially I suspected that my SBC or enclosure might be at fault, but through this testing I've come to instead suspect there is some sort of issue in the xhci driver for the ROCKPro64, perhaps somehow related to the enclosure having a USB hub in front of the disks. But I am only speculating! Could someone help me troubleshoot this issue further, to see why xhci1 is resetting unexpectedly? Thank you for your help!

Comment 1 Hans Petter Selasky freebsd_committer

2022-01-23 15:14:52 UTC

Hi,

Are there any USB power issues?

You'll find the "Resetting controller" print in /usr/src/sys/dev/usb/controller/xhci.c .

Maybe it is a false alarm. Do you see any errors before the Resetting controller print?

--HPS

Comment 2 Will 2022-01-23 23:28:07 UTC

(In reply to Hans Petter Selasky from comment #1)

I've not noticed any issues that appear related to USB power: the USB drive enclosure in question has its own power adapter, it is not bus-powered. Plus it's not had the problem when the ROCKPro64 runs Linux, or when the enclosure is used with Intel-based or Raspberry Pi 4 computers. It's just the specific combination of FreeBSD with the ROCKPro64 that gives me trouble.

I don't see any related messages before the error in the logs. Here are two examples:

Jan  1 06:22:32 generic kernel: ugen5.7: <OWC Mercury Elite Pro Quad D> at usbus5
Jan  1 06:22:32 generic kernel: umass3 on uhub7
Jan  1 06:22:32 generic kernel: umass3: <OWC Mercury Elite Pro Quad D, class 0/0, rev 3.10/1.00, addr 6> on usbus5
Jan  1 06:22:32 generic kernel: umass3:  SCSI over Bulk-Only; quirks = 0x0100
Jan  1 06:22:32 generic kernel: umass3:3:3: Attached to scbus3
Jan  1 06:22:32 generic kernel: da3 at umass-sim3 bus 3 scbus3 target 0 lun 0
Jan  1 06:22:32 generic kernel: da3: <Mercury Elite Pro Quad D 0> Fixed Direct Access SPC-4 SCSI device
Jan  1 06:22:32 generic kernel: da3: Serial Number 123456789015
Jan  1 06:22:32 generic kernel: da3: 400.000MB/s transfers
Jan  1 06:22:32 generic kernel: da3: 488386MB (1000215216 512 byte sectors)
Jan  1 06:22:32 generic kernel: da3: quirks=0x2<NO_6_BYTE>
Jan  1 06:22:45 generic kernel: ZFS filesystem version: 5
Jan  1 06:22:45 generic kernel: ZFS storage pool version: features support (5000)
Jan  1 06:33:27 generic kernel: xhci1: Resetting controller
Jan  1 06:33:27 generic kernel: (da3:umass-sim3:3:0:0): WRITE(10). CDB: 2a 00 2d c5 2f 44 00 01 00 00 
Jan  1 06:33:27 generic kernel: (da3:umass-sim3:3:0:0): CAM status: CCB request completed with an error
Jan  1 06:33:27 generic kernel: (da3:umass-sim3:3:0:0): Retrying command, 3 more tries remain
Jan  1 06:33:27 generic kernel: uhub4: at usbus5, port 1, addr 1 (disconnected)
Jan  1 06:33:27 generic kernel: ugen5.2: <OWC Mercury Elite Pro Quad> at usbus5 (disconnected)
Jan  1 06:33:27 generic kernel: uhub6: at uhub4, port 1, addr 1 (disconnected)
Jan  1 06:33:28 generic kernel: uhub6: detached
Jan  1 06:33:29 generic kernel: ugen5.3: <OWC Mercury Elite Pro Quad> at usbus5 (disconnected)
Jan  1 06:33:29 generic kernel: uhub7: at uhub4, port 2, addr 2 (disconnected)
Jan  1 06:33:30 generic kernel: ugen5.4: <OWC Mercury Elite Pro Quad A> at usbus5 (disconnected)
Jan  1 06:33:30 generic kernel: umass0: at uhub7, port 1, addr 3 (disconnected)

So there you see the last of the 4 drives coming up at 06:22 (I trimmed off the others, they're in immediate succession), then me mounting the ZFS filesystem a few seconds later, then 11 minutes of silence before xhci1 resets. Here's another example, earlier that day:

Jan  1 00:27:32 generic dhclient[724]: send_packet: Host is down
Jan  1 00:27:49 generic syslogd: last message repeated 2 times
Jan  1 01:15:09 generic syslogd: last message repeated 1 times
Jan  1 01:15:21 generic syslogd: last message repeated 1 times
Jan  1 03:00:28 generic syslogd: last message repeated 1 times
Jan  1 03:01:06 generic syslogd: last message repeated 3 times
Jan  1 04:39:06 generic syslogd: last message repeated 1 times
Jan  1 05:26:37 generic dhclient[724]: send_packet: Host is down
Jan  1 05:27:08 generic syslogd: last message repeated 2 times
Jan  1 05:27:32 generic syslogd: last message repeated 2 times
Jan  1 05:45:06 generic kernel: xhci1: Resetting controller
Jan  1 05:45:06 generic kernel: uhub4: at usbus5, port 1, addr 1 (disconnected)
Jan  1 05:45:06 generic kernel: ugen5.7: <OWC Mercury Elite Pro Quad> at usbus5 (disconnected)
Jan  1 05:45:06 generic kernel: uhub7: at uhub4, port 1, addr 6 (disconnected)
Jan  1 05:45:07 generic kernel: uhub7: detached
Jan  1 05:45:08 generic kernel: ugen5.2: <OWC Mercury Elite Pro Quad> at usbus5 (disconnected)
Jan  1 05:45:08 generic kernel: uhub6: at uhub4, port 2, addr 1 (disconnected)
Jan  1 05:45:10 generic kernel: ugen5.3: <OWC Mercury Elite Pro Quad A> at usbus5 (disconnected)
Jan  1 05:45:10 generic kernel: (da0:umass-sim0:0:0:0): READ(16). CDB: 88 00 00 00 00 01 01 89 5c f8 00 00 00 80 00 00 
Jan  1 05:45:10 generic kernel: umass0: at uhub6, port 1, addr 2 (disconnected)
Jan  1 05:45:10 generic kernel: (da0:umass-sim0:0:0:0): CAM status: CCB request completed with an error
Jan  1 05:45:10 generic kernel: (da0:umass-sim0:0:0:0): Retrying command, 3 more tries remain
Jan  1 05:45:10 generic kernel: (da0:umass-sim0:0:0:0): READ(16). CDB: 88 00 00 00 00 01 01 89 5c f8 00 00 00 80 00 00 
Jan  1 05:45:10 generic kernel: (da0:umass-sim0:0:0:0): CAM status: CCB request completed with an error
Jan  1 05:45:10 generic kernel: (da0:umass-sim0:0:0:0): Retrying command, 2 more tries remain
Jan  1 05:45:10 generic kernel: (da0:umass-sim0:0:0:0): READ(16). CDB: 88 00 00 00 00 01 01 89 5c f8 00 00 00 80 00 00 
Jan  1 05:45:10 generic kernel: (da0:umass-sim0:0:0:0): CAM status: CCB request completed with an error
Jan  1 05:45:10 generic kernel: (da0:umass-sim0:0:0:0): Retrying command, 1 more tries remain

Comment 3 Will 2022-03-11 20:35:49 UTC

I've done some further investigation that reinforces my suspicion that this is a FreeBSD issue, and not an issue with any of the hardware involved.

Previously, although my tests reproduced the problem consistently in FreeBSD and not in Linux on the same ROCKPro64, there was a complication around the speed at which the two operating systems could use the disk. FreeBSD could read the disk quite fast (~150MB/s), but Linux only read at a fraction of that because the pool was encrypted, and ZFS on Linux does not yet support ARM hardware crypto. I was thus concerned that if the issue was actually hardware (either the USB enclosure or the ROCKPro64) and only occurred at higher disk speeds, then the constrained Linux speed might be inadvertently preventing the issue, and it would appear under Linux if it could work with the disks faster.

So, I removed my encrypted pool's disks from the enclosure and replaced them with three smaller disks and created a new, unencrypted, lz4-compressed ZFS pool on it. Same enclosure, same ROCKPro64, same everything except for the disks in the enclosure. Previously I used an eMMC module for the OS, but to simplify switching between operating systems I removed that and flashed two SD cards with the following images:

FreeBSD-14.0-CURRENT-arm64-aarch64-ROCKPRO64-20220310-f1d450ddee6-253697.img.xz
Armbian_21.08.1_Rockpro64_focal_current_5.10.60.img.xz

(The latter is Armbian's Ubuntu 20.04.4 release.)

I wrote 1.67TB from /dev/urandom to a file on the ZFS filesystem, then tested reading it on both operating systems. FreeBSD still ran into the same issue I described previously, usually after just a few tens of GB but it did once make it to ~250GB before failing. But it failed 100% of the time, with the same symptoms as before: responsive OS, but any command dealing with the pool would stall indefinitely, and I couldn't complete a graceful shutdown. Similar CAM and USB errors in dimes. Hard resetting the ROCKPro64 would recover it. Ubuntu, by contrast, had no problems whatsoever and was able to read the file in its entirety repeatedly over the course of a day. I repeated all of the tests multiple times.

Given that Ubuntu on the same ROCKPro64 with the same enclosure and the same ZFS pool has no problem, and switching to FreeBSD with nothing else changed introduces the problem, I am much more confident in my prior assumption that the problem is within FreeBSD somewhere.

Is there any further testing or information that I could provide to help figure out what's going on here? I'm happy to help. Thanks to everyone here!

Comment 4 Hans Petter Selasky freebsd_committer

2022-03-11 21:03:59 UTC

Hi,

If you have a spare disk, it would be interesting to see the result from a long test using /usr/src/tools/tools/usbtest .

You basically select mass storage, read and write mode, duration 7000 seconds, for example.

--HPS

Comment 5 Will 2022-03-12 07:20:54 UTC

I git cloned /usr/src and did a make of /usr/src/tools/tools/usbtest. I run the built usbtest as root, I select the first drive in the USB enclosure, go into the "Mass Storage" section, toggle I/O mode to "Read and Write", set the "test duration" to 7000, then "start test", but I immediately get "Could not find USB endpoints". I've poked around in the menus but didn't find anything that changed that outcome. I couldn't find any documentation for the tool either. Sorry to demand hand-holding, but what am I missing here?

Comment 6 Hans Petter Selasky freebsd_committer

2022-03-12 08:44:53 UTC

Could you paste the console output of that usbtest run.

"usbtest" must be run as root.

Did you select like this in the menu:

   2) This computer is Running the Host Side

      1) Select USB device (VID=0x0000, PID=0x0000)

      4) Mass Storage (UMASS)

--HPS

Comment 7 Will 2022-03-12 18:35:10 UTC

Created attachment 232407 [details]
Console log of running usbtest unsuccessfully

Comment 8 Will 2022-03-12 18:37:08 UTC

Yes, that's how I'm running it. I've attached the console output for you. I also tried disconnecting the enclosure and instead connecting a USB 2.0 thumb drive, and usbtest seemed to work with that: it didn't get that that error and it ran for a while doing various tests. (I can include those results if you like, but assume they're not relevant since that's not the USB device I'm having trouble with.) I am not sure what "Could not find USB endpoints" means (my knowledge of both C code and USB's under-the-hood operation are limited), could it perhaps be related to the enclosure containing a USB hub? I believe that's how it's able to have 4 SATA-to-USB bridges in a single enclosure with a single USB connection.

Comment 9 Hans Petter Selasky freebsd_committer

2022-03-12 20:26:25 UTC

Could you run:

usbconfig -d ugen5.4 dump_device_desc dump_curr_config_desc

--HPS

Comment 10 Will 2022-03-12 20:42:45 UTC

[freebsd@generic ~]$ sudo usbconfig -d ugen5.4 dump_device_desc dump_curr_config_desc
ugen5.4: <OWC Mercury Elite Pro Quad A> at usbus5, cfg=0 md=HOST spd=SUPER (5.0Gbps) pwr=ON (0mA)

  bLength = 0x0012 
  bDescriptorType = 0x0001 
  bcdUSB = 0x0310 
  bDeviceClass = 0x0000  <Probed by interface class>
  bDeviceSubClass = 0x0000 
  bDeviceProtocol = 0x0000 
  bMaxPacketSize0 = 0x0009 
  idVendor = 0x1e91 
  idProduct = 0xa4a7 
  bcdDevice = 0x0100 
  iManufacturer = 0x0002  <OWC>
  iProduct = 0x0003  <Mercury Elite Pro Quad A>
  iSerialNumber = 0x0001  <123456789004>
  bNumConfigurations = 0x0001 


 Configuration index 0

    bLength = 0x0009 
    bDescriptorType = 0x0002 
    wTotalLength = 0x0079 
    bNumInterfaces = 0x0001 
    bConfigurationValue = 0x0001 
    iConfiguration = 0x0000  <no string>
    bmAttributes = 0x00c0 
    bMaxPower = 0x0000 

    Interface 0
      bLength = 0x0009 
      bDescriptorType = 0x0004 
      bInterfaceNumber = 0x0000 
      bAlternateSetting = 0x0000 
      bNumEndpoints = 0x0002 
      bInterfaceClass = 0x0008  <Mass storage>
      bInterfaceSubClass = 0x0006 
      bInterfaceProtocol = 0x0050 
      iInterface = 0x0000  <no string>

     Endpoint 0
        bLength = 0x0007 
        bDescriptorType = 0x0005 
        bEndpointAddress = 0x0081  <IN>
        bmAttributes = 0x0002  <BULK>
        wMaxPacketSize = 0x0400 
        bInterval = 0x0000 
        bRefresh = 0x0000 
        bSynchAddress = 0x0000 

      Additional Descriptor

      bLength = 0x06
      bDescriptorType = 0x30
      bDescriptorSubType = 0x0f
       RAW dump: 
       0x00 | 0x06, 0x30, 0x0f, 0x00, 0x00, 0x00


     Endpoint 1
        bLength = 0x0007 
        bDescriptorType = 0x0005 
        bEndpointAddress = 0x0002  <OUT>
        bmAttributes = 0x0002  <BULK>
        wMaxPacketSize = 0x0400 
        bInterval = 0x0000 
        bRefresh = 0x0000 
        bSynchAddress = 0x0000 

      Additional Descriptor

      bLength = 0x06
      bDescriptorType = 0x30
      bDescriptorSubType = 0x0f
       RAW dump: 
       0x00 | 0x06, 0x30, 0x0f, 0x00, 0x00, 0x00



    Interface 0 Alt 1
      bLength = 0x0009 
      bDescriptorType = 0x0004 
      bInterfaceNumber = 0x0000 
      bAlternateSetting = 0x0001 
      bNumEndpoints = 0x0004 
      bInterfaceClass = 0x0008  <Mass storage>
      bInterfaceSubClass = 0x0006 
      bInterfaceProtocol = 0x0062 
      iInterface = 0x0000  <no string>

     Endpoint 0
        bLength = 0x0007 
        bDescriptorType = 0x0005 
        bEndpointAddress = 0x0081  <IN>
        bmAttributes = 0x0002  <BULK>
        wMaxPacketSize = 0x0400 
        bInterval = 0x0000 
        bRefresh = 0x0000 
        bSynchAddress = 0x0000 

      Additional Descriptor

      bLength = 0x06
      bDescriptorType = 0x30
      bDescriptorSubType = 0x0f
       RAW dump: 
       0x00 | 0x06, 0x30, 0x0f, 0x05, 0x00, 0x00


      Additional Descriptor

      bLength = 0x04
      bDescriptorType = 0x24
      bDescriptorSubType = 0x03
       RAW dump: 
       0x00 | 0x04, 0x24, 0x03, 0x00


     Endpoint 1
        bLength = 0x0007 
        bDescriptorType = 0x0005 
        bEndpointAddress = 0x0002  <OUT>
        bmAttributes = 0x0002  <BULK>
        wMaxPacketSize = 0x0400 
        bInterval = 0x0000 
        bRefresh = 0x0000 
        bSynchAddress = 0x0000 

      Additional Descriptor

      bLength = 0x06
      bDescriptorType = 0x30
      bDescriptorSubType = 0x0f
       RAW dump: 
       0x00 | 0x06, 0x30, 0x0f, 0x05, 0x00, 0x00


      Additional Descriptor

      bLength = 0x04
      bDescriptorType = 0x24
      bDescriptorSubType = 0x04
       RAW dump: 
       0x00 | 0x04, 0x24, 0x04, 0x00


     Endpoint 2
        bLength = 0x0007 
        bDescriptorType = 0x0005 
        bEndpointAddress = 0x0083  <IN>
        bmAttributes = 0x0002  <BULK>
        wMaxPacketSize = 0x0400 
        bInterval = 0x0000 
        bRefresh = 0x0000 
        bSynchAddress = 0x0000 

      Additional Descriptor

      bLength = 0x06
      bDescriptorType = 0x30
      bDescriptorSubType = 0x0f
       RAW dump: 
       0x00 | 0x06, 0x30, 0x0f, 0x05, 0x00, 0x00


      Additional Descriptor

      bLength = 0x04
      bDescriptorType = 0x24
      bDescriptorSubType = 0x02
       RAW dump: 
       0x00 | 0x04, 0x24, 0x02, 0x00


     Endpoint 3
        bLength = 0x0007 
        bDescriptorType = 0x0005 
        bEndpointAddress = 0x0004  <OUT>
        bmAttributes = 0x0002  <BULK>
        wMaxPacketSize = 0x0400 
        bInterval = 0x0000 
        bRefresh = 0x0000 
        bSynchAddress = 0x0000 

      Additional Descriptor

      bLength = 0x06
      bDescriptorType = 0x30
      bDescriptorSubType = 0x00
       RAW dump: 
       0x00 | 0x06, 0x30, 0x00, 0x00, 0x00, 0x00


      Additional Descriptor

      bLength = 0x04
      bDescriptorType = 0x24
      bDescriptorSubType = 0x01
       RAW dump: 
       0x00 | 0x04, 0x24, 0x01, 0x00




[freebsd@generic ~]$

Comment 11 Hans Petter Selasky freebsd_committer

2022-03-13 13:43:37 UTC

I don't see why usbtest is not attaching.

Could you add some printf's to the "find_usb_endpoints()" function in usbtest, why it does not enter and find two endpoints on interface zero?

--HPS

Comment 12 Hans Petter Selasky freebsd_committer

2022-03-13 14:21:37 UTC

I found it.

You have multiple devices with same USB vendor and product ID.

Try this usbtest patch:
https://cgit.freebsd.org/src/commit/?id=16346e1401b8b369e251bc70781349fb9b813cef

Then re-run.

Thank you!

--HPS

Comment 13 Will 2022-03-30 15:11:29 UTC

Created attachment 232819 [details]
gzip-ed console log of running usbtest against three disks

Comment 14 Will 2022-03-30 15:20:10 UTC

Apologies for the delay, I was away from home (and the devices in question) for two weeks. I  rebuilt usbtest from the "main" branch (which appeared to include your changes) and am now able to run it. I have drives in the first three bays of the enclosure ("A", "B", & "C") so I ran the "mass storage" test against each of them in order, using read/write mode and 7000 second duration as you suggested. I've attached a gzip-compressed log of the console output. I did notice timeouts on the third one ("C"), but will leave the interpretation to you. Thanks again for your help!

Comment 15 Hans Petter Selasky freebsd_committer

2022-03-31 06:21:39 UTC

Hi Will,

There are several timeouts on device "C":

TIMEOUT: Trying to get CSW again. 8 tries left.

But not always the timeout fails! I suspect this may not be a fault of your USB devices, but the XHCI controller. Did you get any errors in dmesg while running usbtest?

I suspect there is a IRQ problem or something XHCI'ish is bad.

A wild guess, does entering:

set hw.usb.xhci.dcepquirk=1

In the loader help?

--HPS

Comment 16 Will 2022-04-14 18:35:20 UTC

So I have spent a lot of time trying to reproduce those timeout errors in usbtest, but have been unsuccessful. I have done dozens of tests, both read and read/write, many as long as 8 hours in length, dozens of times on all three drives in the bay. None have reproduced those timeout errors we saw before. However, I can reproduce the ZFS freezing issue 100% reliably, so that issue persists unaffected. I did try toggling hw.usb.xhci.dcepquirk between 0 and 1 and it seemed to make no difference on either subtest's ability to cause timeouts, or the appearance of the ZFS freezing issue. Do you have any other ideas for how to try to reproduce whatever underlies the ZFS freezing issue, using usbtest? I see a lot of options in there but do not understand them well enough to know what to try.

Thank you again for your help!

Comment 17 Hans Petter Selasky freebsd_committer

2022-04-15 12:29:42 UTC

1)
Maybe you can use usbdump to catch the exact traffic pattern before freeze?

Use the -s parameter to limit how much is captured, like "-s 128" .

2)
Maybe that the test is read-only makes a difference:

 1) Toggle I/O mode: <Read Only>

Make sure R/W test will destroy the contents of your disk!

--HPS