Summary: | xhci1 resets on ROCKPro64 when reading from ZFS pool | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Base System | Reporter: | Will <freebsd> | ||||||
Component: | arm | Assignee: | freebsd-arm (Nobody) <freebsd-arm> | ||||||
Status: | New --- | ||||||||
Severity: | Affects Only Me | CC: | hselasky | ||||||
Priority: | --- | ||||||||
Version: | 13.0-STABLE | ||||||||
Hardware: | arm64 | ||||||||
OS: | Any | ||||||||
Attachments: |
|
Description
Will
2022-01-19 01:08:07 UTC
Hi, Are there any USB power issues? You'll find the "Resetting controller" print in /usr/src/sys/dev/usb/controller/xhci.c . Maybe it is a false alarm. Do you see any errors before the Resetting controller print? --HPS (In reply to Hans Petter Selasky from comment #1) I've not noticed any issues that appear related to USB power: the USB drive enclosure in question has its own power adapter, it is not bus-powered. Plus it's not had the problem when the ROCKPro64 runs Linux, or when the enclosure is used with Intel-based or Raspberry Pi 4 computers. It's just the specific combination of FreeBSD with the ROCKPro64 that gives me trouble. I don't see any related messages before the error in the logs. Here are two examples: Jan 1 06:22:32 generic kernel: ugen5.7: <OWC Mercury Elite Pro Quad D> at usbus5 Jan 1 06:22:32 generic kernel: umass3 on uhub7 Jan 1 06:22:32 generic kernel: umass3: <OWC Mercury Elite Pro Quad D, class 0/0, rev 3.10/1.00, addr 6> on usbus5 Jan 1 06:22:32 generic kernel: umass3: SCSI over Bulk-Only; quirks = 0x0100 Jan 1 06:22:32 generic kernel: umass3:3:3: Attached to scbus3 Jan 1 06:22:32 generic kernel: da3 at umass-sim3 bus 3 scbus3 target 0 lun 0 Jan 1 06:22:32 generic kernel: da3: <Mercury Elite Pro Quad D 0> Fixed Direct Access SPC-4 SCSI device Jan 1 06:22:32 generic kernel: da3: Serial Number 123456789015 Jan 1 06:22:32 generic kernel: da3: 400.000MB/s transfers Jan 1 06:22:32 generic kernel: da3: 488386MB (1000215216 512 byte sectors) Jan 1 06:22:32 generic kernel: da3: quirks=0x2<NO_6_BYTE> Jan 1 06:22:45 generic kernel: ZFS filesystem version: 5 Jan 1 06:22:45 generic kernel: ZFS storage pool version: features support (5000) Jan 1 06:33:27 generic kernel: xhci1: Resetting controller Jan 1 06:33:27 generic kernel: (da3:umass-sim3:3:0:0): WRITE(10). CDB: 2a 00 2d c5 2f 44 00 01 00 00 Jan 1 06:33:27 generic kernel: (da3:umass-sim3:3:0:0): CAM status: CCB request completed with an error Jan 1 06:33:27 generic kernel: (da3:umass-sim3:3:0:0): Retrying command, 3 more tries remain Jan 1 06:33:27 generic kernel: uhub4: at usbus5, port 1, addr 1 (disconnected) Jan 1 06:33:27 generic kernel: ugen5.2: <OWC Mercury Elite Pro Quad> at usbus5 (disconnected) Jan 1 06:33:27 generic kernel: uhub6: at uhub4, port 1, addr 1 (disconnected) Jan 1 06:33:28 generic kernel: uhub6: detached Jan 1 06:33:29 generic kernel: ugen5.3: <OWC Mercury Elite Pro Quad> at usbus5 (disconnected) Jan 1 06:33:29 generic kernel: uhub7: at uhub4, port 2, addr 2 (disconnected) Jan 1 06:33:30 generic kernel: ugen5.4: <OWC Mercury Elite Pro Quad A> at usbus5 (disconnected) Jan 1 06:33:30 generic kernel: umass0: at uhub7, port 1, addr 3 (disconnected) So there you see the last of the 4 drives coming up at 06:22 (I trimmed off the others, they're in immediate succession), then me mounting the ZFS filesystem a few seconds later, then 11 minutes of silence before xhci1 resets. Here's another example, earlier that day: Jan 1 00:27:32 generic dhclient[724]: send_packet: Host is down Jan 1 00:27:49 generic syslogd: last message repeated 2 times Jan 1 01:15:09 generic syslogd: last message repeated 1 times Jan 1 01:15:21 generic syslogd: last message repeated 1 times Jan 1 03:00:28 generic syslogd: last message repeated 1 times Jan 1 03:01:06 generic syslogd: last message repeated 3 times Jan 1 04:39:06 generic syslogd: last message repeated 1 times Jan 1 05:26:37 generic dhclient[724]: send_packet: Host is down Jan 1 05:27:08 generic syslogd: last message repeated 2 times Jan 1 05:27:32 generic syslogd: last message repeated 2 times Jan 1 05:45:06 generic kernel: xhci1: Resetting controller Jan 1 05:45:06 generic kernel: uhub4: at usbus5, port 1, addr 1 (disconnected) Jan 1 05:45:06 generic kernel: ugen5.7: <OWC Mercury Elite Pro Quad> at usbus5 (disconnected) Jan 1 05:45:06 generic kernel: uhub7: at uhub4, port 1, addr 6 (disconnected) Jan 1 05:45:07 generic kernel: uhub7: detached Jan 1 05:45:08 generic kernel: ugen5.2: <OWC Mercury Elite Pro Quad> at usbus5 (disconnected) Jan 1 05:45:08 generic kernel: uhub6: at uhub4, port 2, addr 1 (disconnected) Jan 1 05:45:10 generic kernel: ugen5.3: <OWC Mercury Elite Pro Quad A> at usbus5 (disconnected) Jan 1 05:45:10 generic kernel: (da0:umass-sim0:0:0:0): READ(16). CDB: 88 00 00 00 00 01 01 89 5c f8 00 00 00 80 00 00 Jan 1 05:45:10 generic kernel: umass0: at uhub6, port 1, addr 2 (disconnected) Jan 1 05:45:10 generic kernel: (da0:umass-sim0:0:0:0): CAM status: CCB request completed with an error Jan 1 05:45:10 generic kernel: (da0:umass-sim0:0:0:0): Retrying command, 3 more tries remain Jan 1 05:45:10 generic kernel: (da0:umass-sim0:0:0:0): READ(16). CDB: 88 00 00 00 00 01 01 89 5c f8 00 00 00 80 00 00 Jan 1 05:45:10 generic kernel: (da0:umass-sim0:0:0:0): CAM status: CCB request completed with an error Jan 1 05:45:10 generic kernel: (da0:umass-sim0:0:0:0): Retrying command, 2 more tries remain Jan 1 05:45:10 generic kernel: (da0:umass-sim0:0:0:0): READ(16). CDB: 88 00 00 00 00 01 01 89 5c f8 00 00 00 80 00 00 Jan 1 05:45:10 generic kernel: (da0:umass-sim0:0:0:0): CAM status: CCB request completed with an error Jan 1 05:45:10 generic kernel: (da0:umass-sim0:0:0:0): Retrying command, 1 more tries remain I've done some further investigation that reinforces my suspicion that this is a FreeBSD issue, and not an issue with any of the hardware involved. Previously, although my tests reproduced the problem consistently in FreeBSD and not in Linux on the same ROCKPro64, there was a complication around the speed at which the two operating systems could use the disk. FreeBSD could read the disk quite fast (~150MB/s), but Linux only read at a fraction of that because the pool was encrypted, and ZFS on Linux does not yet support ARM hardware crypto. I was thus concerned that if the issue was actually hardware (either the USB enclosure or the ROCKPro64) and only occurred at higher disk speeds, then the constrained Linux speed might be inadvertently preventing the issue, and it would appear under Linux if it could work with the disks faster. So, I removed my encrypted pool's disks from the enclosure and replaced them with three smaller disks and created a new, unencrypted, lz4-compressed ZFS pool on it. Same enclosure, same ROCKPro64, same everything except for the disks in the enclosure. Previously I used an eMMC module for the OS, but to simplify switching between operating systems I removed that and flashed two SD cards with the following images: FreeBSD-14.0-CURRENT-arm64-aarch64-ROCKPRO64-20220310-f1d450ddee6-253697.img.xz Armbian_21.08.1_Rockpro64_focal_current_5.10.60.img.xz (The latter is Armbian's Ubuntu 20.04.4 release.) I wrote 1.67TB from /dev/urandom to a file on the ZFS filesystem, then tested reading it on both operating systems. FreeBSD still ran into the same issue I described previously, usually after just a few tens of GB but it did once make it to ~250GB before failing. But it failed 100% of the time, with the same symptoms as before: responsive OS, but any command dealing with the pool would stall indefinitely, and I couldn't complete a graceful shutdown. Similar CAM and USB errors in dimes. Hard resetting the ROCKPro64 would recover it. Ubuntu, by contrast, had no problems whatsoever and was able to read the file in its entirety repeatedly over the course of a day. I repeated all of the tests multiple times. Given that Ubuntu on the same ROCKPro64 with the same enclosure and the same ZFS pool has no problem, and switching to FreeBSD with nothing else changed introduces the problem, I am much more confident in my prior assumption that the problem is within FreeBSD somewhere. Is there any further testing or information that I could provide to help figure out what's going on here? I'm happy to help. Thanks to everyone here! Hi, If you have a spare disk, it would be interesting to see the result from a long test using /usr/src/tools/tools/usbtest . You basically select mass storage, read and write mode, duration 7000 seconds, for example. --HPS I git cloned /usr/src and did a make of /usr/src/tools/tools/usbtest. I run the built usbtest as root, I select the first drive in the USB enclosure, go into the "Mass Storage" section, toggle I/O mode to "Read and Write", set the "test duration" to 7000, then "start test", but I immediately get "Could not find USB endpoints". I've poked around in the menus but didn't find anything that changed that outcome. I couldn't find any documentation for the tool either. Sorry to demand hand-holding, but what am I missing here? Could you paste the console output of that usbtest run. "usbtest" must be run as root. Did you select like this in the menu: 2) This computer is Running the Host Side 1) Select USB device (VID=0x0000, PID=0x0000) 4) Mass Storage (UMASS) --HPS Created attachment 232407 [details]
Console log of running usbtest unsuccessfully
Yes, that's how I'm running it. I've attached the console output for you. I also tried disconnecting the enclosure and instead connecting a USB 2.0 thumb drive, and usbtest seemed to work with that: it didn't get that that error and it ran for a while doing various tests. (I can include those results if you like, but assume they're not relevant since that's not the USB device I'm having trouble with.) I am not sure what "Could not find USB endpoints" means (my knowledge of both C code and USB's under-the-hood operation are limited), could it perhaps be related to the enclosure containing a USB hub? I believe that's how it's able to have 4 SATA-to-USB bridges in a single enclosure with a single USB connection. Could you run: usbconfig -d ugen5.4 dump_device_desc dump_curr_config_desc --HPS [freebsd@generic ~]$ sudo usbconfig -d ugen5.4 dump_device_desc dump_curr_config_desc ugen5.4: <OWC Mercury Elite Pro Quad A> at usbus5, cfg=0 md=HOST spd=SUPER (5.0Gbps) pwr=ON (0mA) bLength = 0x0012 bDescriptorType = 0x0001 bcdUSB = 0x0310 bDeviceClass = 0x0000 <Probed by interface class> bDeviceSubClass = 0x0000 bDeviceProtocol = 0x0000 bMaxPacketSize0 = 0x0009 idVendor = 0x1e91 idProduct = 0xa4a7 bcdDevice = 0x0100 iManufacturer = 0x0002 <OWC> iProduct = 0x0003 <Mercury Elite Pro Quad A> iSerialNumber = 0x0001 <123456789004> bNumConfigurations = 0x0001 Configuration index 0 bLength = 0x0009 bDescriptorType = 0x0002 wTotalLength = 0x0079 bNumInterfaces = 0x0001 bConfigurationValue = 0x0001 iConfiguration = 0x0000 <no string> bmAttributes = 0x00c0 bMaxPower = 0x0000 Interface 0 bLength = 0x0009 bDescriptorType = 0x0004 bInterfaceNumber = 0x0000 bAlternateSetting = 0x0000 bNumEndpoints = 0x0002 bInterfaceClass = 0x0008 <Mass storage> bInterfaceSubClass = 0x0006 bInterfaceProtocol = 0x0050 iInterface = 0x0000 <no string> Endpoint 0 bLength = 0x0007 bDescriptorType = 0x0005 bEndpointAddress = 0x0081 <IN> bmAttributes = 0x0002 <BULK> wMaxPacketSize = 0x0400 bInterval = 0x0000 bRefresh = 0x0000 bSynchAddress = 0x0000 Additional Descriptor bLength = 0x06 bDescriptorType = 0x30 bDescriptorSubType = 0x0f RAW dump: 0x00 | 0x06, 0x30, 0x0f, 0x00, 0x00, 0x00 Endpoint 1 bLength = 0x0007 bDescriptorType = 0x0005 bEndpointAddress = 0x0002 <OUT> bmAttributes = 0x0002 <BULK> wMaxPacketSize = 0x0400 bInterval = 0x0000 bRefresh = 0x0000 bSynchAddress = 0x0000 Additional Descriptor bLength = 0x06 bDescriptorType = 0x30 bDescriptorSubType = 0x0f RAW dump: 0x00 | 0x06, 0x30, 0x0f, 0x00, 0x00, 0x00 Interface 0 Alt 1 bLength = 0x0009 bDescriptorType = 0x0004 bInterfaceNumber = 0x0000 bAlternateSetting = 0x0001 bNumEndpoints = 0x0004 bInterfaceClass = 0x0008 <Mass storage> bInterfaceSubClass = 0x0006 bInterfaceProtocol = 0x0062 iInterface = 0x0000 <no string> Endpoint 0 bLength = 0x0007 bDescriptorType = 0x0005 bEndpointAddress = 0x0081 <IN> bmAttributes = 0x0002 <BULK> wMaxPacketSize = 0x0400 bInterval = 0x0000 bRefresh = 0x0000 bSynchAddress = 0x0000 Additional Descriptor bLength = 0x06 bDescriptorType = 0x30 bDescriptorSubType = 0x0f RAW dump: 0x00 | 0x06, 0x30, 0x0f, 0x05, 0x00, 0x00 Additional Descriptor bLength = 0x04 bDescriptorType = 0x24 bDescriptorSubType = 0x03 RAW dump: 0x00 | 0x04, 0x24, 0x03, 0x00 Endpoint 1 bLength = 0x0007 bDescriptorType = 0x0005 bEndpointAddress = 0x0002 <OUT> bmAttributes = 0x0002 <BULK> wMaxPacketSize = 0x0400 bInterval = 0x0000 bRefresh = 0x0000 bSynchAddress = 0x0000 Additional Descriptor bLength = 0x06 bDescriptorType = 0x30 bDescriptorSubType = 0x0f RAW dump: 0x00 | 0x06, 0x30, 0x0f, 0x05, 0x00, 0x00 Additional Descriptor bLength = 0x04 bDescriptorType = 0x24 bDescriptorSubType = 0x04 RAW dump: 0x00 | 0x04, 0x24, 0x04, 0x00 Endpoint 2 bLength = 0x0007 bDescriptorType = 0x0005 bEndpointAddress = 0x0083 <IN> bmAttributes = 0x0002 <BULK> wMaxPacketSize = 0x0400 bInterval = 0x0000 bRefresh = 0x0000 bSynchAddress = 0x0000 Additional Descriptor bLength = 0x06 bDescriptorType = 0x30 bDescriptorSubType = 0x0f RAW dump: 0x00 | 0x06, 0x30, 0x0f, 0x05, 0x00, 0x00 Additional Descriptor bLength = 0x04 bDescriptorType = 0x24 bDescriptorSubType = 0x02 RAW dump: 0x00 | 0x04, 0x24, 0x02, 0x00 Endpoint 3 bLength = 0x0007 bDescriptorType = 0x0005 bEndpointAddress = 0x0004 <OUT> bmAttributes = 0x0002 <BULK> wMaxPacketSize = 0x0400 bInterval = 0x0000 bRefresh = 0x0000 bSynchAddress = 0x0000 Additional Descriptor bLength = 0x06 bDescriptorType = 0x30 bDescriptorSubType = 0x00 RAW dump: 0x00 | 0x06, 0x30, 0x00, 0x00, 0x00, 0x00 Additional Descriptor bLength = 0x04 bDescriptorType = 0x24 bDescriptorSubType = 0x01 RAW dump: 0x00 | 0x04, 0x24, 0x01, 0x00 [freebsd@generic ~]$ I don't see why usbtest is not attaching. Could you add some printf's to the "find_usb_endpoints()" function in usbtest, why it does not enter and find two endpoints on interface zero? --HPS I found it. You have multiple devices with same USB vendor and product ID. Try this usbtest patch: https://cgit.freebsd.org/src/commit/?id=16346e1401b8b369e251bc70781349fb9b813cef Then re-run. Thank you! --HPS Created attachment 232819 [details]
gzip-ed console log of running usbtest against three disks
Apologies for the delay, I was away from home (and the devices in question) for two weeks. I rebuilt usbtest from the "main" branch (which appeared to include your changes) and am now able to run it. I have drives in the first three bays of the enclosure ("A", "B", & "C") so I ran the "mass storage" test against each of them in order, using read/write mode and 7000 second duration as you suggested. I've attached a gzip-compressed log of the console output. I did notice timeouts on the third one ("C"), but will leave the interpretation to you. Thanks again for your help! Hi Will, There are several timeouts on device "C": TIMEOUT: Trying to get CSW again. 8 tries left. But not always the timeout fails! I suspect this may not be a fault of your USB devices, but the XHCI controller. Did you get any errors in dmesg while running usbtest? I suspect there is a IRQ problem or something XHCI'ish is bad. A wild guess, does entering: set hw.usb.xhci.dcepquirk=1 In the loader help? --HPS So I have spent a lot of time trying to reproduce those timeout errors in usbtest, but have been unsuccessful. I have done dozens of tests, both read and read/write, many as long as 8 hours in length, dozens of times on all three drives in the bay. None have reproduced those timeout errors we saw before. However, I can reproduce the ZFS freezing issue 100% reliably, so that issue persists unaffected. I did try toggling hw.usb.xhci.dcepquirk between 0 and 1 and it seemed to make no difference on either subtest's ability to cause timeouts, or the appearance of the ZFS freezing issue. Do you have any other ideas for how to try to reproduce whatever underlies the ZFS freezing issue, using usbtest? I see a lot of options in there but do not understand them well enough to know what to try. Thank you again for your help! 1) Maybe you can use usbdump to catch the exact traffic pattern before freeze? Use the -s parameter to limit how much is captured, like "-s 128" . 2) Maybe that the test is read-only makes a difference: 1) Toggle I/O mode: <Read Only> Make sure R/W test will destroy the contents of your disk! --HPS |