Bug 249520 - IO errors while mounting ZFS root on UEFI-booted RPi4 from USB3-attached external USB drive
Summary: IO errors while mounting ZFS root on UEFI-booted RPi4 from USB3-attached exte...
Status: Closed Not A Bug
Alias: None
Product: Base System
Classification: Unclassified
Component: arm (show other bugs)
Version: CURRENT
Hardware: arm64 Any
: --- Affects Some People
Assignee: freebsd-arm (Nobody)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-09-22 09:22 UTC by Robert Clausecker
Modified: 2021-04-18 21:18 UTC (History)
5 users (show)

See Also:


Attachments
boot/dmesg log (19.51 KB, text/plain)
2020-09-22 09:22 UTC, Robert Clausecker
no flags Details
usbconfig dump_all_desc output (9.21 KB, text/plain)
2021-04-17 10:06 UTC, Robert Clausecker
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Robert Clausecker freebsd_committer freebsd_triage 2020-09-22 09:22:11 UTC
Created attachment 218172 [details]
boot/dmesg log

I compiled a FreeBSD kernel from CURRENT with patches D25219, D26493, D26494, D26495, and D26495 yesterday and attempted to boot it on my Raspberry Pi 4B with 8 GB memory.  The kernel version is reported as:

    FreeBSD 13.0-CURRENT #0 5942f048f5c-c271691(master)-dirty

To this end, I prepared a USB drive (an M.2 SSD attached through a M.2 SATA-to-USB bridge) with a UEFI bootloader and a FreeBSD installation in a zpool.  When trying to boot the system with the drive attached to a USB2 port, everything works fine.  When I instead use a USB3 port, mounting root fails with a series of IO errors:

da0 at umass-sim0 bus 0 scbus0 target 0 lun 0
da0: <WDC WDS2 40G2G0B-00EP UJ43> Fixed Direct Access SPC-4 SCSI device
da0: Serial Number ABCDEFA74566
da0: 400.000MB/s transfers
da0: 228936MB (468862128 512 byte sectors)
da0: quirks=0x2<NO_6_BYTE>
(da0:umass-sim0:0:0:0): READ(10). CDB: 28 00 02 c7 77 2e 00 00 05 00 
(da0:umass-sim0:0:0:0): CAM status: CCB request completed with an error
(da0:umass-sim0:0:0:0): Retrying command, 3 more tries remain
(da0:umass-sim0:0:0:0): READ(10). CDB: 28 00 02 c7 77 2e 00 00 05 00 
(da0:umass-sim0:0:0:0): CAM status: CCB request completed with an error
(da0:umass-sim0:0:0:0): Retrying command, 2 more tries remain
(da0:umass-sim0:0:0:0): READ(10). CDB: 28 00 02 c7 77 2e 00 00 05 00 
(da0:umass-sim0:0:0:0): CAM status: CCB request completed with an error
(da0:umass-sim0:0:0:0): Retrying command, 1 more tries remain
(da0:umass-sim0:0:0:0): READ(10). CDB: 28 00 02 c7 77 2e 00 00 05 00 
(da0:umass-sim0:0:0:0): CAM status: CCB request completed with an error
(da0:umass-sim0:0:0:0): Retrying command, 0 more tries remain
(da0:umass-sim0:0:0:0): READ(10). CDB: 28 00 02 c7 77 2e 00 00 05 00 
(da0:umass-sim0:0:0:0): CAM status: CCB request completed with an error
(da0:umass-sim0:0:0:0): Error 5, Retries exhausted

The same happened before I applied the D26493 series of patches.  The drive works just fine on a USB3 port on my Haswell-based laptop.  Enabling/disabling the RAM limiter in the UEFI configuration does not change the symptoms.

See attached boot console log for details.
Comment 1 Hans Petter Selasky freebsd_committer freebsd_triage 2020-09-22 11:47:28 UTC
Does the device respond to "usbconfig dump_all_desc" after this error?

--HPS
Comment 2 Robert Clausecker freebsd_committer freebsd_triage 2020-09-22 16:32:30 UTC
I cannot tell because as the root file system fails to be mounted, I don't get a shell to type commands into.
Comment 3 Mark Millard 2020-09-22 18:06:21 UTC
I am running a head -r365932 build on a RPi4B with 8GiByte of
RAM booting from a USB3 SSD via uefi/ACPI v1.20, no ZFS involved.
Previously it was head -r363590 that also worked just fine. All
were based on non-debug buildworld buildkernel. Historically
tuned  via -mcpu=cortex-a53 but recently via -mcpu=cortex-a72 .
I use the 3072 MiByte RAM limit setting.

If I gather right, you are at head -r365941 in svn terms.
So you should have picked up -r365918 that is a xhci fix
(important for -cpu=cortex-a72 based kernels).

I have https://reviews.freebsd.org/D26495 applied and have
had it applied for a long time.

I do not have any of the following applied:

D26493, D26494, D26495 (you listed this one twice),
D26496 (so I guessed this one)

In other words, my software context seems similar to yours
from before you "applied the D26493 series of patches".

This suggests that something more specific to your context
is involved and it may be difficult for others to duplicate
your problem.
Comment 4 Robert Clausecker freebsd_committer freebsd_triage 2020-09-23 08:59:10 UTC
(In reply to Mark Millard from comment #3)

It is plausible that the USB bridge does some funky things.  It's a cheapo
Chinesium bridge, specifically a FIDECO brand model M203CP B-key M.2 (SATA) to USB 3.1 bridge.  Attached to that bridge is a Western Digital WDS240G2G0B SSD.

The kernel has been compiled with default options; only -mcpu=cortex-a72 has been added to src.conf.  The UEFI code is version 1.20, too.  The system appears to have crashed a few hours ago (doesn't ping), so I'll have to wait until I'm back home to bring it back up.

What further information can I supply to help you debug this?  I could apply some patches or even break out the kernel debugger following your instructions if that helps.
Comment 5 Mark Millard 2020-09-23 09:29:10 UTC
(In reply to Robert Clausecker from comment #4)

I do not have much to suggest other than booting off of
other media via, say, a USB2 device or a microsd card,
and then use that context to investigate plugging in
and using the USB3 materials. (I've got very little
background in the subjects involved.)

This might get you to the point of being able to do
something like what Hans Petter Selasky suggested
and report the results back to him.

I've no clue at this point how to find what code
initiated the failing read. A stack backtrace from
that code would be nice for identifying the context
that gets the problem.

Comment #3 was more about the environment not being
generally broken and replicating the problem elsewhere
possibly being problematical.
Comment 6 Robert Clausecker freebsd_committer freebsd_triage 2020-09-23 10:30:47 UTC
Thanks,

I'll try building a UFS-based FreeBSD setup on a separate USB drive and boot from that for testing purposes.  Could take a few days to get done with it.
Comment 7 Mark Millard 2020-09-25 08:17:26 UTC
(In reply to Mark Millard from comment #3)

When I listed what I have applied, I messed up. It should
have listed:

 https://reviews.freebsd.org/D25219

I do not have D26495 applied.
Comment 8 Robert Clausecker freebsd_committer freebsd_triage 2020-09-26 10:48:18 UTC
Issue still occurs on r366144 with D25219 applied.  The XHCI fixes apparently have not affected whatever the underlying problem is here.
Comment 9 Robert Clausecker freebsd_committer freebsd_triage 2021-04-16 22:50:41 UTC
Issue still occurs with the 13.0 release.
Comment 10 Mark Millard 2021-04-17 02:45:13 UTC
(In reply to Robert Clausecker from comment #9)

It it not clear what version(s) of sysutils/rpi-firmware type
materials that you are using. A way of getting solid information
about †he RPi firmware (unless it has been mixed-and-matched across
releases) is:

# strings start4.elf | grep VC_BUILD_ID_
VC_BUILD_ID_USER: dom
VC_BUILD_ID_TIME: 12:10:40
VC_BUILD_ID_VARIANT: start
VC_BUILD_ID_TIME: Feb 25 2021
VC_BUILD_ID_BRANCH: bcm2711_2
VC_BUILD_ID_HOSTNAME: buildbot
VC_BUILD_ID_PLATFORM: raspberrypi_linux
VC_BUILD_ID_VERSION: 564e5f9b852b23a330b1764bcf0b2d022a20afd0 (clean)

If you are using before the 2021-Feb-21 dated build, you likely
have problems in part from the firmware. The status for ones
after 2021-Feb-21 is not well known as far as I can tell.

If you are still using releases from https://github.com/pftf/RPi4/releases/
to have UEFI (possibly used in ACPI mode), that and the specific version
is not clear. (V1.26 is new as of today.) As far as I know no one is
officially supporting use of these releases and it is known that the
3 GiByte limitation must be selected for reliable operation to even
be a potential. From that point of view, these reports might someday
be classified as "not a bug".

If you are using ACPI mode is also not clear.

If you are using sysutils/u-boot-rpi4 or sysutils/u-boot-arpi-arm64 that
also is not clear, including which version(s). These also present a
UEFI interface, not necessarily with ACPI as an option but historically
with a Device Tree. These also end up using EFI/BOOT/bootaa64.efi ( a.k.a.
/boot/loader.efi but copied).

None of that is identified by the FreeBSD version(s) unless you also
indicate something like that you did a dd of something like
FreeBSD-13.0-RELEASE-arm64-aarch64-RPI.img that has more than FreeBSD
involved.

So far I've still never had a problem like  you report. But I have a
UFS context, not ZFS.


In comment #6 you wrote:

QUOTE
I'll try building a UFS-based FreeBSD setup on a separate USB drive and boot from that for testing purposes
END QUOTE.

But, I do not see any explicit reports of what was discovered or if you abandoned
the effort.

You could potentially try a pure FreeBSD-13.0-RELEASE-arm64-aarch64-RPI.img
context and if it worked, then try to substitute more of your context to the
media and see what step starts the failures. (Then possibly start over, making
just that last substitution to see if it is sufficient.)

It appears that such would be required for you to supply enough information
for someone to repeat the problem. Of course, if
FreeBSD-13.0-RELEASE-arm64-aarch64-RPI.img failed up front, it almost certainly
means lack of hardware support in some way --and that would mean needing to
replicate the hardware context in order for someone to investigate.

As stands it is unclear how anyone can help you or investigate.
Comment 11 Mark Millard 2021-04-17 04:59:16 UTC
(In reply to Robert Clausecker from comment #9)

Just for the record: on the lists your have reported:

There's some stuff about UEFI booting in there which you can ignore.
The same problem also appears when booting via U-Boot.
Comment 12 Mark Millard 2021-04-17 08:45:25 UTC
(In reply to Robert Clausecker from comment #9)

As detailed in the below-noted list submittal, I used bsdinstall to
set up a RPi4B 8 GiByte with a ZFS USB3 SSD boot/root-file-system
media via FreeBSD-13.0-RELEASE-arm64-aarch64-RPI.img on a microsd
card as the context bsdisntall ran in. Some RPi4 specfic materials
had to be copied to the file system in /dev/gpt/efiboot0 since
bsdinstall does not deal with such things. The resultant ZFS USB3
SSD worked fine for booting and operating the RPi4B 8 GiByte.

See:
https://lists.freebsd.org/pipermail/freebsd-arm/2021-April/023648.html

From the booted system:

root@RPi4_8G_ZFS:~ # uname -apKU
FreeBSD RPi4_8G_ZFS 13.0-RELEASE FreeBSD 13.0-RELEASE #0 releng/13.0-n244733-ea31abc261f: Fri Apr  9 03:54:53 UTC 2021     root@releng1.nyi.freebsd.org:/usr/obj/usr/src/arm64.aarch64/sys/GENERIC  arm64 aarch64 1300139 1300139

root@RPi4_8G_ZFS:~ # df -m
Filesystem         1M-blocks Used  Avail Capacity  Mounted on
zroot/ROOT/default    196003 1110 194893     1%    /
devfs                      0    0      0   100%    /dev
/dev/gpt/efiboot0        259   18    241     7%    /boot/efi
zroot/tmp             194893    0 194893     0%    /tmp
zroot/usr/home        194893    0 194893     0%    /usr/home
zroot/var/log         194893    0 194893     0%    /var/log
zroot/var/mail        194893    0 194893     0%    /var/mail
zroot                 194893    0 194893     0%    /zroot
zroot/var/tmp         194893    0 194893     0%    /var/tmp
zroot/usr/src         195594  701 194893     0%    /usr/src
zroot/var/audit       194893    0 194893     0%    /var/audit
zroot/usr/ports       195593  700 194893     0%    /usr/ports
zroot/var/crash       194893    0 194893     0%    /var/crash

Robert's problem seems to be based on some detail(s) specific
to his environment, not some sort of general problem with
root on ZFS via USB3 for RPi4B's.

The problem is to isolate what detail(s). As stands, it appears
only Robert has a context to do that in.
Comment 13 Robert Clausecker freebsd_committer freebsd_triage 2021-04-17 09:45:18 UTC
(In reply to Mark Millard from comment #10)

Hi Mark,

> It it not clear what version(s) of sysutils/rpi-firmware type
> materials that you are using. A way of getting solid information
> about †he RPi firmware (unless it has been mixed-and-matched across
> releases) is:

I'm using the exact same firmware version shipped on the FreeBSD 13 release images.  The strings output is identical to yours.

> If you are still using releases from https://github.com/pftf/RPi4/releases/
to have UEFI (possibly used in ACPI mode), that and the specific version
is not clear.

Nope.  I've given up on these attempts when it got clear that UEFI
is not going to be supported going forwards.  Interestingly, I recall
that the problem might not have occurred on UEFI, but I'm not sure.
So it's standard U-Boot stuff right now.

> So far I've still never had a problem like  you report. But I have a
UFS context, not ZFS.

I did an UFS reinstall some months ago and to my surprise, the problem
went away when I did that.  This was very surprising to me and I chalked
it up to perhaps the problem having been fixed by an update in CURRENT
back then.  I had updated that install all the way to FreeBSD 13.0-RELEASE
before trying to reinstall on ZFS and never had any problems.  Those seem
to occur only when installing the system on ZFS.

> But, I do not see any explicit reports of what was discovered or if you abandoned the effort.

After reinstalling on UFS, the problem went away so I thought the problem had been addressed and kinda forgot about the bug report.  I'll try to set up a separate UFS-based disk and boot from that, mounting the zpool later on, to see if it changes anything.

> It appears that such would be required for you to supply enough information
> for someone to repeat the problem. Of course, if
> FreeBSD-13.0-RELEASE-arm64-aarch64-RPI.img failed up front, it almost certainly
> means lack of hardware support in some way --and that would mean needing to
> replicate the hardware context in order for someone to investigate.

Yes, this is very unfortunate.

> (comment #12)

Very strange.  Perhaps it is indeed a power issue (as alluded by some people on the list).
Comment 14 Robert Clausecker freebsd_committer freebsd_triage 2021-04-17 10:06:57 UTC
Created attachment 224184 [details]
usbconfig dump_all_desc output

I've flashed the UFS based default installer image to a separate USB drive attached by USB 2.0 and then tried to import the zpool manually.  The observed errors are similar:

# zpool import
ZFS filesystem version: 5
ZFS storage pool version: features support (5000)
   pool: tau
     id: 11171206566155786428
  state: ONLINE
status: Some supported features are not enabled on the pool.
 action: The pool can be imported using its name or numeric identifier, though
        some features will not be available without an explicit 'zpool upgrade'.
 config:

        tau                            ONLINE
          diskid/DISK-ABCDEFA74566s2a  ONLINE
# zpool import -R /tau tau
(da1:umass-sim1:1:0:0): READ(10). CDB: 28 00 19 81 f3 ad 00 00 07 00 
(da1:umass-sim1:1:0:0): CAM status: CCB request completed with an error
(da1:umass-sim1:1:0:0): Retrying command, 3 more tries remain
(da1:umass-sim1:1:0:0): READ(10). CDB: 28 00 19 81 f3 ad 00 00 07 00 
(da1:umass-sim1:1:0:0): CAM status: CCB request completed with an error
(da1:umass-sim1:1:0:0): Retrying command, 2 more tries remain
(da1:umass-sim1:1:0:0): READ(10). CDB: 28 00 19 81 f3 ad 00 00 07 00 
(da1:umass-sim1:1:0:0): CAM status: CCB request completed with an error
(da1:umass-sim1:1:0:0): Retrying command, 1 more tries remain
(da1:umass-sim1:1:0:0): READ(10). CDB: 28 00 19 81 f3 ad 00 00 07 00 
(da1:umass-sim1:1:0:0): CAM status: CCB request completed with an error
(da1:umass-sim1:1:0:0): Retrying command, 0 more tries remain
(da1:umass-sim1:1:0:0): READ(10). CDB: 28 00 19 81 f3 ad 00 00 07 00 
(da1:umass-sim1:1:0:0): CAM status: CCB request completed with an error
(da1:umass-sim1:1:0:0): Error 5, Retries exhausted

Other operations on the drive, like mounting the UFS partition or reading the whole disk front to back succeed without problems.

I've been able to run the usbconfig command asked for by comment #1, so at least I can attach that.

If all fails, I can mail someone the drive so you can reproduce this for yourself.
Comment 15 Robert Clausecker freebsd_committer freebsd_triage 2021-04-18 09:56:55 UTC
I've now tried a different drive (an ST2000LM007-1R8174; spinning rust) in an external USB-3 enclosure and it works just fine.  Either the drive is faulty or perhaps there is some sort of problem that only affects that drive.
Comment 16 Robert Clausecker freebsd_committer freebsd_triage 2021-04-18 18:49:01 UTC
I found another clue:

If I create a zpool with ashift=12 on the disk in question, it works fine on the RPi 4.  Perhaps the disk does not actually support 512 byte access (you can see from the IO errors that they try to do a transfer of 5 sectors of 512 bytes). Could this perhaps be some sort of ZFS regression?  The other system I tested the drive on is still on FreeBSD 12 with the old ZFS code.  Perhaps that code did not try to perform such accesses?  Or perhaps there was some sort of fallback?
Comment 17 Mark Millard 2021-04-18 20:57:47 UTC
(In reply to Robert Clausecker from comment #16)

Ronald Klop in

https://lists.freebsd.org/pipermail/freebsd-arm/2021-April/023650.html

had written:

QUOTE
Could it be a partitioning difference that you are crossing 4K-sector boundaries or something else that amplifies the traffic when using ZFS?
END QUOTE

So it sounds like it may be a well-known type of issue that
one is supposed to well-manage when setting up ZFS.

I would guess that bsdinstall in auto mode for creating a
ZFS context likely just uses figures such that it avoids
running into such issues. Other contexts may fairly
generally require more explicit handling to avoid creating
issues. (I'm no ZFS expert.)
Comment 18 Robert Clausecker freebsd_committer freebsd_triage 2021-04-18 21:18:13 UTC
(In reply to Mark Millard from comment #17)

Hi Mark,

Apparently bsdinstall sets up

    vfs.zfs.min_auto_ashift=12

in /etc/sysctl.conf to address this potential problem.  So you are right in that this possibility is already addressed.

However, as I manually set up the zpool, this was not the case for me and I ran head first into the problem.

For future reference: the thing that made me diagnose the problem is the CDB showing a 5 sector read.  5 is not a multiple of 8...

Thanks for your excellent help anyway!