Bug 258057 - muge(4) crashes with large tx batches
Summary: muge(4) crashes with large tx batches
Status: New
Alias: None
Product: Base System
Classification: Unclassified
Component: usb (show other bugs)
Version: CURRENT
Hardware: arm64 Any
: --- Affects Only Me
Assignee: freebsd-usb (Nobody)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-08-26 11:18 UTC by Dan Kotowski
Modified: 2021-10-05 16:51 UTC (History)
6 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Dan Kotowski 2021-08-26 11:18:19 UTC
When initiating a large number of flows, the muge(4) driver seems to fall over and the interface flaps until physically removed from and reseated in the USB port.

I've been able to reliably reproduce this with the following:

* freebsd-src/main@9781c28c6d63
* ports/main@58a8a0aa37a8
* poudriere bulk -j $JAIL -p $PORTS devel/gh

The devel/gh port seems to initiate a large number of outbound TCP sessions, which seem to trash the TX queue to the point of making the interface unusable.

* dmesg flooded with `ue0: link state changed to DOWN|UP`
* ping returns "sendto: No buffer space available" until exit

Hardware: https://bsd-hardware.info/?probe=8a7b477512
Comment 1 Hans Petter Selasky freebsd_committer 2021-08-26 12:04:34 UTC
Could you by using netstat give some estimates what amount of traffic (packets and bytes per second) which is causing this?

I know that some devices can send bigger blocks of data, and this needs to be updated in the muge(4) driver itself, if that is the case.

There is also a tool called usbdump which may sched some light into what is going on there.

--HPS
Comment 2 Dan Kotowski 2021-08-26 13:09:32 UTC
$ netstat -I ue0
Name    Mtu Network       Address              Ipkts Ierrs Idrop    Opkts Oerrs  Coll
ue0    1500 <Link#2>      00:1e:c0:e1:2c:60    94112     5     0    80241     0     0
ue0       - 192.168.1.0/2 192.168.1.99         92610     -     -    81539     -     -
ue0       - fe80::%ue0/64 fe80::21e:c0ff:fe       18     -     -       70     -     -
ue0       - 2001:470:e35c 2001:470:e35c:1:b      461     -     -     1301     -     -


Before:
netstat -I ue0 -s | grep -vF $'\t0'
ip6 on ue0:
        312 total input datagrams
        4 input datagrams discarded
        312 datagrams delivered to an upper layer protocol
        186 datagrams sent from an upper layer protocol
        79 multicast datagrams received
        10 multicast datagrams sent
icmp6 on ue0:
        15 total input messages
        7 input router advertisements
        4 input neighbor solicitations
        4 input neighbor advertisements
        14 total output messages
        6 output neighbor solicitations
        4 output neighbor advertisements
        4 output MLD reports


After:
netstat -I ue0 -s | grep -vF $'\t0'
ip6 on ue0:
        1081 total input datagrams
        48 input datagrams discarded
        1081 datagrams delivered to an upper layer protocol
        1309 datagrams sent from an upper layer protocol
        662 multicast datagrams received
        489 multicast datagrams sent
icmp6 on ue0:
        77 total input messages
        40 input router advertisements
        31 input neighbor solicitations
        6 input neighbor advertisements
        536 total output messages
        501 output neighbor solicitations
        31 output neighbor advertisements
        4 output MLD reports



I didn't see anything meaningful from usbdump on ugen0.5 (parent to muge0), but also I'm not sure I understand the output very well. @hps let me know if you want a dump and how you'd like me to capture it.
Comment 3 Hans Petter Selasky freebsd_committer 2021-09-17 07:48:37 UTC
Can you share the usbdump command line you used?

Maybe you missed some parameters like "-s 1000000" to have a larger receive buffer for USB sniffing?
Comment 4 Dan Kotowski 2021-09-23 14:50:33 UTC
I used the default -s value.

I was finally able to recreate the problem over the past 2 days after some time of things working as expected again, so I ran

# usbdump -d ugen0.5 -s -s 1000000

but the raw dump is >500MB, even when isolated with as much disabled as I reasonably could and only building net-mgmt/telegraf with all dependencies already built. I tried using xz and uuencode to make a GitHub gist, but it's still way too big.

Is there a good place to upload the raw dump or some guidance on what to look for/how to filter it down further?
Comment 5 Dan Kotowski 2021-09-23 20:17:39 UTC
Is it possible that a Logitech universal receiver could be causing problems?

I unplugged everything and began testing, adding peripherals back 1 at a time. It worked just fine all the way up until I reattached the receiver for my mouse to the bus. At which point the console starts getting spammed with "ue0: link state changed to (DOWN|UP)" messages every 6 seconds. And even after removing the receiver, the spam continued up until I rebooted.
Comment 6 Dan Kotowski 2021-09-23 20:47:40 UTC
Nevermind... I can still recreate without that attached, even after a full power cycle and unplug.
Comment 7 Dan Kotowski 2021-09-23 21:10:51 UTC
Some logging here: https://gist.github.com/103b17cc17bf805d6b91bb0221b3fde7

And a dump captured with:

# usbdump -i usbus0 -s 0

Uncompressed it's 681M though...
Comment 8 Hans Petter Selasky freebsd_committer 2021-09-24 07:30:23 UTC
Try to use a lower value for the "-s " parameter, so that less data is captured per packet.

600MBytes is too much to analyze.

--HPS
Comment 9 Dan Kotowski 2021-09-24 11:41:10 UTC
Still playing with `usbdump -s` values to get a reasonably-sized dump.

In the meantime, I set hw.usb.muge.debug=1 and uploaded a segment of /var/log/messages during a period where the interface was flapping under heavy use:

https://gist.github.com/agrajag9/07b68d7fe2ea9bd54ad9d2eeb9dcf27d

A tip from the motherboard vendor was to disable USB power saving as the hubs can behave poorly with that enabled, but `usbconfig ugen0.1 power_on` throws an error:

# usbconfig -u 0
ugen0.1: <Generic XHCI root HUB> at usbus0, cfg=0 md=HOST spd=SUPER (5.0Gbps) pwr=SAVE (0mA)
ugen0.2: <Microchip LAN7800> at usbus0, cfg=0 md=HOST spd=HIGH (480Mbps) pwr=ON (500mA)
# usbconfig ugen0.1 power_on
usbconfig: could not set power ON: Invalid argument

Is there another way I can try that while I'm still tuning the "usbdump -s" value?
Comment 10 Gary Jennejohn 2021-09-24 14:38:29 UTC
(In reply to Dan Kotowski from comment #9)
I'm inclined to think that the motherboard vendor means a BIOS setting to disable USB power saving.  Could be worth taking a look in the BIOS.
Comment 11 Dan Kotowski 2021-09-24 15:23:01 UTC
(In reply to Gary Jennejohn from comment #10)

Unfortunately not there - the source is jnettlet at SolidRun (OEM) and he recommended doing it in the OS like can be done in Linux by setting values in /sys/bus/usb/devices/*/power/level
Comment 12 Dan Kotowski 2021-09-27 12:11:43 UTC
A thought occurs to me:

The device I'm using: https://www.microchip.com/en-us/development-tool/EVB-LAN7800LC-1

"USB3 Gen1 device"

But:

# usbconfig -u 0
ugen0.1: <Generic XHCI root HUB> at usbus0, cfg=0 md=HOST spd=SUPER (5.0Gbps) pwr=SAVE (0mA)
ugen0.2: <Microchip LAN7800> at usbus0, cfg=0 md=HOST spd=HIGH (480Mbps) pwr=ON (500mA)

Shouldn't ugen0.2 show `spd=SUPER`?
Comment 13 Hans Petter Selasky freebsd_committer 2021-09-27 12:51:53 UTC
If the device enumerates as a 3.x devices, yes it should show spd=SUPER for the device.
Comment 14 Gary Jennejohn 2021-09-27 13:14:49 UTC
(In reply to Hans Petter Selasky from comment #13)
But if he plugged it into a USB2 hub it will enumerate as 480Mbps.  My mother board has a number of USB3 and USB2 hubs/ports.  When I plug a USB3 stick into a USB2 port it shows up as 40MB/s whereas it shows up as 400MB/s when plugged into a USB3 port.
He should look at how his USB hubs enumerate in /var/run/dmesg.boot.  For me some are USB3 and others are USB2 although I only have Ryzen USB3 controllers.
Comment 15 Dan Kotowski 2021-09-27 13:27:25 UTC
(In reply to Gary Jennejohn from comment #14)

Everything upstream is coming up as expected.

xhci0: <Generic USB 3.0 controller> iomem 0x3100000-0x310ffff irq 25 on acpi0
xhci0: 64 bytes context size, 32-bit DMA
usbus0 on xhci0
xhci0: usbpf: Attached
uhub1 on usbus0
uhub1: <Generic XHCI root HUB, class 9/0, rev 3.00/1.00, addr 1> on usbus0
muge0 on uhub1
muge0: <Microchip LAN7800, rev 2.10/3.00, addr 1> on usbus0
muge0: Chip ID 0x7800 rev 0002
miibus0: <MII bus> on muge0
ukphy0: <Generic IEEE 802.3u media interface> PHY 1 on miibus0
ukphy0: OUI 0x00800f, model 0x0013, rev. 2
ukphy0:  none, 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT, 1000baseT-master, 1000baseT-FDX, 1000baseT-FDX-master, auto
ue0: <USB Ethernet> on muge0
ue0: bpf attached
ue0: Ethernet address: 00:1e:c0:e1:2c:60

jntettlet at Solid-Run says usbus0 is supposed to be wired direct to the USB3.0 controller in the SoC.

SoC block diagram: https://www.nxp.com/assets/images/en/block-diagrams/LX2160-TN.jpg

Board block diagram: https://www.solid-run.com/wp-content/uploads/2020/10/HoneyComb-LX2K-block-diagram.png
Comment 16 Gary Jennejohn 2021-09-27 14:04:55 UTC
(In reply to Dan Kotowski from comment #15)
Wow, neat!  16 A72 CPUs.  But it's not the controller which matters, it's the hub.  The board apparently has 3 USB2 hubs.  Two go to the USB2 headers and one apparently goes to the USB3 headers, based on the board block diagram.  But the latter may only be enabled by a junper.  Not enough detail in the diagram to tell.
But, if what I write above is correct, you have a 75% probability of plugging the muge into a USB2 hub.
Comment 17 Dan Kotowski 2021-09-27 15:14:12 UTC
(In reply to Gary Jennejohn from comment #16)

From dmesg:

* muge0 on uhub1 as <Generic XHCI root HUB>
* uhub1 on usbus0
* usbus0 on xhci0 as <Generic USB 3.0 controller>

If it helps here's dump_stats and dump_all_desc from `usbconfig -u 0`: https://gist.github.com/9f8da6b6148e8485fb75492bd158ad17
Comment 18 Gary Jennejohn 2021-09-27 15:45:39 UTC
(In reply to Dan Kotowski from comment #17)
Ok, but now I'm wondering why the USB 2.0 HUB shown in the block diagram doesn't show up in dmesg.  On my system all HUBs, including external ones, appear in dmesg.
Comment 19 Dan Kotowski 2021-09-27 15:54:16 UTC
(In reply to Gary Jennejohn from comment #18)
Because I redacted it for focus. You can see the 2.0 hub on usbus1 while I have the GbE adapter on usbus0

# usbconfig show_ifdrv | sort
ugen0.1: <Generic XHCI root HUB> at usbus0, cfg=0 md=HOST spd=SUPER (5.0Gbps) pwr=SAVE (0mA)
ugen0.1.0: uhub1: <Generic XHCI root HUB, class 9/0, rev 3.00/1.00, addr 1>
ugen0.2: <Microchip LAN7800> at usbus0, cfg=0 md=HOST spd=HIGH (480Mbps) pwr=ON (500mA)
ugen0.2.0: muge0: <Microchip LAN7800, rev 2.10/3.00, addr 1>
ugen1.1: <Generic XHCI root HUB> at usbus1, cfg=0 md=HOST spd=SUPER (5.0Gbps) pwr=SAVE (0mA)
ugen1.1.0: uhub0: <Generic XHCI root HUB, class 9/0, rev 3.00/1.00, addr 1>
ugen1.10: <GenesysLogic USB3.0 Hub> at usbus1, cfg=0 md=HOST spd=SUPER (5.0Gbps) pwr=SAVE (0mA)
ugen1.10.0: uhub6: <GenesysLogic>
ugen1.2: <vendor 0x04b4 product 0x6502> at usbus1, cfg=0 md=HOST spd=HIGH (480Mbps) pwr=SAVE (0mA)
ugen1.2.0: uhub2: <vendor 0x04b4 product 0x6502, class 9/0, rev 2.10/50.10, addr 1>
ugen1.3: <GenesysLogic USB2.0 Hub> at usbus1, cfg=0 md=HOST spd=HIGH (480Mbps) pwr=SAVE (500mA)
ugen1.3.0: uhub3: <GenesysLogic USB2.0 Hub, class 9/0, rev 2.10/92.26, addr 2>
ugen1.4: <Logitech USB Receiver> at usbus1, cfg=0 md=HOST spd=FULL (12Mbps) pwr=ON (98mA)
ugen1.4.0: ukbd0: <Logitech USB Receiver, class 0/0, rev 2.00/24.11, addr 3>
ugen1.4.1: ums0: <Logitech USB Receiver, class 0/0, rev 2.00/24.11, addr 3>
ugen1.4.2: uhid0: <Logitech USB Receiver, class 0/0, rev 2.00/24.11, addr 3>
ugen1.5: <Yubico Yubikey NEO OTP+U2F+CCID> at usbus1, cfg=0 md=HOST spd=FULL (12Mbps) pwr=ON (30mA)
ugen1.5.0: ukbd1: <Yubico Yubikey NEO OTP+U2F+CCID, class 0/0, rev 2.00/3.50, addr 10>
ugen1.5.1: uhid1: <Yubico Yubikey NEO OTP+U2F+CCID, class 0/0, rev 2.00/3.50, addr 10>
ugen1.6: <vendor 0x04d9 USB Keyboard> at usbus1, cfg=0 md=HOST spd=FULL (12Mbps) pwr=ON (100mA)
ugen1.6.0: ukbd2: <vendor 0x04d9 USB Keyboard, class 0/0, rev 1.10/1.12, addr 5>
ugen1.6.1: uhid2: <vendor 0x04d9 USB Keyboard, class 0/0, rev 1.10/1.12, addr 5>
ugen1.6.2: ums1: <vendor 0x04d9 USB Keyboard, class 0/0, rev 1.10/1.12, addr 5>
ugen1.7: <vendor 0x0424 product 0x2514> at usbus1, cfg=0 md=HOST spd=HIGH (480Mbps) pwr=SAVE (2mA)
ugen1.7.0: uhub4: <vendor 0x0424 product 0x2514, class 9/0, rev 2.00/b.b3, addr 6>
ugen1.9: <vendor 0x04b4 product 0x6500> at usbus1, cfg=0 md=HOST spd=SUPER (5.0Gbps) pwr=SAVE (0mA)
ugen1.9.0: uhub5: <vendor 0x04b4 product 0x6500, class 9/0, rev 3.00/50.10, addr 8>

And a complete dmesg.boot: https://gist.github.com/abd1093d36334ecc008fe056931b2d30
Comment 20 Dan Kotowski 2021-09-27 17:00:22 UTC
On a whim I attached it to another system I had and it came up at full speed!

https://gist.github.com/7b3f21bbd71236ebd4aaafb500e7790f

Notice that on the Honeycomb (aka hc and the system I'm focused on) we see muge0 come up with `bcdUSB = 0x0210` but on a SuperMicro Xeon system we see `bcdUSB = 0x0310`. And you can even see it come up here with spd=SUPER:



# devinfo -p muge0
muge0 uhub2 usbus0 xhci0 pci1 pcib1 acpi0 nexus0

# usbconfig -u 0 show_ifdrv
ugen0.1: <0x8086 XHCI root HUB> at usbus0, cfg=0 md=HOST spd=SUPER (5.0Gbps) pwr=SAVE (0mA)
ugen0.1.0: uhub2: <0x8086 XHCI root HUB, class 9/0, rev 3.00/1.00, addr 1>
...
ugen0.5: <Microchip LAN7800> at usbus0, cfg=0 md=HOST spd=SUPER (5.0Gbps) pwr=ON (224mA)
ugen0.5.0: muge0: <Microchip LAN7800, rev 2.10/3.00, addr 2>



What's odd is that the only difference between `dump_all_desc` on the Honeycomb and the known-working SuperMicro server is the root hub Manufacturer ID:



1c1
< ugen0.1: <Generic XHCI root HUB> at usbus0, cfg=0 md=HOST spd=SUPER (5.0Gbps) pwr=SAVE (0mA)
---
> ugen0.1: <0x8086 XHCI root HUB> at usbus0, cfg=0 md=HOST spd=SUPER (5.0Gbps) pwr=SAVE (0mA)
13c13
<   iManufacturer = 0x0001  <Generic>
---
>   iManufacturer = 0x0001  <0x8086>



However the actual device comes up with some extra stuff on the SuperMicro (aka sm):



$ diff hc/ugen0.2.dump_all_desc sm/ugen0.5.dump_all_desc
1c1
< ugen0.2: <Microchip LAN7800> at usbus0, cfg=0 md=HOST spd=HIGH (480Mbps) pwr=ON (500mA)
---
> ugen0.5: <Microchip LAN7800> at usbus0, cfg=0 md=HOST spd=SUPER (5.0Gbps) pwr=ON (224mA)
5c5
<   bcdUSB = 0x0210
---
>   bcdUSB = 0x0310
9c9
<   bMaxPacketSize0 = 0x0040
---
>   bMaxPacketSize0 = 0x0009
22c22
<     wTotalLength = 0x0027
---
>     wTotalLength = 0x0039
27c27
<     bMaxPower = 0x00fa
---
>     bMaxPower = 0x0070
45c45
<         wMaxPacketSize = 0x0200
---
>         wMaxPacketSize = 0x0400
49a50,58
>       Additional Descriptor
>
>       bLength = 0x06
>       bDescriptorType = 0x30
>       bDescriptorSubType = 0x07
>        RAW dump:
>        0x00 | 0x06, 0x30, 0x07, 0x00, 0x00, 0x00
>
>
55c64
<         wMaxPacketSize = 0x0200
---
>         wMaxPacketSize = 0x0400
59a69,77
>       Additional Descriptor
>
>       bLength = 0x06
>       bDescriptorType = 0x30
>       bDescriptorSubType = 0x06
>        RAW dump:
>        0x00 | 0x06, 0x30, 0x06, 0x00, 0x00, 0x00
>
>
66c84
<         bInterval = 0x0004
---
>         bInterval = 0x0006
68a87,94
>
>       Additional Descriptor
>
>       bLength = 0x06
>       bDescriptorType = 0x30
>       bDescriptorSubType = 0x00
>        RAW dump:
>        0x00 | 0x06, 0x30, 0x00, 0x00, 0x04, 0x00
Comment 21 Dan Kotowski 2021-09-27 17:11:44 UTC
Sigh... I just tested a handful of other USB3.0 devices I have laying around and confirmed they ALL attach to the HC as spd=HIGH.

So the problem is upstream somewhere and it is very unlikely an issue with muge :(

And just to be sure I checked with jnettlet@SR and he said it should all be USB3.0 from the SoC to the specific port I'm using (rear IO, lower USB3.0).




At this point I'm not sure what to try next, but open to all ideas.
Comment 22 Gary Jennejohn 2021-09-27 18:26:49 UTC
(In reply to Dan Kotowski from comment #21)
Well, that's progress in a way.  If you haven't done it already I'd try every USB port available.  Might be a wiring error on the board.
Another thought is that the Generic driver might not be supporting all the functionality of the USB3 controller as well as a driver which targets a specific controller type, such as the Intel USB3 controller on the SuperMicro board.
Note that the Generic driver reports that the controller on the HC is limited to 32-bit DMA transfers.  Can't say whether that's correct since I know nothing about the controller being used.
Comment 23 Bjoern A. Zeeb freebsd_committer 2021-09-27 20:21:20 UTC
(In reply to Dan Kotowski from comment #21)

I have the same issue with an ure(4).
I might go ahead and spend some more on this some time the next days unless you'll figure it out;  I had always blamed the ure(4).
Comment 24 Dan Kotowski 2021-09-27 20:55:52 UTC
Do our xhci drivers not support full-power mode for hubs?

[root@honeycomb ~]# usbconfig ugen1.1 power_on
usbconfig: could not set power ON: Invalid argument

jnettlet has suggested that powersaving on the root hubs should be disabled as it can lead to improperly negotiating the max speed on the bus.
Comment 25 Mark Millard 2021-09-27 22:46:14 UTC
(In reply to Dan Kotowski from comment #21)

From  HoneyComb that was delivered in 2021-June (same report as I made on SolidRuns's discord area, then some extra notes):

# usbconfig
ugen1.1: <Generic XHCI root HUB> at usbus1, cfg=0 md=HOST spd=SUPER (5.0Gbps) pwr=SAVE (0mA)
ugen0.1: <Generic XHCI root HUB> at usbus0, cfg=0 md=HOST spd=SUPER (5.0Gbps) pwr=SAVE (0mA)
ugen1.2: <vendor 0x04b4 product 0x6502> at usbus1, cfg=0 md=HOST spd=HIGH (480Mbps) pwr=SAVE (0mA)
ugen0.2: <Realtek USB 10/100/1000 LAN> at usbus0, cfg=0 md=HOST spd=SUPER (5.0Gbps) pwr=ON (72mA)
ugen1.3: <vendor 0x0424 product 0x2514> at usbus1, cfg=0 md=HOST spd=HIGH (480Mbps) pwr=SAVE (2mA)
ugen1.4: <vendor 0x04b4 product 0x6500> at usbus1, cfg=0 md=HOST spd=SUPER (5.0Gbps) pwr=SAVE (0mA)
ugen1.5: <Samsung PSSD T7 Touch> at usbus1, cfg=0 md=HOST spd=SUPER (5.0Gbps) pwr=ON (224mA)
ugen1.6: <Samsung PSSD T7 Touch> at usbus1, cfg=0 md=HOST spd=SUPER (5.0Gbps) pwr=ON (224mA)

I've never had problems with USB3 devices that I've tried ending up without SUPER (5.0Gbps) status. (But it is only a few types of devices.) The Realtek USB
10/100/1000 LAN is plugged into the port nearest to the board.

I've been running various FreeBSD versions over the time. I use bectl to boot
any of releng/13.0.0 , stable/13, and main [so: 14], mostly 13.0R and main.
For example (root-is-ZFS):

# uname -apKU
FreeBSD CA72_16Gp_ZFS 14.0-CURRENT FreeBSD 14.0-CURRENT #12 main-n249019-0637070b5bca-dirty: Tue Aug 31 02:24:20 PDT 2021     root@CA72_16Gp_ZFS:/usr/obj/BUILDs/main-CA72-nodbg-clang/usr/main-src/arm64.aarch64/sys/GENERIC-NODBG-CA72  arm64 aarch64 1400032 1400032

I usually boot an Optane in the PCIe slot but also boot an external USB3 SSD that
has a slightly modified copy of the Optane (recopied & adjusted as I progress).
The system is built with -mcpu=cortex-a72 (mentioned because it is unusual).
The USB3 ssd can boot RPi4B's with 8 GiBytes of RAM. (I avoid ZFS when there is
less than 8 GiBytes.) None of those systems have had problems identifying USB3
SUPER as appropriate. I've used the Realtek USB3 LAN type of dongle on all of
them.
Comment 26 Dan Kotowski 2021-09-28 12:45:57 UTC
I believe that Bjoern and I have earlier revisions of the boards, while Mark has a more recent one. And I know Solid-Run had to fix at least 1 USB erata (flipped TX/RX on 1 port) so I can imagine there may be others.

Of note, jnettlet has mentioned the following:

"""
I have [usb powersaving] turned off on all my boards.  Synopsis did have a bug with powersave and hubs in their IP which we have the errata workaround enabled for...but even then some devices are just flaky

a patch is merged in [Linux].  the feature is called parkmode

I am setting that in UEFI though.  Unless maybe BSD is re-enabling it
"""

If it helps, this is the raw DSDT ASL for the USB parts of the firmware:

https://github.com/SolidRun/edk2-platforms/blob/24698f90b79facfbbfc4067b39a4ddf8c7fdfa88/Silicon/NXP/LX2160A/AcpiTables/Dsdt/Usb.asl

You can easily see it setting `snps,dis_rxdet_inp3_quirk` - perhaps we ignore or unset this?

https://github.com/torvalds/linux/blob/835d31d319d9c8c4eb6cac074643360ba0ecab10/drivers/usb/dwc3/core.h#L1065

 * @dis_rxdet_inp3_quirk: set if we disable Rx.Detect in P3
Comment 27 Bjoern A. Zeeb freebsd_committer 2021-09-29 17:12:22 UTC
(In reply to Dan Kotowski from comment #26)

Okay, so we are talking multiple issues here.

On this PR let us focus on the muge(4) issue.   The other HoneyComb related issues seem unrelated to this (probably) driver problem.

Dan, could you post a full

     usbconfig -d 1.9 dump_curr_config_desc

(replace 1.9 with the ugen your muge(4) is attached as currently), so we have a better idea what's going on there on the USB side.
Comment 28 Dan Kotowski 2021-09-29 17:17:12 UTC
usbconfig -d 1.9 dump_curr_config_desc: https://gist.github.com/7c2b64defe008cda815d8b4ac2158136

Also worth noting is that the ACTUAL controller in the SOC is a Synopsys DesignWare Core SuperSpeed: https://www.kernel.org/doc/html/latest/driver-api/usb/dwc3.html

Sure enough the 2 Known Limitations are OUT Transfer Size and TRB Ring Size. I'm not sure if that's necessarily related, but the device only seems to fail under large TX, not RX.

A few other notes:

1. ugen0.1 refuses to come up to superspeed, even under Linux - this may just be an issue with early revision boards. I intend to avoid using this bus entirely for now.

2. I was able to get the bus and devices up to superspeed hanging off of ugen1.1 under Linux _sometimes_ but only once I disabled powersaving on everything.
Comment 29 Gary Jennejohn 2021-09-29 18:44:09 UTC
(In reply to Dan Kotowski from comment #28)
There's a driver for this controller:
/sys/conf/files.arm64:dev/usb/controller/dwc3.c      optional fdt dwc3
Don't know whether you use fdt.
Comment 30 Bjoern A. Zeeb freebsd_committer 2021-09-29 20:03:37 UTC
(In reply to Gary Jennejohn from comment #29)

And I have ACPI attachments for it; cleaning this up and will post a review.
Again this all doesn't help with the muge(4) problem so can we stay focused here?
Comment 31 Dan Kotowski 2021-10-05 15:21:18 UTC
I've been able to trigger this behavior as well from Firefox by reloading ~20 tabs at once, even when I'm able to get the device to negotiate superspeed.

ugen1.8: <Microchip LAN7800> at usbus1, cfg=0 md=HOST spd=SUPER (5.0Gbps) pwr=ON (224mA)
ugen1.8.0: muge0: <Microchip LAN7800, rev 3.10/3.00, addr 8>

Something I noticed as that when the interface begins flapping, IPv4 will sometimes remain partially usable but IPv6 fails entirely. I'm not sure if it's related, but I also regularly see this in /var/log/messages:

lock order nd6 list -> lle established at:
#0 0xffff0000004e3368 at witness_checkorder+0x438
#1 0xffff000000470574 at _rw_wlock_cookie+0x74
#2 0xffff00000064ba14 at nd6_llinfo_timer+0xa8
#3 0xffff000000493260 at softclock_call_cc+0x13c
#4 0xffff00000049356c at softclock+0x44
#5 0xffff0000004329a0 at ithread_loop+0x2a8
#6 0xffff00000042efa8 at fork_exit+0x74
#7 0xffff00000077bbc4 at fork_trampoline+0x14
lock order lle -> nd6 list attempted at:
#0 0xffff0000004e3b3c at witness_checkorder+0xc0c
#1 0xffff000000470574 at _rw_wlock_cookie+0x74
#2 0xffff000000651a20 at defrouter_remove+0x40
#3 0xffff00000064e7c0 at nd6_na_input+0x848
#4 0xffff0000006244cc at icmp6_input+0xd2c
#5 0xffff00000063d078 at ip6_input+0xe5c
#6 0xffff0000005b4044 at netisr_dispatch_src+0xe4
#7 0xffff0000005980ac at ether_demux+0x174
#8 0xffff000000599734 at ether_nh_input+0x3f8
#9 0xffff0000005b4044 at netisr_dispatch_src+0xe4
#10 0xffff000000598584 at ether_input+0x80
#11 0xffff0000002f2b50 at uether_rxflush+0x8c
#12 0xffff0000002e1778 at muge_bulk_read_callback+0x120
#13 0xffff0000002dd9a0 at usbd_callback_wrapper+0x6b0
#14 0xffff0000002decb0 at usb_command_wrapper+0x124
#15 0xffff0000002ddb38 at usb_callback_proc+0x30
#16 0xffff0000002d8c6c at usb_process+0x10c
#17 0xffff00000042efa8 at fork_exit+0x74
Comment 32 Dan Kotowski 2021-10-05 15:27:43 UTC
$ curl -4 icanhazip.com ; curl -6 icanhazip.com
73.163.228.147
curl: (7) Couldn't connect to server

$ ifconfig ue0 inet
ue0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
        options=80009<RXCSUM,VLAN_MTU,LINKSTATE>
        inet 192.168.1.99 netmask 0xffffff00 broadcast 192.168.1.255

$ ifconfig ue0 inet6
ue0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
        options=80009<RXCSUM,VLAN_MTU,LINKSTATE>
        inet6 fe80::21e:c0ff:fee1:2c60%ue0 prefixlen 64 scopeid 0x3
        inet6 2001:470:e35c:REDACTED prefixlen 64
        nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>

$ netstat -6rnW
Routing tables

Internet6:
Destination                       Gateway                       Flags   Nhop#    Mtu    Netif Expire
::/96                             ::1                           UGRS        6  16384      lo0
::1                               link#1                        UHS         1  16384      lo0
::ffff:0.0.0.0/96                 ::1                           UGRS        6  16384      lo0
2001:470:e35c:1::/64              link#3                        U           8   1500      ue0
2001:470:e35c:1:REDACTED            link#3                      UHS         4  16384      lo0
fe80::/10                         ::1                           UGRS        6  16384      lo0
fe80::%lo0/64                     link#1                        U           3  16384      lo0
fe80::1%lo0                       link#1                        UHS         2  16384      lo0
fe80::%ue0/64                     link#3                        U           8   1500      ue0
fe80::21e:c0ff:fee1:2c60%ue0      link#3                        UHS         4  16384      lo0
ff02::/16                         ::1                           UGRS        6  16384      lo0
Comment 33 Dan Kotowski 2021-10-05 16:51:14 UTC
At boot, everything working:

$ netstat -6rnW
Routing tables

Internet6:
Destination                       Gateway                       Flags   Nhop#    Mtu    Netif Expire
::/96                             ::1                           UGRS        6  16384      lo0
default                           2001:470:e35c:1::1            UGS         7   1500      ue0
::1                               link#1                        UHS         1  16384      lo0
::ffff:0.0.0.0/96                 ::1                           UGRS        6  16384      lo0
2001:470:e35c:1::/64              link#3                        U           5   1500      ue0
2001:470:e35c:1:REDACTED link#3                      UHS         4  16384      lo0
fe80::/10                         ::1                           UGRS        6  16384      lo0
fe80::%lo0/64                     link#1                        U           3  16384      lo0
fe80::1%lo0                       link#1                        UHS         2  16384      lo0
fe80::%ue0/64                     link#3                        U           5   1500      ue0
fe80::21e:c0ff:fee1:2c60%ue0      link#3                        UHS         4  16384      lo0
ff02::/16                         ::1                           UGRS        6  16384      lo0

It looks like it loses the default v6 route for some reason and never recovers it?