I'm using if_ure on a USB-C dock station. I'm having the following random issue: Dec 25 12:20:10 Fryzen495 kernel: ue0: link state changed to UP Dec 25 12:20:10 Fryzen495 kernel: ue0: link state changed to DOWN Dec 25 12:20:10 Fryzen495 kernel: ue0: link state changed to UP Dec 25 12:20:58 Fryzen495 kernel: ue0: link state changed to DOWN Dec 25 12:21:01 Fryzen495 kernel: ue0: link state changed to UP Dec 25 12:21:07 Fryzen495 dhclient[20388]: New IP Address (ue0): 192.168.1.145 Dec 25 12:21:07 Fryzen495 dhclient[21191]: New Subnet Mask (ue0): 255.255.255.0 Dec 25 12:21:07 Fryzen495 dhclient[22099]: New Broadcast Address (ue0): 192.168.1.255 Dec 25 12:21:07 Fryzen495 dhclient[22651]: New Routers (ue0): 192.168.1.254 After digging into the code, I've discovered that the issue is caused by the following race condition: In mii/mii.c, the function miibus_linkchg(device_t dev) reads mii->mii_media_status and sets the link state accordingly. static void miibus_linkchg(device_t dev) { ... if (mii->mii_media_status & IFM_AVALID) { if (mii->mii_media_status & IFM_ACTIVE) link_state = LINK_STATE_UP; else link_state = LINK_STATE_DOWN; ... } On the other hand, the function mii_pollstat running on another thread, polls the media status, which is using rgephy.c code in my case. (via PHY_SERVICE call) static void rgephy_status(struct mii_softc *sc) { mii->mii_media_status = IFM_AVALID; mii->mii_media_active = IFM_ETHER; /* HERE IS THE ISSUE The read in miibus_linkchg is randomly seeing mii->mii_media_status as IFM_AVALID, but it does not yet have the IFM_ACTIVE flag set, which makes it trigger a link state change... */ if (rgephy_linkup(sc) != 0) mii->mii_media_status |= IFM_ACTIVE; ... This is an issue in all miibus modules, rgephy.c, truephy.c, mlphy.c, etc... Please note that also in mii_pollstat, mii->mii_media_status is initialized to 0 (not sure why), and that also races with miibus_linkchg. There are many ways to fix it, locally I'm not resetting mii->mii_media_status until rgephy_linkup(sc) returns link down.
I believe I have seen this several times in the past, mostly when I am doing diskless boots via pxe/ipxe/nfs.
^Triage: set to net@ as this affect miibus.
Created attachment 221274 [details] Workaround race in link status change (In reply to Rodney W. Grimes from comment #1) The various mii modules are "terribly" SMP broken, they all reset the media status to mii->mii_media_status = IFM_AVALID; then call the hardware specific function to check the link and eventually set mii->mii_media_status |= IFM_ACTIVE; But during this time the call to miibus_linkchg might trigger a link change, as it randomly sees the IFM_AVALID flag only, before the IFM_ACTIVE being set. Attached is the workaround I'm currently applying to avoid the issue on my T495 with the Thinkpad USB-C gen2 docking station (My hardware uses rgephy.c). If you think this is the way to go, I can provide a full patch for the rest of the mii bus modules.
Could you extend your patch to fix the other cases you see aswell, and create a patch at https://reviews.freebsd.org/differential/ with full context?
(In reply to Hans Petter Selasky from comment #4) Yes I'll do that.
I had a deeper look into the issue, applying the work-around described in comment 3 is definitely not the way to go, the problem lies somewhere else. Actually the code path that end up calling miibus_linkchg is mii_tick and mii_poll (through mii_phy_update). At the first look they seem to be running without lock, but looking at the usb ethernet specific code (if_ure.c in my case), a common lock is acquired before calling those functions. in ure_tick the lock is already acquired beforehand, by the usb ethernet code static void ure_tick(struct usb_ether *ue) .... URE_LOCK_ASSERT(sc, MA_OWNED); mii_tick(mii); .... ure_ifmedia_sts is what polls for media status, and it acquires the same lock before calling mii_pollstat. static void ure_ifmedia_sts(struct ifnet *ifp, struct ifmediareq *ifmr) ... URE_LOCK(sc); mii_pollstat(mii); ... URE_UNLOCK(sc); What I'm seeing is that, despite the lock, sometimes they run both in parallel, which is causing the race in the media status... I'm not sure why, and it is kind of driving me crazy.
Did you try using dtrace, or adding simple prints: printf("%d %s:%d\n", ticks, __FILE__, __LINE__); --HPS
Yes I did add simple printf, showing the thread id, and this is what I have Jan 8 16:24:47 frix230 kernel: tick begin thread 0xfffff800a2343000 Jan 8 16:24:47 frix230 kernel: tick finish thread 0xfffff800a2343000 Jan 8 16:24:48 frix230 kernel: Jan 8 16:24:48 frix230 kernel: poll begin thread 0xfffff8001e67c000 Jan 8 16:24:48 frix230 kernel: poll finish thread 0xfffff8001e67c000 Jan 8 16:24:48 frix230 kernel: Jan 8 16:24:48 frix230 kernel: Jan 8 16:24:48 frix230 kernel: poll begin thread 0xfffff8001e67c000 Jan 8 16:24:48 frix230 kernel: poll finish thread 0xfffff8001e67c000 Jan 8 16:24:48 frix230 kernel: Jan 8 16:24:48 frix230 kernel: tick begin thread 0xfffff800a2343000 Jan 8 16:24:48 frix230 kernel: Jan 8 16:24:48 frix230 kernel: poll begin thread 0xfffff8001e67c000 Jan 8 16:24:48 frix230 kernel: Link down detected from thread 0xfffff800a2343000 Jan 8 16:24:48 frix230 kernel: tick finish thread 0xfffff800a2343000 Jan 8 16:24:48 frix230 kernel: ue0: link state changed to DOWN Jan 8 16:24:48 frix230 kernel: Jan 8 16:24:48 frix230 kernel: Link up detected from thread 0xfffff8001e67c000 Jan 8 16:24:48 frix230 kernel: poll finish thread 0xfffff8001e67c000 Jan 8 16:24:48 frix230 kernel: ue0: link state changed to UP Jan 8 16:24:48 frix230 kernel: All those print are after having acquired the lock and before releasing it. As you can see, the poll thread begins while the tick one didn't finish yet.
How can that be even possible? I've added also the following static variable in ure_tick URE_LOCK_ASSERT(sc, MA_OWNED); printf("\ntick begin thread %p\n", curthread); ticking = 1; mii_tick(mii); ticking = 0; printf("\ntick end thread %p\n", curthread); And in ure_ifmedia_sts URE_LOCK(sc); printf("poll begin %p\n", curthread); if (!ticking) mii_pollstat(mii); printf("poll end %p\n", curthread); ... URE_UNLOCK(sc); And it is running fine since last Friday, which proves further that the poll thread is being executed while the tick on is running, despite the fact that they use the same lock! The only difference I personally see here, unlike mii_tick, ure_ifmedia_sts is called at the end of a syscall, not sure if this is relevant here.
Hi, Can you print "%s" curthread->td_proc->p_comm and curthread->td_name Probably two threads are racing there! --HPS
The additional strings to print changed a bit the timing, it is no longer racing on media status, but media type (it went from 1000baseT <full-duplex> to nothing during the race below). Jan 11 14:40:42 frix230 kernel: tick begin thread 0xfffff8027121a740 ure0 usb Jan 11 14:40:42 frix230 kernel: Jan 11 14:40:42 frix230 kernel: poll begin 0xfffff801ca32f000 ifconfig ifconfig Jan 11 14:40:42 frix230 kernel: poll finish 0xfffff801ca32f000 ifconfig ifconfig Jan 11 14:40:42 frix230 kernel: Jan 11 14:40:42 frix230 kernel: poll begin 0xfffff801ca32f000 ifconfig ifconfig Jan 11 14:40:42 frix230 kernel: poll finish 0xfffff801ca32f000 ifconfig ifconfig Jan 11 14:40:42 frix230 kernel: Jan 11 14:40:42 frix230 kernel: poll begin 0xfffff801ca32f000 ifconfig ifconfig Jan 11 14:40:42 frix230 kernel: poll finish 0xfffff801ca32f000 ifconfig ifconfig Jan 11 14:40:42 frix230 kernel: tick end thread 0xfffff8027121a740 ure0 usb For the test, I'm calling ifconfig continuously from a script, but any network monitoring applet will trigger the issue.
Any possible explanation of what is happening here? Thanks.
I had a better look at the code, and I think I have an explanation of what is the root cause of the issue. The function uether_rxflush releases the lock, causing a mess up with lock/unlock orders. void uether_rxflush(struct usb_ether *ue) { ... UE_LOCK_ASSERT(ue, MA_OWNED); n = mbufq_flush(&ue->ue_rxq); UE_UNLOCK(ue); NET_EPOCH_ENTER(et); ... NET_EPOCH_EXIT(et); UE_LOCK(ue); } I've dtraced the calls 2 64153 ure_ifmedia_sts:return 155646206056 0 64153 ure_ifmedia_sts:return 155647214955 0 64152 ure_ifmedia_sts:entry 155647476515 2 64153 ure_ifmedia_sts:return 155648459628 2 64152 ure_ifmedia_sts:entry 155648490676 1 64220 uether_rxflush:entry 155649463422 <- 0 64153 ure_ifmedia_sts:return 155649466344 <- 1 64221 uether_rxflush:return 155649488201 When uether_rxflush:entry is called, it has the lock, but then it is releasing it and re-acquiring it later, causing the mess with ure_ifmedia_sts IMHO.
Even without UE_LOCK and UE_UNLOCK of comment 13 the issue is still occurring... The following dtrace is observed when the link state fictitiously changes 2 64192 ue_tick_task:entry 242528044179 2 64175 ure_tick:entry 242528050925 1 64152 ure_ifmedia_sts:entry 242529137165 2 64176 ure_tick:return 242529369904 1 64152 ure_ifmedia_sts:entry 242529474008 2 64153 ure_ifmedia_sts:return 242530486029 For whatever reason... Normally the sequence should be 2 64152 ure_ifmedia_sts:entry 243552084077 2 64153 ure_ifmedia_sts:return 243553071554 0 64192 ue_tick_task:entry 243553441743 0 64175 ure_tick:entry 243553444178 2 64176 ure_tick:return 243554572011 3 64152 ure_ifmedia_sts:entry 243663084286 3 64153 ure_ifmedia_sts:return 243664074467
Probably all of the chip configuration and status reads need to be serialized on a USB process callback thread. --HPS
(In reply to Hans Petter Selasky from comment #15) Probably, but I'm really not sure how to deal in a usb callback thread with the SIOCSIFMEDIA ioctl. Also, I'm still not able to identify the root cause of the issue, the fact that the usb thread and the ifconfig ioctl thread use the same lock should avoid any race, but for some reason it is not happening. I would still like to understand what is going wrong.
After digging further into the code, I'm still unable to understand how the inversion described in comment 14 is even possible... On a spare system T430 with a USB ethernet, I've compiled the kernel with options WITNESS options INVARIANTS options INVARIANT_SUPPORT options DEBUG_LOCKS options DIAGNOSTIC The up/down link issue is even more frequent with the debug kernel, but all I see is if_delmulti_locked: detaching ifnet instance 0xfffff8001880b800 ue0: link state changed to DOWN ue0: link state changed to UP ue0: link state changed to DOWN ue0: link state changed to UP ue0: link state changed to DOWN ue0: link state changed to UP This issue is driving me crazy, any fbsd kernel locking expert to help here?
dtrace does not work if a function was inlined in the same C-file it was declared! Try to compile a kernel w/o optimizations, -O0, first. --HPS
(In reply to Hans Petter Selasky from comment #18) Compiling the kernel with COPTFLAGS=-O0 and CFLAGS=-O0 resulted in non-working dtrace... (dtrace: invalid probe specifier... "/usr/lib/dtrace/mbuf.d", line 1: failed to copy type of 'm_data': Type information is in parent and unavailable.) But anyway the problem occurs, with no particular dmesg from witness or debug_lock. On the other hand, on the standard generic kernel, when the issue is not occurring, the following dtrace is shown # dtrace -n "fbt:kernel:mii_pollstat:*,fbt:uether:ue_tick_task:*,fbt:if_ure:ure_tick:*,fbt:if_ure:ure_ifmedia_sts:*,fbt:kernel:mii_tick:*" 0 92491 ure_ifmedia_sts:entry 0 35254 mii_pollstat:entry 2 35255 mii_pollstat:return 2 92492 ure_ifmedia_sts:return 3 92425 ue_tick_task:entry 3 92489 ure_tick:entry 3 35276 mii_tick:entry 3 35277 mii_tick:return 3 92490 ure_tick:return Which are the two "code-path" checking for link status. Now with the same dtrace probes, when the link up/down problem occurs, I see 3 92425 ue_tick_task:entry 3 92489 ure_tick:entry 3 35276 mii_tick:entry 0 92491 ure_ifmedia_sts:entry <---- 0 35254 mii_pollstat:entry <---- 3 35277 mii_tick:return 3 92490 ure_tick:return How that is possible? ure_ifmedia_sts should acquire the same mutex lock as ue_tick_task before calling mii_pollstat. And this is not happening apparently.
Can you make dtrace print the backtrace? Maybe it is another ethernet device? --HPS
(In reply to Hans Petter Selasky from comment #20) Not sure how to do that, but anyway, it is the only USB ethernet device I have that uses if_ure and mii bus, if not loaded, nothing is shown in the mentioned dtrace command.
I believe I am having the same issue with if_axe using a usb to ethernet adapter based on the ASIX AX88178 chipset. It works (sort of) and I can get the full speed I'm expecting but dmesg shows the interface as going UP and DOWN constantly which seems to manifest as dropped packets. If there's any information I can provide please ask. Has there been any progress on this? (In reply to Ali Abdallah from comment #6) I know it's not a good fix but would you be willing to apply that workaround to if_axe so I could build it myself?
(In reply to vxasxfepbikrfqdruz from comment #22) Unfortunately I had no time recently to continue debugging the issue, I'm still using the ugly workaround in my local branch
Created attachment 236901 [details] dtrace ure usb traces I got another machine on which I didn't apply the workaround, running 13.1, with a USB Lan adapter that uses if_ure, I used the following dtrace probes. dtrace -n 'fbt:miibus:mii_pollstat:*,fbt:uether:ue_tick_task:*,fbt:if_ure:ure_tick:*,fbt:if_ure:ure_ifmedia_sts:*,fbt:miibus:mii_tick:*,fbt:miibus:miibus_linkchg:*, fbt:usb:* { printf("%p", curthread); }' You can see from the attached debug log (numbered lines) 1 48573 ure_ifmedia_sts:entry fffffe00dc75c720 1 48475 mii_pollstat:entry fffffe00dc75c720 But before exit entry of re_ifmedia_sts:return of thread fffffe00dc75c720, you can see: 1052 0 48573 ure_ifmedia_sts:entry fffffe00e00271e0 1053 2 27791 usb_command_wrapper:entry fffffe0082a17ac0 1054 2 27738 usbd_callback_wrapper:entry fffffe0082a17ac0 1055 0 48475 mii_pollstat:entry fffffe00e00271e0 And that exactly when the link up/down happened. So thread fffffe00e00271e0 managed to go ahead and call mii_pollstat, acquiring the sc mutex, despite that thread fffffe00dc75c720 didn't release it yet, you can see that the return from "polling the media state" on thread fffffe00dc75c720 comes way after the ure_ifmedia_sts:entry of thread fffffe00e00271e0 1517 3 48476 mii_pollstat:return fffffe00dc75c720 1518 3 48574 ure_ifmedia_sts:return fffffe00dc75c720 __snip if_ure__ ure_ifmedia_sts { ... URE_LOCK(sc); mii_pollstat(mii); ifmr->ifm_active = mii->mii_media_active; ifmr->ifm_status = mii->mii_media_status; URE_UNLOCK(sc); } __snip__ The mutex locked/released by URE_LOCK, is the same mutex used in if_ure, USB xfer and USB transfer, so I guess something is not playing well in the release/acquire sequence... Will try if time permits to dig more into the USB transfer/process code, but I have to say that I'm extremely surprised that this bug didn't attract much attention, it actually renders any USB Ethernet device unusable on FreeBSD...
(In reply to Ali Abdallah from comment #24) Yeah, I was pretty surprised as well that this is entirely broken despite several USB to ethernet adapters being listed as supported by FreeBSD. This also affects pfSense and opnSense which you think would generate some attention about this, but they just seem to recommend not using USB adapters at all. If it's any help when I was getting my USB adapters I checked with the supported devices list and the comments on amazon. There were several comments saying their adapters worked fine with freebsd with the latest one being from October 2016. That would mean this problem was probably introduced sometime between v10.4 and now. This is kinda over my head but I would really appreciate if this were fixed and I was able to build the fixed mii module. Thank you for your efforts. P.S. This is only sort of related but there's another issue with USB adapters if you have more than one. They get picked up at boot in a random order so which adapter is which interface will get shuffled every boot. I was able to mitigate that using the ethname package.
These issues are not forgotten. It just takes a while to find good and generic solutions. Maybe all MII bus code should be executed for the same thread? Would remove the need for depending on locking. --HPS
(In reply to vxasxfepbikrfqdruz from comment #22) Just a me too here. I have an AX88179A USB-C adapter. Feel free to ask for any info.
(In reply to ml from comment #27) P.S. I tried the patch from comment #3, but it doesn't help in my case.
(In reply to ml from comment #28) your device doesn't use mii bus rgephy code, but something else. See which part your device uses from sys/dev/mii/ and adapt the workaround accordingly.
(In reply to Hans Petter Selasky from comment #26) > Maybe all MII bus code should be executed for the same thread? Not sure what do you mean. The issue is that the ioctl for media status (which invokes the mii bus) races with a currently running tick (ure_tick for example). So the MII bus code runs in a single thread always, but one invoked by ioctl and another one by the tick callback race between each other sometimes (and it is not clear why, as both code paths use the same lock...)
(In reply to Ali Abdallah from comment #29) Doh! Should have been obvious! (At least to someone more expert than me :) Anyway, I made the same changes to ukphy and now LAN seems to work perfectly: I was able to run a full "make installworld" via NFS (which previously would have required unplugging/replugging the adapter at least 4 or 5 times). Thanks.
Hello. I'm an OPNsense user, and I'm having this exact issue. What can I do to help? I have 2 different USB NICs (Anker and Ugreen), and both exhibit this behavior. I'd really like to help solve this issue, because it impacts many and I think it would be very nice to be able to use USB NICs on my firewall PC. Thanks!
(In reply to ml from comment #31) Hi! Any chance you could share what you changed for asix ethernet adapter? (In reply to ml from comment #31)
(In reply to kosolapiy from comment #33) Sure. I'm attaching the patch I applied to CURRENT. Notice: a) I'm not using current, so I also applied the same to 13.3R; b) this does not *solve*, only makes it usable (i.e. hangs and forces me to unplug/replug the adapter several times per days, instead of a few times per minute).
Created attachment 251770 [details] Partially solves for me
(In reply to Hans Petter Selasky from comment #26) The issue is that ioctl thread to get the media status, which basically started by userspace races with ue_tick_task. And I finally know why. The usbd_do_request_flags called releases the lock and then later acquire it (see the code). So, an ioctl waiting on the same lock will be executed together with ue_tick_task (because they aren't executed from the same thread), causing the race observed here and basically breaking all usb ethernet devices. I will shortly attach a patch to Sync ioctl/tick media status changes.
Created attachment 255142 [details] usb_ethernet Sync ioctl/tick media status changes This solves the issue without touching the mii code or any specific usb ethernet device code (like if_ure and others for example) I tested it and so far so good. Please review it/test it extensively.
(In reply to Ali Abdallah from comment #37) The patch in comment 37 could be a bit optimized, yes! The usb process/request code is completely broken when it comes to different threads sharing ue_lock (ioctl versus uether tick), the usb request code releases the acquired lock, making ioctl thread waiting for the same lock to be scheduled for execution, lovely! To be more precise. ioctly waiting for ue_lock, which is acquired by ue_tick, but then the tick code will call at some point usbd_do_request_flags, which will release the lock at the beginning (and then acquire it later), but in the meantime we have our "cute" ioctl thread waken-up only to report wrong media_status value!
Created attachment 255162 [details] usb_ethernet Sync ioctl/tick media status changes v2 patch v2, this should be enough to sync ioctl with tick usb net threads.
(In reply to Ali Abdallah from comment #39) I see UP / DOWN and network stall with v2. I'll try with v1
(In reply to Mikael Urankar from comment #40) forget my last comment, I forgot to patch my src tree :/
(In reply to Ali Abdallah from comment #39) None of the patches (v1 or v2) work for me.
(In reply to Mikael Urankar from comment #42) What do you mean by they don't work? Can you be more specific? Which usb network device you are using? I tested the patch (v2) on a couple of network devices I have on FreeBSD 14.1 (both use if_ure), I no longer can reproduce the issue and network is stable so far.
(In reply to Mikael Urankar from comment #42) I can still reproduce with both patches the state down/up issue. The patch solves tick vs ioctl. But running 2 or ifconfig from different screens lead to the issue (ioctl thread1 vs ioctl thread2), since usbd_do_request_flags releases the lock and so one of the ioctl thread waiting for it gets executed, altering the media_state flag. I will try to dedicate some time for it to see how this can be solved.
Created attachment 255301 [details] usb_ethernet Sync ioctl/tick media status changes v3 Please kindly give the attach v3 patch a try.
(In reply to Ali Abdallah from comment #43) It's a dell dockstation, rgephy0: <RTL8251/8153 1000BASE-T media interface> PHY 0 on miibus0 No packets are coming in or coming out. Same problem with v3
(In reply to Mikael Urankar from comment #46) But does your dell dockstation work without the patch?
(In reply to Ali Abdallah from comment #47) It kinda works, the interface goes up and down and I can transfer a few GB of data before any more packets pass through.
(In reply to Mikael Urankar from comment #48) The patch doesn't touch packets transferring/receiving. It is only trying to address the up/down issue. Are you using dhclient? can you attach ifconfig?