On my MacMini (Late 2006, Macmini1,1, MA607*/A), the msk0 interface shows usually this: Jan 1 22:29:17 beastli kernel: msk0: watchdog timeout Jan 1 22:29:17 beastli kernel: msk0: link state changed to DOWN Jan 1 22:29:20 beastli kernel: msk0: link state changed to UP and thereafter the interface won't transport any data anymore. Only a reboot helps. This is since 9.2 (for over a year). I have looked at patches found in this bug forum but it didn't help. This happens when quite some traffic goes oer the interface, I presume. It happens either once a week or every 30 minutes... Anything I can switch on/off to configure? Or anything I can do to circumvent the reboot? (as kernel module could be a way) as read in #150257 ? Thanks Christian Platform info: PF firewall confugured FreeBSD 9.3-RELEASE-p33 #0: Thu Jan 14 00:48:15 UTC 2016 mskc0: <Marvell Yukon 88E8053 Gigabit Ethernet> mem 0x90200000-0x90203fff irq 16 at device 0.0 on pci1 msk0: <Marvell Technology Group Ltd. Yukon EC Id 0xb6 Rev 0x02> on mskc0 msk0: Ethernet address: 00:16:xx:yy:aa:qq miibus0: <MII bus> on msk0 e1000phy0: <Marvell 88E1111 Gigabit PHY> PHY 0 on miibus0 e1000phy0: none, 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT, 1000baseT-master, 1000baseT-FDX, 1000baseT-FDX-master, auto, auto-flow root@beastli:~ # pciconf -lbceVv @pci0:1:0:0: mskc0@pci0:1:0:0: class=0x020000 card=0x532111ab chip=0x436211ab rev=0x22 hdr=0x00 vendor = 'Marvell Technology Group Ltd.' device = '88E8053 PCI-E Gigabit Ethernet Controller' class = network subclass = ethernet bar [10] = type Memory, range 64, base rx90200000, size 16384, enabled bar [18] = type I/O Port, range 32, base rx1000, size 256, disabled cap 01[48] = powerspec 2 supports D0 D1 D2 D3 current D0 cap 03[50] = VPD cap 05[5c] = MSI supports 2 messages, 64 bit enabled with 1 message cap 10[e0] = PCI-Express 1 legacy endpoint max data 128(128) link x1(x1) speed 2.5(2.5) ASPM disabled(L0s) ecap 0001[100] = AER 1 0 fatal 0 non-fatal 1 corrected PCI-e errors = Correctable Error Detected Non-Fatal Error Detected Unsupported Request Detected Corrected = Receiver Error VPD ident = 'Marvell Yukon 88E8053 Gigabit Ethernet Controller' VPD ro PN = 'Yukon 88E8053' VPD ro EC = 'Rev. 2.2' VPD ro MN = 'Marvell' VPD ro SN = 'AbCdEfG334455' VPD ro CP = ID 01 in map 0x50[0x3cc] VPD rw VE = '00'
Same thing happens to my Mac mini (2006 upgraded to Core2Duo & firmware from the 2007 model, HardenedBSD 11-STABLE amd64, booted via GRUB2 i386 EFI). One workaround I found on the freebsd.org forum is suspend_bounce, that worked for me once but now my mini is not coming back from that suspended state. I guess using Wi-Fi is a better solution, since the Atheros card of the mini is very well supported :D By the way, another msk problem — trying to add msk0 as a laggport to lagg0 results in a kernel panic.
I'm getting much the same behavior on an early 2006 Mac Mini (MA206LL/A). The ethernet interface shows up in dmesg just as quoted above: ... pcib1: <ACPI PCI-PCI bridge> irq 17 at device 28.0 on pci0 pcib1: failed to allocate initial I/O port window: 0x1000-0x1fff pci1: <ACPI PCI bus> on pcib1 mskc0: <Marvell Yukon 88E8053 Gigabit Ethernet> mem 0x90200000-0x90203fff irq 16 at device 0.0 on pci1 msk0: <Marvell Technology Group Ltd. Yukon EC Id 0xb6 Rev 0x02> on mskc0 msk0: Using defaults for TSO: 65518/35/2048 msk0: Ethernet address: 00:16:cb:a5:22:11 miibus0: <MII bus> on msk0 e1000phy0: <Marvell 88E1111 Gigabit PHY> PHY 0 on miibus0 e1000phy0: none, 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT, 1000baseT-master, 1000baseT-FDX, 1000baseT-FDX-master, auto, auto-flow pcib2: <ACPI PCI-PCI bridge> irq 16 at device 28.1 on pci0 pci2: <ACPI PCI bus> on pcib2 ... Things work fine for quite a while (a month or more), and then suddenly the interface fails to operate, with repeating messages like this in syslog: Mar 17 03:27:15 oldmini kernel: msk0: watchdog timeout Mar 17 03:27:15 oldmini kernel: msk0: link state changed to DOWN Mar 17 03:27:18 oldmini kernel: msk0: link state changed to UP Mar 17 03:28:15 oldmini kernel: msk0: watchdog timeout Mar 17 03:28:15 oldmini kernel: msk0: link state changed to DOWN Mar 17 03:28:18 oldmini kernel: msk0: link state changed to UP Mar 17 03:29:16 oldmini kernel: msk0: watchdog timeout Mar 17 03:29:16 oldmini kernel: msk0: link state changed to DOWN Mar 17 03:29:19 oldmini kernel: msk0: link state changed to UP Mar 17 03:30:16 oldmini kernel: msk0: watchdog timeout Mar 17 03:30:16 oldmini kernel: msk0: link state changed to DOWN Mar 17 03:30:19 oldmini kernel: msk0: link state changed to UP Rebooting fixes it; I've not tried any lesser solutions. Currently running FreeBSD oldmini.sss.pgh.pa.us 12.1-RELEASE-p2 FreeBSD 12.1-RELEASE-p2 GENERIC i386 but I also saw this with 12.0.
IMHO this hardware is buggy and never worked reliable for me. I have been still using it as a part of lagg(4) on FreeBSD 11.3. It used to behave even worse earlier (used with older FreeBSD releases). I recall similar failures on Linux, rebooting machine was solving the issue. Bringing this card up and down, reloading drivers etc was never helping regardless it was going on FreeBSD or Linux. Please replace your NIC.
macOS has been rock solid reliable for ~14 years on that same hardware, so that sounds like a pretty lame excuse to me. (I wonder whether digging into the Darwin kernel sources would yield anything interesting about how to drive this interface...)
At the moment OpenBSD disables MSI interrupts for the 8053 chipset. Have any of you guys tried using the tunable to disable MSI and see if it makes any difference?
> Have any of you guys tried using the tunable to disable MSI and > see if it makes any difference? Oh, thanks for the suggestion. I'm happy to try, but what change are you suggesting exactly? I see multiple sysctl variables with "msi" in their names, so I'm confused. $ sysctl -a | grep -i msi hw.sdhci.enable_msi: 1 hw.puc.msi_disable: 0 hw.pci.honor_msi_blacklist: 1 hw.pci.msix_rewrite_table: 0 hw.pci.enable_msix: 1 hw.pci.enable_msi: 1 hw.mfi.msi: 1 hw.malo.pci.msi_disable: 0 hw.bce.msi_enable: 1 hw.aac.enable_msi: 1 machdep.disable_msix_migration: 0 machdep.num_msi_irqs: 512
Looks like hw.msk.msi_disable=1 in /boot/loader.conf.
Ah, now I see the tunable in msk(4). Installed and rebooted; no obvious change in dmesg output or performance. Since the MTBF was a month or two already, it'll be awhile before I can say if this fixed things ... but thanks for the tip!
Out of curiosity. If you turn up the verbosity (boot_verbose="YES" in /boot/loader.conf) What is the message regarding msiX. I ask this because in most of the cases I've seen regarding this watchdog error on NICs. Is because of the interrupt handling where msiX is concerned -- not enough interrupts available, mis-routed, poor implementation. So the suggestion has been to terminate msiX. Just thought it worth mentioning. --Chris
Created attachment 212548 [details] verbose boot-time dmesg output from tgl's mac mini
(In reply to Chris Hutchinson from comment #9) > Out of curiosity. If you turn up the verbosity > (boot_verbose="YES" in /boot/loader.conf) > What is the message regarding msiX. Not sure which message you were looking for, so I uploaded the whole dmesg result from rebooting with boot_verbose="YES" (and without the hw.msk.msi_disable change that Brad suggested, in case that makes a difference).
(In reply to Tom Lane from comment #11) OK. I only see one MSI-X message in dmesg(8) It's related to your (msk) card -- Found 1 of 1 interrupts. I ran into the same error you're reporting. But it was on a dual port re (realtek) card. The most proposed solution was to disable MSI-X on the card. Blamed on a poor implementation. I wasn't satisfied with that, as doing so would reduce performance, and I work that card pretty hard. So I performed some additional investigation into exactly *why* this problem occurs, and ultimately decided to investigate the kernel tunable kern.ipc.nmbjumbop and kern.ipc.nmbclusters After tweaking them for awhile. I found thresholds that eliminated the error, and have been running the card blissfully for about 9 months now. If I were you, I might be inclined to increase those numbers until the errors cease. There is no magic number I can provide. As it differs from card to card, and load to load. But it's enough to say that the *default* number is inadequate (too small). HTH --Chris
(In reply to Chris Hutchinson from comment #12) OH, and make sure you RE-enable MSI. If you already disabled it. :)
FTR, I've now had three months of trouble-free uptime on that machine, using hw.msk.msi_disable="1" in /boot/loader.conf. I have not had the time to try Chris' more complex fix.