Bug 206567 - [msk] msk0: watchdog timeout - 88E8053 on i386
Summary: [msk] msk0: watchdog timeout - 88E8053 on i386
Status: New
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 9.3-STABLE
Hardware: i386 Any
: --- Affects Only Me
Assignee: freebsd-net (Nobody)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-01-24 12:34 UTC by C. Bern
Modified: 2020-06-22 16:07 UTC (History)
5 users (show)

See Also:


Attachments
verbose boot-time dmesg output from tgl's mac mini (43.83 KB, text/plain)
2020-03-20 15:04 UTC, Tom Lane
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description C. Bern 2016-01-24 12:34:54 UTC
On my MacMini (Late 2006, Macmini1,1, MA607*/A), the msk0 interface shows usually this:
Jan  1 22:29:17 beastli kernel: msk0: watchdog timeout
Jan  1 22:29:17 beastli kernel: msk0: link state changed to DOWN
Jan  1 22:29:20 beastli kernel: msk0: link state changed to UP

and thereafter the interface won't transport any data anymore. Only a reboot helps. This is since 9.2 (for over a year).
I have looked at patches found in this bug forum but it didn't help.

This happens when quite some traffic goes oer the interface, I presume. It happens either once a week or every 30 minutes...

Anything I can switch on/off to configure?
Or anything I can do to circumvent the reboot? (as kernel module could be a way) as read in #150257 ?

Thanks
Christian

Platform info:

PF firewall confugured

FreeBSD 9.3-RELEASE-p33 #0: Thu Jan 14 00:48:15 UTC 2016

mskc0: <Marvell Yukon 88E8053 Gigabit Ethernet> mem 0x90200000-0x90203fff irq 16 at device 0.0 on pci1
msk0: <Marvell Technology Group Ltd. Yukon EC Id 0xb6 Rev 0x02> on mskc0
msk0: Ethernet address: 00:16:xx:yy:aa:qq
miibus0: <MII bus> on msk0
e1000phy0: <Marvell 88E1111 Gigabit PHY> PHY 0 on miibus0
e1000phy0:  none, 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT, 1000baseT-master, 1000baseT-FDX, 1000baseT-FDX-master, auto, auto-flow

root@beastli:~ # pciconf -lbceVv @pci0:1:0:0:
mskc0@pci0:1:0:0:	class=0x020000 card=0x532111ab chip=0x436211ab rev=0x22 hdr=0x00
    vendor     = 'Marvell Technology Group Ltd.'
    device     = '88E8053 PCI-E Gigabit Ethernet Controller'
    class      = network
    subclass   = ethernet
    bar   [10] = type Memory, range 64, base rx90200000, size 16384, enabled
    bar   [18] = type I/O Port, range 32, base rx1000, size 256, disabled
    cap 01[48] = powerspec 2  supports D0 D1 D2 D3  current D0
    cap 03[50] = VPD
    cap 05[5c] = MSI supports 2 messages, 64 bit enabled with 1 message
    cap 10[e0] = PCI-Express 1 legacy endpoint max data 128(128) link x1(x1)
                 speed 2.5(2.5) ASPM disabled(L0s)
ecap 0001[100] = AER 1 0 fatal 0 non-fatal 1 corrected
  PCI-e errors = Correctable Error Detected
                 Non-Fatal Error Detected
                 Unsupported Request Detected
     Corrected = Receiver Error
    VPD ident  = 'Marvell Yukon 88E8053 Gigabit Ethernet Controller'
    VPD ro PN  = 'Yukon 88E8053'
    VPD ro EC  = 'Rev. 2.2'
    VPD ro MN  = 'Marvell'
    VPD ro SN  = 'AbCdEfG334455'
    VPD ro CP  = ID 01 in map 0x50[0x3cc]
    VPD rw VE  = '00'
Comment 1 Val Packett 2017-03-24 11:00:58 UTC
Same thing happens to my Mac mini (2006 upgraded to Core2Duo & firmware from the 2007 model, HardenedBSD 11-STABLE amd64, booted via GRUB2 i386 EFI).

One workaround I found on the freebsd.org forum is suspend_bounce, that worked for me once but now my mini is not coming back from that suspended state.

I guess using Wi-Fi is a better solution, since the Atheros card of the mini is very well supported :D

By the way, another msk problem — trying to add msk0 as a laggport to lagg0 results in a kernel panic.
Comment 2 Tom Lane 2020-03-19 17:20:20 UTC
I'm getting much the same behavior on an early 2006 Mac Mini (MA206LL/A).  The ethernet interface shows up in dmesg just as quoted above:

...
pcib1: <ACPI PCI-PCI bridge> irq 17 at device 28.0 on pci0
pcib1: failed to allocate initial I/O port window: 0x1000-0x1fff
pci1: <ACPI PCI bus> on pcib1
mskc0: <Marvell Yukon 88E8053 Gigabit Ethernet> mem 0x90200000-0x90203fff irq 16 at device 0.0 on pci1
msk0: <Marvell Technology Group Ltd. Yukon EC Id 0xb6 Rev 0x02> on mskc0
msk0: Using defaults for TSO: 65518/35/2048
msk0: Ethernet address: 00:16:cb:a5:22:11
miibus0: <MII bus> on msk0
e1000phy0: <Marvell 88E1111 Gigabit PHY> PHY 0 on miibus0
e1000phy0:  none, 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT, 1000baseT-master, 1000baseT-FDX, 1000baseT-FDX-master, auto, auto-flow
pcib2: <ACPI PCI-PCI bridge> irq 16 at device 28.1 on pci0
pci2: <ACPI PCI bus> on pcib2
...

Things work fine for quite a while (a month or more), and then suddenly the interface fails to operate, with repeating messages like this in syslog:

Mar 17 03:27:15 oldmini kernel: msk0: watchdog timeout
Mar 17 03:27:15 oldmini kernel: msk0: link state changed to DOWN
Mar 17 03:27:18 oldmini kernel: msk0: link state changed to UP
Mar 17 03:28:15 oldmini kernel: msk0: watchdog timeout
Mar 17 03:28:15 oldmini kernel: msk0: link state changed to DOWN
Mar 17 03:28:18 oldmini kernel: msk0: link state changed to UP
Mar 17 03:29:16 oldmini kernel: msk0: watchdog timeout
Mar 17 03:29:16 oldmini kernel: msk0: link state changed to DOWN
Mar 17 03:29:19 oldmini kernel: msk0: link state changed to UP
Mar 17 03:30:16 oldmini kernel: msk0: watchdog timeout
Mar 17 03:30:16 oldmini kernel: msk0: link state changed to DOWN
Mar 17 03:30:19 oldmini kernel: msk0: link state changed to UP

Rebooting fixes it; I've not tried any lesser solutions.

Currently running
FreeBSD oldmini.sss.pgh.pa.us 12.1-RELEASE-p2 FreeBSD 12.1-RELEASE-p2 GENERIC  i386
but I also saw this with 12.0.
Comment 3 Marek Zarychta 2020-03-19 17:49:31 UTC
IMHO this hardware is buggy and never worked reliable for me. I have been still using it as a part of lagg(4) on FreeBSD 11.3. It used to behave even worse earlier (used with older FreeBSD releases). I recall similar failures on Linux, rebooting machine was solving the issue. Bringing this card up and down, reloading drivers etc was never helping regardless it was going on FreeBSD or Linux. Please replace your NIC.
Comment 4 Tom Lane 2020-03-19 18:13:53 UTC
macOS has been rock solid reliable for ~14 years on that same hardware, so that sounds like a pretty lame excuse to me.  (I wonder whether digging into the Darwin kernel sources would yield anything interesting about how to drive this interface...)
Comment 5 Brad Smith 2020-03-19 23:37:29 UTC
At the moment OpenBSD disables MSI interrupts for the 8053 chipset.

Have any of you guys tried using the tunable to disable MSI and see
if it makes any difference?
Comment 6 Tom Lane 2020-03-20 03:06:26 UTC
> Have any of you guys tried using the tunable to disable MSI and
> see if it makes any difference?

Oh, thanks for the suggestion.  I'm happy to try, but what change are you suggesting exactly?  I see multiple sysctl variables with "msi" in their names, so I'm confused.

$ sysctl -a | grep -i msi
hw.sdhci.enable_msi: 1
hw.puc.msi_disable: 0
hw.pci.honor_msi_blacklist: 1
hw.pci.msix_rewrite_table: 0
hw.pci.enable_msix: 1
hw.pci.enable_msi: 1
hw.mfi.msi: 1
hw.malo.pci.msi_disable: 0
hw.bce.msi_enable: 1
hw.aac.enable_msi: 1
machdep.disable_msix_migration: 0
machdep.num_msi_irqs: 512
Comment 7 Brad Smith 2020-03-20 03:11:13 UTC
Looks like hw.msk.msi_disable=1 in /boot/loader.conf.
Comment 8 Tom Lane 2020-03-20 03:46:51 UTC
Ah, now I see the tunable in msk(4).  Installed and rebooted; no obvious change in dmesg output or performance.  Since the MTBF was a month or two already, it'll be awhile before I can say if this fixed things ... but thanks for the tip!
Comment 9 Chris Hutchinson 2020-03-20 14:22:44 UTC
Out of curiosity. If you turn up the verbosity
(boot_verbose="YES" in /boot/loader.conf)
What is the message regarding msiX. I ask
this because in most of the cases I've seen
regarding this watchdog error on NICs. Is
because of the interrupt handling where
msiX is concerned -- not enough interrupts
available, mis-routed, poor implementation.
So the suggestion has been to terminate
msiX.
Just thought it worth mentioning.

--Chris
Comment 10 Tom Lane 2020-03-20 15:04:59 UTC
Created attachment 212548 [details]
verbose boot-time dmesg output from tgl's mac mini
Comment 11 Tom Lane 2020-03-20 15:07:46 UTC
(In reply to Chris Hutchinson from comment #9)
> Out of curiosity. If you turn up the verbosity
> (boot_verbose="YES" in /boot/loader.conf)
> What is the message regarding msiX.

Not sure which message you were looking for, so I uploaded the whole dmesg result from rebooting with boot_verbose="YES" (and without the hw.msk.msi_disable change that Brad suggested, in case that makes a difference).
Comment 12 Chris Hutchinson 2020-03-20 15:37:19 UTC
(In reply to Tom Lane from comment #11)
OK. I only see one MSI-X message in dmesg(8)
It's related to your (msk) card -- Found 1 of
1 interrupts. I ran into the same error you're
reporting. But it was on a dual port re (realtek)
card. The most proposed solution was to disable
MSI-X on the card. Blamed on a poor implementation.
I wasn't satisfied with that, as doing so would
reduce performance, and I work that card pretty
hard. So I performed some additional investigation
into exactly *why* this problem occurs, and ultimately
decided to investigate the kernel tunable
kern.ipc.nmbjumbop
and
kern.ipc.nmbclusters
After tweaking them for awhile. I found thresholds
that eliminated the error, and have been running
the card blissfully for about 9 months now. If I
were you, I might be inclined to increase those
numbers until the errors cease. There is no magic
number I can provide. As it differs from card to
card, and load to load. But it's enough to say that
the *default* number is inadequate (too small).

HTH

--Chris
Comment 13 Chris Hutchinson 2020-03-20 15:44:06 UTC
(In reply to Chris Hutchinson from comment #12)
OH, and make sure you RE-enable MSI. If you already
disabled it. :)
Comment 14 Tom Lane 2020-06-22 16:07:37 UTC
FTR, I've now had three months of trouble-free uptime on that machine, using hw.msk.msi_disable="1" in /boot/loader.conf.  I have not had the time to try Chris' more complex fix.