Bug 208205 - re0 watchdog timeout
Summary: re0 watchdog timeout
Status: New
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 11.1-RELEASE
Hardware: Any Any
: --- Affects Many People
Assignee: freebsd-net mailing list
URL:
Keywords: needs-qa
Depends on: 212283
Blocks: 227979
  Show dependency treegraph
 
Reported: 2016-03-22 17:44 UTC by Nick
Modified: 2020-03-24 14:16 UTC (History)
17 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Nick 2016-03-22 17:44:10 UTC
i have an issue with re0 watchdog timeout under moderate load with PowerD enabled. 

im using pfsense with a wan link set to vlan and using native as lan.
network speed is 180mbps down 12mbps up. upon initiating a speed test the test will fail it has hit the re0 watchdog timeout.

after disabling PowerD the system appears to be functioning as intended with no records of re0 watchdog timeout.

im not sure which chipset it is however it is a asrock beebox n3000 
http://www.asrock.com/nettop/Intel/Beebox%20Series/
Comment 1 Sean Bruno freebsd_committer 2016-03-25 14:29:03 UTC
Can you post the output of pciconf -lvbc?
Comment 2 Nick 2016-03-25 14:52:48 UTC
here is the output.

[2.3-BETA][root@pfSense.Home.lan]/root: pciconf -lvbc
hostb0@pci0:0:0:0:      class=0x060000 card=0x22b11849 chip=0x22808086 rev=0x21 hdr=0x00
    vendor     = 'Intel Corporation'
    class      = bridge
    subclass   = HOST-PCI
vgapci0@pci0:0:2:0:     class=0x030000 card=0x22b11849 chip=0x22b18086 rev=0x21 hdr=0x00
    vendor     = 'Intel Corporation'
    class      = display
    subclass   = VGA
    bar   [10] = type Memory, range 64, base rx90000000, size 16777216, enabled
    bar   [18] = type Prefetchable Memory, range 64, base rx80000000, size 268435456, enabled
    bar   [20] = type I/O Port, range 32, base rxf000, size 64, enabled
    cap 01[d0] = powerspec 2  supports D0 D3  current D0
    cap 05[90] = MSI supports 1 message
    cap 09[b0] = vendor (length 7) Intel cap 0 version 1
ahci0@pci0:0:19:0:      class=0x010601 card=0x22a31849 chip=0x22a38086 rev=0x21 hdr=0x00
    vendor     = 'Intel Corporation'
    class      = mass storage
    subclass   = SATA
    bar   [20] = type I/O Port, range 32, base rxf060, size 32, enabled
    bar   [24] = type Memory, range 32, base rx91415000, size 2048, enabled
    cap 05[80] = MSI supports 1 message enabled with 1 message
    cap 01[70] = powerspec 3  supports D0 D3  current D0
    cap 12[a8] = SATA Index-Data Pair
xhci0@pci0:0:20:0:      class=0x0c0330 card=0x22b51849 chip=0x22b58086 rev=0x21 hdr=0x00
    vendor     = 'Intel Corporation'
    class      = serial bus
    subclass   = USB
    bar   [10] = type Memory, range 64, base rx91400000, size 65536, enabled
    cap 01[70] = powerspec 2  supports D0 D3  current D0
    cap 05[80] = MSI supports 8 messages, 64 bit enabled with 1 message
none0@pci0:0:26:0:      class=0x108000 card=0x22981849 chip=0x22988086 rev=0x21 hdr=0x00
    vendor     = 'Intel Corporation'
    class      = encrypt/decrypt
    bar   [10] = type Memory, range 32, base rx91100000, size 1048576, enabled
    bar   [14] = type Memory, range 32, base rx91000000, size 1048576, enabled
    cap 01[80] = powerspec 3  supports D0 D3  current D0
    cap 05[a0] = MSI supports 1 message
hdac0@pci0:0:27:0:      class=0x040300 card=0x02831849 chip=0x22848086 rev=0x21 hdr=0x00
    vendor     = 'Intel Corporation'
    class      = multimedia
    subclass   = HDA
    bar   [10] = type Memory, range 64, base rx91410000, size 16384, enabled
    cap 01[50] = powerspec 2  supports D0 D3  current D0
    cap 05[60] = MSI supports 1 message, 64 bit enabled with 1 message
pcib1@pci0:0:28:0:      class=0x060400 card=0x22c81849 chip=0x22c88086 rev=0x21 hdr=0x01
    vendor     = 'Intel Corporation'
    class      = bridge
    subclass   = PCI-PCI
    cap 10[40] = PCI-Express 2 root port slot max data 128(128) link x1(x1)
                 speed 2.5(5.0) ASPM disabled(L0s/L1)
    cap 05[80] = MSI supports 1 message
    cap 0d[90] = PCI Bridge card=0x22c81849
    cap 01[a0] = powerspec 3  supports D0 D3  current D0
    ecap 0000[100] = unknown 0
    ecap 001e[200] = unknown 1
pcib2@pci0:0:28:1:      class=0x060400 card=0x22ca1849 chip=0x22ca8086 rev=0x21 hdr=0x01
    vendor     = 'Intel Corporation'
    class      = bridge
    subclass   = PCI-PCI
    cap 10[40] = PCI-Express 2 root port slot max data 128(128) link x1(x1)
                 speed 2.5(5.0) ASPM disabled(L0s/L1)
    cap 05[80] = MSI supports 1 message
    cap 0d[90] = PCI Bridge card=0x22ca1849
    cap 01[a0] = powerspec 3  supports D0 D3  current D0
    ecap 0000[100] = unknown 0
    ecap 001e[200] = unknown 1
isab0@pci0:0:31:0:      class=0x060100 card=0x229c1849 chip=0x229c8086 rev=0x21 hdr=0x00
    vendor     = 'Intel Corporation'
    class      = bridge
    subclass   = PCI-ISA
    cap 09[e0] = vendor (length 12) Intel cap 1 version 0
                 features: 4 PCI-e x1 slots
none1@pci0:0:31:3:      class=0x0c0500 card=0x22921849 chip=0x22928086 rev=0x21 hdr=0x00
    vendor     = 'Intel Corporation'
    class      = serial bus
    subclass   = SMBus
    bar   [10] = type Memory, range 32, base rx91414000, size 32, enabled
    bar   [20] = type I/O Port, range 32, base rxf040, size 32, enabled
    cap 01[50] = powerspec 3  supports D0 D3  current D0
none2@pci0:1:0:0:       class=0x028000 card=0x882110ec chip=0x882110ec rev=0x00 hdr=0x00
    vendor     = 'Realtek Semiconductor Co., Ltd.'
    device     = 'RTL8821AE 802.11ac PCIe Wireless Network Adapter'
    class      = network
    bar   [10] = type I/O Port, range 32, base rxe000, size 256, enabled
    bar   [18] = type Memory, range 64, base rx91300000, size 16384, enabled
    cap 01[40] = powerspec 3  supports D0 D1 D2 D3  current D0
    cap 05[50] = MSI supports 1 message, 64 bit
    cap 10[70] = PCI-Express 2 endpoint max data 128(128) RO link x1(x1)
                 speed 2.5(2.5) ASPM disabled(L0s/L1)
    ecap 0001[100] = AER 2 0 fatal 0 non-fatal 1 corrected
    ecap 0003[140] = Serial 1 00e04cfffe872b01
    ecap 0018[150] = LTR 1
    ecap 001e[158] = unknown 1
re0@pci0:2:0:0: class=0x020000 card=0x81681849 chip=0x816810ec rev=0x11 hdr=0x00
    vendor     = 'Realtek Semiconductor Co., Ltd.'
    device     = 'RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller'
    class      = network
    subclass   = ethernet
    bar   [10] = type I/O Port, range 32, base rxd000, size 256, enabled
    bar   [18] = type Memory, range 64, base rx91204000, size 4096, enabled
    bar   [20] = type Prefetchable Memory, range 64, base rx91200000, size 16384, enabled
    cap 01[40] = powerspec 3  supports D0 D1 D2 D3  current D0
    cap 05[50] = MSI supports 1 message, 64 bit
    cap 10[70] = PCI-Express 2 endpoint IRQ 1 max data 128(128) RO link x1(x1)
                 speed 2.5(2.5) ASPM disabled(L0s/L1)
    cap 11[b0] = MSI-X supports 4 messages, enabled
                 Table in map 0x20[0x0], PBA in map 0x20[0x800]
    cap 03[d0] = VPD
    ecap 0001[100] = AER 1 0 fatal 0 non-fatal 0 corrected
    ecap 0002[140] = VC 1 max VC0
    ecap 0003[160] = Serial 1 01000000684ce000
    ecap 0018[170] = LTR 1
Comment 3 Kubilay Kocak freebsd_committer freebsd_triage 2016-08-01 09:18:45 UTC
@Nick, is this still reproducible in the latest release?
Comment 4 Nick 2016-08-25 18:58:33 UTC
its updated to 10.3-RELEASE-p5 and the issue is still there.
Comment 5 Sean Bruno freebsd_committer 2016-08-30 15:11:10 UTC
We're seeing this on one of our gateways in the FreeBSD cluster at bytemark.  The interface will not come back up unless the machine is rebooted.

FreeBSD igw0.bme.freebsd.org 11.0-ALPHA6 FreeBSD 11.0-ALPHA6 #0 r302331: Sun Jul  3 23:03:04 UTC 2016     peter@build-11.freebsd.org:/usr/obj/usr/src/sys/CLUSTER11  amd64


re0@pci0:3:0:0: class=0x020000 card=0x85051043 chip=0x816810ec rev=0x09 hdr=0x00
    vendor     = 'Realtek Semiconductor Co., Ltd.'
    device     = 'RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller'
    class      = network
    subclass   = ethernet
    bar   [10] = type I/O Port, range 32, base rxe800, size 256, enabled
    bar   [18] = type Prefetchable Memory, range 64, base rxfdfff000, size 4096, enabled
    bar   [20] = type Prefetchable Memory, range 64, base rxfdff8000, size 16384, enabled
    cap 01[40] = powerspec 3  supports D0 D1 D2 D3  current D0
    cap 05[50] = MSI supports 1 message, 64 bit 
    cap 10[70] = PCI-Express 2 endpoint MSI 1 max data 128(128) RO
                 link x1(x1) speed 2.5(2.5) ASPM disabled(L0s/L1)
    cap 11[b0] = MSI-X supports 4 messages, enabled
                 Table in map 0x20[0x0], PBA in map 0x20[0x800]
    cap 03[d0] = VPD
    ecap 0001[100] = AER 1 0 fatal 0 non-fatal 1 corrected
    ecap 0002[140] = VC 1 max VC0
    ecap 0003[160] = Serial 1 0000000000000000


re0: <RealTek 8168/8111 B/C/CP/D/DP/E/F/G PCIe Gigabit Ethernet> port 0xe800-0xe8ff mem 0xfdfff000-0xfdffffff,0xfdff8000-0xfdffbfff irq 18 at device 0.0 on pci3
re0: Using 1 MSI-X message
re0: turning off MSI enable bit.
re0: Chip rev. 0x48000000
re0: MAC rev. 0x00000000
miibus0: <MII bus> on re0
rgephy0: <RTL8169S/8110S/8211 1000BASE-T media interface> PHY 1 on miibus0
rgephy0:  none, 10baseT, 10baseT-FDX, 10baseT-FDX-flow, 100baseTX, 100baseTX-FDX, 100baseTX-FDX-flow, 1000baseT-FDX, 1000baseT-FDX-master, 1000baseT-FDX-flow, 1000baseT-FDX-flow-master, auto, 
auto-flow
re0: Using defaults for TSO: 65518/35/2048
re0: Ethernet address: 08:60:6e:d7:31:d2
Comment 6 Sean Bruno freebsd_committer 2016-08-30 15:13:15 UTC
(In reply to Sean Bruno from comment #5)
This machine does *not* run powerd as its a gateway host.
Comment 7 c.kworr 2016-09-01 16:55:14 UTC
I'll try to chime in.

After upgrading one of my hosts to 11-STABLE I started to receive those timeouts. In my case this is a home router/storage. I'm using pf and bridge on that interface.

In my case error may be recoverable in some cases. So I see network glitch, transfers are stopping, but after a couple of seconds the packets are flowing again. This can happen a few times in a 10 minute period before network goes down completely.
Comment 8 Sean Bruno freebsd_committer 2016-09-01 17:44:36 UTC
(In reply to c.kworr from comment #7)
Does this happen without pf being used?
Comment 9 c.kworr 2016-09-01 18:01:51 UTC
(In reply to Sean Bruno from comment #8)
Never checked without pf… I have some LOR regarding pf but hadn't thought pf could be the one to blame.

To check without pf I need to rewrite all firewall rules for that one. I'm not using it directly, pf is configured for bridge that contains this interface. I'll see how I can move it out of configuration.
Comment 10 ml 2016-09-07 10:29:35 UTC
Just a "me too" here...

The box is a 10.3p5 running as a router/server.
The internal interface (re0) got blocked (just once luckily until now).

The outside interface (re1) was working, so I could log in remotely and reboot; ifconfig re0 down/up would not help.



powerd is running in its default config (i.e. just powerd_enable="YES" in /etc/rc.conf).
PF is not running, but IPFW is.








# pciconf -lv
...
re0@pci0:2:0:0: class=0x020000 card=0x78171462 chip=0x816810ec rev=0x0c hdr=0x00
    vendor     = 'Realtek Semiconductor Co., Ltd.'
    device     = 'RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller'
    class      = network
    subclass   = ethernet
re1@pci0:3:0:0: class=0x020000 card=0x34687470 chip=0x816810ec rev=0x06 hdr=0x00
    vendor     = 'Realtek Semiconductor Co., Ltd.'
    device     = 'RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller'
    class      = network
    subclass   = ethernet
Comment 11 c.kworr 2016-09-07 11:30:45 UTC
(In reply to Sean Bruno from comment #8)

Actually you can be right. During boot I always have this LOR:

Sep  7 10:08:26 limbo kernel: lock order reversal:
Sep  7 10:08:26 limbo kernel: 1st 0xffffffff818f5b80 pf rulesets (pf rulesets) @ /usr/src/sys/modules/pf/../../netpfil/pf/pf.c:5879
Sep  7 10:08:26 limbo kernel: 2nd 0xffffffff80d4a6c8 pcbinfohash (pcbinfohash) @ /usr/src/sys/netinet/in_pcb.c:1957
Sep  7 10:08:26 limbo kernel: stack backtrace:
Sep  7 10:08:26 limbo kernel: #0 0xffffffff8040af80 at witness_debugger+0x70
Sep  7 10:08:26 limbo kernel: #1 0xffffffff8040ae74 at witness_checkorder+0xe54
Sep  7 10:08:26 limbo kernel: #2 0xffffffff803a8647 at __rw_rlock+0xa7
Sep  7 10:08:26 limbo kernel: #3 0xffffffff804d369f at in_pcblookup_hash+0x3f
Sep  7 10:08:26 limbo kernel: #4 0xffffffff818c75e5 at pf_socket_lookup+0xe5
Sep  7 10:08:26 limbo kernel: #5 0xffffffff818cdbc7 at pf_test_rule+0x1817
Sep  7 10:08:26 limbo kernel: #6 0xffffffff818c9254 at pf_test+0x18f4
Sep  7 10:08:26 limbo kernel: #7 0xffffffff818dc5dd at pf_check_out+0x1d
Sep  7 10:08:26 limbo kernel: #8 0xffffffff804b890b at pfil_run_hooks+0x8b
Sep  7 10:08:26 limbo kernel: #9 0xffffffff804d5bc5 at ip_tryforward+0x295
Sep  7 10:08:26 limbo kernel: #10 0xffffffff804d818f at ip_input+0x35f
Sep  7 10:08:26 limbo kernel: #11 0xffffffff804b77c0 at netisr_dispatch_src+0x80
Sep  7 10:08:26 limbo kernel: #12 0xffffffff804a2fea at ether_demux+0x14a
Sep  7 10:08:26 limbo kernel: #13 0xffffffff804a3de0 at ether_nh_input+0x340
Sep  7 10:08:26 limbo kernel: #14 0xffffffff804b77c0 at netisr_dispatch_src+0x80
Sep  7 10:08:26 limbo kernel: #15 0xffffffff804a3352 at ether_input+0x62
Sep  7 10:08:26 limbo kernel: #16 0xffffffff817d9f75 at re_rxeof+0x5c5
Sep  7 10:08:26 limbo kernel: #17 0xffffffff817d75ba at re_intr_msi+0xca

re0@pci0:2:0:0: class=0x020000 card=0x4b101186 chip=0x43001186 rev=0x06 hdr=0x00
    vendor     = 'D-Link System Inc'
    device     = 'DGE-528T Gigabit Ethernet Adapter'
    class      = network
    subclass   = ethernet
re1@pci0:3:0:0: class=0x020000 card=0x230e1565 chip=0x816810ec rev=0x07 hdr=0x00
    vendor     = 'Realtek Semiconductor Co., Ltd.'
    device     = 'RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller'
    class      = network
    subclass   = ethernet

Only re1 fails.
Comment 12 ml 2017-01-19 07:56:40 UTC
I disabled powerd, but the problem showed up again.

It only happened twice in some months, but it's still a critical problem for us.
Comment 13 Marc Mach 2017-01-27 13:05:59 UTC
Greetings! This issue seems to be not only FBSD 10 specific: Some other users (including me) had an discussion @ FBSD Forum about this annoying problem. Please see https://forums.freebsd.org/threads/55861/ for more details

In short: it seems that the built-in Driver for re0 nics has a bug. The workaround is to compile and use the latest version from realtek by compiling the re0 driver as an external module.

I hope this helps you further.
Comment 14 Sean Bruno freebsd_committer 2017-01-27 15:05:26 UTC
(In reply to Marc Mach from comment #13)
If someone could identify what is in the Realtek driver that is not in the FreeBSD base driver, I'm willing to commit it.  The diff's are kind of ridiculous and may take a lot of sleuthing to figure out.
Comment 15 ml 2017-06-14 12:36:57 UTC
Hello.

Just to say that I have a new box which is showing this behaviour: re0 (on motherboard) locks, while re1 (PCI-X card) is still working.

# pciconf -lv
...
re0@pci0:1:0:0: class=0x020000 card=0x79821462 chip=0x816810ec rev=0x15 hdr=0x00
    vendor     = 'Realtek Semiconductor Co., Ltd.'
    device     = 'RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller'
    class      = network
    subclass   = ethernet
re1@pci0:2:0:0: class=0x020000 card=0x34687470 chip=0x816810ec rev=0x06 hdr=0x00
    vendor     = 'Realtek Semiconductor Co., Ltd.'
    device     = 'RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller'
    class      = network
    subclass   = ethernet


Those are the same cards as in my previous comment, but the box is a different one.
Comment 16 zjk 2017-09-04 13:42:22 UTC
The problem still exists in version 11.1.
Low load - "re" works fine, but large bidirectional - completely hangs the network.

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=166724

Is there a chance to renew the topic?

zjk
Comment 17 Dirk Meyer freebsd_committer 2017-12-16 05:59:47 UTC
Hosted environment:

with FreeBSD 11.1-RELEASE.

Dual-Stack:
re0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
        options=82088<VLAN_MTU,VLAN_HWCSUM,WOL_MAGIC,LINKSTATE>
        nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
        media: Ethernet autoselect (1000baseT <full-duplex,master>)
        status: active

under load > 64 Mbit full dumplex I see the problem:
 kernel: re0: watchdog timeout
 kernel: re0: link state changed to DOWN
 kernel: re0: link state changed to UP

after a few times the card goes offline until reboot.

hw swap did not help.

hardware1:
re0: <RealTek 8168/8111 B/C/CP/D/DP/E/F/G PCIe Gigabit Ethernet> port 0xe000-0xe0ff mem 0xf0004000-0xf0004fff,0xf0000000-0xf0003fff irq 17 at device 0.0 on pci2
re0: MSI count : 1
re0: MSI-X count : 4
re0: attempting to allocate 1 MSI-X vectors (4 supported)
re0: using IRQ 265 for MSI-X
re0: Using 1 MSI-X message
re0: turning off MSI enable bit.
re0: Chip rev. 0x2c800000
re0: MAC rev. 0x00100000
miibus0: <MII bus> on re0
re0: Using defaults for TSO: 65518/35/2048
re0: bpf attached

hardware2:
re0: <RealTek 8168/8111 B/C/CP/D/DP/E/F/G PCIe Gigabit Ethernet> port 0xe000-0xe0ff mem 0xf7c00000-0xf7c00fff,0xf0000000-0xf0003fff irq 16 at device 0.0 on pci1
re0: MSI count : 1
re0: MSI-X count : 4
re0: attempting to allocate 1 MSI-X vectors (4 supported)
re0: using IRQ 266 for MSI-X
re0: Using 1 MSI-X message
re0: Chip rev. 0x4c000000
re0: MAC rev. 0x00000000
miibus0: <MII bus> on re0
re0: Using defaults for TSO: 65518/35/2048
re0: bpf attached
re0: Ethernet address: 44:8a:5b:d4:49:6d
re0: netmap queues/slots: TX 1/256, RX 1/256
random: harvesting attach, 8 bytes (4 bits) from re0
Comment 18 Dirk Meyer freebsd_committer 2017-12-16 06:03:21 UTC
(In reply to Dirk Meyer from comment #17)

pciconf -v -l

hardware1:
re0@pci0:2:0:0: class=0x020000 card=0x78161462 chip=0x816810ec rev=0x06 hdr=0x00
    vendor     = 'Realtek Semiconductor Co., Ltd.'
    device     = 'RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller'
    class      = network
    subclass   = ethernet

hardware2:
re0@pci0:2:0:0: class=0x020000 card=0x78231462 chip=0x816810ec rev=0x0c hdr=0x00
    vendor     = 'Realtek Semiconductor Co., Ltd.'
    device     = 'RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller'
    class      = network
    subclass   = ethernet
Comment 19 Alex Dupre freebsd_committer 2018-02-14 22:37:30 UTC
(In reply to Sean Bruno from comment #14)
I don't know if there is something in the Realtek driver that isn't in ours, or maybe the opposite. For instance the watchdog functionality is commented out in the Realtek driver.
Comment 20 Nick 2018-02-14 22:59:43 UTC
I did some additional testing a while back. And it seems on this setup powerD does nothing to save power. With it on or off the computer still consumed 8 watts. And I guess at that level it's pretty efficient.
Comment 21 Alex Dupre freebsd_committer 2018-02-15 07:01:56 UTC
The use of powerd can make the issue more evident / easier to reproduce, but it's not the root cause.
Comment 22 Arto Pekkanen 2018-05-02 15:53:16 UTC
I managed to solve this issue by disabling MSI and MSI-X. Put the following lines into /boot/loader.conf

hw.re.msi_disable="1"
hw.re.msix_disable="1"

You see, the MSI/MSI-X interrupt processing supposedly eliminates the need to perform an extra read from device register after receiving an interrupt which tells that a DMA write is finished. However, there is some kind of problem either in the driver or the chip itself in the way it handles these interrupts.

By disabling MSI and MSI-X, the driver switches to using the older interrupt filter handler, and thus probably performs and extra read from some device register to wait for the DMA transfer to memory to be ready (according to wikipedia, when using legacy interrupts this is the only way to ensure the DMA transfer wasn't buffered by the chipset etc).

So, I would suggest everybody watching this thread to try if disabling MSI and MSI-X on their system helps. Might not apply to all Realtek NICs, but on my machine this workaround is valid.

PS. the performance is still horrible when transferring to and from the machine, but at least now it doesn't hang sporadically.
Comment 23 Alex Dupre freebsd_committer 2018-05-05 17:20:58 UTC
Disabling MSI/MSI-X was proposed as solution in the past. I've just tried again to be sure, it helps, but the issue doesn't disappear completely. With it I can successfully run the google (m-lab) speed test, but I still get a watchdog timeout and network reset as soon as I start the Ookla speed test. Fully reproducible.
Comment 24 zjk 2018-05-06 07:30:52 UTC
hw.re.msi_disable hw.re.msix_disable
I tested this solution for a few days (it already exists somewhere on the internet).
There is no visible effect (on my computers) - network is closing very quickly. 
But - maybe it depends on the network card chipset?

However, I highly recommend the analysis:
https://forums.freebsd.org/threads/10-2-release-re0-watchdog-timeout.55306/#post-337045
There are some extremely important remarks.
One important tip - this may be the result of overloading the processor. In general - a problem for low-performance processors. Or vice versa: for the "computationally demanding" chipset of the network card, and finally the "programmatically extended" driver.

Probably because the version of "built-in" driver for FreeBSD is so much "slimmed", in relation to the full version from Realtek (from the Realtek website). It may be intended to run on less-efficient processors.

But I can not fully appreciate everything from this analysis. "Watchdog timeout" messages - also occur after stopping the transmission. Processor load drops to several percent, but watchdog timeout messages still appear every few seconds.

In general - a reset is needed to restore the normal operation of the interface.

As a solution, you can use "patch" - instead of, for example, limit the connection speed to 100 Mb, you can use, for example, dummynet for flow / band management.

It is still not a solution to the problem of the driver itself.
Comment 25 Alex Dupre freebsd_committer 2018-06-29 13:15:38 UTC
After upgrading to 11.2-RELEASE the problem seems disappeared on my machine. 

Looking at dmesg the only difference is the missing of the following line at boot:

re0: turning off MSI enable bit.
Comment 26 zjk 2018-06-30 07:53:06 UTC
After upgrading several machines to 11.2 and all-night tests: nothing better, still a watchdog fault.
zjk
Comment 27 zjk 2018-06-30 07:54:08 UTC
After upgrading several machines to 11.2 and all-night tests: nothing better, still a watchdog fault.
zjk
Comment 28 Alex Dupre freebsd_committer 2018-06-30 08:36:26 UTC
I still see a few watchdog errors in the logs, but I'm unable to trigger them voluntarily, even with very high traffic. While before it was enough to run a single speed test to drop the connection, now I can saturate the link without a watchdog timeout. The connection is quite stable now. The issue is likely not solved, but it's much harder to be triggered in my scenario.
Comment 29 zjk 2018-06-30 18:32:31 UTC
The following configuration is very promising:
- kernel 11.2-RELEASE recompiled together,
- re driver v. 1.93 (from realtek site).

Effect:
- NO (absolutely none) watchdog timeout,
- FULL speed in both directions (I will still test different situations),
- works well with lagg(!).

Now I compile realtek version 1.94 with 11.2-RELEASE - I will let you know what are the effects.

zjk
Comment 30 Alex Dupre freebsd_committer 2018-06-30 20:26:45 UTC
Surely you won't get the watchdog timeout error with the driver taken from the realtek website, it's been commented out from the source code, so it's not a real clue.

Said so, with 11.0 and 11.1 I've always used the 1.93 version without issues.
Comment 31 zjk 2018-09-03 14:34:37 UTC
A. After longer tests - I must cancel the previous optimistic news. We are talking about the 11.2-RELEASE + 1.93-realtek driver:

1. Suspensions, computer stops - still occur. They are only shorter - though still cumbersome.

Generally at the beginning the interface works quickly, after some time it slows down and shows signs of loss.

2. There are still messages about the interface suspension. Because I use lagg it looks like this:
+ [20445] re1: Interface stopped DISTRIBUTING, possible flapping
+ [48114] re0: Interface stopped DISTRIBUTING, possible flapping

B. Regarding Alex's statements. This is a real problem.
Of course, the "watchdog timeout" message itself is not harmful.
The important thing is that the message in the function follows the reset and re-initialisation of the interface - this unfortunately results in the loss or partial destruction of transmitted files / frames (which unfortunately I have experienced many times).

The application of version 1.93-1.94: is therefore of such a  improvement that not only does the message disappear (commented out from function - as Alex correctly writes), but the files are not damaged during the transmission (yet to be checked!).

Version 11.2-RELEASE - for me it certainly generates hundreds of messages "watchdog timeout" - but today I do not know if it prevents damage or loss of transmitted data (to be checked). 
I see:
	/* Cancel pending I/O and free all RX/TX buffers. */
	re_stop(sc);
	/* Put controller into known state. */
	re_reset(sc);
It means: drop, loss transmitted information.

C. However, I will not agree with Alex that it is good. Perhaps it is good for a laptop, too little for the server. It is still terrible.

D. Test 11.2 + 1.94 - I have not started yet.
Comment 32 Ralf Wostrack 2020-02-19 06:33:46 UTC
Hi,

starting from version 12.0, i facing this issue on my mini-itx server.

Hardware Info:
re0@pci0:3:0:0:	class=0x020000 card=0xe0001458 chip=0x816810ec rev=0x0c hdr=0x00
    vendor     = 'Realtek Semiconductor Co., Ltd.'
    device     = 'RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller'
    class      = network
    subclass   = ethernet
re1@pci0:4:0:0:	class=0x020000 card=0xe0001458 chip=0x816810ec rev=0x0c hdr=0x00
    vendor     = 'Realtek Semiconductor Co., Ltd.'
    device     = 'RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller'
    class      = network
    subclass   = ethernet

I solved it, by commenting out re_txeof in re_tick function in if_re.c.
In my oppinion its a timing issue during high load situtations, blocking the interrupt/dma of the device.
And of course the job is done by watchdog function after 5 timer ticks.

static void
re_tick(void *xsc)
{
	struct rl_softc		*sc;
	struct mii_data		*mii;

	sc = xsc;

	RL_LOCK_ASSERT(sc);

	mii = device_get_softc(sc->rl_miibus);
	mii_tick(mii);
	if ((sc->rl_flags & RL_FLAG_LINK) == 0)
		re_miibus_statchg(sc->rl_dev);
	/*
	 * Reclaim transmitted frames here. Technically it is not
	 * necessary to do here but it ensures periodic reclamation
	 * regardless of Tx completion interrupt which seems to be
	 * lost on PCIe based controllers under certain situations.
	 */
	// re_txeof(sc);
	re_watchdog(sc);
	callout_reset(&sc->rl_stat_callout, hz, re_tick, sc);
}
Comment 33 Chris Hutchinson 2020-02-19 16:32:18 UTC
FWIW we ran into this problem when we opted
to become a public CPAN mirror (perl.org).
Which necessitated adding 2 more ports, and
delegating 2 additional addresses. We used
a 2 port RealTek adapter. About an hour into
going live. The watchdog(8) timeouts began
spamming the logs. The solution was to bump
2 entries in sysctl.conf(5):
kern.ipc.nmbjumbop (245550 by default)
and
kern.ipc.nmbclusters (491100 by default)
as in FreeBSD 11, these numbers are too small --
at least for these NICs.
How high you bump them depends upon the load
and traffic on your NICs. As a rule of thumb
I would suggest bumping them up a quarter of
their original values until watchdog shuts up.
All the while accessing any performance changes.
We're now on 12, and moving to 13 shortly. 12
did NOT exhibit this problem, because the
numbers are much higher by default.

HTH

--Chris
Comment 34 ml 2020-03-24 11:37:09 UTC
(In reply to Chris Hutchinson from comment #33)

Aren't those value's default proportional to system RAM?

From a quick survey on my systems I see:
4GB  11.3 123975 247952
16GB 11.3 507532 1015064
32GB 11.3 1017580 2035160
32GB 12.1 1019729 2039460

I guess the last two don't actually differ due to FreeBSD version difference, but due to little difference in memory availability.

Also I've seen this problem on the third system above, where the values are 4 times the default. So increasing them a little is no solution.

I've not seen the problem on the last machine (with 12.1), but that might be because re0 is a WAN port on a slow link.
Comment 35 Chris Hutchinson 2020-03-24 14:16:55 UTC
(In reply to ml from comment #34)
For *me*, doubling the values from their defaults
fixed it. I only mentioned it as

1) disabling MSI-X (default solution) reduces
   performance -- even significantly
2) It worked.

So felt it worth mentioning. :)

--Chris