Bug 236962

Summary: Realtek RTL8111/8168/8411 erratically drops network connection
Product: Base System Reporter: Marcel Bischoff <marcel>
Component: kernAssignee: freebsd-net (Nobody) <net>
Status: New ---    
Severity: Affects Some People CC: eugen, kib, rgrimes
Priority: ---    
Version: 12.0-RELEASE   
Hardware: amd64   
OS: Any   

Description Marcel Bischoff 2019-04-02 14:16:50 UTC
I have encountered this issue multiple times on at least two dedicated servers, one running 12.0-RELEASE and one 11.2-RELEASE. Both have identical Realtek ethernet controllers.

"pciconf -lv" on 11.2 machine:

re0@pci0:2:0:0: class=0x020000 card=0x78161462 chip=0x816810ec rev=0x06 hdr=0x00
    vendor     = 'Realtek Semiconductor Co., Ltd.'
    device     = 'RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller'
    class      = network
    subclass   = ethernet

"pciconf -lv" on 12.0 machine:

re0@pci0:2:0:0: class=0x020000 card=0x78161462 chip=0x816810ec rev=0x06 hdr=0x00
    vendor     = 'Realtek Semiconductor Co., Ltd.'
    device     = 'RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller'
    class      = network
    subclass   = ethernet

The machine is not reachable because it drops the network connection repeatedly with the following output in "/var/log/messages":

[...]
Apr  2 10:29:01 <kern.crit> cloud kernel: [5054487] re0: watchdog timeout
Apr  2 10:29:01 <kern.notice> cloud kernel: [5054487] re0: link state changed to DOWN
Apr  2 10:29:04 <kern.notice> cloud kernel: [5054490] re0: link state changed to UP
Apr  2 10:33:31 <kern.crit> cloud kernel: [5054757] re0: watchdog timeout
Apr  2 10:33:31 <kern.notice> cloud kernel: [5054757] re0: link state changed to DOWN
Apr  2 10:33:34 <kern.notice> cloud kernel: [5054760] re0: link state changed to UP
Apr  2 10:38:01 <kern.crit> cloud kernel: [5055027] re0: watchdog timeout
Apr  2 10:38:01 <kern.notice> cloud kernel: [5055027] re0: link state changed to DOWN
Apr  2 10:38:04 <kern.notice> cloud kernel: [5055030] re0: link state changed to UP
Apr  2 10:42:35 <kern.crit> cloud kernel: [5055301] re0: watchdog timeout
Apr  2 10:42:35 <kern.notice> cloud kernel: [5055301] re0: link state changed to DOWN
Apr  2 10:42:38 <kern.notice> cloud kernel: [5055304] re0: link state changed to UP
[...]

This repeats infinitely and renders the machine essentially unusable.

There is no apparent pattern to this behavior like higher-than-usual load or bandwidth usage and both machines have no special configuration regarding network settings or the like, it's all out-of-the-box.
Comment 1 Rodney W. Grimes freebsd_committer freebsd_triage 2019-04-22 21:56:17 UTC
The RTL 1G family has a fairly long history of issues, both with FreeBSD and other systems.  I discourage there use for much more than simple tasks and bench tests.  If you can I would switch nics, if not this is going to be a long tail debug issue to try and find the problem.
Comment 2 Rodney W. Grimes freebsd_committer freebsd_triage 2019-04-22 21:58:34 UTC
You might start with this bugzilla search:
https://bugs.freebsd.org/bugzilla/buglist.cgi?bug_status=__all__&content=0x816810ec&list_id=292731&order=Importance&query_format=specific

I searched for your chipid in all bug reports...
Comment 3 Marcel Bischoff 2019-04-27 08:01:26 UTC
Rodney, thanks for the information. I cannot change the NICs since the machines affected are dedicated servers provided by a commercial data center. Since their pricing is quite competitive, it appears this may reflect on the hardware side. However, I have Linux installations running on similar hardware and they never displayed this behavior.

I'm aware that FreeBSD values clean implementation over quick hacks and issues like this one are probably hard to troubleshoot. On average, the issue comes up every two months and I wasn't able to reproduce it. From what I gather, this will likely remain unfixed.

I have asked the data center to switch hardware and see where this gets me. If this is not possible, I guess I'm up for some long-tail debugging, provided a team member like you feels this would benefit the project and is prepared to dive into this with me.
Comment 4 Eugene Grosbein freebsd_committer freebsd_triage 2019-04-27 10:48:09 UTC
(In reply to marcel from comment #3)

I was in exactly same position using cheap hoster's hardware and re0 watchdog timeouts. There is simple work-around that may be acceptable if problem is rare. Add single line to /etc/sysctl.conf:

kern.* |/root/bin/monitor_nic

Simple script /root/bin/monitor_nic just does what driver is supposed to do in such case: reset interface to revive it.

#!/bin/sh
PATH=/bin:/sbin:/usr/bin:/usr/sbin
while read month day time s host kernel rest
do
  case "$rest" in
  "re0: watchdog timeout")
    sleep 5
    ifconfig re0 down
    sleep 1
    ifconfig re0 up
    sleep 30
    ;;
  esac
done
# EOF
Comment 5 Eugene Grosbein freebsd_committer freebsd_triage 2019-04-27 10:51:52 UTC
Maybe you'll need to adjust pattern matching as your logs have different format comparing to logs generated to my FreeBSD 11.2 boxes.
Comment 6 Konstantin Belousov freebsd_committer freebsd_triage 2019-04-27 11:13:04 UTC
Some time ago I started using
https://www.gigabyte.com/Motherboard/GA-J3455N-D3H-rev-10#ov
for my home server.  In-tree driver stops operating with dreaded 'device
timeout', and the official realtek driver caused some weird hangs of the
whole machine.

I was not able to figure out what is missing in the in-tree driver.  But
for the realtek code, the cause appeared quite silly.  Since chips are
able to do jumbo, but not scatter-gather, they allocated 9K clusters
for rx fill always, even if interface was configured for standard 1500
MTU.  At some time (2-3 weeks for my workload) memory becomes fragmented
enough that driver cannot refill rx, and due to the interface mutex, this
cascaded to everything that touched network.

I added a knob to disable jumbo and re-imported several revisions of the
vendor driver there:
https://github.com/kostikbel/rere

After that I am quite happy running stable/11 for a year without an issue.