| Summary: | Realtek RTL8111/8168/8411 erratically drops network connection | ||
|---|---|---|---|
| Product: | Base System | Reporter: | Marcel Bischoff <marcel> |
| Component: | kern | Assignee: | freebsd-net (Nobody) <net> |
| Status: | New --- | ||
| Severity: | Affects Some People | CC: | eugen, kib, rgrimes |
| Priority: | --- | ||
| Version: | 12.0-RELEASE | ||
| Hardware: | amd64 | ||
| OS: | Any | ||
The RTL 1G family has a fairly long history of issues, both with FreeBSD and other systems. I discourage there use for much more than simple tasks and bench tests. If you can I would switch nics, if not this is going to be a long tail debug issue to try and find the problem. You might start with this bugzilla search: https://bugs.freebsd.org/bugzilla/buglist.cgi?bug_status=__all__&content=0x816810ec&list_id=292731&order=Importance&query_format=specific I searched for your chipid in all bug reports... Rodney, thanks for the information. I cannot change the NICs since the machines affected are dedicated servers provided by a commercial data center. Since their pricing is quite competitive, it appears this may reflect on the hardware side. However, I have Linux installations running on similar hardware and they never displayed this behavior. I'm aware that FreeBSD values clean implementation over quick hacks and issues like this one are probably hard to troubleshoot. On average, the issue comes up every two months and I wasn't able to reproduce it. From what I gather, this will likely remain unfixed. I have asked the data center to switch hardware and see where this gets me. If this is not possible, I guess I'm up for some long-tail debugging, provided a team member like you feels this would benefit the project and is prepared to dive into this with me. (In reply to marcel from comment #3) I was in exactly same position using cheap hoster's hardware and re0 watchdog timeouts. There is simple work-around that may be acceptable if problem is rare. Add single line to /etc/sysctl.conf: kern.* |/root/bin/monitor_nic Simple script /root/bin/monitor_nic just does what driver is supposed to do in such case: reset interface to revive it. #!/bin/sh PATH=/bin:/sbin:/usr/bin:/usr/sbin while read month day time s host kernel rest do case "$rest" in "re0: watchdog timeout") sleep 5 ifconfig re0 down sleep 1 ifconfig re0 up sleep 30 ;; esac done # EOF Maybe you'll need to adjust pattern matching as your logs have different format comparing to logs generated to my FreeBSD 11.2 boxes. Some time ago I started using https://www.gigabyte.com/Motherboard/GA-J3455N-D3H-rev-10#ov for my home server. In-tree driver stops operating with dreaded 'device timeout', and the official realtek driver caused some weird hangs of the whole machine. I was not able to figure out what is missing in the in-tree driver. But for the realtek code, the cause appeared quite silly. Since chips are able to do jumbo, but not scatter-gather, they allocated 9K clusters for rx fill always, even if interface was configured for standard 1500 MTU. At some time (2-3 weeks for my workload) memory becomes fragmented enough that driver cannot refill rx, and due to the interface mutex, this cascaded to everything that touched network. I added a knob to disable jumbo and re-imported several revisions of the vendor driver there: https://github.com/kostikbel/rere After that I am quite happy running stable/11 for a year without an issue. |
I have encountered this issue multiple times on at least two dedicated servers, one running 12.0-RELEASE and one 11.2-RELEASE. Both have identical Realtek ethernet controllers. "pciconf -lv" on 11.2 machine: re0@pci0:2:0:0: class=0x020000 card=0x78161462 chip=0x816810ec rev=0x06 hdr=0x00 vendor = 'Realtek Semiconductor Co., Ltd.' device = 'RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller' class = network subclass = ethernet "pciconf -lv" on 12.0 machine: re0@pci0:2:0:0: class=0x020000 card=0x78161462 chip=0x816810ec rev=0x06 hdr=0x00 vendor = 'Realtek Semiconductor Co., Ltd.' device = 'RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller' class = network subclass = ethernet The machine is not reachable because it drops the network connection repeatedly with the following output in "/var/log/messages": [...] Apr 2 10:29:01 <kern.crit> cloud kernel: [5054487] re0: watchdog timeout Apr 2 10:29:01 <kern.notice> cloud kernel: [5054487] re0: link state changed to DOWN Apr 2 10:29:04 <kern.notice> cloud kernel: [5054490] re0: link state changed to UP Apr 2 10:33:31 <kern.crit> cloud kernel: [5054757] re0: watchdog timeout Apr 2 10:33:31 <kern.notice> cloud kernel: [5054757] re0: link state changed to DOWN Apr 2 10:33:34 <kern.notice> cloud kernel: [5054760] re0: link state changed to UP Apr 2 10:38:01 <kern.crit> cloud kernel: [5055027] re0: watchdog timeout Apr 2 10:38:01 <kern.notice> cloud kernel: [5055027] re0: link state changed to DOWN Apr 2 10:38:04 <kern.notice> cloud kernel: [5055030] re0: link state changed to UP Apr 2 10:42:35 <kern.crit> cloud kernel: [5055301] re0: watchdog timeout Apr 2 10:42:35 <kern.notice> cloud kernel: [5055301] re0: link state changed to DOWN Apr 2 10:42:38 <kern.notice> cloud kernel: [5055304] re0: link state changed to UP [...] This repeats infinitely and renders the machine essentially unusable. There is no apparent pattern to this behavior like higher-than-usual load or bandwidth usage and both machines have no special configuration regarding network settings or the like, it's all out-of-the-box.