Summary: | Realtek RTL8111/8168/8411 erratically drops network connection | ||
---|---|---|---|
Product: | Base System | Reporter: | Marcel Bischoff <marcel> |
Component: | kern | Assignee: | freebsd-net (Nobody) <net> |
Status: | New --- | ||
Severity: | Affects Some People | CC: | eugen, kib, rgrimes |
Priority: | --- | ||
Version: | 12.0-RELEASE | ||
Hardware: | amd64 | ||
OS: | Any |
Description
Marcel Bischoff
2019-04-02 14:16:50 UTC
The RTL 1G family has a fairly long history of issues, both with FreeBSD and other systems. I discourage there use for much more than simple tasks and bench tests. If you can I would switch nics, if not this is going to be a long tail debug issue to try and find the problem. You might start with this bugzilla search: https://bugs.freebsd.org/bugzilla/buglist.cgi?bug_status=__all__&content=0x816810ec&list_id=292731&order=Importance&query_format=specific I searched for your chipid in all bug reports... Rodney, thanks for the information. I cannot change the NICs since the machines affected are dedicated servers provided by a commercial data center. Since their pricing is quite competitive, it appears this may reflect on the hardware side. However, I have Linux installations running on similar hardware and they never displayed this behavior. I'm aware that FreeBSD values clean implementation over quick hacks and issues like this one are probably hard to troubleshoot. On average, the issue comes up every two months and I wasn't able to reproduce it. From what I gather, this will likely remain unfixed. I have asked the data center to switch hardware and see where this gets me. If this is not possible, I guess I'm up for some long-tail debugging, provided a team member like you feels this would benefit the project and is prepared to dive into this with me. (In reply to marcel from comment #3) I was in exactly same position using cheap hoster's hardware and re0 watchdog timeouts. There is simple work-around that may be acceptable if problem is rare. Add single line to /etc/sysctl.conf: kern.* |/root/bin/monitor_nic Simple script /root/bin/monitor_nic just does what driver is supposed to do in such case: reset interface to revive it. #!/bin/sh PATH=/bin:/sbin:/usr/bin:/usr/sbin while read month day time s host kernel rest do case "$rest" in "re0: watchdog timeout") sleep 5 ifconfig re0 down sleep 1 ifconfig re0 up sleep 30 ;; esac done # EOF Maybe you'll need to adjust pattern matching as your logs have different format comparing to logs generated to my FreeBSD 11.2 boxes. Some time ago I started using https://www.gigabyte.com/Motherboard/GA-J3455N-D3H-rev-10#ov for my home server. In-tree driver stops operating with dreaded 'device timeout', and the official realtek driver caused some weird hangs of the whole machine. I was not able to figure out what is missing in the in-tree driver. But for the realtek code, the cause appeared quite silly. Since chips are able to do jumbo, but not scatter-gather, they allocated 9K clusters for rx fill always, even if interface was configured for standard 1500 MTU. At some time (2-3 weeks for my workload) memory becomes fragmented enough that driver cannot refill rx, and due to the interface mutex, this cascaded to everything that touched network. I added a knob to disable jumbo and re-imported several revisions of the vendor driver there: https://github.com/kostikbel/rere After that I am quite happy running stable/11 for a year without an issue. |