Bug 272416 - Seldom crash happening with RTL8125
Summary: Seldom crash happening with RTL8125
Status: Open
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 13.2-RELEASE
Hardware: Any Any
: --- Affects Only Me
Assignee: freebsd-net (Nobody)
URL:
Keywords: crash
Depends on:
Blocks:
 
Reported: 2023-07-07 16:26 UTC by Jonathan Vasquez
Modified: 2024-09-13 06:18 UTC (History)
6 users (show)

See Also:


Attachments
re-driver-crash (288.40 KB, text/plain)
2023-07-07 16:26 UTC, Jonathan Vasquez
no flags Details
crash on fbsd 14-release-p2 (129.24 KB, text/plain)
2023-12-11 02:46 UTC, Jonathan Vasquez
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Jonathan Vasquez 2023-07-07 16:26:52 UTC
Created attachment 243309 [details]
re-driver-crash

Hello all,

I'm not sure if this should be filed for the base system or for the port given that my system needs the "net/realtek-re-kmod" to have a working ethernet connection.

I'm using 13.2-RELEASE with a mostly stock kernel, with the only addition of me adding a DELAY(5000) that I'm testing for a separate crash happening for my AMD Raven HDA Controller (https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=268393#c48). Given the different subsystems, hopefully this isn't interfering. I believe I've experienced this crash before it would only happen seldomly if anything. I wanted to report just so that we can start collecting data on the problem.

I've attached a crash dump as well.

re0@pci0:12:0:0:        class=0x020000 rev=0x05 hdr=0x00 vendor=0x10ec device=0x8125 subvendor=0x1043 subdevice=0x87d7
    vendor     = 'Realtek Semiconductor Co., Ltd.'
    device     = 'RTL8125 2.5GbE Controller'
    class      = network
    subclass   = ethernet

/etc/rc.conf

hostname="weshly"
dumpdev="AUTO"
kld_list="amdgpu vboxdrv fusefs"
clear_tmp_enable="YES"

# Disable Mail
sendmail_enable="NO"
sendmail_submit_enable="NO"
sendmail_outbound_enable="NO"
sendmail_msp_queue_enable="NO"

# Networking
ifconfig_re0="DHCP"
ifconfig_re0_ipv6="inet6 accept_rtadv"

# Services
zfs_enable="YES"
dbus_enable="YES"
seatd_enable="YES"
mixer_enable="YES"

syncthing_enable="YES"
syncthing_user="jon"
syncthing_group="jon"

# NFS
nfs_client_enable="YES"
rpc_lockd_enable="YES"
rpc_statd_enable="YES"

When the crash happened, I did have virtualbox running in the background (although it was mostly idle). The VM was a Windows 10 VM and it had the network adapter settings set to NAT.

Thank you,

Jonathan
Comment 1 Jonathan Vasquez 2023-07-07 16:27:30 UTC
I forgot to post this:

root@weshly:~ # pkg info net/realtek-re-kmod
realtek-re-kmod-198.00_2
Name           : realtek-re-kmod
Version        : 198.00_2
Installed on   : Mon Jul  3 10:19:36 2023 EDT
Origin         : net/realtek-re-kmod
Architecture   : FreeBSD:13:amd64
Prefix         : /usr/local
Categories     : net kld
Licenses       : BSD4CLAUSE
Maintainer     : ale@FreeBSD.org
WWW            : https://github.com/alexdupre/rtl_bsd_drv
Comment        : Kernel driver for Realtek PCIe Ethernet Controllers
Annotations    :
        FreeBSD_version: 1302001
        repo_type      : binary
        repository     : Weshly
Flat size      : 1023KiB
Description    :
Realtek PCIe FE / GBE / 2.5G / Gaming Ethernet Family Controller
kernel driver.

This is the official driver from Realtek with a few patches to
improve stability and performance. It can be loaded instead of
the FreeBSD driver built into the GENERIC kernel if you experience
issues with it (eg. watchdog timeouts), or your card is not supported.

Supported devices:

* 2.5G Gigabit Ethernet
  - RTL8125 / RTL8125B(S)(G)

* 10/100/1000M Gigabit Ethernet
  - RTL8111B / RTL8111C / RTL8111D / RTL8111E / RTL8111F / RTL8111G(S)
    RTL8111H(S) / RTL8118(A)(S) / RTL8119i / RTL8111L / RTL8111K
  - RTL8168B / RTL8168E / RTL8168H
  - RTL8111DP / RTL8111EP / RTL8111FP
  - RTL8411 / RTL8411B

* 10/100M Fast Ethernet
  - RTL8101E / RTL8102E / RTL8103E / RTL8105E / RTL8106E / RTL8107E
  - RTL8401 / RTL8402

See also: https://www.realtek.com/en/component/zoo/category/network-interface-controllers-10-100-1000m-gigabit-ethernet-pci-express-software

WWW: https://github.com/alexdupre/rtl_bsd_drv
Comment 2 Tino Engel 2023-07-08 06:04:22 UTC
Hey I assume I know how to fix this, you can assign to me if you want.
Comment 3 Jonathan Vasquez 2023-12-11 02:46:15 UTC
Created attachment 246965 [details]
crash on fbsd 14-release-p2

I'm uploading a new kernel crash dump from the latest FreeBSD 14.0-RELEASE-p2 compiled from latest releng/14.0 master commit (06497fbd52e2f138b7d590c8499d9cebad182850). I'm able to essentially deterministically reproduce the error now. I basically just need to stress out the card and it will hang. Sometimes it actually hangs after I log into the system and start my session. Opening "transmission-gtk" to help seed all of the AMD64 Torrent links at https://wiki.freebsd.org/Torrents is enough to cause the system to crash. Please let me know anything I can do to help further debug this.
Comment 4 Mina Galić freebsd_triage 2023-12-11 02:57:55 UTC
I've set the attachment type to text/plain, and as far as I can see, it looks the same as the previous crash, so let's extract the most important part:

Fatal trap 12: page fault while in kernel mode
cpuid = 16; apic id = 10
fault virtual address	= 0x10007
fault code		= supervisor read data, page not present
instruction pointer	= 0x20:0xffffffff80c686e0
stack pointer	        = 0x28:0xfffffe015f114d30
frame pointer	        = 0x28:0xfffffe015f114d80
code segment		= base rx0, limit 0xfffff, type 0x1b
			= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags	= interrupt enabled, resume, IOPL = 0
current process		= 0 (re0 taskq)
rdi: 0000000000000000 rsi: 00000000040003f4 rdx: 00000000ffffffff
rcx: 0000000000000001  r8: 0000000000000403  r9: 0000000000000403
rax: 0000000000000000 rbx: 000000000000ffff rbp: fffffe015f114d80
r10: 00000000000100a4 r11: 00000000000072c3 r12: 0000000000008803
r13: 000000000000ffff r14: fffffe015ecb9c80 r15: 0000000000000000
trap number		= 12
panic: page fault
cpuid = 16
time = 1702262468
KDB: stack backtrace:
#0 0xffffffff80b9002d at kdb_backtrace+0x5d
#1 0xffffffff80b43132 at vpanic+0x132
#2 0xffffffff80b42ff3 at panic+0x43
#3 0xffffffff8100c85c at trap_fatal+0x40c
#4 0xffffffff8100c8af at trap_pfault+0x4f
#5 0xffffffff80fe39c8 at calltrap+0x8
#6 0xffffffff8253edd0 at re_rxeof+0x2c0
#7 0xffffffff8252f87a at re_int_task_8125+0xba
#8 0xffffffff80ba5922 at taskqueue_run_locked+0x182
#9 0xffffffff80ba6bb2 at taskqueue_thread_loop+0xc2
#10 0xffffffff80afdb0f at fork_exit+0x7f
#11 0xffffffff80fe4a2e at fork_trampoline+0xe
Uptime: 38s
Dumping 2707 out of 65221 MB:..1%..11%..21%..31%..41%..51%..61%..71%..81%..91%

__curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:57
57		__asm("movq %%gs:%P1,%0" : "=r" (td) : "n" (offsetof(struct pcpu,
(kgdb) #0  __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:57
#1  doadump (textdump=<optimized out>)
    at /usr/src/sys/kern/kern_shutdown.c:405
#2  0xffffffff80b42cc7 in kern_reboot (howto=260)
    at /usr/src/sys/kern/kern_shutdown.c:526
#3  0xffffffff80b4319f in vpanic (fmt=0xffffffff81136b3b "%s", 
    ap=ap@entry=0xfffffe015f114b80) at /usr/src/sys/kern/kern_shutdown.c:970
#4  0xffffffff80b42ff3 in panic (fmt=<unavailable>)
    at /usr/src/sys/kern/kern_shutdown.c:894
#5  0xffffffff8100c85c in trap_fatal (frame=0xfffffe015f114c70, eva=65543)
    at /usr/src/sys/amd64/amd64/trap.c:952
#6  0xffffffff8100c8af in trap_pfault (frame=0xfffffe015f114c70, 
    usermode=false, signo=<optimized out>, ucode=<optimized out>)
    at /usr/src/sys/amd64/amd64/trap.c:760
#7  <signal handler called>
#8  ether_input (ifp=<optimized out>, m=0xffff)
    at /usr/src/sys/net/if_ethersubr.c:849
#9  0xffffffff8253edd0 in re_rxeof () from /boot/modules/if_re.ko
#10 0xffffffff8252f87a in re_int_task_8125 () from /boot/modules/if_re.ko
#11 0xffffffff80ba5922 in taskqueue_run_locked (queue=0xfffffe006ba84168, 
    queue@entry=0xfffff800025df700) at /usr/src/sys/kern/subr_taskqueue.c:512
#12 0xffffffff80ba6bb2 in taskqueue_thread_loop (
    arg=arg@entry=0xfffffe006ba84238)
    at /usr/src/sys/kern/subr_taskqueue.c:824
#13 0xffffffff80afdb0f in fork_exit (
    callout=0xffffffff80ba6af0 <taskqueue_thread_loop>, 
    arg=0xfffffe006ba84238, frame=0xfffffe015f114f40)
    at /usr/src/sys/kern/kern_fork.c:1160
#14 <signal handler called>
#15 0x3b1daa5d375daa59 in ?? ()
Backtrace stopped: Cannot access memory at address 0x9c2a3111906a3115
(kgdb)
Comment 5 Jonathan Vasquez 2023-12-11 03:22:06 UTC
Thanks Mina. That’s correct it’s mostly the same, I posted it so we have as much latest data as possible (on a 14.X base, my last copy was from a while ago on 13.X).
Comment 6 Jonathan Vasquez 2023-12-11 14:52:19 UTC
This is the list *[addr] and 'backtrace' from kgdb for the dump I provided yesterday. Let me know if you need me to recompile the kernel with a debugging configuration. Perhaps that can provide further information if the current info isn't enough. Thank you!

(kgdb) list *0xffffffff80c686e0
0xffffffff80c686e0 is in ether_input (/usr/src/sys/net/if_ethersubr.c:849).
844		 */
845		CURVNET_SET_QUIET(ifp->if_vnet);
846		if (__predict_false(needs_epoch))
847			NET_EPOCH_ENTER(et);
848		while (m) {
849			mn = m->m_nextpkt;
850			m->m_nextpkt = NULL;
851	
852			/*
853			 * We will rely on rcvif being set properly in the deferred
(kgdb) backtrace
#0  __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:57
#1  doadump (textdump=<optimized out>) at /usr/src/sys/kern/kern_shutdown.c:405
#2  0xffffffff80b42cc7 in kern_reboot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:526
#3  0xffffffff80b4319f in vpanic (fmt=0xffffffff81136b3b "%s", ap=ap@entry=0xfffffe015f114b80)
    at /usr/src/sys/kern/kern_shutdown.c:970
#4  0xffffffff80b42ff3 in panic (fmt=<unavailable>) at /usr/src/sys/kern/kern_shutdown.c:894
#5  0xffffffff8100c85c in trap_fatal (frame=0xfffffe015f114c70, eva=65543) at /usr/src/sys/amd64/amd64/trap.c:952
#6  0xffffffff8100c8af in trap_pfault (frame=0xfffffe015f114c70, usermode=false, signo=<optimized out>,
    ucode=<optimized out>) at /usr/src/sys/amd64/amd64/trap.c:760
#7  <signal handler called>
#8  ether_input (ifp=<optimized out>, m=0xffff) at /usr/src/sys/net/if_ethersubr.c:849
#9  0xffffffff8253edd0 in re_rxeof () from /boot/modules/if_re.ko
#10 0xffffffff8252f87a in re_int_task_8125 () from /boot/modules/if_re.ko
#11 0xffffffff80ba5922 in taskqueue_run_locked (queue=0xfffffe006ba84168, queue@entry=0xfffff800025df700)
    at /usr/src/sys/kern/subr_taskqueue.c:512
#12 0xffffffff80ba6bb2 in taskqueue_thread_loop (arg=arg@entry=0xfffffe006ba84238)
    at /usr/src/sys/kern/subr_taskqueue.c:824
#13 0xffffffff80afdb0f in fork_exit (callout=0xffffffff80ba6af0 <taskqueue_thread_loop>, arg=0xfffffe006ba84238,
    frame=0xfffffe015f114f40) at /usr/src/sys/kern/kern_fork.c:1160
#14 <signal handler called>
#15 0x3b1daa5d375daa59 in ?? ()
Backtrace stopped: Cannot access memory at address 0x9c2a3111906a3115
Comment 7 Tino Engel 2023-12-11 18:49:19 UTC
I had the same problem about a year ago - unfortunately I have forgotten what workaround I used.
Anyhow when I look at the source of /usr/ports/net/realtek-re-kmod/work/rtl_bsd_drv-d3a7a3d/if_re.c where the problem seems to happen:

  7103	#if OS_VER < VERSION(4,9)
  7104	                /* Remove header from mbuf and pass it on. */
  7105	                m_adj(m, sizeof(struct ether_header));
  7106	                ether_input(ifp, eh, m);
  7107	#else
  7108	                (*ifp->if_input)(ifp, m);
  7109	#endif
  7110	                RE_LOCK(sc);

From the crash dump stacks, it seems to end up in line 7106, which is between 
#if OS_VER < VERSION(4,9)
and
#else

see also Comment #4:
#8  ether_input (ifp=<optimized out>, m=0xffff)
    at /usr/src/sys/net/if_ethersubr.c:849

That seems wrong to me for FreeBSD 13. See if_rereg.h:
#define OS_VER	__FreeBSD_version
Comment 8 Zhenlei Huang freebsd_committer freebsd_triage 2023-12-12 04:43:41 UTC
(In reply to Tino Engel from comment #7)
> From the crash dump stacks, it seems to end up in line 7106, which is between 
> #if OS_VER < VERSION(4,9)
> and
> #else

> see also Comment #4:
> #8  ether_input (ifp=<optimized out>, m=0xffff)
>     at /usr/src/sys/net/if_ethersubr.c:849

> That seems wrong to me for FreeBSD 13. See if_rereg.h:
> #define OS_VER	__FreeBSD_version

See /usr/src/sys/net/if_ethersubr.c:

```
void
ether_ifattach(struct ifnet *ifp, const u_int8_t *lla)
{
...
    ifp->if_input = ether_input;
...
}
```
Comment 9 Jonathan Vasquez 2023-12-14 00:10:50 UTC
I've installed/upgraded a PCIe Intel(R) X520-1 82599EN SPF+ 10GiB NIC (and a TP-Link transceiver) on the server which will replace the Realtek onboard NIC (and thus avoid this issue). I'll still have this onboard NIC available so if anyone wants me to try patches, lmk. I'll be happy to send down traffic to it.
Comment 10 Mithun 2024-09-12 06:33:52 UTC
(In reply to Jonathan Vasquez from comment #9)

Hi Jonathan Vasquez,

I’m experiencing a similar problem with the if_re module crashing on my FreeBSD system when I load the module. I saw your comment and bug report about the Realtek onboard NIC issue. Would it be possible for you to share the patch you mentioned? I would greatly appreciate any guidance on how to apply and test it.

Thanks for your help!

Best regards,
Mithun
Comment 11 Jonathan Vasquez 2024-09-12 13:20:05 UTC
Hey Mithun,

I don't have a patch for this. I installed an Intel NIC on a spare PCIe slot that I have and it's been rock solid and fast now. So I'm no longer using Realtek for my networking.
Comment 12 Mithun 2024-09-13 06:18:49 UTC
(In reply to Jonathan Vasquez from comment #11)

Hi Jonathan,

Thanks for your response!. Unfortunately, I don't have a spare slot to switch NICs, so I'll continue looking for a solution with the if_re module.