Created attachment 223748 [details] Kernel Panic I am currently seeing a panic on a Hyper-V based virtual machine when mounting an NFS share. The -CURRENT build is from the today (1th of April). I would think that this panic is Hyper-V or amd64 related. I have the same share mounted on a RPi4B and with a build from today the share is accessible and a stress test via a build world was successful. The system hangs in an endless loop, so I can currently only provide screenshot of the panic.
btw it looks more like an error in tcp code, this might be related: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=254725
(In reply to Li-Wen Hsu from comment #1) Its mainly the call chain from vmbus_chan_task() -> hn_chan_callback() which leads me towards an bug in the Hyper-V implementation, since hn0 is the default network interface on Hyper-V. I got an old kernel from around march 19th booted and will examine the crash dump as soon as I found some free time.
The KERNCONF is the following --------------------------------------------------- include GENERIC options RATELIMIT options TCPHPTS options KERN_TLS options ROUTE_MPATH options RANDOM_FENESTRASX --------------------------------------------------- Dump information are --------------------------------------------------- Dump header from device: /dev/da0p3 Architecture: amd64 Architecture Version: 2 Dump Length: 282898432 Blocksize: 512 Compression: none Dumptime: 2021-04-01 13:05:24 +0200 Hostname: fbsd-dev.0xfce3.net Magic: FreeBSD Kernel Dump Version String: FreeBSD 14.0-CURRENT #25 main-n245771-529a2a0f2765: Thu Apr 1 11:36:01 CEST 2021 root@lion.0xfce3.net:/boiler/nfs/obj/boiler/nfs/src/amd64.amd64/sys/GENERIC-TCP Panic String: Assertion in_epoch(net_epoch_preempt) failed at /boiler/nfs/src/sys/netinet/tcp_lro.c:915 Dump Parity: 4289937466 Bounds: 3 Dump Status: good --------------------------------------------------- src.conf is --------------------------------------------------- WITH_EXTRA_TCP_STACKS=1 WITH_BEARSSL=1 WITH_PIE=1 WITH_RETPOLINE=1 WITH_INIT_ALL_ZERO=1 --------------------------------------------------- I upload the dump in a minute.
The crash dump is too large for the upload. The stacktrace is the following. KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe0051761840 vpanic() at vpanic+0x181/frame 0xfffffe0051761890 panic() at panic+0x43/frame 0xfffffe00517618f0 tcp_lro_lookup() at tcp_lro_lookup+0xef/frame 0xfffffe0051761920 tcp_lro_rx2() at tcp_lro_rx2+0x7da/frame 0xfffffe00517619f0 hn_chan_callback() at hn_chan_callback+0x1eb/frame 0xfffffe0051761ad0 vmbus_chan_task() at vmbus_chan_task+0x2f/frame 0xfffffe0051761b00 taskqueue_run_locked() at taskqueue_run_locked+0xaa/frame 0xfffffe0051761b80 taskqueue_thread_loop() at taskqueue_thread_loop+0x94/frame 0xfffffe0051761bb0 fork_exit() at fork_exit+0x80/frame 0xfffffe0051761bf0 fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe0051761bf0 If I can provide more information, just let me know.
Backtrace #0 __curthread () at /boiler/nfs/src/sys/amd64/include/pcpu_aux.h:55 #1 doadump (textdump=textdump@entry=1) at /boiler/nfs/src/sys/kern/kern_shutdown.c:399 #2 0xffffffff80c132f0 in kern_reboot (howto=260) at /boiler/nfs/src/sys/kern/kern_shutdown.c:486 #3 0xffffffff80c13750 in vpanic (fmt=<optimized out>, ap=<optimized out>) at /boiler/nfs/src/sys/kern/kern_shutdown.c:919 #4 0xffffffff80c134a3 in panic (fmt=<unavailable>) at /boiler/nfs/src/sys/kern/kern_shutdown.c:843 #5 0xffffffff80ded83f in tcp_lro_lookup (lc=0xfffffe0057299ad0, le=0xfffffe00573a3b48) at /boiler/nfs/src/sys/netinet/tcp_lro.c:915 #6 0xffffffff80dee79a in tcp_lro_rx2 (lc=<optimized out>, lc@entry=0xfffffe0057299ad0, m=<optimized out>, m@entry=0xfffff80003956d00, csum=<optimized out>, csum@entry=0, use_hash=<optimized out>, use_hash@entry=1) at /boiler/nfs/src/sys/netinet/tcp_lro.c:1946 #7 0xffffffff80def51f in tcp_lro_rx (lc=<unavailable>, lc@entry=0xfffffe0057299ad0, m=<unavailable>, m@entry=0xfffff80003956d00, csum=<unavailable>, csum@entry=0) at /boiler/nfs/src/sys/netinet/tcp_lro.c:2060 #8 0xffffffff8103822b in hn_lro_rx (lc=0xfffffe0057299ad0, m=0xfffff80003956d00) at /boiler/nfs/src/sys/dev/hyperv/netvsc/if_hn.c:3421 #9 hn_rxpkt (rxr=0xfffffe0057298000) at /boiler/nfs/src/sys/dev/hyperv/netvsc/if_hn.c:3722 #10 hn_rndis_rx_data (rxr=<optimized out>, data=<optimized out>, dlen=<optimized out>) at /boiler/nfs/src/sys/dev/hyperv/netvsc/if_hn.c:7392 #11 hn_rndis_rxpkt (rxr=<optimized out>, data=<optimized out>, dlen=<optimized out>) at /boiler/nfs/src/sys/dev/hyperv/netvsc/if_hn.c:7413 #12 hn_nvs_handle_rxbuf (rxr=<optimized out>, chan=0xfffff800039f7400, pkthdr=<optimized out>) at /boiler/nfs/src/sys/dev/hyperv/netvsc/if_hn.c:7512 #13 hn_chan_callback (chan=chan@entry=0xfffff800039f7400, xrxr=<optimized out>, xrxr@entry=0xfffffe0057298000) at /boiler/nfs/src/sys/dev/hyperv/netvsc/if_hn.c:7604 #14 0xffffffff810454df in vmbus_chan_task (xchan=0xfffff800039f7400, pending=<optimized out>) at /boiler/nfs/src/sys/dev/hyperv/vmbus/vmbus_chan.c:1381 #15 0xffffffff80c752fa in taskqueue_run_locked (queue=queue@entry=0xfffff800038b9300) at /boiler/nfs/src/sys/kern/subr_taskqueue.c:476 #16 0xffffffff80c76384 in taskqueue_thread_loop (arg=arg@entry=0xfffffe000428f090) at /boiler/nfs/src/sys/kern/subr_taskqueue.c:793 #17 0xffffffff80bcd540 in fork_exit (callout=0xffffffff80c762f0 <taskqueue_thread_loop>, arg=0xfffffe000428f090, frame=0xfffffe0051761c00) at /boiler/nfs/src/sys/kern/kern_fork.c:1077 #18 <signal handler called>
It hit this assert in tcp_lro.c 911 tcp_lro_lookup(struct lro_ctrl *lc, struct lro_entry *le) 912 { 913 struct inpcb *inp = NULL; 914 915 NET_EPOCH_ASSERT(); <--- panic here 916 switch (le->eh_type) { How often does it occur? I am not familiar to the lro and epoch code. the HyperV hn driver had couple commit since March 12. The commits are about RSC support for packets from same host. Is the NFS server VM running on the same Hyper-V host? If it is easy for your reproduce on the current build, can you try any build before March 12 to see if it is reproducible?
(In reply to Wei Hu from comment #6) This panic occurs right at boot time when an NFS share is mounted. I have a 12-STABLE system, virtualized also via Hyper-V on the same Windows10 system, which acts as a NFS-server. I can try to bisect this, based on the timeframe from march 12 to april 1, but this will take some time.
(In reply to Gordon Bergling from comment #7) I tried to reproduce in my test env without luck. I created two FreeBSD guests on same Windows Server 2019 host, one being the NFS server and the other the test client with latest commit. I tried two builds on the client: 1. a091c353235e0ee97d2531e80d9d64e1648350f4 on April 11 2. b6fd00791f2b9690b0a5d8670fc03f74eda96da2 on March 22 Build world and build kernel with NFS share all succeeded on the client. I may not have the right environment as you do. Here is how my test env looks: - Hyper-V Version: 10.0.17763 [SP1] - NFS mount not at boot time. I tried automount during boot but it keeps complaining the file system is not clean, couldn't find fsck_nfs to check and kick me into single user mode. I think that's something with my configuration however irrelevant to this bug. - Both nfs server and client are running DEBUG build. - Both nfs server and client are Gen-1 VMs. It will be really helpful if you can bisect in your env.
(In reply to Wei Hu from comment #8) Thanks for the investigation. I was able to boot a kernel from today (15 April) on this machine. I tracked the issue down to the tcp_bbr or cc_htcp. I build the system with WITH_EXTRA_TCP_STACKS=1 and have tcp_bbr_load="YES" cc_htcp_load="YES" in /boot/loader.conf and net.inet.tcp.cc.algorithm=htcp net.inet.tcp.functions_default=bbr in /etc/sysctl.conf. I first disabled the sysctl.conf settings and the panic is still happening. So it is enough to load the modules at boot time. If I disable both modules the system is starting as usual. If one of these modules is loaded at boot time, the system panics. Maybe it is something locking related. Hope that helps to track down that issue.
(In reply to Gordon Bergling from comment #9) With tcp_bbr module loaded, have you ever successfully booted up the system with NFS before? The BBR code was introduced in commit 35c7bb340788f0ce9347b7066619d8afb31e2123 on Sept 24, 2019. I wonder if this problem existed from Day 1 of this commit and it has only been tested on Hyper-V recently by you.