Bug 254695 - Hyper-V + TCP_BBR: Kernel Panic: Assertion in_epoch(net_epoch_preempt) failed at netinet/tcp_lro.c:915
Summary: Hyper-V + TCP_BBR: Kernel Panic: Assertion in_epoch(net_epoch_preempt) failed...
Status: New
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: CURRENT
Hardware: Any Any
: --- Affects Some People
Assignee: freebsd-virtualization (Nobody)
URL:
Keywords: panic
Depends on:
Blocks:
 
Reported: 2021-04-01 14:02 UTC by Gordon Bergling
Modified: 2021-04-19 15:34 UTC (History)
4 users (show)

See Also:


Attachments
Kernel Panic (127.23 KB, image/png)
2021-04-01 14:02 UTC, Gordon Bergling
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Gordon Bergling freebsd_committer 2021-04-01 14:02:36 UTC
Created attachment 223748 [details]
Kernel Panic

I am currently seeing a panic on a Hyper-V based virtual machine when mounting an NFS share. The -CURRENT build is from the today (1th of April). I would think that this panic is Hyper-V or amd64 related. I have the same share mounted on a RPi4B and with a build from today the share is accessible and a stress test via a build world was successful.

The system hangs in an endless loop, so I can currently only provide screenshot of the panic.
Comment 1 Li-Wen Hsu freebsd_committer 2021-04-06 15:25:12 UTC
btw it looks more like an error in tcp code, this might be related: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=254725
Comment 2 Gordon Bergling freebsd_committer 2021-04-06 16:01:55 UTC
(In reply to Li-Wen Hsu from comment #1)

Its mainly the call chain from vmbus_chan_task() -> hn_chan_callback() which leads me towards an bug in the Hyper-V implementation, since hn0 is the default network interface on Hyper-V.

I got an old kernel from around march 19th booted and will examine the crash dump as soon as I found some free time.
Comment 3 Gordon Bergling freebsd_committer 2021-04-07 06:51:26 UTC
The KERNCONF is the following

---------------------------------------------------
include		GENERIC
options		RATELIMIT
options		TCPHPTS
options		KERN_TLS
options		ROUTE_MPATH
options		RANDOM_FENESTRASX
---------------------------------------------------

Dump information are
---------------------------------------------------
Dump header from device: /dev/da0p3
  Architecture: amd64
  Architecture Version: 2
  Dump Length: 282898432
  Blocksize: 512
  Compression: none
  Dumptime: 2021-04-01 13:05:24 +0200
  Hostname: fbsd-dev.0xfce3.net
  Magic: FreeBSD Kernel Dump
  Version String: FreeBSD 14.0-CURRENT #25 main-n245771-529a2a0f2765: Thu Apr  1 11:36:01 CEST 2021
    root@lion.0xfce3.net:/boiler/nfs/obj/boiler/nfs/src/amd64.amd64/sys/GENERIC-TCP
  Panic String: Assertion in_epoch(net_epoch_preempt) failed at /boiler/nfs/src/sys/netinet/tcp_lro.c:915
  Dump Parity: 4289937466
  Bounds: 3
  Dump Status: good
---------------------------------------------------

src.conf is
---------------------------------------------------
WITH_EXTRA_TCP_STACKS=1
WITH_BEARSSL=1
WITH_PIE=1
WITH_RETPOLINE=1
WITH_INIT_ALL_ZERO=1
---------------------------------------------------


I upload the dump in a minute.
Comment 4 Gordon Bergling freebsd_committer 2021-04-07 07:35:26 UTC
The crash dump is too large for the upload. The stacktrace is the following.

KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe0051761840
vpanic() at vpanic+0x181/frame 0xfffffe0051761890
panic() at panic+0x43/frame 0xfffffe00517618f0
tcp_lro_lookup() at tcp_lro_lookup+0xef/frame 0xfffffe0051761920
tcp_lro_rx2() at tcp_lro_rx2+0x7da/frame 0xfffffe00517619f0
hn_chan_callback() at hn_chan_callback+0x1eb/frame 0xfffffe0051761ad0
vmbus_chan_task() at vmbus_chan_task+0x2f/frame 0xfffffe0051761b00
taskqueue_run_locked() at taskqueue_run_locked+0xaa/frame 0xfffffe0051761b80
taskqueue_thread_loop() at taskqueue_thread_loop+0x94/frame 0xfffffe0051761bb0
fork_exit() at fork_exit+0x80/frame 0xfffffe0051761bf0
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe0051761bf0

If I can provide more information, just let me know.
Comment 5 Gordon Bergling freebsd_committer 2021-04-07 07:36:35 UTC
Backtrace

#0  __curthread () at /boiler/nfs/src/sys/amd64/include/pcpu_aux.h:55
#1  doadump (textdump=textdump@entry=1) at /boiler/nfs/src/sys/kern/kern_shutdown.c:399
#2  0xffffffff80c132f0 in kern_reboot (howto=260) at /boiler/nfs/src/sys/kern/kern_shutdown.c:486
#3  0xffffffff80c13750 in vpanic (fmt=<optimized out>, ap=<optimized out>) at /boiler/nfs/src/sys/kern/kern_shutdown.c:919
#4  0xffffffff80c134a3 in panic (fmt=<unavailable>) at /boiler/nfs/src/sys/kern/kern_shutdown.c:843
#5  0xffffffff80ded83f in tcp_lro_lookup (lc=0xfffffe0057299ad0, le=0xfffffe00573a3b48)
    at /boiler/nfs/src/sys/netinet/tcp_lro.c:915
#6  0xffffffff80dee79a in tcp_lro_rx2 (lc=<optimized out>, lc@entry=0xfffffe0057299ad0, m=<optimized out>,
    m@entry=0xfffff80003956d00, csum=<optimized out>, csum@entry=0, use_hash=<optimized out>, use_hash@entry=1)
    at /boiler/nfs/src/sys/netinet/tcp_lro.c:1946
#7  0xffffffff80def51f in tcp_lro_rx (lc=<unavailable>, lc@entry=0xfffffe0057299ad0, m=<unavailable>,
    m@entry=0xfffff80003956d00, csum=<unavailable>, csum@entry=0) at /boiler/nfs/src/sys/netinet/tcp_lro.c:2060
#8  0xffffffff8103822b in hn_lro_rx (lc=0xfffffe0057299ad0, m=0xfffff80003956d00)
    at /boiler/nfs/src/sys/dev/hyperv/netvsc/if_hn.c:3421
#9  hn_rxpkt (rxr=0xfffffe0057298000) at /boiler/nfs/src/sys/dev/hyperv/netvsc/if_hn.c:3722
#10 hn_rndis_rx_data (rxr=<optimized out>, data=<optimized out>, dlen=<optimized out>)
    at /boiler/nfs/src/sys/dev/hyperv/netvsc/if_hn.c:7392
#11 hn_rndis_rxpkt (rxr=<optimized out>, data=<optimized out>, dlen=<optimized out>)
    at /boiler/nfs/src/sys/dev/hyperv/netvsc/if_hn.c:7413
#12 hn_nvs_handle_rxbuf (rxr=<optimized out>, chan=0xfffff800039f7400, pkthdr=<optimized out>)
    at /boiler/nfs/src/sys/dev/hyperv/netvsc/if_hn.c:7512
#13 hn_chan_callback (chan=chan@entry=0xfffff800039f7400, xrxr=<optimized out>, xrxr@entry=0xfffffe0057298000)
    at /boiler/nfs/src/sys/dev/hyperv/netvsc/if_hn.c:7604
#14 0xffffffff810454df in vmbus_chan_task (xchan=0xfffff800039f7400, pending=<optimized out>)
    at /boiler/nfs/src/sys/dev/hyperv/vmbus/vmbus_chan.c:1381
#15 0xffffffff80c752fa in taskqueue_run_locked (queue=queue@entry=0xfffff800038b9300)
    at /boiler/nfs/src/sys/kern/subr_taskqueue.c:476
#16 0xffffffff80c76384 in taskqueue_thread_loop (arg=arg@entry=0xfffffe000428f090)
    at /boiler/nfs/src/sys/kern/subr_taskqueue.c:793
#17 0xffffffff80bcd540 in fork_exit (callout=0xffffffff80c762f0 <taskqueue_thread_loop>, arg=0xfffffe000428f090,
    frame=0xfffffe0051761c00) at /boiler/nfs/src/sys/kern/kern_fork.c:1077
#18 <signal handler called>
Comment 6 Wei Hu 2021-04-09 06:43:17 UTC
It hit this assert in tcp_lro.c

 911 tcp_lro_lookup(struct lro_ctrl *lc, struct lro_entry *le)
 912 {
 913         struct inpcb *inp = NULL;
 914
 915         NET_EPOCH_ASSERT();          <--- panic here
 916         switch (le->eh_type) {


How often does it occur? I am not familiar to the lro and epoch code. the HyperV hn driver had couple commit since March 12. The commits are about RSC support for packets from same host. Is the NFS server VM running on the same Hyper-V host?

If it is easy for your reproduce on the current build, can you try any build before March 12 to see if it is reproducible?
Comment 7 Gordon Bergling freebsd_committer 2021-04-09 10:30:58 UTC
(In reply to Wei Hu from comment #6)

This panic occurs right at boot time when an NFS share is mounted. I have a 12-STABLE system, virtualized also via Hyper-V on the same Windows10 system, which acts as a NFS-server. I can try to bisect this, based on the timeframe from march 12 to april 1, but this will take some time.
Comment 8 Wei Hu 2021-04-11 13:05:58 UTC
(In reply to Gordon Bergling from comment #7)
I tried to reproduce in my test env without luck. I created two FreeBSD guests on same Windows Server 2019 host, one being the NFS server and the other the test client with latest commit. 

I tried two builds on the client:
1. a091c353235e0ee97d2531e80d9d64e1648350f4 on April 11
2. b6fd00791f2b9690b0a5d8670fc03f74eda96da2 on March 22

Build world and build kernel with NFS share all succeeded on the client. I may not have the right environment as you do. Here is how my test env looks:

- Hyper-V Version: 10.0.17763 [SP1]
- NFS mount not at boot time. I tried automount during boot but it keeps complaining the file system is not clean, couldn't find fsck_nfs to check and kick me into single user mode. I think that's something with my configuration however irrelevant to this bug.
- Both nfs server and client are running DEBUG build.
- Both nfs server and client are Gen-1 VMs. 

It will be really helpful if you can bisect in your env.
Comment 9 Gordon Bergling freebsd_committer 2021-04-15 19:18:23 UTC
(In reply to Wei Hu from comment #8)

Thanks for the investigation. I was able to boot a kernel from today (15 April) on this machine. I tracked the issue down to the tcp_bbr or cc_htcp. I build the system with WITH_EXTRA_TCP_STACKS=1 and have

tcp_bbr_load="YES"
cc_htcp_load="YES"

in /boot/loader.conf

and

net.inet.tcp.cc.algorithm=htcp
net.inet.tcp.functions_default=bbr

in /etc/sysctl.conf.

I first disabled the sysctl.conf settings and the panic is still happening. So it is enough to load the modules at boot time. If I disable both modules the system is starting as usual. If one of these modules is loaded at boot time, the system panics. Maybe it is something locking related.

Hope that helps to track down that issue.
Comment 10 Wei Hu 2021-04-19 15:34:34 UTC
(In reply to Gordon Bergling from comment #9)
With tcp_bbr module loaded, have you ever successfully booted up the system with NFS before?

The BBR code was introduced in commit 35c7bb340788f0ce9347b7066619d8afb31e2123 on Sept 24, 2019. I wonder if this problem existed from Day 1 of this commit and it has only been tested on Hyper-V recently by you.