I'm running network benchmarks with IPsec in transport mode. It is simple "iperf3" run after setting up IPsec with "setkey". Network adapter is igb (I210). Here is backtrace Fatal trap 9: general protection fault while in kernel mode cpuid = 3; apic id = 06 instruction pointer = 0x20:0xffffffff8077ae62 stack pointer = 0x28:0xfffffe00004de280 frame pointer = 0x28:0xfffffe00004de2a0 code segment = base rx0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = resume, IOPL = 0 current process = 0 (if_io_tqg_3) [ thread pid 0 tid 100068 ] Stopped at intr_execute_handlers+0x12: addq $0x1,(%rax) db> bt Tracing pid 0 tid 100068 td 0xfffff80002970000 intr_execute_handlers() at intr_execute_handlers+0x12/frame 0xfffffe00004de2a0 lapic_handle_intr() at lapic_handle_intr+0x44/frame 0xfffffe00004de2c0 Xapic_isr1() at Xapic_isr1+0xd9/frame 0xfffffe00004de2c0 --- interrupt, rip = 0xffffffff80721d1a, rsp = 0xfffffe00004de390, rbp = 0xfffffe00004de3a0 --- spinlock_exit() at spinlock_exit+0x3a/frame 0xfffffe00004de3a0 putchar() at putchar+0x14e/frame 0xfffffe00004de420 kvprintf() at kvprintf+0x106/frame 0xfffffe00004de540 vprintf() at vprintf+0x84/frame 0xfffffe00004de610 printf() at printf+0x43/frame 0xfffffe00004de670 trap_fatal() at trap_fatal+0x9d/frame 0xfffffe00004de6c0 trap() at trap+0x6d/frame 0xfffffe00004de7d0 calltrap() at calltrap+0x8/frame 0xfffffe00004de7d0 --- trap 0x9, rip = 0xffffffff804b13e9, rsp = 0xfffffe00004de8a0, rbp = 0xfffffe00004de8f0 --- __rw_rlock_hard() at __rw_rlock_hard+0xb9/frame 0xfffffe00004de8f0 bpf_mtap() at bpf_mtap+0x46/frame 0xfffffe00004de970 ether_nh_input() at ether_nh_input+0xca/frame 0xfffffe00004de9c0 netisr_dispatch_src() at netisr_dispatch_src+0xa1/frame 0xfffffe00004dea20 ether_input() at ether_input+0x26/frame 0xfffffe00004dea40 _task_fn_rx() at _task_fn_rx+0x7ea/frame 0xfffffe00004deb30 gtaskqueue_run_locked() at gtaskqueue_run_locked+0xe3/frame 0xfffffe00004deb80 gtaskqueue_thread_loop() at gtaskqueue_thread_loop+0x88/frame 0xfffffe00004debb0 fork_exit() at fork_exit+0x76/frame 0xfffffe00004debf0 fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe00004debf0 --- trap 0, rip = 0, rsp = 0, rbp = 0 ---
Steps to reproduce for me: (1) Two hosts: 192.168.134.1, 12-ALPHA7, slow, without AES-NI 192.168.134.2, 11-STABLE, fast, with AES-NI (2) Setup IPsec transport for TCP port 5201 (iperf3 part): (a) on 192.168.134.2 setkey -c<<__END flush; spdflush; add 192.168.134.1 192.168.134.2 esp 0x10001 -E rijndael-cbc "0123456789abcdef"; add 192.168.134.2 192.168.134.1 esp 0x10002 -E rijndael-cbc "0123456789abcdef"; spdadd 192.168.134.2/32[5201] 192.168.134.1/32 tcp -P out ipsec esp/transport//require; spdadd 192.168.134.1/32 192.168.134.2/32[5201] tcp -P in ipsec esp/transport//require; __END (b) on 192.168.134.1 setkey -c <<__END flush; spdflush; add 192.168.134.1 192.168.134.2 esp 0x10001 -E rijndael-cbc "0123456789abcdef"; add 192.168.134.2 192.168.134.1 esp 0x10002 -E rijndael-cbc "0123456789abcdef"; spdadd 192.168.134.1/32 192.168.134.2/32[5201] tcp -P out ipsec esp/transport//require; spdadd 192.168.134.2/32[5201] 192.168.134.1/32 tcp -P in ipsec esp/transport//require; __END (3) run "iperf3 -s" on 192.168.134.2 (4) run "iperf -c 192.168.134.2 -R" on 192.168.134.1 (5) Almost instant crash on 192.168.134.1. It looks have something to do with timings, as same setup where slow 192.168.134.1 is replaced bu much faster and AES-NI-capable system (same FreeBSD version) make crash much more hard to reproduce. I've got only one for 6 hours of testing with fast system.
Stopping all bpf consumers (dhcp client, dhcp server) doesn't help.
Do you have kernel.debug and crashdump handy to make kgdb backtrace?
(In reply to Eugene Grosbein from comment #3) Unfortunately, not right now. It is NanoBSD installation of my home router, so it doesn't have permanent writable storage and auto-reboots. I could repeat crash at weekend (Saturday/Sunday) with attached Serial console and installed kgdb & kernel.debug.
(In reply to Lev A. Serebryakov from comment #4) If your kernel has "options NETDUMP" configured, you can try configuring your router to dump to a different host on the local network. See netdump(4) and the netdumpd port.
(In reply to Mark Johnston from comment #5) Ooooh, thank you, I've missed this feature, I'll rebuild it with it! Great!
Adding NETBOOT doesn't help. I've checked at console and system simply reboots now, without panic, crash report or debugger or anything like this. Ok. I'll add INVARIANTS and WITNESS to kernel and try again.
(In reply to Lev A. Serebryakov from comment #7) Do you mean NETDUMP? The system crashes when you configure netdump on the router, or after? You might try setting debug.debugger_on_panic=1 to see if anything useful is printed before the reboot.
(In reply to Mark Johnston from comment #8) I mean, that new kernel with NETDUMP option enabled silently reboots without "panic" message, debugger prompt, crashdump or thing like this. Simple reboot at some place in testing and it's all. So, NETDUMP doesn't make system to crash. It works as usual till I start testing. But it doesn't panic anymore under load, just reboots. I'll try to add some cooling and build kernel with all debugging options at Saturday.
(In reply to Lev A. Serebryakov from comment #9) You can still add some USB stick and configure kernel to use its /dev/da0s1b partition as crashdump target. The kernel is somewhat picky when it decides if it should write to a device or not, so make sure you correctly create and label traditional "swap" partition.
(In reply to Eugene Grosbein from comment #10) I understand it. Yes, I'll try this tomorrow. And I'll try to add some BIG FAN to this box (it is passively cooled MiniTIX box), as I start to suspect tha tit may be overhrating Output of "sysctl dev.cpu.0.temperature" doesn't show anything suspicious, values are around 51-55⁰C, but maybe it overheats something else?
Ok. Now I have kernel with INVARIANTs and WITNESS and I have space fro crashdumps. I've got 3 crashdumps with exactly same panic message: Assertion (staterr & E1000_RXD_STAT_DD) != 0 failed at /data/src/sys/dev/e1000/em_txrx.c:698 Here is stacktrace: #0 doadump (textdump=1) at pcpu.h:230 #1 0xffffffff80565a70 in kern_reboot (howto=260) at /data/src/sys/kern/kern_shutdown.c:446 #2 0xffffffff80565ec3 in vpanic (fmt=<value optimized out>, ap=<value optimized out>) at /data/src/sys/kern/kern_shutdown.c:872 #3 0xffffffff80565c23 in panic (fmt=<value optimized out>) at /data/src/sys/kern/kern_shutdown.c:799 #4 0xffffffff803f1de4 in em_isc_rxd_pkt_get (arg=<value optimized out>, ri=<value optimized out>) at /data/src/sys/dev/e1000/em_txrx.c:698 #5 0xffffffff806688d8 in iflib_rxeof (rxq=0xfffff80002295800, budget=<value optimized out>) at /data/src/sys/net/iflib.c:2684 #6 0xffffffff80664d19 in _task_fn_rx (context=0xfffff80002295800) at /data/src/sys/net/iflib.c:3820 #7 0xffffffff805a5e49 in gtaskqueue_run_locked (queue=0xfffff800021dc400) at /data/src/sys/kern/subr_gtaskqueue.c:332 #8 0xffffffff805a5c08 in gtaskqueue_thread_loop (arg=<value optimized out>) at /data/src/sys/kern/subr_gtaskqueue.c:507 #9 0xffffffff8052f5c4 in fork_exit (callout=0xffffffff805a5b80 <gtaskqueue_thread_loop>, arg=0xfffffe00017f8008, frame=0xfffffe000043ac00) at /data/src/sys/kern/kern_fork.c:1057 #10 0xffffffff8081cbfe in fork_trampoline () at /data/src/sys/amd64/amd64/exception.S:993 #11 0x0000000000000000 in ?? ()
Without debug options in kernel crashes are all different, but it is always GPE. Looks like memory corruption. Please note, that without SAD/SDP everything works. And with "null" SAD everything works for hours. Even sending works with IPsec and aes-256-gcm/aes-256-cbc. Only combination of IPsec with true encryption and receiving data leads to crash.
One crash without debug options Fatal trap 9: general protection fault while in kernel mode cpuid = 1; apic id = 01 instruction pointer = 0x20:0xffffffff806585ea cpuid = 0; apic id = 00 instruction pointer = 0x20:0xffffffff806dab77 stack pointer = 0x28:0xfffffe0025d10a10 frame pointer = 0x28:0xfffffe0025d10b40 stack pointer = 0x28:0xfffffe0000470a30 code segment = base rx0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 11 (swi1: netisr 0) trap number = 9 panic: general protection fault #0 doadump (textdump=1) at pcpu.h:230 #1 0xffffffff8056008b in kern_reboot (howto=260) at /data/src/sys/kern/kern_shutdown.c:446 #2 0xffffffff805604c3 in vpanic (fmt=<value optimized out>, ap=<value optimized out>) at /data/src/sys/kern/kern_shutdown.c:872 #3 0xffffffff805602b3 in panic (fmt=<value optimized out>) at /data/src/sys/kern/kern_shutdown.c:799 #4 0xffffffff8081fd2f in trap_fatal (frame=0xfffffe0025d10950, eva=0) at /data/src/sys/amd64/amd64/trap.c:929 #5 0xffffffff8081f22d in trap (frame=0xfffffe0025d10950) at counter.h:87 #6 0xffffffff807ff367 in calltrap () at /data/src/sys/amd64/amd64/exception.S:232 #7 0xffffffff806dab77 in ip_output (m=<value optimized out>, opt=<value optimized out>, ro=<value optimized out>, flags=<value optimized out>, imo=0x0, inp=0x0) at /data/src/sys/netinet/ip_output.c:659 #8 0xffffffff807462b8 in ipsec_process_done (m=0xfffff8009947b700, sp=0x0, sav=0xfffff800058ef200, idx=1) at /data/src/sys/netipsec/ipsec_output.c:796 #9 0xffffffff8075b55b in esp_output_cb (crp=0xfffff800b20bf000) at /data/src/sys/netipsec/xform_esp.c:951 #10 0xffffffff80784007 in swcr_process (dev=<value optimized out>, crp=<value optimized out>, hint=<value optimized out>) at /data/src/sys/opencrypto/cryptosoft.c:1222 #11 0xffffffff807804eb in crypto_dispatch (crp=0xfffff800b20bf000) at /data/src/sys/opencrypto/crypto.c:1001 #12 0xffffffff8075af80 in esp_output (m=0xfffff8009947b700, sp=0xfffff8000541eb00, sav=<value optimized out>, idx=0, skip=20, protoff=9) at /data/src/sys/netipsec/xform_esp.c:869 #13 0xffffffff8074581f in ipsec4_perform_request (m=<value optimized out>, sp=<value optimized out>, inp=0xfffff800058fc3d0, idx=0) at /data/src/sys/netipsec/ipsec_output.c:275 #14 0xffffffff80745916 in ipsec4_output (m=0xfffff8009947b700, inp=0xfffff800058fc3d0) at /data/src/sys/netipsec/ipsec_output.c:292 #15 0xffffffff806da49e in ip_output (m=<value optimized out>, opt=<value optimized out>, ro=<value optimized out>, flags=<value optimized out>, imo=0x0, inp=0xfffff800058fc3d0) at /data/src/sys/netinet/ip_output.c:549 #16 0xffffffff806e84b5 in tcp_output (tp=0xfffff80099d72000) at /data/src/sys/netinet/tcp_output.c:1409 #17 0xffffffff806e4bc3 in tcp_do_segment (m=0xfffff800b2473400, th=<value optimized out>, so=0xfffff80005e67000, tp=0xfffff80099d72000, drop_hdrlen=52, tlen=<value optimized out>, iptos=0 '\0') at atomic.h:221 #18 0xffffffff806e1a51 in tcp_input (mp=<value optimized out>, offp=<value optimized out>, proto=<value optimized out>) at /data/src/sys/netinet/tcp_input.c:1392 #19 0xffffffff806d2078 in ip_input (m=0x0) at /data/src/sys/netinet/ip_input.c:827 #20 0xffffffff8065c343 in swi_net (arg=<value optimized out>) at /data/src/sys/net/netisr.c:901 #21 0xffffffff8052ffc5 in ithread_loop (arg=<value optimized out>) at /data/src/sys/kern/kern_intr.c:1043 #22 0xffffffff8052d0a6 in fork_exit (callout=0xffffffff8052fe60 <ithread_loop>, arg=0xfffff8000204e540, frame=0xfffffe0025d11c00) at /data/src/sys/kern/kern_fork.c:1057 #23 0xffffffff8080034e in fork_trampoline () at /data/src/sys/amd64/amd64/exception.S:993
Other crash without debug options Fatal trap 9: general protection fault while in kernel mode cpuid = 0; cpuid = 1; apic id = 01 instruction pointer = 0x20:0xffffffff806585ea stack pointer = 0x28:0xfffffe0000470a30 frame pointer = 0x28:0xfffffe0000470b00 apic id = 00 instruction pointer = 0x20:0xffffffff806dab77 code segment = base rx0, limit 0xfffff, type 0x1b stack pointer = 0x28:0xfffffe0025d10a10 = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = frame pointer = 0x28:0xfffffe0025d10b40 interrupt enabled, resume, IOPL = 0 current process = 0 (if_io_tqg_1) trap number = 9 panic: general protection fault cpuid = 1 #0 doadump (textdump=1) at pcpu.h:230 #1 0xffffffff8056008b in kern_reboot (howto=260) at /data/src/sys/kern/kern_shutdown.c:446 #2 0xffffffff805604c3 in vpanic (fmt=<value optimized out>, ap=<value optimized out>) at /data/src/sys/kern/kern_shutdown.c:872 #3 0xffffffff805602b3 in panic (fmt=<value optimized out>) at /data/src/sys/kern/kern_shutdown.c:799 #4 0xffffffff8081fd2f in trap_fatal (frame=0xfffffe0000470970, eva=0) at /data/src/sys/amd64/amd64/trap.c:929 #5 0xffffffff8081f22d in trap (frame=0xfffffe0000470970) at counter.h:87 #6 0xffffffff807ff367 in calltrap () at /data/src/sys/amd64/amd64/exception.S:232 #7 0xffffffff806585ea in iflib_rxeof (rxq=<value optimized out>, budget=<value optimized out>) at /data/src/sys/net/iflib.c:2770 #8 0xffffffff80654970 in _task_fn_rx (context=0xfffff80002292ac0) at /data/src/sys/net/iflib.c:3820 #9 0xffffffff8059fa63 in gtaskqueue_run_locked (queue=0xfffff800021da500) at /data/src/sys/kern/subr_gtaskqueue.c:332 #10 0xffffffff8059f7e8 in gtaskqueue_thread_loop (arg=<value optimized out>) at /data/src/sys/kern/subr_gtaskqueue.c:507 #11 0xffffffff8052d0a6 in fork_exit (callout=0xffffffff8059f760 <gtaskqueue_thread_loop>, arg=0xfffffe00017fa020, frame=0xfffffe0000470c00) at /data/src/sys/kern/kern_fork.c:1057 #12 0xffffffff8080034e in fork_trampoline () at /data/src/sys/amd64/amd64/exception.S:993 #13 0x0000000000000000 in ?? ()
And third non-debug crash Fatal trap 9: general protection fault while in kernel mode cpuid = 0; apic id = 00 instruction pointer = 0x20:0xffffffff806dab77 stack pointer = 0x28:0xfffffe0025d10a10 frame pointer = 0x28:0xfffffe0025d10b40 cpuid = 1; apic id = 01 instruction pointer = 0x20:0xffffffff806585ea code segment = base rx0, limit 0xfffff, type 0x1b stack pointer = 0x28:0xfffffe0000470a30 = DPL 0, pres 1, long 1, def32 0, gran 1 frame pointer = 0x28:0xfffffe0000470b00 processor eflags = interrupt enabled, code segment = base rx0, limit 0xfffff, type 0x1b resume, IOPL = 0 current process = 11 (swi1: netisr 0) trap number = 9 = DPL 0, pres 1, long 1, def32 0, gran 1 #0 doadump (textdump=1) at pcpu.h:230 #1 0xffffffff8056008b in kern_reboot (howto=260) at /data/src/sys/kern/kern_shutdown.c:446 #2 0xffffffff805604c3 in vpanic (fmt=<value optimized out>, ap=<value optimized out>) at /data/src/sys/kern/kern_shutdown.c:872 #3 0xffffffff805602b3 in panic (fmt=<value optimized out>) at /data/src/sys/kern/kern_shutdown.c:799 #4 0xffffffff8081fd2f in trap_fatal (frame=0xfffffe0025d10950, eva=0) at /data/src/sys/amd64/amd64/trap.c:929 #5 0xffffffff8081f22d in trap (frame=0xfffffe0025d10950) at counter.h:87 #6 0xffffffff807ff367 in calltrap () at /data/src/sys/amd64/amd64/exception.S:232 #7 0xffffffff806dab77 in ip_output (m=<value optimized out>, opt=<value optimized out>, ro=<value optimized out>, flags=<value optimized out>, imo=0x0, inp=0x0) at /data/src/sys/netinet/ip_output.c:659 #8 0xffffffff807462b8 in ipsec_process_done (m=0xfffff80005782400, sp=0x0, sav=0xfffff80005402100, idx=1) at /data/src/sys/netipsec/ipsec_output.c:796 #9 0xffffffff8075b55b in esp_output_cb (crp=0xfffff8008c5e1080) at /data/src/sys/netipsec/xform_esp.c:951 #10 0xffffffff80784007 in swcr_process (dev=<value optimized out>, crp=<value optimized out>, hint=<value optimized out>) at /data/src/sys/opencrypto/cryptosoft.c:1222 #11 0xffffffff807804eb in crypto_dispatch (crp=0xfffff8008c5e1080) at /data/src/sys/opencrypto/crypto.c:1001 #12 0xffffffff8075af80 in esp_output (m=0xfffff80005782400, sp=0xfffff8008c371100, sav=<value optimized out>, idx=0, skip=20, protoff=9) at /data/src/sys/netipsec/xform_esp.c:869 #13 0xffffffff8074581f in ipsec4_perform_request (m=<value optimized out>, sp=<value optimized out>, inp=0xfffff80005a611e8, idx=0) at /data/src/sys/netipsec/ipsec_output.c:275 #14 0xffffffff80745916 in ipsec4_output (m=0xfffff80005782400, inp=0xfffff80005a611e8) at /data/src/sys/netipsec/ipsec_output.c:292 #15 0xffffffff806da49e in ip_output (m=<value optimized out>, opt=<value optimized out>, ro=<value optimized out>, flags=<value optimized out>, imo=0x0, inp=0xfffff80005a611e8) at /data/src/sys/netinet/ip_output.c:549 #16 0xffffffff806e84b5 in tcp_output (tp=0xfffff80005a63760) at /data/src/sys/netinet/tcp_output.c:1409 #17 0xffffffff806e4bc3 in tcp_do_segment (m=0xfffff80005da8700, th=<value optimized out>, so=0xfffff80005b5ea38, tp=0xfffff80005a63760, drop_hdrlen=52, tlen=<value optimized out>, iptos=0 '\0') at atomic.h:221 #18 0xffffffff806e1a51 in tcp_input (mp=<value optimized out>, offp=<value optimized out>, proto=<value optimized out>) at /data/src/sys/netinet/tcp_input.c:1392 #19 0xffffffff806d2078 in ip_input (m=0x0) at /data/src/sys/netinet/ip_input.c:827 #20 0xffffffff8065c343 in swi_net (arg=<value optimized out>) at /data/src/sys/net/netisr.c:901 #21 0xffffffff8052ffc5 in ithread_loop (arg=<value optimized out>) at /data/src/sys/kern/kern_intr.c:1043 #22 0xffffffff8052d0a6 in fork_exit (callout=0xffffffff8052fe60 <ithread_loop>, arg=0xfffff8000204e540, frame=0xfffffe0025d11c00) at /data/src/sys/kern/kern_fork.c:1057 #23 0xffffffff8080034e in fork_trampoline () at /data/src/sys/amd64/amd64/exception.S:993 #24 0x0000000000000000 in ?? ()
I have all crashdumps saved, and I have corresponding kernel.full saved, too, so I could provide any additional information which could be extracted with "kgdb" from these.
Do I need to provide additional information?
Lev, what versions are you testing on? Given the stacks in the last few comments, it appears you're using cryptosoft driver. If you're not already on r338953, please try updating to that revision and retesting. Another thing you could do without rebuilding kernel is load the 'aesni' driver (which probably provides better performance too).
(In reply to Conrad Meyer from comment #19) It is r339021. aesni is useless here, as this hardware doesn't have support for it.
Ok, I think it's probably not an OCF bug then. Plenty of room for a NIC or IPsec bug, though.
(In reply to Conrad Meyer from comment #21) It gives me idea to test on AES-NI capable hardware with other NICs (igb instead of em) but without AES-NI loaded, to force it use soft crypto. Also, I could test VM installation with vtnet (and disable AES-NI for it). Hardware with igb and AES-NI works without any problems.
Ok, let me try and understand what has been tested. Please correct me if I am mistaken: - igb + ??? + bpf = crash (initial description)? - igb + AESNI + no bpf = no crash - em + !AESNI + no bpf = crash Have you tried, or can you try: - igb + !AESNI (unload aesni.ko module) - em + AESNI (is it possible to move em NIC to the CPU that supports AESNI, or is it soldered to the board?) Additionally, would it be possible for me to access your kernel binaries and core dump(s) from comments 14-16? One other question -- what revision is the 11-STABLE machine on? Thanks.
CCing ae@ who is more familiar with ipsec than me :-).
(In reply to Conrad Meyer from comment #23) Now I have: igb + AESNI + bpf — one crash on older revision, no crashes for several hours of testing on newer revisions. It is very first stack trace, and I can not reproduce it anymore (with or without bpf). Looks like, we could skip this, as it is not reproducible. em + !AESNI — crash with or without bpf. All other stack traces (with IBVARIANTS and without them) are for this configuration. Looks like, bpf is not evolved at all. Unfortunately, I can not swap NICs or CPUs, as it is embedded-like hardware with everything soldered on board. I'll try igb + !AESNI tonight by unloading aesni.ko in first place. I'll send you URL for kernels + dumps via e-mail, as I'm not sure it doesn't contain sensitive data (it should not, but though). 11-STABLE which serves as "other end" of this test setup is r338960.
I could report, that vtent0 + AESNI + INVARIANTS — no crash. vtnet0 + !AESNI + INVARIANTS — no crash. vtent0 + AESNI + !INVARIANTS — no crash. vtent0 + !AESNI + !INVARIANTS — no crash. I'll need to perform tests for igb + !AESNI.
igb0 + !AESNI + !INVARIANTS — crash. But I can not provide dumps or stacks yet :-( Looks like it is combination of Intel NICs and soft crypto.
It still a issue for ALPHA8, r339259.
netdump(8) via same igb0 as test itself leads to loop of panics :-)
(In reply to Lev A. Serebryakov from comment #29) Are you able grab a backtrace or at least a panic message in this case? netdump does have a lot of failure modes when used on a busy interface. It works best when configured on a management interface.
Ok, system with igb0 CAN NOT crash dumps in automatic mode at all, even if it is local dump. It goes to panic loop and print on console this again and again and again, forever: Fatal trap 9: general protection fault while in kernel mode cpuid = 2; apic id = 04 instruction pointer = 0x20:0xffffffff8057e206 stack pointer = 0x28:0xfffffe00255a3c20 frame pointer = 0x28:0xfffffe00255a3c20 code segment = base rx0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = resume, IOPL = 0 current process = 0 (if_io_tqg_2) trap number = 9 panic: general protection fault cpuid = 2 time = 1539126183 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe00255a3930 vpanic() at vpanic+0x1a3/frame 0xfffffe00255a3990 panic() at panic+0x43/frame 0xfffffe00255a39f0 trap_fatal() at trap_fatal+0x35f/frame 0xfffffe00255a3a40 trap() at trap+0x6d/frame 0xfffffe00255a3b50 calltrap() at calltrap+0x8/frame 0xfffffe00255a3b50 --- trap 0x9, rip = 0xffffffff8057e206, rsp = 0xfffffe00255a3c20, rbp = 0xfffffe00255a3c20 --- strcmp() at strcmp+0x6/frame 0xfffffe00255a3c20 eventhandler_find_list() at eventhandler_find_list+0x4b/frame 0xfffffe00255a3c50 kern_reboot() at kern_reboot+0x103/frame 0xfffffe00255a3ca0 vpanic() at vpanic+0x203/frame 0xfffffe00255a3d00 panic() at panic+0x43/frame 0xfffffe00255a3d60 trap_fatal() at trap_fatal+0x35f/frame 0xfffffe00255a3db0 trap() at trap+0x6d/frame 0xfffffe00255a3ec0 calltrap() at calltrap+0x8/frame 0xfffffe00255a3ec0 --- trap 0x9, rip = 0xffffffff8057e206, rsp = 0xfffffe00255a3f90, rbp = 0xfffffe00255a3f90 --- strcmp() at strcmp+0x6/frame 0xfffffe00255a3f90 eventhandler_find_list() at eventhandler_find_list+0x4b/frame 0xfffffe00255a3fc0 kern_reboot() at kern_reboot+0x103/frame 0xfffffe00255a4010 vpanic() at vpanic+0x203/frame 0xfffffe00255a4070 panic() at panic+0x43/frame 0xfffffe00255a40d0 trap_fatal() at trap_fatal+0x35f/frame 0xfffffe00255a4120 trap() at trap+0x6d/frame 0xfffffe00255a4230 calltrap() at calltrap+0x8/frame 0xfffffe00255a4230 --- trap 0x9, rip = 0xffffffff8057e206, rsp = 0xfffffe00255a4300, rbp = 0xfffffe00255a4300 --- strcmp() at strcmp+0x6/frame 0xfffffe00255a4300 eventhandler_find_list() at eventhandler_find_list+0x4b/frame 0xfffffe00255a4330 kern_reboot() at kern_reboot+0x103/frame 0xfffffe00255a4380 vpanic() at vpanic+0x203/frame 0xfffffe00255a43e0 panic() at panic+0x43/frame 0xfffffe00255a4440 trap_fatal() at trap_fatal+0x35f/frame 0xfffffe00255a4490 trap() at trap+0x6d/frame 0xfffffe00255a45a0 calltrap() at calltrap+0x8/frame 0xfffffe00255a45a0 --- trap 0x9, rip = 0xffffffff8057e206, rsp = 0xfffffe00255a4670, rbp = 0xfffffe00255a4670 --- strcmp() at strcmp+0x6/frame 0xfffffe00255a4670 eventhandler_find_list() at eventhandler_find_list+0x4b/frame 0xfffffe00255a46a0 kern_reboot() at kern_reboot+0x103/frame 0xfffffe00255a46f0 vpanic() at vpanic+0x203/frame 0xfffffe00255a4750 panic() at panic+0x43/frame 0xfffffe00255a47b0 trap_fatal() at trap_fatal+0x35f/frame 0xfffffe00255a4800 trap() at trap+0x6d/frame 0xfffffe00255a4910 calltrap() at calltrap+0x8/frame 0xfffffe00255a4910 --- trap 0x9, rip = 0xffffffff8057e206, rsp = 0xfffffe00255a49e0, rbp = 0xfffffe00255a49e0 --- strcmp() at strcmp+0x6/frame 0xfffffe00255a49e0 eventhandler_find_list() at eventhandler_find_list+0x4b/frame 0xfffffe00255a4a10 kern_reboot() at kern_reboot+0x103/frame 0xfffffe00255a4a60 vpanic() at vpanic+0x203/frame 0xfffffe00255a4ac0 panic() at panic+0x43/frame 0xfffffe00255a4b20 trap_fatal() at trap_fatal+0x35f/frame 0xfffffe00255a4b70 trap() at trap+0x6d/frame 0xfffffe00255a4c80 calltrap() at calltrap+0x8/frame 0xfffffe00255a4c80 --- trap 0x9, rip = 0xffffffff8057e206, rsp = 0xfffffe00255a4d50, rbp = 0xfffffe00255a4d50 --- strcmp() at strcmp+0x6/frame 0xfffffe00255a4d50 eventhandler_find_list() at eventhandler_find_list+0x4b/frame 0xfffffe00255a4d80 kern_reboot() at kern_reboot+0x103/frame 0xfffffe00255a4dd0 vpanic() at vpanic+0x203/frame 0xfffffe00255a4e30 panic() at panic+0x43/frame 0xfffffe00255a4e90 trap_fatal() at trap_fatal+0x35f/frame 0xfffffe00255a4ee0 trap() at trap+0x6d/frame 0xfffffe00255a4ff0 calltrap() at calltrap+0x8/frame 0xfffffe00255a4ff0 --- trap 0x9, rip = 0xffffffff8057e206, rsp = 0xfffffe00255a50c0, rbp = 0xfffffe00255a50c0 --- strcmp() at strcmp+0x6/frame 0xfffffe00255a50c0 eventhandler_find_list() at eventhandler_find_list+0x4b/frame 0xfffffe00255a50f0 kern_reboot() at kern_reboot+0x103/frame 0xfffffe00255a5140 vpanic() at vpanic+0x203/frame 0xfffffe00255a51a0 panic() at panic+0x43/frame 0xfffffe00255a5200 trap_fatal() at trap_fatal+0x35f/frame 0xfffffe00255a5250 trap() at trap+0x6d/frame 0xfffffe00255a5360 calltrap() at calltrap+0x8/frame 0xfffffe00255a5360 --- trap 0x9, rip = 0xffffffff8057e206, rsp = 0xfffffe00255a5430, rbp = 0xfffffe00255a5430 --- strcmp() at strcmp+0x6/frame 0xfffffe00255a5430 eventhandler_find_list() at eventhandler_find_list+0x4b/frame 0xfffffe00255a5460 kern_reboot() at kern_reboot+0x103/frame 0xfffffe00255a54b0 vpanic() at vpanic+0x203/frame 0xfffffe00255a5510 panic() at panic+0x43/frame 0xfffffe00255a5570 trap_fatal() at trap_fatal+0x35f/frame 0xfffffe00255a55c0 trap() at trap+0x6d/frame 0xfffffe00255a56d0 calltrap() at calltrap+0x8/frame 0xfffffe00255a56d0 --- trap 0x9, rip = 0xffffffff8057e206, rsp = 0xfffffe00255a57a0, rbp = 0xfffffe00255a57a0 --- strcmp() at strcmp+0x6/frame 0xfffffe00255a57a0 eventhandler_find_list() at eventhandler_find_list+0x4b/frame 0xfffffe00255a57d0 kern_reboot() at kern_reboot+0x103/frame 0xfffffe00255a5820 vpanic() at vpanic+0x203/frame 0xfffffe00255a5880 panic() at panic+0x43/frame 0xfffffe00255a58e0 trap_fatal() at trap_fatal+0x35f/frame 0xfffffe00255a5930 trap() at trap+0x6d/frame 0xfffffe00255a5a40 calltrap() at calltrap+0x8/frame 0xfffffe00255a5a40 --- trap 0x9, rip = 0xffffffff8057e206, rsp = 0xfffffe00255a5b10, rbp = 0xfffffe00255a5b10 --- strcmp() at strcmp+0x6/frame 0xfffffe00255a5b10 eventhandler_find_list() at eventhandler_find_list+0x4b/frame 0xfffffe00255a5b40 kern_reboot() at kern_reboot+0x103/frame 0xfffffe00255a5b90 vpanic() at vpanic+0x203/frame 0xfffffe00255a5bf0 panic() at panic+0x43/frame 0xfffffe00255a5c50 trap_fatal() at trap_fatal+0x35f/frame 0xfffffe00255a5ca0 trap() at trap+0x6d/frame 0xfffffe00255a5db0 calltrap() at calltrap+0x8/frame 0xfffffe00255a5db0 --- trap 0x9, rip = 0xffffffff8057e206, rsp = 0xfffffe00255a5e80, rbp = 0xfffffe00255a5e80 --- strcmp() at strcmp+0x6/frame 0xfffffe00255a5e80 eventhandler_find_list() at eventhandler_find_list+0x4b/frame 0xfffffe00255a5eb0 kern_reboot() at kern_reboot+0x103/frame 0xfffffe00255a5f00 vpanic() at vpanic+0x203/frame 0xfffffe00255a5f60 panic() at panic+0x43/frame 0xfffffe00255a5fc0 trap_fatal() at trap_fatal+0x35f/frame 0xfffffe00255a6010 trap() at trap+0x6d/frame 0xfffffe00255a6120 calltrap() at calltrap+0x8/frame 0xfffffe00255a6120 --- trap 0x9, rip = 0xffffffff8048600b, rsp = 0xfffffe00255a61f0, rbp = 0xfffffe00255a6230 --- intr_event_handle() at intr_event_handle+0xbb/frame 0xfffffe00255a6230 intr_execute_handlers() at intr_execute_handlers+0x58/frame 0xfffffe00255a6260 lapic_handle_intr() at lapic_handle_intr+0x44/frame 0xfffffe00255a6280 Xapic_isr1() at Xapic_isr1+0xd9/frame 0xfffffe00255a6280 --- interrupt, rip = 0xffffffff804f8b65, rsp = 0xfffffe00255a6350, rbp = 0xfffffe00255a6350 --- lock_delay() at lock_delay+0x35/frame 0xfffffe00255a6350 _mtx_lock_spin_cookie() at _mtx_lock_spin_cookie+0xb1/frame 0xfffffe00255a63b0 cnputs() at cnputs+0xb8/frame 0xfffffe00255a63d0 putchar() at putchar+0x14e/frame 0xfffffe00255a6450 kvprintf() at kvprintf+0x106/frame 0xfffffe00255a6570 vprintf() at vprintf+0x84/frame 0xfffffe00255a6640 printf() at printf+0x43/frame 0xfffffe00255a66a0 trap_fatal() at trap_fatal+0x1a0/frame 0xfffffe00255a66f0 trap() at trap+0x6d/frame 0xfffffe00255a6800 calltrap() at calltrap+0x8/frame 0xfffffe00255a6800 --- trap 0x9, rip = 0xffffffff805a3699, rsp = 0xfffffe00255a68d0, rbp = 0xfffffe00255a6920 --- vlan_input() at vlan_input+0x199/frame 0xfffffe00255a6920 ether_demux() at ether_demux+0x129/frame 0xfffffe00255a6950 ether_nh_input() at ether_nh_input+0x30c/frame 0xfffffe00255a69a0 netisr_dispatch_src() at netisr_dispatch_src+0xa1/frame 0xfffffe00255a6a00 ether_input() at ether_input+0x26/frame 0xfffffe00255a6a20 iflib_rxeof() at iflib_rxeof+0x880/frame 0xfffffe00255a6b00 _task_fn_rx() at _task_fn_rx+0x40/frame 0xfffffe00255a6b30 gtaskqueue_run_locked() at gtaskqueue_run_locked+0xe3/frame 0xfffffe00255a6b80 gtaskqueue_thread_loop() at gtaskqueue_thread_loop+0x88/frame 0xfffffe00255a6bb0 fork_exit() at fork_exit+0x76/frame 0xfffffe00255a6bf0 fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe00255a6bf0
I've removed KDB_UNATTENDED from kernel and got dump! Looks like kernel memory is complete mess at this moment, as it panics on all 4 cores! Unread portion of the kernel message buffer: kernel trap 9 with interrupts disabled kernel trap 9 with interrupts disabled kernel trap 9 with interrupts disabled Fatal trap 9: general protection fault while in kernel mode cpuid = 0; Fatal trap 12: page fault while in kernel mode Fatal trap 9: general protection fault while in kernel mode cpuid = 1; apic id = 02 apic id = 00 instruction pointer = 0x20:0xffffffff8048600b Fatal trap 9: general protection fault while in kernel mode cpuid = 2; apic id = 04 cpuid = 3; apic id = 06 instruction pointer = 0x20:0xffffffff807864a2 stack pointer = 0x28:0xfffffe00255ab800 fault virtual address = 0xfffff800c1401524 stack pointer = 0x28:0xfffffe00004511b0 fault code = supervisor write data, page not present instruction pointer = 0x20:0xffffffff80338877 stack pointer = 0x28:0xfffffe00255a1ad0 kernel trap 9 with interrupts disabled frame pointer = 0x28:0xfffffe00004511f0 Fatal trap 9: general protection fault while in kernel mode cpuid = 1; apic id = 02 instruction pointer = 0x20:0xffffffff8048600b code segment = base rx0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 frame pointer = 0x28:0xfffffe00255ab820 code segment = base rx0, limit 0xfffff, type 0x1b instruction pointer = 0x20:0xffffffff807864a2 = DPL 0, pres 1, long 1, def32 0, gran 1 stack pointer = 0x28:0xfffffe00255a1390 frame pointer = 0x28:0xfffffe00255a13d0 code segment = base rx0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = resume, IOPL = 0 current process = 0 (if_io_tqg_3) #0 doadump (textdump=0) at pcpu.h:230 230 pcpu.h: No such file or directory. in pcpu.h (kgdb) #0 doadump (textdump=0) at pcpu.h:230 #1 0xffffffff8031436b in db_dump (dummy=<value optimized out>, dummy2=<value optimized out>, dummy3=<value optimized out>, dummy4=<value optimized out>) at /data/src/sys/ddb/db_command.c:574 #2 0xffffffff80314139 in db_command (cmd_table=<value optimized out>) at /data/src/sys/ddb/db_command.c:481 #3 0xffffffff80313eb4 in db_command_loop () at /data/src/sys/ddb/db_command.c:534 #4 0xffffffff8031715f in db_trap (type=<value optimized out>, code=<value optimized out>) at /data/src/sys/ddb/db_main.c:252 #5 0xffffffff804f7e33 in kdb_trap (type=9, code=0, tf=<value optimized out>) at /data/src/sys/kern/subr_kdb.c:693 #6 0xffffffff8073e561 in trap_fatal (frame=0xfffffe00255ab740, eva=0) at /data/src/sys/amd64/amd64/trap.c:921 #7 0xffffffff8073db0d in trap (frame=0xfffffe00255ab740) at counter.h:87 #8 0xffffffff8071db37 in calltrap () at /data/src/sys/amd64/amd64/exception.S:232 #9 0xffffffff807864a2 in intr_execute_handlers (isrc=0xfffff80002429180, frame=0xfffffe00255ab850) at /data/src/sys/x86/x86/intr_machdep.c:341 #10 0xffffffff8078c154 in lapic_handle_intr (vector=<value optimized out>, frame=<value optimized out>) at /data/src/sys/x86/x86/local_apic.c:1293 #11 0xffffffff8071ecc9 in Xapic_isr1 () at apic_vector.S:118 #12 0xffffffff806e9db0 in uma_zalloc_arg (zone=<value optimized out>, udata=0x20, flags=-512) at /data/src/sys/vm/uma_core.c:2571 #13 0xffffffff805b021d in _iflib_fl_refill (ctx=0xfffff8000241dc00, fl=0xfffff8000241e000, count=<value optimized out>) at mbuf.h:790 #14 0xffffffff805afcc8 in iflib_rxeof (rxq=<value optimized out>, budget=<value optimized out>) at /data/src/sys/net/iflib.c:2072 #15 0xffffffff805ac250 in _task_fn_rx (context=0xfffff80002424840) at /data/src/sys/net/iflib.c:3820 #16 0xffffffff804f6073 in gtaskqueue_run_locked (queue=0xfffff8000222b500) at /data/src/sys/kern/subr_gtaskqueue.c:332 #17 0xffffffff804f5df8 in gtaskqueue_thread_loop (arg=<value optimized out>) at /data/src/sys/kern/subr_gtaskqueue.c:507 #18 0xffffffff80483716 in fork_exit (callout=0xffffffff804f5d70 <gtaskqueue_thread_loop>, arg=0xfffffe0000221050, frame=0xfffffe00255abc00) at /data/src/sys/kern/kern_fork.c:1057 #19 0xffffffff8071eb1e in fork_trampoline () at /data/src/sys/amd64/amd64/exception.S:993 #20 0x0000000000000000 in ?? () Current language: auto; currently minimal (kgdb)
And igb + !AESNI + INVARIANTS — looks very similar to em! panic: Assertion (staterr & E1000_RXD_STAT_DD) != 0 failed at /data/src/sys/dev/e1000/igb_txrx.c:451 cpuid = 2 time = 1539129723 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe00255a68f0 vpanic() at vpanic+0x1a3/frame 0xfffffe00255a6950 panic() at panic+0x43/frame 0xfffffe00255a69b0 igb_isc_rxd_pkt_get() at igb_isc_rxd_pkt_get+0x264/frame 0xfffffe00255a6a10 iflib_rxeof() at iflib_rxeof+0x128/frame 0xfffffe00255a6b00 _task_fn_rx() at _task_fn_rx+0x49/frame 0xfffffe00255a6b30 gtaskqueue_run_locked() at gtaskqueue_run_locked+0xf9/frame 0xfffffe00255a6b80 gtaskqueue_thread_loop() at gtaskqueue_thread_loop+0x88/frame 0xfffffe00255a6bb0 fork_exit() at fork_exit+0x84/frame 0xfffffe00255a6bf0 fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe00255a6bf0 --- trap 0, rip = 0, rsp = 0, rbp = 0 --- KDB: enter: panic #0 doadump (textdump=0) at pcpu.h:230 230 pcpu.h: No such file or directory. in pcpu.h (kgdb) #0 doadump (textdump=0) at pcpu.h:230 #1 0xffffffff8031544b in db_dump (dummy=<value optimized out>, dummy2=<value optimized out>, dummy3=<value optimized out>, dummy4=<value optimized out>) at /data/src/sys/ddb/db_command.c:574 #2 0xffffffff80315219 in db_command (cmd_table=<value optimized out>) at /data/src/sys/ddb/db_command.c:481 #3 0xffffffff80314f94 in db_command_loop () at /data/src/sys/ddb/db_command.c:534 #4 0xffffffff803181af in db_trap (type=<value optimized out>, code=<value optimized out>) at /data/src/sys/ddb/db_main.c:252 #5 0xffffffff804fe7e3 in kdb_trap (type=3, code=0, tf=<value optimized out>) at /data/src/sys/kern/subr_kdb.c:693 #6 0xffffffff8075b922 in trap (frame=0xfffffe00255a6820) at /data/src/sys/amd64/amd64/trap.c:619 #7 0xffffffff80739cc7 in calltrap () at /data/src/sys/amd64/amd64/exception.S:232 #8 0xffffffff804fdeab in kdb_enter (why=0xffffffff80821b4a "panic", msg=<value optimized out>) at cpufunc.h:65 #9 0xffffffff804bc9c0 in vpanic (fmt=<value optimized out>, ap=0xfffffe00255a6990) at /data/src/sys/kern/kern_shutdown.c:861 #10 0xffffffff804bc763 in panic (fmt=<value optimized out>) at /data/src/sys/kern/kern_shutdown.c:799 #11 0xffffffff8033db94 in igb_isc_rxd_pkt_get (arg=<value optimized out>, ri=<value optimized out>) at /data/src/sys/dev/e1000/igb_txrx.c:451 #12 0xffffffff805c0448 in iflib_rxeof (rxq=0xfffff8000245b580, budget=<value optimized out>) at /data/src/sys/net/iflib.c:2684 #13 0xffffffff805bc8b9 in _task_fn_rx (context=0xfffff8000245b580) at /data/src/sys/net/iflib.c:3820 #14 0xffffffff804fc959 in gtaskqueue_run_locked (queue=0xfffff8000222d600) at /data/src/sys/kern/subr_gtaskqueue.c:332 #15 0xffffffff804fc718 in gtaskqueue_thread_loop (arg=<value optimized out>) at /data/src/sys/kern/subr_gtaskqueue.c:507 #16 0xffffffff80486134 in fork_exit (callout=0xffffffff804fc690 <gtaskqueue_thread_loop>, arg=0xfffffe0000221038, frame=0xfffffe00255a6c00) at /data/src/sys/kern/kern_fork.c:1057 #17 0xffffffff8073acae in fork_trampoline () at /data/src/sys/amd64/amd64/exception.S:993 #18 0x0000000000000000 in ?? ()
Ok, I have new data. Softcrypto or IPsec is only symptom, not cause. Cause is igb/em driver (different files, logically same place). I can reproduce driver KASSERT on kernel with INVARIANTS without any crypto at all. Conditions are: low-power hardware, high load, receive data as fast as possible. On Celeron J3160 + igb(8) it requires to load system with IPSec with soft crypto to trigger bug. I was not able to trigger it without crypto or AESNI. On Atom D2500 + em(8) it requires either soft crypto (easy!) or multitude of plain connections without crypto. For example, 32 iperf3 streams for 2+ minutes is enough. With IPsec it triggers with 1 stream for 5 seconds. So, I can reproduce this on Atom D2500 + em(8) with simple "iperf3 -c <server> -R -t 3600 --nstreams 32 Without INVARIANTS, it is very hard to catch this bug without IPsec. I think, it is because this memory corruption is hard to notice without additional traffic processing. I think, IPsec is only way to deiscover that memory is corrupted, not a way to corrupt memory. Here is stack trace with INVARIANTS and without any crypto. It is virutally the same as with crypto. As usual, I can provide kernel file and full crash dump and can re-run tests with any patches and settings. I'm sure now, it is bug in Intel driver. Race condition, maybe? panic: Assertion (staterr & E1000_RXD_STAT_DD) != 0 failed at /data/src/sys/dev/e1000/em_txrx.c:698 cpuid = 1 time = 1539169364 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe000043f900 vpanic() at vpanic+0x1a3/frame 0xfffffe000043f960 panic() at panic+0x43/frame 0xfffffe000043f9c0 em_isc_rxd_pkt_get() at em_isc_rxd_pkt_get+0x1d4/frame 0xfffffe000043fa10 iflib_rxeof() at iflib_rxeof+0x128/frame 0xfffffe000043fb00 _task_fn_rx() at _task_fn_rx+0x49/frame 0xfffffe000043fb30 gtaskqueue_run_locked() at gtaskqueue_run_locked+0xf9/frame 0xfffffe000043fb80 gtaskqueue_thread_loop() at gtaskqueue_thread_loop+0x88/frame 0xfffffe000043fbb0 fork_exit() at fork_exit+0x84/frame 0xfffffe000043fbf0 fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe000043fbf0 --- trap 0, rip = 0, rsp = 0, rbp = 0 --- Uptime: 17m24s Dumping 477 out of 4060 MB:..4%..11%..21%..31%..41%..51%..61%..71%..81%..91% #0 doadump (textdump=1) at pcpu.h:230 230 pcpu.h: No such file or directory. in pcpu.h (kgdb) #0 doadump (textdump=1) at pcpu.h:230 #1 0xffffffff80565c60 in kern_reboot (howto=260) at /data/src/sys/kern/kern_shutdown.c:446 #2 0xffffffff805660b3 in vpanic (fmt=<value optimized out>, ap=<value optimized out>) at /data/src/sys/kern/kern_shutdown.c:872 #3 0xffffffff80565e13 in panic (fmt=<value optimized out>) at /data/src/sys/kern/kern_shutdown.c:799 #4 0xffffffff803f1d94 in em_isc_rxd_pkt_get (arg=<value optimized out>, ri=<value optimized out>) at /data/src/sys/dev/e1000/em_txrx.c:698 #5 0xffffffff80668b28 in iflib_rxeof (rxq=0xfffff80002295ac0, budget=<value optimized out>) at /data/src/sys/net/iflib.c:2684 #6 0xffffffff80664f69 in _task_fn_rx (context=0xfffff80002295ac0) at /data/src/sys/net/iflib.c:3820 #7 0xffffffff805a6039 in gtaskqueue_run_locked (queue=0xfffff800021dc500) at /data/src/sys/kern/subr_gtaskqueue.c:332 #8 0xffffffff805a5df8 in gtaskqueue_thread_loop (arg=<value optimized out>) at /data/src/sys/kern/subr_gtaskqueue.c:507 #9 0xffffffff8052f7e4 in fork_exit (callout=0xffffffff805a5d70 <gtaskqueue_thread_loop>, arg=0xfffffe00017f8020, frame=0xfffffe000043fc00) at /data/src/sys/kern/kern_fork.c:1057 #10 0xffffffff8081ce2e in fork_trampoline () at /data/src/sys/amd64/amd64/exception.S:993 #11 0x0000000000000000 in ?? () Current language: auto; currently minimal (kgdb)
(In reply to Lev A. Serebryakov from comment #34) iperf3 -c <server> -R -t 3600 -P 32 "-P 32", not "--nstreams 32", as we speak TCP, not SCTP here.
Maybe you're encountering something similar to what was fixed here? https://github.com/freebsd/freebsd/commit/e2a6991d7175b5ba9b6832b1d8770e58fa57e998 That was causing us to hit the MPASS() in rxd_pkt_get in ixl(4).
(In reply to Eric Joyner from comment #36) Maybe. I'm using mtu 9000 in my tests… I could try to reproduce it with standard mtu (1500).
(In reply to Eric Joyner from comment #36) I can not reproduce it with mtu=1500 on both ends for 20 minutes. I'm trying to comment out "budget == 1" case for em(8).
(In reply to Eric Joyner from comment #36) Nope, commenting out "budget == 1" section in em_txrx.c (lines 556-560) doesn't help. Same assertion was triggered.
One additional datapoint: when mtu=1500 on both ends, everything works for tens of minutes, but sending part (11.2-STABLE based) shows bursts of "resends", which is not occurs with mtu=9000 till crash.
(In reply to Lev A. Serebryakov from comment #39) OOPS! Looks like I patched only "lem" but not "em" function! Let's try to patch "em" too...
(In reply to Eric Joyner from comment #36) Looks like it helps. Simple traffic with INVARIANTS works, now I'm testing IPsec configuration.
(In reply to Eric Joyner from comment #36) Yess! It helps em0 to pass all my torture tests (when I comment out this "optimization" twice, for lem and em). I can not test on igb now, but belive it will help too. Please, commit this fix :-)
BTW, if_ix contains SAME problem. I can not reproduce it (I have one ix link), but code is same and I'm sure, it has same problem.
I'll work on porting that change (and maybe another) to em/igb/ix.
A commit references this bug: Author: erj Date: Sun Oct 14 05:09:44 UTC 2018 New revision: 339354 URL: https://svnweb.freebsd.org/changeset/base/339354 Log: em/igb/ix(4): Port two Tx/Rx fixes made to ixl in r339338 - Fix assert/panic on receive when Jumbo Frames are enabled. From the commit I made to ixl: "It turns out that *_isc_rxd_available is supposed to return how many packets are available to be cleaned on the rx ring. This patch removes a section of code where if the budget argument is 1, the function would return one if there was a descriptor available, not necessarily a packet. This is okay in regular mtu 1500 traffic since the max frame size is less than the configured receive buffer size (2048), but this doesn't work when received packets can span more than one descriptor, as is the case when the mtu is 9000 and the receive buffer size is 4096." - Fix possible Tx hang because *_isc_txd_credits_update returns incorrect result From the commit by Krzysztof Galazka to ixl: "Function isc_txd_update_credits called with clear set to false should return 1 if there are TX descriptors already handled by HW. It was always returning 0 causing troubles with UDP TX traffic." PR: 231659 Reported by: lev@ Approved by: re (gjb@) Sponsored by: Intel Corporation Changes: head/sys/dev/e1000/em_txrx.c head/sys/dev/e1000/igb_txrx.c head/sys/dev/ixgbe/ix_txrx.c
(In reply to Lev A. Serebryakov from comment #44) Lev, Did my commit fix your issue? Every Intel driver should be fixed now.
(In reply to Eric Joyner from comment #47) Yep, I can not reproduce crash anymore. Thank you! (I'm not sure, who should close ticket, me or you).
(In reply to Lev A. Serebryakov from comment #48) I'll close it.