231659 – [em][igb] 12-ALPHA8 r339259 crashes on receive under load

Bug 231659 - [em][igb] 12-ALPHA8 r339259 crashes on receive under load

Summary: [em][igb] 12-ALPHA8 r339259 crashes on receive under load

Status:	Closed FIXED

Alias:	None

Product:	Base System
Classification:	Unclassified
Component:	kern (show other bugs)
Version:	CURRENT
Hardware:	amd64 Any

Importance:	--- Affects Only Me
Assignee:	Eric Joyner

URL:
Keywords:	IntelNetworking, crash, regression

Depends on:
Blocks:

Reported:	2018-09-24 12:17 UTC by Lev A. Serebryakov
Modified:	2019-09-26 14:33 UTC (History)
CC List:	13 users (show)

See Also:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Lev A. Serebryakov freebsd_committer

2018-09-24 12:17:11 UTC

I'm running network benchmarks with IPsec in transport mode. It is simple "iperf3" run after setting up IPsec with "setkey".

Network adapter is igb (I210).

Here is backtrace

Fatal trap 9: general protection fault while in kernel mode
cpuid = 3; apic id = 06
instruction pointer     = 0x20:0xffffffff8077ae62
stack pointer           = 0x28:0xfffffe00004de280
frame pointer           = 0x28:0xfffffe00004de2a0
code segment            = base rx0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = resume, IOPL = 0
current process         = 0 (if_io_tqg_3)
[ thread pid 0 tid 100068 ]
Stopped at      intr_execute_handlers+0x12:     addq    $0x1,(%rax)
db> bt
Tracing pid 0 tid 100068 td 0xfffff80002970000
intr_execute_handlers() at intr_execute_handlers+0x12/frame 0xfffffe00004de2a0
lapic_handle_intr() at lapic_handle_intr+0x44/frame 0xfffffe00004de2c0
Xapic_isr1() at Xapic_isr1+0xd9/frame 0xfffffe00004de2c0
--- interrupt, rip = 0xffffffff80721d1a, rsp = 0xfffffe00004de390, rbp = 0xfffffe00004de3a0 ---
spinlock_exit() at spinlock_exit+0x3a/frame 0xfffffe00004de3a0
putchar() at putchar+0x14e/frame 0xfffffe00004de420
kvprintf() at kvprintf+0x106/frame 0xfffffe00004de540
vprintf() at vprintf+0x84/frame 0xfffffe00004de610
printf() at printf+0x43/frame 0xfffffe00004de670
trap_fatal() at trap_fatal+0x9d/frame 0xfffffe00004de6c0
trap() at trap+0x6d/frame 0xfffffe00004de7d0
calltrap() at calltrap+0x8/frame 0xfffffe00004de7d0
--- trap 0x9, rip = 0xffffffff804b13e9, rsp = 0xfffffe00004de8a0, rbp = 0xfffffe00004de8f0 ---
__rw_rlock_hard() at __rw_rlock_hard+0xb9/frame 0xfffffe00004de8f0
bpf_mtap() at bpf_mtap+0x46/frame 0xfffffe00004de970
ether_nh_input() at ether_nh_input+0xca/frame 0xfffffe00004de9c0
netisr_dispatch_src() at netisr_dispatch_src+0xa1/frame 0xfffffe00004dea20
ether_input() at ether_input+0x26/frame 0xfffffe00004dea40
_task_fn_rx() at _task_fn_rx+0x7ea/frame 0xfffffe00004deb30
gtaskqueue_run_locked() at gtaskqueue_run_locked+0xe3/frame 0xfffffe00004deb80
gtaskqueue_thread_loop() at gtaskqueue_thread_loop+0x88/frame 0xfffffe00004debb0
fork_exit() at fork_exit+0x76/frame 0xfffffe00004debf0
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe00004debf0
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---

Comment 1 Lev A. Serebryakov freebsd_committer

2018-09-25 11:31:29 UTC

Steps to reproduce for me:

(1) Two hosts:
  192.168.134.1, 12-ALPHA7, slow, without AES-NI
  192.168.134.2, 11-STABLE, fast, with AES-NI

(2) Setup IPsec transport for TCP port 5201 (iperf3 part):
 (a) on 192.168.134.2
setkey -c<<__END
flush;
spdflush;
add 192.168.134.1 192.168.134.2 esp 0x10001 -E rijndael-cbc "0123456789abcdef";
add 192.168.134.2 192.168.134.1 esp 0x10002 -E rijndael-cbc "0123456789abcdef";
spdadd 192.168.134.2/32[5201] 192.168.134.1/32 tcp -P out ipsec esp/transport//require;
spdadd 192.168.134.1/32 192.168.134.2/32[5201] tcp -P in  ipsec esp/transport//require;
__END
 (b) on 192.168.134.1
setkey -c <<__END
flush;
spdflush;
add 192.168.134.1 192.168.134.2 esp 0x10001 -E rijndael-cbc "0123456789abcdef";
add 192.168.134.2 192.168.134.1 esp 0x10002 -E rijndael-cbc "0123456789abcdef";
spdadd 192.168.134.1/32 192.168.134.2/32[5201] tcp -P out ipsec esp/transport//require;
spdadd 192.168.134.2/32[5201] 192.168.134.1/32 tcp -P in  ipsec esp/transport//require;
__END

(3) run "iperf3 -s" on 192.168.134.2
(4) run "iperf -c 192.168.134.2 -R" on 192.168.134.1
(5) Almost instant crash on 192.168.134.1.

It looks have something to do with timings, as same setup where slow 192.168.134.1 is replaced bu much faster and AES-NI-capable system (same FreeBSD version) make crash much more hard to reproduce. I've got only one for 6 hours of testing with fast system.

Comment 2 Lev A. Serebryakov freebsd_committer

2018-09-27 14:25:37 UTC

Stopping all bpf consumers (dhcp client, dhcp server) doesn't help.

Comment 3 Eugene Grosbein freebsd_committer

2018-09-27 15:17:44 UTC

Do you have kernel.debug and crashdump handy to make kgdb backtrace?

Comment 4 Lev A. Serebryakov freebsd_committer

2018-09-27 15:31:09 UTC

(In reply to Eugene Grosbein from comment #3)
Unfortunately, not right now. It is NanoBSD installation of my home router, so it doesn't have permanent writable storage and auto-reboots.

I could repeat crash at weekend (Saturday/Sunday) with attached Serial console and installed kgdb & kernel.debug.

Comment 5 Mark Johnston freebsd_committer

2018-09-27 15:36:47 UTC

(In reply to Lev A. Serebryakov from comment #4)
If your kernel has "options NETDUMP" configured, you can try configuring your router to dump to a different host on the local network.  See netdump(4) and the netdumpd port.

Comment 6 Lev A. Serebryakov freebsd_committer

2018-09-27 15:45:40 UTC

(In reply to Mark Johnston from comment #5)
Ooooh, thank you, I've missed this feature, I'll rebuild it with it! Great!

Comment 7 Lev A. Serebryakov freebsd_committer

2018-09-27 21:43:39 UTC

Adding NETBOOT doesn't help. I've checked at console and system simply reboots now, without panic, crash report or debugger or anything like this.

Ok. I'll add INVARIANTS and WITNESS to kernel and try again.

Comment 8 Mark Johnston freebsd_committer

2018-09-27 21:49:33 UTC

(In reply to Lev A. Serebryakov from comment #7)
Do you mean NETDUMP?  The system crashes when you configure netdump on the router, or after?  You might try setting debug.debugger_on_panic=1 to see if anything useful is printed before the reboot.

Comment 9 Lev A. Serebryakov freebsd_committer

2018-09-28 10:37:50 UTC

(In reply to Mark Johnston from comment #8)
I mean, that new kernel with NETDUMP option enabled silently reboots without "panic" message, debugger prompt, crashdump or thing like this. Simple reboot at some place in testing and it's all. So, NETDUMP doesn't make system to crash. It works as usual till I start testing. But it doesn't panic anymore under load, just reboots.

I'll try to add some cooling and build kernel with all debugging options at Saturday.

Comment 10 Eugene Grosbein freebsd_committer

2018-09-28 13:29:21 UTC

(In reply to Lev A. Serebryakov from comment #9)

You can still add some USB stick and configure kernel to use its /dev/da0s1b partition as crashdump target. The kernel is somewhat picky when it decides if it should write to a device or not, so make sure you correctly create and label traditional "swap" partition.

Comment 11 Lev A. Serebryakov freebsd_committer

2018-09-28 13:36:33 UTC

(In reply to Eugene Grosbein from comment #10)
I understand it. Yes, I'll try this tomorrow.

And I'll try to add some BIG FAN to this box (it is passively cooled MiniTIX box), as I start to suspect tha tit may be overhrating

Output of "sysctl dev.cpu.0.temperature" doesn't show anything suspicious, values are around 51-55⁰C, but maybe it overheats something else?

Comment 12 Lev A. Serebryakov freebsd_committer

2018-09-30 13:18:29 UTC

Ok. Now I have kernel with INVARIANTs and WITNESS and I have space fro crashdumps.
I've got 3 crashdumps with exactly same panic message:

Assertion (staterr & E1000_RXD_STAT_DD) != 0 failed at /data/src/sys/dev/e1000/em_txrx.c:698

Here is stacktrace:

#0  doadump (textdump=1) at pcpu.h:230
#1  0xffffffff80565a70 in kern_reboot (howto=260) at /data/src/sys/kern/kern_shutdown.c:446
#2  0xffffffff80565ec3 in vpanic (fmt=<value optimized out>, ap=<value optimized out>) at /data/src/sys/kern/kern_shutdown.c:872
#3  0xffffffff80565c23 in panic (fmt=<value optimized out>) at /data/src/sys/kern/kern_shutdown.c:799
#4  0xffffffff803f1de4 in em_isc_rxd_pkt_get (arg=<value optimized out>, ri=<value optimized out>) at /data/src/sys/dev/e1000/em_txrx.c:698
#5  0xffffffff806688d8 in iflib_rxeof (rxq=0xfffff80002295800, budget=<value optimized out>) at /data/src/sys/net/iflib.c:2684
#6  0xffffffff80664d19 in _task_fn_rx (context=0xfffff80002295800) at /data/src/sys/net/iflib.c:3820
#7  0xffffffff805a5e49 in gtaskqueue_run_locked (queue=0xfffff800021dc400) at /data/src/sys/kern/subr_gtaskqueue.c:332
#8  0xffffffff805a5c08 in gtaskqueue_thread_loop (arg=<value optimized out>) at /data/src/sys/kern/subr_gtaskqueue.c:507
#9  0xffffffff8052f5c4 in fork_exit (callout=0xffffffff805a5b80 <gtaskqueue_thread_loop>, arg=0xfffffe00017f8008, frame=0xfffffe000043ac00) at /data/src/sys/kern/kern_fork.c:1057
#10 0xffffffff8081cbfe in fork_trampoline () at /data/src/sys/amd64/amd64/exception.S:993
#11 0x0000000000000000 in ?? ()

Comment 13 Lev A. Serebryakov freebsd_committer

2018-09-30 14:03:00 UTC

Without debug options in kernel crashes are all different, but it is always GPE. Looks like memory corruption.

Please note, that without SAD/SDP everything works. And with "null" SAD everything works for hours.
Even sending works with IPsec and aes-256-gcm/aes-256-cbc. Only combination of IPsec with true encryption and receiving data leads to crash.

Comment 14 Lev A. Serebryakov freebsd_committer

2018-09-30 14:03:32 UTC

One crash without debug options

Fatal trap 9: general protection fault while in kernel mode
cpuid = 1; apic id = 01
instruction pointer	= 0x20:0xffffffff806585ea
cpuid = 0; apic id = 00
instruction pointer	= 0x20:0xffffffff806dab77
stack pointer	        = 0x28:0xfffffe0025d10a10
frame pointer	        = 0x28:0xfffffe0025d10b40
stack pointer	        = 0x28:0xfffffe0000470a30
code segment		= base rx0, limit 0xfffff, type 0x1b
			= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags	= interrupt enabled, resume, IOPL = 0
current process		= 11 (swi1: netisr 0)
trap number		= 9
panic: general protection fault

#0  doadump (textdump=1) at pcpu.h:230
#1  0xffffffff8056008b in kern_reboot (howto=260) at /data/src/sys/kern/kern_shutdown.c:446
#2  0xffffffff805604c3 in vpanic (fmt=<value optimized out>, ap=<value optimized out>) at /data/src/sys/kern/kern_shutdown.c:872
#3  0xffffffff805602b3 in panic (fmt=<value optimized out>) at /data/src/sys/kern/kern_shutdown.c:799
#4  0xffffffff8081fd2f in trap_fatal (frame=0xfffffe0025d10950, eva=0) at /data/src/sys/amd64/amd64/trap.c:929
#5  0xffffffff8081f22d in trap (frame=0xfffffe0025d10950) at counter.h:87
#6  0xffffffff807ff367 in calltrap () at /data/src/sys/amd64/amd64/exception.S:232
#7  0xffffffff806dab77 in ip_output (m=<value optimized out>, opt=<value optimized out>, ro=<value optimized out>, flags=<value optimized out>, imo=0x0, inp=0x0) at /data/src/sys/netinet/ip_output.c:659
#8  0xffffffff807462b8 in ipsec_process_done (m=0xfffff8009947b700, sp=0x0, sav=0xfffff800058ef200, idx=1) at /data/src/sys/netipsec/ipsec_output.c:796
#9  0xffffffff8075b55b in esp_output_cb (crp=0xfffff800b20bf000) at /data/src/sys/netipsec/xform_esp.c:951
#10 0xffffffff80784007 in swcr_process (dev=<value optimized out>, crp=<value optimized out>, hint=<value optimized out>) at /data/src/sys/opencrypto/cryptosoft.c:1222
#11 0xffffffff807804eb in crypto_dispatch (crp=0xfffff800b20bf000) at /data/src/sys/opencrypto/crypto.c:1001
#12 0xffffffff8075af80 in esp_output (m=0xfffff8009947b700, sp=0xfffff8000541eb00, sav=<value optimized out>, idx=0, skip=20, protoff=9) at /data/src/sys/netipsec/xform_esp.c:869
#13 0xffffffff8074581f in ipsec4_perform_request (m=<value optimized out>, sp=<value optimized out>, inp=0xfffff800058fc3d0, idx=0) at /data/src/sys/netipsec/ipsec_output.c:275
#14 0xffffffff80745916 in ipsec4_output (m=0xfffff8009947b700, inp=0xfffff800058fc3d0) at /data/src/sys/netipsec/ipsec_output.c:292
#15 0xffffffff806da49e in ip_output (m=<value optimized out>, opt=<value optimized out>, ro=<value optimized out>, flags=<value optimized out>, imo=0x0, inp=0xfffff800058fc3d0) at /data/src/sys/netinet/ip_output.c:549
#16 0xffffffff806e84b5 in tcp_output (tp=0xfffff80099d72000) at /data/src/sys/netinet/tcp_output.c:1409
#17 0xffffffff806e4bc3 in tcp_do_segment (m=0xfffff800b2473400, th=<value optimized out>, so=0xfffff80005e67000, tp=0xfffff80099d72000, drop_hdrlen=52, tlen=<value optimized out>, iptos=0 '\0') at atomic.h:221
#18 0xffffffff806e1a51 in tcp_input (mp=<value optimized out>, offp=<value optimized out>, proto=<value optimized out>) at /data/src/sys/netinet/tcp_input.c:1392
#19 0xffffffff806d2078 in ip_input (m=0x0) at /data/src/sys/netinet/ip_input.c:827
#20 0xffffffff8065c343 in swi_net (arg=<value optimized out>) at /data/src/sys/net/netisr.c:901
#21 0xffffffff8052ffc5 in ithread_loop (arg=<value optimized out>) at /data/src/sys/kern/kern_intr.c:1043
#22 0xffffffff8052d0a6 in fork_exit (callout=0xffffffff8052fe60 <ithread_loop>, arg=0xfffff8000204e540, frame=0xfffffe0025d11c00) at /data/src/sys/kern/kern_fork.c:1057
#23 0xffffffff8080034e in fork_trampoline () at /data/src/sys/amd64/amd64/exception.S:993

Comment 15 Lev A. Serebryakov freebsd_committer

2018-09-30 14:04:02 UTC

Other crash without debug options

Fatal trap 9: general protection fault while in kernel mode
cpuid = 0; cpuid = 1; apic id = 01
instruction pointer	= 0x20:0xffffffff806585ea
stack pointer	        = 0x28:0xfffffe0000470a30
frame pointer	        = 0x28:0xfffffe0000470b00
apic id = 00
instruction pointer	= 0x20:0xffffffff806dab77
code segment		= base rx0, limit 0xfffff, type 0x1b
stack pointer	        = 0x28:0xfffffe0025d10a10
			= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags	= frame pointer	        = 0x28:0xfffffe0025d10b40
interrupt enabled, resume, IOPL = 0
current process		= 0 (if_io_tqg_1)
trap number		= 9
panic: general protection fault
cpuid = 1

#0  doadump (textdump=1) at pcpu.h:230
#1  0xffffffff8056008b in kern_reboot (howto=260) at /data/src/sys/kern/kern_shutdown.c:446
#2  0xffffffff805604c3 in vpanic (fmt=<value optimized out>, ap=<value optimized out>) at /data/src/sys/kern/kern_shutdown.c:872
#3  0xffffffff805602b3 in panic (fmt=<value optimized out>) at /data/src/sys/kern/kern_shutdown.c:799
#4  0xffffffff8081fd2f in trap_fatal (frame=0xfffffe0000470970, eva=0) at /data/src/sys/amd64/amd64/trap.c:929
#5  0xffffffff8081f22d in trap (frame=0xfffffe0000470970) at counter.h:87
#6  0xffffffff807ff367 in calltrap () at /data/src/sys/amd64/amd64/exception.S:232
#7  0xffffffff806585ea in iflib_rxeof (rxq=<value optimized out>, budget=<value optimized out>) at /data/src/sys/net/iflib.c:2770
#8  0xffffffff80654970 in _task_fn_rx (context=0xfffff80002292ac0) at /data/src/sys/net/iflib.c:3820
#9  0xffffffff8059fa63 in gtaskqueue_run_locked (queue=0xfffff800021da500) at /data/src/sys/kern/subr_gtaskqueue.c:332
#10 0xffffffff8059f7e8 in gtaskqueue_thread_loop (arg=<value optimized out>) at /data/src/sys/kern/subr_gtaskqueue.c:507
#11 0xffffffff8052d0a6 in fork_exit (callout=0xffffffff8059f760 <gtaskqueue_thread_loop>, arg=0xfffffe00017fa020, frame=0xfffffe0000470c00) at /data/src/sys/kern/kern_fork.c:1057
#12 0xffffffff8080034e in fork_trampoline () at /data/src/sys/amd64/amd64/exception.S:993
#13 0x0000000000000000 in ?? ()

Comment 16 Lev A. Serebryakov freebsd_committer

2018-09-30 14:09:50 UTC

And third non-debug crash

Fatal trap 9: general protection fault while in kernel mode
cpuid = 0; apic id = 00
instruction pointer	= 0x20:0xffffffff806dab77
stack pointer	        = 0x28:0xfffffe0025d10a10
frame pointer	        = 0x28:0xfffffe0025d10b40
cpuid = 1; apic id = 01
instruction pointer	= 0x20:0xffffffff806585ea
code segment		= base rx0, limit 0xfffff, type 0x1b
stack pointer	        = 0x28:0xfffffe0000470a30
			= DPL 0, pres 1, long 1, def32 0, gran 1
frame pointer	        = 0x28:0xfffffe0000470b00
processor eflags	= interrupt enabled, code segment		= base rx0, limit 0xfffff, type 0x1b
resume, IOPL = 0
current process		= 11 (swi1: netisr 0)
trap number		= 9
			= DPL 0, pres 1, long 1, def32 0, gran 1

#0  doadump (textdump=1) at pcpu.h:230
#1  0xffffffff8056008b in kern_reboot (howto=260) at /data/src/sys/kern/kern_shutdown.c:446
#2  0xffffffff805604c3 in vpanic (fmt=<value optimized out>, ap=<value optimized out>) at /data/src/sys/kern/kern_shutdown.c:872
#3  0xffffffff805602b3 in panic (fmt=<value optimized out>) at /data/src/sys/kern/kern_shutdown.c:799
#4  0xffffffff8081fd2f in trap_fatal (frame=0xfffffe0025d10950, eva=0) at /data/src/sys/amd64/amd64/trap.c:929
#5  0xffffffff8081f22d in trap (frame=0xfffffe0025d10950) at counter.h:87
#6  0xffffffff807ff367 in calltrap () at /data/src/sys/amd64/amd64/exception.S:232
#7  0xffffffff806dab77 in ip_output (m=<value optimized out>, opt=<value optimized out>, ro=<value optimized out>, flags=<value optimized out>, imo=0x0, inp=0x0) at /data/src/sys/netinet/ip_output.c:659
#8  0xffffffff807462b8 in ipsec_process_done (m=0xfffff80005782400, sp=0x0, sav=0xfffff80005402100, idx=1) at /data/src/sys/netipsec/ipsec_output.c:796
#9  0xffffffff8075b55b in esp_output_cb (crp=0xfffff8008c5e1080) at /data/src/sys/netipsec/xform_esp.c:951
#10 0xffffffff80784007 in swcr_process (dev=<value optimized out>, crp=<value optimized out>, hint=<value optimized out>) at /data/src/sys/opencrypto/cryptosoft.c:1222
#11 0xffffffff807804eb in crypto_dispatch (crp=0xfffff8008c5e1080) at /data/src/sys/opencrypto/crypto.c:1001
#12 0xffffffff8075af80 in esp_output (m=0xfffff80005782400, sp=0xfffff8008c371100, sav=<value optimized out>, idx=0, skip=20, protoff=9) at /data/src/sys/netipsec/xform_esp.c:869
#13 0xffffffff8074581f in ipsec4_perform_request (m=<value optimized out>, sp=<value optimized out>, inp=0xfffff80005a611e8, idx=0) at /data/src/sys/netipsec/ipsec_output.c:275
#14 0xffffffff80745916 in ipsec4_output (m=0xfffff80005782400, inp=0xfffff80005a611e8) at /data/src/sys/netipsec/ipsec_output.c:292
#15 0xffffffff806da49e in ip_output (m=<value optimized out>, opt=<value optimized out>, ro=<value optimized out>, flags=<value optimized out>, imo=0x0, inp=0xfffff80005a611e8) at /data/src/sys/netinet/ip_output.c:549
#16 0xffffffff806e84b5 in tcp_output (tp=0xfffff80005a63760) at /data/src/sys/netinet/tcp_output.c:1409
#17 0xffffffff806e4bc3 in tcp_do_segment (m=0xfffff80005da8700, th=<value optimized out>, so=0xfffff80005b5ea38, tp=0xfffff80005a63760, drop_hdrlen=52, tlen=<value optimized out>, iptos=0 '\0') at atomic.h:221
#18 0xffffffff806e1a51 in tcp_input (mp=<value optimized out>, offp=<value optimized out>, proto=<value optimized out>) at /data/src/sys/netinet/tcp_input.c:1392
#19 0xffffffff806d2078 in ip_input (m=0x0) at /data/src/sys/netinet/ip_input.c:827
#20 0xffffffff8065c343 in swi_net (arg=<value optimized out>) at /data/src/sys/net/netisr.c:901
#21 0xffffffff8052ffc5 in ithread_loop (arg=<value optimized out>) at /data/src/sys/kern/kern_intr.c:1043
#22 0xffffffff8052d0a6 in fork_exit (callout=0xffffffff8052fe60 <ithread_loop>, arg=0xfffff8000204e540, frame=0xfffffe0025d11c00) at /data/src/sys/kern/kern_fork.c:1057
#23 0xffffffff8080034e in fork_trampoline () at /data/src/sys/amd64/amd64/exception.S:993
#24 0x0000000000000000 in ?? ()

Comment 17 Lev A. Serebryakov freebsd_committer

2018-09-30 14:10:46 UTC

I have all crashdumps saved, and I have corresponding kernel.full saved, too, so I could provide any additional information which could be extracted with "kgdb" from these.

Comment 18 Lev A. Serebryakov freebsd_committer

2018-10-05 12:15:02 UTC

Do I need to provide additional information?

Comment 19 Conrad Meyer freebsd_committer

2018-10-09 04:55:42 UTC

Lev, what versions are you testing on?  Given the stacks in the last few comments, it appears you're using cryptosoft driver.  If you're not already on r338953, please try updating to that revision and retesting.

Another thing you could do without rebuilding kernel is load the 'aesni' driver (which probably provides better performance too).

Comment 20 Lev A. Serebryakov freebsd_committer

2018-10-09 12:19:18 UTC

(In reply to Conrad Meyer from comment #19)
It is r339021.

aesni is useless here, as this hardware doesn't have support for it.

Comment 21 Conrad Meyer freebsd_committer

2018-10-09 16:47:31 UTC

Ok, I think it's probably not an OCF bug then.  Plenty of room for a NIC or IPsec bug, though.

Comment 22 Lev A. Serebryakov freebsd_committer

2018-10-09 17:11:40 UTC

(In reply to Conrad Meyer from comment #21)
It gives me idea to test on AES-NI capable hardware with other NICs (igb instead of em) but without AES-NI loaded, to force it use soft crypto.

Also, I could test VM installation with vtnet (and disable AES-NI for it).

Hardware with igb and AES-NI works without any problems.

Comment 23 Conrad Meyer freebsd_committer

2018-10-09 17:38:09 UTC

Ok, let me try and understand what has been tested.  Please correct me if I am mistaken:

- igb + ??? + bpf = crash (initial description)?
- igb + AESNI + no bpf = no crash
- em + !AESNI + no bpf = crash

Have you tried, or can you try:

- igb + !AESNI (unload aesni.ko module)
- em + AESNI (is it possible to move em NIC to the CPU that supports AESNI, or is it soldered to the board?)

Additionally, would it be possible for me to access your kernel binaries and core dump(s) from comments 14-16?

One other question -- what revision is the 11-STABLE machine on?

Thanks.

Comment 24 Conrad Meyer freebsd_committer

2018-10-09 17:38:50 UTC

CCing ae@ who is more familiar with ipsec than me :-).

Comment 25 Lev A. Serebryakov freebsd_committer

2018-10-09 17:51:32 UTC

(In reply to Conrad Meyer from comment #23)

Now I have:

igb + AESNI + bpf — one crash on older revision, no crashes for several hours of testing on newer revisions. It is very first stack trace, and I can not reproduce it anymore (with or without bpf). Looks like, we could skip this, as it is not reproducible.

em + !AESNI — crash with or without bpf. All other stack traces (with IBVARIANTS and without them) are for this configuration. Looks like, bpf is not evolved at all.

Unfortunately, I can not swap NICs or CPUs, as it is embedded-like hardware with everything soldered on board.

I'll try igb + !AESNI tonight by unloading aesni.ko in first place.

I'll send you URL for kernels + dumps via e-mail, as I'm not sure it doesn't contain sensitive data (it should not, but though).

11-STABLE which serves as "other end" of this test setup is r338960.

Comment 26 Lev A. Serebryakov freebsd_committer

2018-10-09 19:01:50 UTC

I could report, that 

vtent0 +  AESNI +  INVARIANTS — no crash.
vtnet0 + !AESNI +  INVARIANTS — no crash.
vtent0 +  AESNI + !INVARIANTS — no crash.
vtent0 + !AESNI + !INVARIANTS — no crash.

I'll need to perform tests for igb + !AESNI.

Comment 27 Lev A. Serebryakov freebsd_committer

2018-10-09 19:56:24 UTC

igb0 + !AESNI + !INVARIANTS — crash.

But I can not provide dumps or stacks yet :-(

Looks like it is combination of Intel NICs and soft crypto.

Comment 28 Lev A. Serebryakov freebsd_committer

2018-10-09 20:02:23 UTC

It still a issue for ALPHA8, r339259.

Comment 29 Lev A. Serebryakov freebsd_committer

2018-10-09 23:09:23 UTC

netdump(8) via same igb0 as test itself leads to loop of panics :-)

Comment 30 Mark Johnston freebsd_committer

2018-10-09 23:12:42 UTC

(In reply to Lev A. Serebryakov from comment #29)
Are you able grab a backtrace or at least a panic message in this case?

netdump does have a lot of failure modes when used on a busy interface.  It works best when configured on a management interface.

Comment 31 Lev A. Serebryakov freebsd_committer

2018-10-09 23:36:50 UTC

Ok, system with igb0 CAN NOT crash dumps in automatic mode at all, even if it is local dump. It goes to panic loop and print on console this again and again and again, forever:

Fatal trap 9: general protection fault while in kernel mode
cpuid = 2; apic id = 04
instruction pointer     = 0x20:0xffffffff8057e206
stack pointer           = 0x28:0xfffffe00255a3c20
frame pointer           = 0x28:0xfffffe00255a3c20
code segment            = base rx0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = resume, IOPL = 0
current process         = 0 (if_io_tqg_2)
trap number             = 9
panic: general protection fault
cpuid = 2
time = 1539126183
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe00255a3930
vpanic() at vpanic+0x1a3/frame 0xfffffe00255a3990
panic() at panic+0x43/frame 0xfffffe00255a39f0
trap_fatal() at trap_fatal+0x35f/frame 0xfffffe00255a3a40
trap() at trap+0x6d/frame 0xfffffe00255a3b50
calltrap() at calltrap+0x8/frame 0xfffffe00255a3b50
--- trap 0x9, rip = 0xffffffff8057e206, rsp = 0xfffffe00255a3c20, rbp = 0xfffffe00255a3c20 ---
strcmp() at strcmp+0x6/frame 0xfffffe00255a3c20
eventhandler_find_list() at eventhandler_find_list+0x4b/frame 0xfffffe00255a3c50
kern_reboot() at kern_reboot+0x103/frame 0xfffffe00255a3ca0
vpanic() at vpanic+0x203/frame 0xfffffe00255a3d00
panic() at panic+0x43/frame 0xfffffe00255a3d60
trap_fatal() at trap_fatal+0x35f/frame 0xfffffe00255a3db0
trap() at trap+0x6d/frame 0xfffffe00255a3ec0
calltrap() at calltrap+0x8/frame 0xfffffe00255a3ec0
--- trap 0x9, rip = 0xffffffff8057e206, rsp = 0xfffffe00255a3f90, rbp = 0xfffffe00255a3f90 ---
strcmp() at strcmp+0x6/frame 0xfffffe00255a3f90
eventhandler_find_list() at eventhandler_find_list+0x4b/frame 0xfffffe00255a3fc0
kern_reboot() at kern_reboot+0x103/frame 0xfffffe00255a4010
vpanic() at vpanic+0x203/frame 0xfffffe00255a4070
panic() at panic+0x43/frame 0xfffffe00255a40d0
trap_fatal() at trap_fatal+0x35f/frame 0xfffffe00255a4120
trap() at trap+0x6d/frame 0xfffffe00255a4230
calltrap() at calltrap+0x8/frame 0xfffffe00255a4230
--- trap 0x9, rip = 0xffffffff8057e206, rsp = 0xfffffe00255a4300, rbp = 0xfffffe00255a4300 ---
strcmp() at strcmp+0x6/frame 0xfffffe00255a4300
eventhandler_find_list() at eventhandler_find_list+0x4b/frame 0xfffffe00255a4330
kern_reboot() at kern_reboot+0x103/frame 0xfffffe00255a4380
vpanic() at vpanic+0x203/frame 0xfffffe00255a43e0
panic() at panic+0x43/frame 0xfffffe00255a4440
trap_fatal() at trap_fatal+0x35f/frame 0xfffffe00255a4490
trap() at trap+0x6d/frame 0xfffffe00255a45a0
calltrap() at calltrap+0x8/frame 0xfffffe00255a45a0
--- trap 0x9, rip = 0xffffffff8057e206, rsp = 0xfffffe00255a4670, rbp = 0xfffffe00255a4670 ---
strcmp() at strcmp+0x6/frame 0xfffffe00255a4670
eventhandler_find_list() at eventhandler_find_list+0x4b/frame 0xfffffe00255a46a0
kern_reboot() at kern_reboot+0x103/frame 0xfffffe00255a46f0
vpanic() at vpanic+0x203/frame 0xfffffe00255a4750
panic() at panic+0x43/frame 0xfffffe00255a47b0
trap_fatal() at trap_fatal+0x35f/frame 0xfffffe00255a4800
trap() at trap+0x6d/frame 0xfffffe00255a4910
calltrap() at calltrap+0x8/frame 0xfffffe00255a4910
--- trap 0x9, rip = 0xffffffff8057e206, rsp = 0xfffffe00255a49e0, rbp = 0xfffffe00255a49e0 ---
strcmp() at strcmp+0x6/frame 0xfffffe00255a49e0
eventhandler_find_list() at eventhandler_find_list+0x4b/frame 0xfffffe00255a4a10
kern_reboot() at kern_reboot+0x103/frame 0xfffffe00255a4a60
vpanic() at vpanic+0x203/frame 0xfffffe00255a4ac0
panic() at panic+0x43/frame 0xfffffe00255a4b20
trap_fatal() at trap_fatal+0x35f/frame 0xfffffe00255a4b70
trap() at trap+0x6d/frame 0xfffffe00255a4c80
calltrap() at calltrap+0x8/frame 0xfffffe00255a4c80
--- trap 0x9, rip = 0xffffffff8057e206, rsp = 0xfffffe00255a4d50, rbp = 0xfffffe00255a4d50 ---
strcmp() at strcmp+0x6/frame 0xfffffe00255a4d50
eventhandler_find_list() at eventhandler_find_list+0x4b/frame 0xfffffe00255a4d80
kern_reboot() at kern_reboot+0x103/frame 0xfffffe00255a4dd0
vpanic() at vpanic+0x203/frame 0xfffffe00255a4e30
panic() at panic+0x43/frame 0xfffffe00255a4e90
trap_fatal() at trap_fatal+0x35f/frame 0xfffffe00255a4ee0
trap() at trap+0x6d/frame 0xfffffe00255a4ff0
calltrap() at calltrap+0x8/frame 0xfffffe00255a4ff0
--- trap 0x9, rip = 0xffffffff8057e206, rsp = 0xfffffe00255a50c0, rbp = 0xfffffe00255a50c0 ---
strcmp() at strcmp+0x6/frame 0xfffffe00255a50c0
eventhandler_find_list() at eventhandler_find_list+0x4b/frame 0xfffffe00255a50f0
kern_reboot() at kern_reboot+0x103/frame 0xfffffe00255a5140
vpanic() at vpanic+0x203/frame 0xfffffe00255a51a0
panic() at panic+0x43/frame 0xfffffe00255a5200
trap_fatal() at trap_fatal+0x35f/frame 0xfffffe00255a5250
trap() at trap+0x6d/frame 0xfffffe00255a5360
calltrap() at calltrap+0x8/frame 0xfffffe00255a5360
--- trap 0x9, rip = 0xffffffff8057e206, rsp = 0xfffffe00255a5430, rbp = 0xfffffe00255a5430 ---
strcmp() at strcmp+0x6/frame 0xfffffe00255a5430
eventhandler_find_list() at eventhandler_find_list+0x4b/frame 0xfffffe00255a5460
kern_reboot() at kern_reboot+0x103/frame 0xfffffe00255a54b0
vpanic() at vpanic+0x203/frame 0xfffffe00255a5510
panic() at panic+0x43/frame 0xfffffe00255a5570
trap_fatal() at trap_fatal+0x35f/frame 0xfffffe00255a55c0
trap() at trap+0x6d/frame 0xfffffe00255a56d0
calltrap() at calltrap+0x8/frame 0xfffffe00255a56d0
--- trap 0x9, rip = 0xffffffff8057e206, rsp = 0xfffffe00255a57a0, rbp = 0xfffffe00255a57a0 ---
strcmp() at strcmp+0x6/frame 0xfffffe00255a57a0
eventhandler_find_list() at eventhandler_find_list+0x4b/frame 0xfffffe00255a57d0
kern_reboot() at kern_reboot+0x103/frame 0xfffffe00255a5820
vpanic() at vpanic+0x203/frame 0xfffffe00255a5880
panic() at panic+0x43/frame 0xfffffe00255a58e0
trap_fatal() at trap_fatal+0x35f/frame 0xfffffe00255a5930
trap() at trap+0x6d/frame 0xfffffe00255a5a40
calltrap() at calltrap+0x8/frame 0xfffffe00255a5a40
--- trap 0x9, rip = 0xffffffff8057e206, rsp = 0xfffffe00255a5b10, rbp = 0xfffffe00255a5b10 ---
strcmp() at strcmp+0x6/frame 0xfffffe00255a5b10
eventhandler_find_list() at eventhandler_find_list+0x4b/frame 0xfffffe00255a5b40
kern_reboot() at kern_reboot+0x103/frame 0xfffffe00255a5b90
vpanic() at vpanic+0x203/frame 0xfffffe00255a5bf0
panic() at panic+0x43/frame 0xfffffe00255a5c50
trap_fatal() at trap_fatal+0x35f/frame 0xfffffe00255a5ca0
trap() at trap+0x6d/frame 0xfffffe00255a5db0
calltrap() at calltrap+0x8/frame 0xfffffe00255a5db0
--- trap 0x9, rip = 0xffffffff8057e206, rsp = 0xfffffe00255a5e80, rbp = 0xfffffe00255a5e80 ---
strcmp() at strcmp+0x6/frame 0xfffffe00255a5e80
eventhandler_find_list() at eventhandler_find_list+0x4b/frame 0xfffffe00255a5eb0
kern_reboot() at kern_reboot+0x103/frame 0xfffffe00255a5f00
vpanic() at vpanic+0x203/frame 0xfffffe00255a5f60
panic() at panic+0x43/frame 0xfffffe00255a5fc0
trap_fatal() at trap_fatal+0x35f/frame 0xfffffe00255a6010
trap() at trap+0x6d/frame 0xfffffe00255a6120
calltrap() at calltrap+0x8/frame 0xfffffe00255a6120
--- trap 0x9, rip = 0xffffffff8048600b, rsp = 0xfffffe00255a61f0, rbp = 0xfffffe00255a6230 ---
intr_event_handle() at intr_event_handle+0xbb/frame 0xfffffe00255a6230
intr_execute_handlers() at intr_execute_handlers+0x58/frame 0xfffffe00255a6260
lapic_handle_intr() at lapic_handle_intr+0x44/frame 0xfffffe00255a6280
Xapic_isr1() at Xapic_isr1+0xd9/frame 0xfffffe00255a6280
--- interrupt, rip = 0xffffffff804f8b65, rsp = 0xfffffe00255a6350, rbp = 0xfffffe00255a6350 ---
lock_delay() at lock_delay+0x35/frame 0xfffffe00255a6350
_mtx_lock_spin_cookie() at _mtx_lock_spin_cookie+0xb1/frame 0xfffffe00255a63b0
cnputs() at cnputs+0xb8/frame 0xfffffe00255a63d0
putchar() at putchar+0x14e/frame 0xfffffe00255a6450
kvprintf() at kvprintf+0x106/frame 0xfffffe00255a6570
vprintf() at vprintf+0x84/frame 0xfffffe00255a6640
printf() at printf+0x43/frame 0xfffffe00255a66a0
trap_fatal() at trap_fatal+0x1a0/frame 0xfffffe00255a66f0
trap() at trap+0x6d/frame 0xfffffe00255a6800
calltrap() at calltrap+0x8/frame 0xfffffe00255a6800
--- trap 0x9, rip = 0xffffffff805a3699, rsp = 0xfffffe00255a68d0, rbp = 0xfffffe00255a6920 ---
vlan_input() at vlan_input+0x199/frame 0xfffffe00255a6920
ether_demux() at ether_demux+0x129/frame 0xfffffe00255a6950
ether_nh_input() at ether_nh_input+0x30c/frame 0xfffffe00255a69a0
netisr_dispatch_src() at netisr_dispatch_src+0xa1/frame 0xfffffe00255a6a00
ether_input() at ether_input+0x26/frame 0xfffffe00255a6a20
iflib_rxeof() at iflib_rxeof+0x880/frame 0xfffffe00255a6b00
_task_fn_rx() at _task_fn_rx+0x40/frame 0xfffffe00255a6b30
gtaskqueue_run_locked() at gtaskqueue_run_locked+0xe3/frame 0xfffffe00255a6b80
gtaskqueue_thread_loop() at gtaskqueue_thread_loop+0x88/frame 0xfffffe00255a6bb0
fork_exit() at fork_exit+0x76/frame 0xfffffe00255a6bf0
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe00255a6bf0

Comment 32 Lev A. Serebryakov freebsd_committer

2018-10-09 23:40:10 UTC

I've removed KDB_UNATTENDED from kernel and got dump! Looks like kernel memory is complete mess at this moment, as it panics on all 4 cores!



Unread portion of the kernel message buffer:
kernel trap 9 with interrupts disabled
kernel trap 9 with interrupts disabled
kernel trap 9 with interrupts disabled


Fatal trap 9: general protection fault while in kernel mode


cpuid = 0; 
Fatal trap 12: page fault while in kernel mode


Fatal trap 9: general protection fault while in kernel mode
cpuid = 1; apic id = 02
apic id = 00
instruction pointer	= 0x20:0xffffffff8048600b

Fatal trap 9: general protection fault while in kernel mode
cpuid = 2; apic id = 04
cpuid = 3; apic id = 06
instruction pointer	= 0x20:0xffffffff807864a2
stack pointer	        = 0x28:0xfffffe00255ab800
fault virtual address	= 0xfffff800c1401524
stack pointer	        = 0x28:0xfffffe00004511b0
fault code		= supervisor write data, page not present
instruction pointer	= 0x20:0xffffffff80338877
stack pointer	        = 0x28:0xfffffe00255a1ad0
kernel trap 9 with interrupts disabled
frame pointer	        = 0x28:0xfffffe00004511f0


Fatal trap 9: general protection fault while in kernel mode
cpuid = 1; apic id = 02
instruction pointer	= 0x20:0xffffffff8048600b
code segment		= base rx0, limit 0xfffff, type 0x1b
			= DPL 0, pres 1, long 1, def32 0, gran 1
frame pointer	        = 0x28:0xfffffe00255ab820
code segment		= base rx0, limit 0xfffff, type 0x1b
instruction pointer	= 0x20:0xffffffff807864a2
			= DPL 0, pres 1, long 1, def32 0, gran 1
stack pointer	        = 0x28:0xfffffe00255a1390
frame pointer	        = 0x28:0xfffffe00255a13d0
code segment		= base rx0, limit 0xfffff, type 0x1b
			= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags	= resume, IOPL = 0
current process		= 0 (if_io_tqg_3)

#0  doadump (textdump=0) at pcpu.h:230
230	pcpu.h: No such file or directory.
	in pcpu.h
(kgdb) #0  doadump (textdump=0) at pcpu.h:230
#1  0xffffffff8031436b in db_dump (dummy=<value optimized out>, dummy2=<value optimized out>, dummy3=<value optimized out>, dummy4=<value optimized out>) at /data/src/sys/ddb/db_command.c:574
#2  0xffffffff80314139 in db_command (cmd_table=<value optimized out>) at /data/src/sys/ddb/db_command.c:481
#3  0xffffffff80313eb4 in db_command_loop () at /data/src/sys/ddb/db_command.c:534
#4  0xffffffff8031715f in db_trap (type=<value optimized out>, code=<value optimized out>) at /data/src/sys/ddb/db_main.c:252
#5  0xffffffff804f7e33 in kdb_trap (type=9, code=0, tf=<value optimized out>) at /data/src/sys/kern/subr_kdb.c:693
#6  0xffffffff8073e561 in trap_fatal (frame=0xfffffe00255ab740, eva=0) at /data/src/sys/amd64/amd64/trap.c:921
#7  0xffffffff8073db0d in trap (frame=0xfffffe00255ab740) at counter.h:87
#8  0xffffffff8071db37 in calltrap () at /data/src/sys/amd64/amd64/exception.S:232
#9  0xffffffff807864a2 in intr_execute_handlers (isrc=0xfffff80002429180, frame=0xfffffe00255ab850) at /data/src/sys/x86/x86/intr_machdep.c:341
#10 0xffffffff8078c154 in lapic_handle_intr (vector=<value optimized out>, frame=<value optimized out>) at /data/src/sys/x86/x86/local_apic.c:1293
#11 0xffffffff8071ecc9 in Xapic_isr1 () at apic_vector.S:118
#12 0xffffffff806e9db0 in uma_zalloc_arg (zone=<value optimized out>, udata=0x20, flags=-512) at /data/src/sys/vm/uma_core.c:2571
#13 0xffffffff805b021d in _iflib_fl_refill (ctx=0xfffff8000241dc00, fl=0xfffff8000241e000, count=<value optimized out>) at mbuf.h:790
#14 0xffffffff805afcc8 in iflib_rxeof (rxq=<value optimized out>, budget=<value optimized out>) at /data/src/sys/net/iflib.c:2072
#15 0xffffffff805ac250 in _task_fn_rx (context=0xfffff80002424840) at /data/src/sys/net/iflib.c:3820
#16 0xffffffff804f6073 in gtaskqueue_run_locked (queue=0xfffff8000222b500) at /data/src/sys/kern/subr_gtaskqueue.c:332
#17 0xffffffff804f5df8 in gtaskqueue_thread_loop (arg=<value optimized out>) at /data/src/sys/kern/subr_gtaskqueue.c:507
#18 0xffffffff80483716 in fork_exit (callout=0xffffffff804f5d70 <gtaskqueue_thread_loop>, arg=0xfffffe0000221050, frame=0xfffffe00255abc00) at /data/src/sys/kern/kern_fork.c:1057
#19 0xffffffff8071eb1e in fork_trampoline () at /data/src/sys/amd64/amd64/exception.S:993
#20 0x0000000000000000 in ?? ()
Current language:  auto; currently minimal
(kgdb)

Comment 33 Lev A. Serebryakov freebsd_committer

2018-10-10 00:07:54 UTC

And igb + !AESNI + INVARIANTS — looks very similar to em!

panic: Assertion (staterr & E1000_RXD_STAT_DD) != 0 failed at /data/src/sys/dev/e1000/igb_txrx.c:451
cpuid = 2
time = 1539129723
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe00255a68f0
vpanic() at vpanic+0x1a3/frame 0xfffffe00255a6950
panic() at panic+0x43/frame 0xfffffe00255a69b0
igb_isc_rxd_pkt_get() at igb_isc_rxd_pkt_get+0x264/frame 0xfffffe00255a6a10
iflib_rxeof() at iflib_rxeof+0x128/frame 0xfffffe00255a6b00
_task_fn_rx() at _task_fn_rx+0x49/frame 0xfffffe00255a6b30
gtaskqueue_run_locked() at gtaskqueue_run_locked+0xf9/frame 0xfffffe00255a6b80
gtaskqueue_thread_loop() at gtaskqueue_thread_loop+0x88/frame 0xfffffe00255a6bb0
fork_exit() at fork_exit+0x84/frame 0xfffffe00255a6bf0
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe00255a6bf0
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---
KDB: enter: panic

#0  doadump (textdump=0) at pcpu.h:230
230	pcpu.h: No such file or directory.
	in pcpu.h
(kgdb)
#0  doadump (textdump=0) at pcpu.h:230
#1  0xffffffff8031544b in db_dump (dummy=<value optimized out>, dummy2=<value optimized out>, dummy3=<value optimized out>, dummy4=<value optimized out>) at /data/src/sys/ddb/db_command.c:574
#2  0xffffffff80315219 in db_command (cmd_table=<value optimized out>) at /data/src/sys/ddb/db_command.c:481
#3  0xffffffff80314f94 in db_command_loop () at /data/src/sys/ddb/db_command.c:534
#4  0xffffffff803181af in db_trap (type=<value optimized out>, code=<value optimized out>) at /data/src/sys/ddb/db_main.c:252
#5  0xffffffff804fe7e3 in kdb_trap (type=3, code=0, tf=<value optimized out>) at /data/src/sys/kern/subr_kdb.c:693
#6  0xffffffff8075b922 in trap (frame=0xfffffe00255a6820) at /data/src/sys/amd64/amd64/trap.c:619
#7  0xffffffff80739cc7 in calltrap () at /data/src/sys/amd64/amd64/exception.S:232
#8  0xffffffff804fdeab in kdb_enter (why=0xffffffff80821b4a "panic", msg=<value optimized out>) at cpufunc.h:65
#9  0xffffffff804bc9c0 in vpanic (fmt=<value optimized out>, ap=0xfffffe00255a6990) at /data/src/sys/kern/kern_shutdown.c:861
#10 0xffffffff804bc763 in panic (fmt=<value optimized out>) at /data/src/sys/kern/kern_shutdown.c:799
#11 0xffffffff8033db94 in igb_isc_rxd_pkt_get (arg=<value optimized out>, ri=<value optimized out>) at /data/src/sys/dev/e1000/igb_txrx.c:451
#12 0xffffffff805c0448 in iflib_rxeof (rxq=0xfffff8000245b580, budget=<value optimized out>) at /data/src/sys/net/iflib.c:2684
#13 0xffffffff805bc8b9 in _task_fn_rx (context=0xfffff8000245b580) at /data/src/sys/net/iflib.c:3820
#14 0xffffffff804fc959 in gtaskqueue_run_locked (queue=0xfffff8000222d600) at /data/src/sys/kern/subr_gtaskqueue.c:332
#15 0xffffffff804fc718 in gtaskqueue_thread_loop (arg=<value optimized out>) at /data/src/sys/kern/subr_gtaskqueue.c:507
#16 0xffffffff80486134 in fork_exit (callout=0xffffffff804fc690 <gtaskqueue_thread_loop>, arg=0xfffffe0000221038, frame=0xfffffe00255a6c00) at /data/src/sys/kern/kern_fork.c:1057
#17 0xffffffff8073acae in fork_trampoline () at /data/src/sys/amd64/amd64/exception.S:993
#18 0x0000000000000000 in ?? ()

Comment 34 Lev A. Serebryakov freebsd_committer

2018-10-10 12:32:30 UTC

Ok, I have new data.

Softcrypto or IPsec is only symptom, not cause.

Cause is igb/em driver (different files, logically same place).

I can reproduce driver KASSERT on kernel with INVARIANTS without any crypto at all.

Conditions are: low-power hardware, high load, receive data as fast as possible.

On Celeron J3160 + igb(8) it requires to load system with IPSec with soft crypto to trigger bug. I was not able to trigger it without crypto or AESNI.

On Atom D2500 + em(8) it requires either soft crypto (easy!) or multitude of plain connections without crypto. For example, 32 iperf3 streams for 2+ minutes is enough. With IPsec it triggers with 1 stream for 5 seconds.

So, I can reproduce this on Atom D2500 + em(8) with simple "iperf3 -c <server> -R -t 3600 --nstreams 32

Without INVARIANTS, it is very hard to catch this bug without IPsec. I think, it is because this memory corruption is hard to notice without additional traffic processing. I think, IPsec is only way to deiscover that memory is corrupted, not a way to corrupt memory.

Here is stack trace with INVARIANTS and without any crypto. It is virutally the same as with crypto. As usual, I can provide kernel file and full crash dump and can re-run tests with any patches and settings.

I'm sure now, it is bug in Intel driver. Race condition, maybe?

panic: Assertion (staterr & E1000_RXD_STAT_DD) != 0 failed at /data/src/sys/dev/e1000/em_txrx.c:698
cpuid = 1
time = 1539169364
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe000043f900
vpanic() at vpanic+0x1a3/frame 0xfffffe000043f960
panic() at panic+0x43/frame 0xfffffe000043f9c0
em_isc_rxd_pkt_get() at em_isc_rxd_pkt_get+0x1d4/frame 0xfffffe000043fa10
iflib_rxeof() at iflib_rxeof+0x128/frame 0xfffffe000043fb00
_task_fn_rx() at _task_fn_rx+0x49/frame 0xfffffe000043fb30
gtaskqueue_run_locked() at gtaskqueue_run_locked+0xf9/frame 0xfffffe000043fb80
gtaskqueue_thread_loop() at gtaskqueue_thread_loop+0x88/frame 0xfffffe000043fbb0
fork_exit() at fork_exit+0x84/frame 0xfffffe000043fbf0
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe000043fbf0
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---
Uptime: 17m24s
Dumping 477 out of 4060 MB:..4%..11%..21%..31%..41%..51%..61%..71%..81%..91%

#0  doadump (textdump=1) at pcpu.h:230
230	pcpu.h: No such file or directory.
	in pcpu.h
(kgdb)
#0  doadump (textdump=1) at pcpu.h:230
#1  0xffffffff80565c60 in kern_reboot (howto=260) at /data/src/sys/kern/kern_shutdown.c:446
#2  0xffffffff805660b3 in vpanic (fmt=<value optimized out>, ap=<value optimized out>) at /data/src/sys/kern/kern_shutdown.c:872
#3  0xffffffff80565e13 in panic (fmt=<value optimized out>) at /data/src/sys/kern/kern_shutdown.c:799
#4  0xffffffff803f1d94 in em_isc_rxd_pkt_get (arg=<value optimized out>, ri=<value optimized out>) at /data/src/sys/dev/e1000/em_txrx.c:698
#5  0xffffffff80668b28 in iflib_rxeof (rxq=0xfffff80002295ac0, budget=<value optimized out>) at /data/src/sys/net/iflib.c:2684
#6  0xffffffff80664f69 in _task_fn_rx (context=0xfffff80002295ac0) at /data/src/sys/net/iflib.c:3820
#7  0xffffffff805a6039 in gtaskqueue_run_locked (queue=0xfffff800021dc500) at /data/src/sys/kern/subr_gtaskqueue.c:332
#8  0xffffffff805a5df8 in gtaskqueue_thread_loop (arg=<value optimized out>) at /data/src/sys/kern/subr_gtaskqueue.c:507
#9  0xffffffff8052f7e4 in fork_exit (callout=0xffffffff805a5d70 <gtaskqueue_thread_loop>, arg=0xfffffe00017f8020, frame=0xfffffe000043fc00) at /data/src/sys/kern/kern_fork.c:1057
#10 0xffffffff8081ce2e in fork_trampoline () at /data/src/sys/amd64/amd64/exception.S:993
#11 0x0000000000000000 in ?? ()
Current language:  auto; currently minimal
(kgdb)

Comment 35 Lev A. Serebryakov freebsd_committer

2018-10-12 15:46:06 UTC

(In reply to Lev A. Serebryakov from comment #34)

iperf3 -c <server> -R -t 3600 -P 32

"-P 32", not "--nstreams 32", as we speak TCP, not SCTP here.

Comment 36 Eric Joyner freebsd_committer

2018-10-12 17:46:03 UTC

Maybe you're encountering something similar to what was fixed here? https://github.com/freebsd/freebsd/commit/e2a6991d7175b5ba9b6832b1d8770e58fa57e998

That was causing us to hit the MPASS() in rxd_pkt_get in ixl(4).

Comment 37 Lev A. Serebryakov freebsd_committer

2018-10-12 17:52:05 UTC

(In reply to Eric Joyner from comment #36)
Maybe. I'm using mtu 9000 in my tests…
I could try to reproduce it with standard mtu (1500).

Comment 38 Lev A. Serebryakov freebsd_committer

2018-10-12 18:32:51 UTC

(In reply to Eric Joyner from comment #36)
I can not reproduce it with mtu=1500 on both ends for 20 minutes.

I'm trying to comment out "budget == 1" case for em(8).

Comment 39 Lev A. Serebryakov freebsd_committer

2018-10-12 18:44:28 UTC

(In reply to Eric Joyner from comment #36)
Nope, commenting out "budget == 1" section in em_txrx.c (lines 556-560) doesn't help.

Same assertion was triggered.

Comment 40 Lev A. Serebryakov freebsd_committer

2018-10-12 19:22:29 UTC

One additional datapoint: when mtu=1500 on both ends, everything works for tens of minutes, but sending part (11.2-STABLE based) shows bursts of "resends", which is not occurs with mtu=9000 till crash.

Comment 41 Lev A. Serebryakov freebsd_committer

2018-10-12 19:33:21 UTC

(In reply to Lev A. Serebryakov from comment #39)
OOPS! Looks like I patched only "lem" but not "em" function!
Let's try to patch "em" too...

Comment 42 Lev A. Serebryakov freebsd_committer

2018-10-12 22:14:44 UTC

(In reply to Eric Joyner from comment #36)
Looks like it helps.
Simple traffic with INVARIANTS works, now I'm testing IPsec configuration.

Comment 43 Lev A. Serebryakov freebsd_committer

2018-10-12 23:51:05 UTC

(In reply to Eric Joyner from comment #36)
Yess! 

It helps em0 to pass all my torture tests (when I comment out this "optimization" twice, for lem and em). I can not test on igb now, but  belive it will help too.

Please, commit this fix :-)

Comment 44 Lev A. Serebryakov freebsd_committer

2018-10-13 09:20:47 UTC

BTW, if_ix contains SAME problem. I can not reproduce it (I have one ix link), but code is same and I'm sure, it has same problem.

Comment 45 Eric Joyner freebsd_committer

2018-10-13 20:12:54 UTC

I'll work on porting that change (and maybe another) to em/igb/ix.

Comment 46 commit-hook freebsd_committer

2018-10-14 05:10:04 UTC

A commit references this bug:

Author: erj
Date: Sun Oct 14 05:09:44 UTC 2018
New revision: 339354
URL: https://svnweb.freebsd.org/changeset/base/339354

Log:
  em/igb/ix(4): Port two Tx/Rx fixes made to ixl in r339338

  - Fix assert/panic on receive when Jumbo Frames are enabled.

  From the commit I made to ixl:
  "It turns out that *_isc_rxd_available is supposed to return how many
  packets are available to be cleaned on the rx ring. This patch removes
  a section of code where if the budget argument is 1, the function would return
  one if there was a descriptor available, not necessarily a packet.

  This is okay in regular mtu 1500 traffic since the max frame size is less
  than the configured receive buffer size (2048), but this doesn't work when
  received packets can span more than one  descriptor, as is the case when the
  mtu is 9000 and the receive buffer size is 4096."

  - Fix possible Tx hang because *_isc_txd_credits_update returns incorrect result

  From the commit by Krzysztof Galazka to ixl: "Function isc_txd_update_credits
  called with clear set to false should return 1 if there are TX descriptors
  already handled by HW. It was always returning 0 causing troubles with UDP TX
  traffic."

  PR:             231659
  Reported by:    lev@
  Approved by:	re (gjb@)
  Sponsored by:   Intel Corporation

Changes:
  head/sys/dev/e1000/em_txrx.c
  head/sys/dev/e1000/igb_txrx.c
  head/sys/dev/ixgbe/ix_txrx.c

Comment 47 Eric Joyner freebsd_committer

2018-10-17 15:28:46 UTC

(In reply to Lev A. Serebryakov from comment #44)

Lev,

Did my commit fix your issue? Every Intel driver should be fixed now.

Comment 48 Lev A. Serebryakov freebsd_committer

2018-10-18 10:59:04 UTC

(In reply to Eric Joyner from comment #47)
Yep, I can not reproduce crash anymore.

Thank you!

(I'm not sure, who should close ticket, me or you).

Comment 49 Eric Joyner freebsd_committer

2018-10-18 15:49:53 UTC

(In reply to Lev A. Serebryakov from comment #48)

I'll close it.