Summary: | KTLS with Intel QAT may trigger kernel panics | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Base System | Reporter: | ss3bsd <3226388001> | ||||||||||||
Component: | kern | Assignee: | freebsd-net (Nobody) <net> | ||||||||||||
Status: | Open --- | ||||||||||||||
Severity: | Affects Only Me | CC: | lexi, markj | ||||||||||||
Priority: | --- | Keywords: | crash | ||||||||||||
Version: | 14.2-RELEASE | ||||||||||||||
Hardware: | Any | ||||||||||||||
OS: | Any | ||||||||||||||
Attachments: |
|
For the second crash, could you please print *crp from frame 14? Hello. Is this what you need? I'm not familiar with kernel debugging. #14 0xffffffff80c83fb0 in crypto_dispatch_one (crp=crp@entry=0xfffffe014b081c38, hint=0) at /usr/src/sys/opencrypto/crypto.c:1432 (kgdb) p *0xfffffe014b081c38 $1 = 0 (In reply to ss3bsd from comment #2) Please try running the following commands: (kgdb) frame 14 (kgdb) p/x *crp and share the output. Also, is the second panic reproducible? Does it happen as soon as a KTLS session starts, or does it take some time to trigger the crash? (kgdb) frame 14 #14 0xffffffff80c83fb0 in crypto_dispatch_one (crp=crp@entry=0xfffffe014b081c38, hint=0) at /usr/src/sys/opencrypto/crypto.c:1432 1432 result = crypto_invoke(cap, crp, hint); (kgdb) p/x *crp $1 = {crp_next = {tqe_next = 0x0, tqe_prev = 0x0}, crp_task = {ta_link = {stqe_next = 0x0}, ta_pending = 0x0, ta_priority = 0x0, ta_flags = 0x0, ta_func = 0x0, ta_context = 0x0}, crp_session = 0xfffff8079d037c80, crp_olen = 0x0, crp_etype = 0x0, crp_flags = 0x10, crp_op = 0x2, crp_buf = {{{cb_buf = 0xfffffe014b081be0, cb_buf_len = 0x0}, cb_mbuf = 0xfffffe014b081be0, {cb_vm_page = 0xfffffe014b081be0, cb_vm_page_len = 0x0, cb_vm_page_offset = 0x0}, cb_uio = 0xfffffe014b081be0}, cb_type = 0x2}, crp_obuf = {{{cb_buf = 0x0, cb_buf_len = 0x0}, cb_mbuf = 0x0, {cb_vm_page = 0x0, cb_vm_page_len = 0x0, cb_vm_page_offset = 0x0}, cb_uio = 0x0}, cb_type = 0x0}, crp_aad = 0x0, crp_aad_start = 0x0, crp_aad_length = 0x0, crp_esn = {0x0, 0x0, 0x0, 0x0}, crp_iv_start = 0x0, crp_payload_start = 0x0, crp_payload_output_start = 0x0, crp_payload_length = 0x501, crp_digest_start = 0x501, crp_iv = {0x0 <repeats 64 times>}, crp_cipher_key = 0x0, crp_auth_key = 0x0, crp_opaque = 0xfffffe014b081b98, crp_callback = 0xffffffff80c8c4d0, crp_tstamp = {sec = 0x0, frac = 0x0}, crp_seq = 0x0, crp_retw_id = 0x3} > Also, is the second panic reproducible? Does it happen as soon as a KTLS session starts, or does it take some time to trigger the crash?
I upgraded from 13.3 to 14.1 on 6/19 and (perhaps) this occured 2 times.
No dump for the first one.
I enabled kernel dump after the first occurence.
And, "it takes some time to trigger the crash."
(The machine is my multi-purpose home router/server/nas etc.)
I'm now running the same machine with
kern.ipc.tls.cbc_enable=0
to see if the stability changes.
(In reply to ss3bsd from comment #4) Thank you. Could we also see (kgdb) frame 14 (kgdb) p/x *crp->crp_buf.cb_uio output? (kgdb) frame 14 #14 0xffffffff80c83fb0 in crypto_dispatch_one (crp=crp@entry=0xfffffe014b081c38, hint=0) at /usr/src/sys/opencrypto/crypto.c:1432 1432 result = crypto_invoke(cap, crp, hint); (kgdb) p/x *crp->crp_buf.cb_uio $1 = {uio_iov = 0xfffff8079cb21c80, uio_iovcnt = 0x3, uio_offset = 0x0, uio_resid = 0x53d, uio_segflg = 0x1, uio_rw = 0x0, uio_td = 0xfffff8010459d000} (kgdb) Created attachment 251776 [details]
kgdb log #3
Another panic log attached.
#3
Tried to change
sysctl dev.qat_ocf.0.enable: 1 -> 0
then, panic.
Created attachment 251786 [details]
kgdb #3 with clean kernel
I'm reuploading #3 log because the previous one was acquired on a dirty kernel with a private patch that force enables TCP LRO at forwarding envoronment.
The new log shows a similar trace info without tcp_lro part.
To be clear, Panic #1: reproducible with a clean kernel. I was aware of that "kldunload qat" might trigger panic at 13.x too. Panic #2: log was acquired for a dirty kernel. It is intermittent and so far is difficult to intentionally reproduce. I may not be able to offer another log with clean kernel. Therefore, at this moment I cannot assure that my patch is not affecting anything. What I can tell is that 13.x was stably running with KTLS and QAT, and I had not experienced this before upgrading to 14.1 even with the same kind of private patch. Panic #3: reproducible with a clean kernel. I did not try to switch dev.qat_ocf.0.enable=0 when the host was running 13.x. I cannot tell whether it is common for 14.1 and 13.x. (may be not exist in 13.x?) I do not know if it is related to the panic but found this at the console: qat_ocf0: waiting for session un-busy failed qat_ocf0: unable to remove encrypt session qat_ocf0: waiting for session un-busy failed qat_ocf0: unable to remove encrypt session
> I'm now running the same machine with
> kern.ipc.tls.cbc_enable=0
> to see if the stability changes.
The machine worked for 2 weeks without panic.
Changed back kern.ipc.tls.cbc_enable=0 -> 1, a couple of days ago.
The machine today panics again at CBC related part of code (ktls_ocf_tls_cbc_decrypt).
So, the cause might be something specific to CBC. (not a strong evidence though.)
--
Unread portion of the kernel message buffer:
trap number = 12
panic: page fault
cpuid = 1
time = 1721297633
KDB: stack backtrace:
#0 0xffffffff809d2b5d at kdb_backtrace+0x5d
#1 0xffffffff809858c1 at vpanic+0x131
#2 0xffffffff80985783 at panic+0x43
#3 0xffffffff80e5f91b at trap_fatal+0x40b
#4 0xffffffff80e5f966 at trap_pfault+0x46
#5 0xffffffff80e36288 at calltrap+0x8
#6 0xffffffff80e21cc8 at bounce_bus_dmamap_load_buffer+0x178
#7 0xffffffff809c8a07 at bus_dmamap_load_crp_buffer+0x237
#8 0xffffffff82ddb7ad at qat_ocf_process+0x40d
#9 0xffffffff80c83fd0 at crypto_dispatch+0x60
#10 0xffffffff80c8c3cd at ktls_ocf_dispatch+0x5d
#11 0xffffffff80c8d3f4 at ktls_ocf_tls_cbc_decrypt+0x344
#12 0xffffffff80a1a484 at ktls_work_thread+0x664
#13 0xffffffff8093fc7f at fork_exit+0x7f
#14 0xffffffff80e372ee at fork_trampoline+0xe
Uptime: 2d17h13m52s
By the way, is there any easy way to disable only CBC acceleration of QAT but enable that of KTLS offload? I want to try that next if there is. Another (reproducible) panic while high TX load is found. Something (such as infinite loop) is happening in this path? *** /* * Caller marked the request to be processed immediately; dispatch it * directly to the driver unless the driver is currently blocked, in * which case it is queued for deferred dispatch. */ cap = crp->crp_session->cap; if (!atomic_load_int(&cap->cc_qblocked)) { result = crypto_invoke(cap, crp, hint); if (result != ERESTART) return (result); /* * The driver ran out of resources, put the request on the * queue. */ } ** -- Now on FreeBSD 14.2 (kgdb) backtrace #0 __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:57 #1 doadump (textdump=<optimized out>) at /usr/src/sys/kern/kern_shutdown.c:405 #2 0xffffffff809908e7 in kern_reboot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:523 #3 0xffffffff80990dbe in vpanic (fmt=0xffffffff8100a112 "double fault", ap=ap@entry=0xfffffe003a415e40) at /usr/src/sys/kern/kern_shutdown.c:967 #4 0xffffffff80990c13 in panic (fmt=<unavailable>) at /usr/src/sys/kern/kern_shutdown.c:891 #5 0xffffffff80e87f6e in dblfault_handler (frame=<optimized out>) at /usr/src/sys/amd64/amd64/trap.c:1012 #6 <signal handler called> #7 0xffffffff80e49460 in bounce_bus_dmamap_load_buffer (dmat=0xfffff801ceb14700, map=0xffffffff81a27d90 <nobounce_dmamap>, buf=0xfffff807af688ad8, buflen=5, pmap=0xffffffff817514d0 <kernel_pmap_store>, flags=1, segs=0x0, segp=0xfffffe014e2c90d4) at /usr/src/sys/x86/x86/busdma_bounce.c:741 #8 0xffffffff809d4699 in _bus_dmamap_load_buffer (dmat=0xfffff801ceb14700, map=0x0, buf=0xfffff807af688ad8, buflen=18446744071589494160, flags=1, segs=0x0, segp=0xfffffe014e2c90d4, pmap=<optimized out>) at /usr/src/sys/x86/include/bus_dma.h:166 #9 _bus_dmamap_load_mbuf_epg (dmat=dmat@entry=0xfffff801ceb14700, map=map@entry=0x0, m=0xfffff807af688a00, segs=segs@entry=0x0, nsegs=nsegs@entry=0xfffffe014e2c90d4, flags=1) at /usr/src/sys/kern/subr_bus_dma.c:176 #10 0xffffffff809d4431 in _bus_dmamap_load_single_mbuf (dmat=0xfffff801ceb14700, map=0x0, m=0xfffff807af688ad8, segs=0x0, nsegs=0xfffffe014e2c90d4, flags=1) at /usr/src/sys/kern/subr_bus_dma.c:220 #11 bus_dmamap_load_crp_buffer (dmat=0xfffff801ceb14700, map=0x0, cb=<optimized out>, callback=0xffffffff83000840 <qat_ocf_crypto_load_buf_cb>, callback_arg=0xfffffe014e2c9138, flags=1) at /usr/src/sys/kern/subr_bus_dma.c:554 #12 0xffffffff82fff7ad in qat_ocf_process () from /boot/kernel/qat.ko #13 0xffffffff80c99700 in crypto_dispatch_one (crp=0xfffff807a8af10b8, hint=0) at /usr/src/sys/opencrypto/crypto.c:1432 #14 crypto_dispatch (crp=0xfffff807a8af10b8) at /usr/src/sys/opencrypto/crypto.c:1448 #15 0xffffffff80ca1cf7 in ktls_ocf_dispatch_async_cb (crp=0xfffff801ceb14700) at /usr/src/sys/opencrypto/ktls_ocf.c:246 #16 0xffffffff82fff801 in qat_ocf_process () from /boot/kernel/qat.ko #17 0xffffffff80c99700 in crypto_dispatch_one (crp=0xfffff807a8af10b8, hint=0) at /usr/src/sys/opencrypto/crypto.c:1432 #18 crypto_dispatch (crp=0xfffff807a8af10b8) at /usr/src/sys/opencrypto/crypto.c:1448 #19 0xffffffff80ca1cf7 in ktls_ocf_dispatch_async_cb (crp=0xfffff801ceb14700) at /usr/src/sys/opencrypto/ktls_ocf.c:246 #20 0xffffffff82fff801 in qat_ocf_process () from /boot/kernel/qat.ko #21 0xffffffff80c99700 in crypto_dispatch_one (crp=0xfffff807a8af10b8, hint=0) at /usr/src/sys/opencrypto/crypto.c:1432 #22 crypto_dispatch (crp=0xfffff807a8af10b8) at /usr/src/sys/opencrypto/crypto.c:1448 : omit : #245 0xffffffff80c99700 in crypto_dispatch_one (crp=0xfffff807a8af10b8, hint=0) at /usr/src/sys/opencrypto/crypto.c:1432 #246 crypto_dispatch (crp=0xfffff807a8af10b8) at /usr/src/sys/opencrypto/crypto.c:1448 #247 0xffffffff80ca1cf7 in ktls_ocf_dispatch_async_cb (crp=0xfffff801ceb14700) at /usr/src/sys/opencrypto/ktls_ocf.c:246 #248 0xffffffff82fff801 in qat_ocf_process () from /boot/kernel/qat.ko #249 0xffffffff80c99700 in crypto_dispatch_one (crp=crp@entry=0xfffff807a8af10b8, hint=0) at /usr/src/sys/opencrypto/crypto.c:1432 #250 crypto_dispatch (crp=crp@entry=0xfffff807a8af10b8) at /usr/src/sys/opencrypto/crypto.c:1448 #251 0xffffffff80ca16bd in ktls_ocf_dispatch_async (state=0xfffff801ceaf8420, crp=0xfffff807a8af10b8) at /usr/src/sys/opencrypto/ktls_ocf.c:268 #252 ktls_ocf_tls13_aead_encrypt (state=0xfffff801ceaf8420, tls=0xfffff805c1fec108, m=0xfffffe014c5687c0, outiov=0xfffff801ceaf8000, outiovcnt=0) at /usr/src/sys/opencrypto/ktls_ocf.c:856 #253 0xffffffff80a2807e in ktls_encrypt_async (wq=wq@entry=0xfffff804cc22c680, top=top@entry=0xfffff807af688a00) at /usr/src/sys/kern/uipc_ktls.c:3108 #254 0xffffffff80a27a36 in ktls_work_thread (ctx=ctx@entry=0xfffff804cc22c680) at /usr/src/sys/kern/uipc_ktls.c:3254 #255 0xffffffff8094a75f in fork_exit (callout=0xffffffff80a26a50 <ktls_work_thread>, arg=0xfffff804cc22c680, frame=0xfffffe014e2ccf40) at /usr/src/sys/kern/kern_fork.c:1164 #256 <signal handler called> (In reply to ss3bsd from comment #14) My reproducing procedure for the panic above is: 1. modify wget in ports to use KTLS and build it work/wget-1.25.0/src/openssl.c: ** ssl_init (void) { SSL_METHOD const *meth = NULL; - long ssl_options = SSL_OP_ENABLE_KTLS; + long ssl_options = 0; char *ciphers_string = NULL; #if !defined(LIBRESSL_VERSION_NUMBER) && (OPENSSL_VERSION_NUMBER >= 0x10100000L) int ssl_proto_version = 0; ** 2. run several instances of POST /usr/ports/ftp/wget/work/stage/usr/local/bin/wget --post-file=VERY_LARGE_FILE --no-check-certificate -O /dev/null https://A_SERVER_OVER_10GbEther/ & /usr/ports/ftp/wget/work/stage/usr/local/bin/wget --post-file=VERY_LARGE_FILE --no-check-certificate -O /dev/null https://A_SERVER_OVER_10GbEther/ & /usr/ports/ftp/wget/work/stage/usr/local/bin/wget --post-file=VERY_LARGE_FILE --no-check-certificate -O /dev/null https://A_SERVER_OVER_10GbEther/ & /usr/ports/ftp/wget/work/stage/usr/local/bin/wget --post-file=VERY_LARGE_FILE --no-check-certificate -O /dev/null https://A_SERVER_OVER_10GbEther/ & /usr/ports/ftp/wget/work/stage/usr/local/bin/wget --post-file=VERY_LARGE_FILE --no-check-certificate -O /dev/null https://A_SERVER_OVER_10GbEther/ & Created attachment 256331 [details]
draft patch that addresses comment no.11 and 14
I have tried to create a draft patch (attached) for comment no.11 (resource leak) and 14 (panic).
With high TX load test & kldunload qat,
Before applying this patch (for no.11):
qat_ocf0: unable to remove encrypt session
qat_ocf0: waiting for session un-busy failed
qat_ocf0: unable to remove encrypt session
qat_ocf0: waiting for session un-busy failed
qat_ocf0: unable to remove encrypt session
qat_ocf0: waiting for session un-busy failed
qat_ocf0: unable to remove encrypt session
qat_ocf0: waiting for session un-busy failed
qat_ocf0: unable to remove encrypt session
qat_ocf0: waiting for session un-busy failed
qat_ocf0: unable to remove encrypt session
qat_ocf0: waiting for session un-busy failed
qat_ocf0: unable to remove encrypt session
qat_ocf0: waiting for session un-busy failed
qat_ocf0: unable to remove encrypt session
qat_ocf0: detached
Warning: memory type qat_ocf leaked memory on destroy (250 allocations, 1024000 bytes leaked).
qat0: qat_dev0 stopped 6 acceleration engines
qat0: Resetting device qat_dev0
qat0: detached
After (EAGAIN logs are added this time for debug):
% dmesg | tail
qat_ocf0: EAGAIN cpaCySymDpEnqueueOp
qat_ocf0: EAGAIN cpaCySymDpEnqueueOp
ktls_ocf_dispatch_async_cb EAGAIN DELAY
ktls_ocf_dispatch_async_cb EAGAIN DELAY
qat_ocf0: EAGAIN cpaCySymDpEnqueueOp
ktls_ocf_dispatch_async_cb EAGAIN DELAY
qat_ocf0: EAGAIN cpaCySymDpEnqueueOp
qat_ocf0: EAGAIN cpaCySymDpEnqueueOp
ktls_ocf_dispatch_async_cb EAGAIN DELAY
ktls_ocf_dispatch_async_cb EAGAIN DELAY
qat_ocf0: detached
qat0: qat_dev0 stopped 6 acceleration engines
qat0: Resetting device qat_dev0
qat0: detached
Created attachment 256343 [details] experimental patch that utilizes extra IRQ instances of QAT driver I have created an experimental patch that utilizes extra IRQ instances (with a queue for each) of QAT driver, which might improve the performance on high load where driver stalls (EAGAIN) occur. This patch introduces a sysctl entry to switch the behavior. With the option enabled (dev.qat.0.use_extra_instances=1), - I see fewer driver stalls in question - I could not actually confirm the throughput improvement because 10GbE is saturated even with the option disabled IRQ statistics shown in sysctl after some traffic: <dev.qat.0.use_extra_instances: 0> % sysctl -a | grep qat | grep irq irq105: qat0:b0:217 @cpu0(domain0): 315445 irq106: qat0:b1:219 @cpu0(domain0): 197384 irq107: qat0:b2:221 @cpu0(domain0): 0 irq108: qat0:b3:223 @cpu0(domain0): 0 irq109: qat0:b4:225 @cpu0(domain0): 0 irq110: qat0:b5:227 @cpu0(domain0): 0 irq111: qat0:b6:229 @cpu0(domain0): 0 irq112: qat0:b7:231 @cpu0(domain0): 0 irq113: qat0:b8:233 @cpu0(domain0): 0 irq114: qat0:b9:235 @cpu0(domain0): 0 irq115: qat0:b10:237 @cpu0(domain0): 0 irq116: qat0:b11:239 @cpu0(domain0): 0 irq117: qat0:b12:241 @cpu0(domain0): 0 irq118: qat0:b13:243 @cpu0(domain0): 0 irq119: qat0:b14:245 @cpu0(domain0): 0 irq120: qat0:b15:247 @cpu0(domain0): 0 irq121: qat0:ae:249 @cpu0(domain0): 0 <dev.qat.0.use_extra_instances: 1> % sysctl -a | grep qat | grep irq irq105: qat0:b0:217 @cpu0(domain0): 117087 irq106: qat0:b1:219 @cpu1(domain0): 303001 irq107: qat0:b2:221 @cpu2(domain0): 631575 irq108: qat0:b3:223 @cpu3(domain0): 536653 irq109: qat0:b4:225 @cpu4(domain0): 534724 irq110: qat0:b5:227 @cpu5(domain0): 697166 irq111: qat0:b6:229 @cpu6(domain0): 708141 irq112: qat0:b7:231 @cpu7(domain0): 726079 irq113: qat0:b8:233 @cpu0(domain0): 0 irq114: qat0:b9:235 @cpu1(domain0): 0 irq115: qat0:b10:237 @cpu2(domain0): 0 irq116: qat0:b11:239 @cpu3(domain0): 0 irq117: qat0:b12:241 @cpu4(domain0): 0 irq118: qat0:b13:243 @cpu5(domain0): 0 irq119: qat0:b14:245 @cpu6(domain0): 0 irq120: qat0:b15:247 @cpu7(domain0): 0 irq121: qat0:ae:249 @cpu0(domain0): 0 Related topic found: https://forums.freebsd.org/threads/qat-engines-and-cryptodev-usage-underutilised-card.95415/ Note that how many engines used (discussed at the topic) is perhaps irrelevant to how many IRQs used. By the way, the current QAT driver seems to accept more types of cfg_services value than the Manual mentions. -- static const char *const cfg_serv[] = { "sym;asym", "sym", "asym", "dc", "sym;dc", "asym;dc", "cy", "cy;dc" }; -- man qat: cfg_services Override the device services enabled: symmetric, asymmetric, data compression. Possible values: "sym", "asym", "dc", "sym;dc", "asym;dc", "sym;asym". Default services configured are "sym;asym" for even and "dc" for odd devices. -- "cy;dc" is actually accepted: % sysctl -a | grep cfg_services dev.qat.0.cfg_services: cy;dc |
Created attachment 251744 [details] kgdb logs (#1 and #2) There seems to be some instability issue around KTLS with QAT (Atom C3000). Two kgdb logs attached. #1 Relatively easy to reproducible on my environment if # kldunload qat while active KTLS sessions exist. #2 The specific trigger is unknown. I had not experienced this panic before upgrading from 13.x to 14.1 recently.