Bug 280037

Summary: KTLS with Intel QAT may trigger kernel panics
Product: Base System Reporter: ss3bsd <3226388001>
Component: kernAssignee: freebsd-net (Nobody) <net>
Status: Open ---    
Severity: Affects Only Me CC: lexi, markj
Priority: --- Keywords: crash
Version: 14.2-RELEASE   
Hardware: Any   
OS: Any   
Attachments:
Description Flags
kgdb logs (#1 and #2)
none
kgdb log #3
none
kgdb #3 with clean kernel
none
draft patch that addresses comment no.11 and 14
none
experimental patch that utilizes extra IRQ instances of QAT driver none

Description ss3bsd 2024-06-28 13:44:03 UTC
Created attachment 251744 [details]
kgdb logs (#1 and #2)

There seems to be some instability issue around KTLS with QAT (Atom C3000).

Two kgdb logs attached.

 #1 

   Relatively easy to reproducible on my environment if
     # kldunload qat
   while active KTLS sessions exist.

 #2 

   The specific trigger is unknown.
   I had not experienced this panic before upgrading from 13.x to 14.1 recently.
Comment 1 Mark Johnston freebsd_committer freebsd_triage 2024-06-28 13:50:42 UTC
For the second crash, could you please print *crp from frame 14?
Comment 2 ss3bsd 2024-06-28 14:24:58 UTC
Hello.

Is this what you need?
I'm not familiar with kernel debugging.


#14 0xffffffff80c83fb0 in crypto_dispatch_one (crp=crp@entry=0xfffffe014b081c38, hint=0) at /usr/src/sys/opencrypto/crypto.c:1432

(kgdb) p *0xfffffe014b081c38
$1 = 0
Comment 3 Mark Johnston freebsd_committer freebsd_triage 2024-06-28 14:26:41 UTC
(In reply to ss3bsd from comment #2)
Please try running the following commands:

(kgdb) frame 14
(kgdb) p/x *crp

and share the output.

Also, is the second panic reproducible?  Does it happen as soon as a KTLS session starts, or does it take some time to trigger the crash?
Comment 4 ss3bsd 2024-06-28 14:30:21 UTC
(kgdb) frame 14
#14 0xffffffff80c83fb0 in crypto_dispatch_one (crp=crp@entry=0xfffffe014b081c38, hint=0) at /usr/src/sys/opencrypto/crypto.c:1432
1432                    result = crypto_invoke(cap, crp, hint);
(kgdb) p/x *crp
$1 = {crp_next = {tqe_next = 0x0, tqe_prev = 0x0}, crp_task = {ta_link = {stqe_next = 0x0}, ta_pending = 0x0, ta_priority = 0x0, ta_flags = 0x0, ta_func = 0x0,
    ta_context = 0x0}, crp_session = 0xfffff8079d037c80, crp_olen = 0x0, crp_etype = 0x0, crp_flags = 0x10, crp_op = 0x2, crp_buf = {{{cb_buf = 0xfffffe014b081be0,
        cb_buf_len = 0x0}, cb_mbuf = 0xfffffe014b081be0, {cb_vm_page = 0xfffffe014b081be0, cb_vm_page_len = 0x0, cb_vm_page_offset = 0x0}, cb_uio = 0xfffffe014b081be0},
    cb_type = 0x2}, crp_obuf = {{{cb_buf = 0x0, cb_buf_len = 0x0}, cb_mbuf = 0x0, {cb_vm_page = 0x0, cb_vm_page_len = 0x0, cb_vm_page_offset = 0x0}, cb_uio = 0x0},
    cb_type = 0x0}, crp_aad = 0x0, crp_aad_start = 0x0, crp_aad_length = 0x0, crp_esn = {0x0, 0x0, 0x0, 0x0}, crp_iv_start = 0x0, crp_payload_start = 0x0,
  crp_payload_output_start = 0x0, crp_payload_length = 0x501, crp_digest_start = 0x501, crp_iv = {0x0 <repeats 64 times>}, crp_cipher_key = 0x0, crp_auth_key = 0x0,
  crp_opaque = 0xfffffe014b081b98, crp_callback = 0xffffffff80c8c4d0, crp_tstamp = {sec = 0x0, frac = 0x0}, crp_seq = 0x0, crp_retw_id = 0x3}
Comment 5 ss3bsd 2024-06-28 14:38:45 UTC
> Also, is the second panic reproducible?  Does it happen as soon as a KTLS session starts, or does it take some time to trigger the crash?


I upgraded from 13.3 to 14.1 on 6/19 and (perhaps) this occured 2 times.

No dump for the first one.
I enabled kernel dump after the first occurence.

And, "it takes some time to trigger the crash."
(The machine is my multi-purpose home router/server/nas etc.)

I'm now running the same machine with 
 kern.ipc.tls.cbc_enable=0 
to see if the stability changes.
Comment 6 Mark Johnston freebsd_committer freebsd_triage 2024-06-28 14:47:13 UTC
(In reply to ss3bsd from comment #4)
Thank you.  Could we also see

(kgdb) frame 14
(kgdb) p/x *crp->crp_buf.cb_uio

output?
Comment 7 ss3bsd 2024-06-28 14:50:04 UTC
(kgdb) frame 14
#14 0xffffffff80c83fb0 in crypto_dispatch_one (crp=crp@entry=0xfffffe014b081c38, hint=0) at /usr/src/sys/opencrypto/crypto.c:1432
1432                    result = crypto_invoke(cap, crp, hint);
(kgdb) p/x *crp->crp_buf.cb_uio
$1 = {uio_iov = 0xfffff8079cb21c80, uio_iovcnt = 0x3, uio_offset = 0x0, uio_resid = 0x53d, uio_segflg = 0x1, uio_rw = 0x0, uio_td = 0xfffff8010459d000}
(kgdb)
Comment 8 ss3bsd 2024-06-29 15:51:56 UTC
Created attachment 251776 [details]
kgdb log #3

Another panic log attached.

 #3

   Tried to change
    sysctl dev.qat_ocf.0.enable: 1 -> 0
   then, panic.
Comment 9 ss3bsd 2024-06-30 04:41:36 UTC
Created attachment 251786 [details]
kgdb #3 with clean kernel

I'm reuploading #3 log because the previous one was acquired on a dirty kernel with a private patch that force enables TCP LRO at forwarding envoronment.

The new log shows a similar trace info without tcp_lro part.
Comment 10 ss3bsd 2024-06-30 05:38:07 UTC
To be clear,

 Panic #1:
   reproducible with a clean kernel.
   I was aware of that "kldunload qat" might trigger panic at 13.x too.

 Panic #2:
   log was acquired for a dirty kernel.
   It is intermittent and so far is difficult to intentionally reproduce. I may not be able to offer another log with clean kernel.

   Therefore, at this moment I cannot assure that my patch is not affecting anything. What I can tell is that 13.x was stably running with KTLS and QAT, and I had not experienced this before upgrading to 14.1 even with the same kind of private patch.

 Panic #3: reproducible with a clean kernel.
   I did not try to switch dev.qat_ocf.0.enable=0 when the host was running 13.x. I cannot tell whether it is common for 14.1 and 13.x. (may be not exist in 13.x?)
Comment 11 ss3bsd 2024-06-30 09:12:18 UTC
I do not know if it is related to the panic but found this at the console:

 qat_ocf0: waiting for session un-busy failed
 qat_ocf0: unable to remove encrypt session
 qat_ocf0: waiting for session un-busy failed
 qat_ocf0: unable to remove encrypt session
Comment 12 ss3bsd 2024-07-18 10:38:05 UTC
> I'm now running the same machine with 
> kern.ipc.tls.cbc_enable=0 
> to see if the stability changes.

The machine worked for 2 weeks without panic.

Changed back kern.ipc.tls.cbc_enable=0 -> 1, a couple of days ago.

The machine today panics again at CBC related part of code (ktls_ocf_tls_cbc_decrypt).

So, the cause might be something specific to CBC. (not a strong evidence though.)


--

Unread portion of the kernel message buffer:
trap number             = 12
panic: page fault
cpuid = 1
time = 1721297633
KDB: stack backtrace:
#0 0xffffffff809d2b5d at kdb_backtrace+0x5d
#1 0xffffffff809858c1 at vpanic+0x131
#2 0xffffffff80985783 at panic+0x43
#3 0xffffffff80e5f91b at trap_fatal+0x40b
#4 0xffffffff80e5f966 at trap_pfault+0x46
#5 0xffffffff80e36288 at calltrap+0x8
#6 0xffffffff80e21cc8 at bounce_bus_dmamap_load_buffer+0x178
#7 0xffffffff809c8a07 at bus_dmamap_load_crp_buffer+0x237
#8 0xffffffff82ddb7ad at qat_ocf_process+0x40d
#9 0xffffffff80c83fd0 at crypto_dispatch+0x60
#10 0xffffffff80c8c3cd at ktls_ocf_dispatch+0x5d
#11 0xffffffff80c8d3f4 at ktls_ocf_tls_cbc_decrypt+0x344
#12 0xffffffff80a1a484 at ktls_work_thread+0x664
#13 0xffffffff8093fc7f at fork_exit+0x7f
#14 0xffffffff80e372ee at fork_trampoline+0xe
Uptime: 2d17h13m52s
Comment 13 ss3bsd 2024-07-18 10:48:24 UTC
By the way, is there any easy way to disable only CBC acceleration of QAT but enable that of KTLS offload?

I want to try that next if there is.
Comment 14 ss3bsd 2024-12-31 10:38:15 UTC
Another (reproducible) panic while high TX load is found.

Something (such as infinite loop) is happening in this path?

***
	/*
	 * Caller marked the request to be processed immediately; dispatch it
	 * directly to the driver unless the driver is currently blocked, in
	 * which case it is queued for deferred dispatch.
	 */
	cap = crp->crp_session->cap;
	if (!atomic_load_int(&cap->cc_qblocked)) {
		result = crypto_invoke(cap, crp, hint);
		if (result != ERESTART)
			return (result);

		/*
		 * The driver ran out of resources, put the request on the
		 * queue.
		 */
	}
**

--

Now on FreeBSD 14.2


(kgdb) backtrace
#0  __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:57
#1  doadump (textdump=<optimized out>) at /usr/src/sys/kern/kern_shutdown.c:405
#2  0xffffffff809908e7 in kern_reboot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:523
#3  0xffffffff80990dbe in vpanic (fmt=0xffffffff8100a112 "double fault", ap=ap@entry=0xfffffe003a415e40) at /usr/src/sys/kern/kern_shutdown.c:967
#4  0xffffffff80990c13 in panic (fmt=<unavailable>) at /usr/src/sys/kern/kern_shutdown.c:891
#5  0xffffffff80e87f6e in dblfault_handler (frame=<optimized out>) at /usr/src/sys/amd64/amd64/trap.c:1012
#6  <signal handler called>
#7  0xffffffff80e49460 in bounce_bus_dmamap_load_buffer (dmat=0xfffff801ceb14700, map=0xffffffff81a27d90 <nobounce_dmamap>, buf=0xfffff807af688ad8, buflen=5,
    pmap=0xffffffff817514d0 <kernel_pmap_store>, flags=1, segs=0x0, segp=0xfffffe014e2c90d4) at /usr/src/sys/x86/x86/busdma_bounce.c:741
#8  0xffffffff809d4699 in _bus_dmamap_load_buffer (dmat=0xfffff801ceb14700, map=0x0, buf=0xfffff807af688ad8, buflen=18446744071589494160, flags=1, segs=0x0,
    segp=0xfffffe014e2c90d4, pmap=<optimized out>) at /usr/src/sys/x86/include/bus_dma.h:166
#9  _bus_dmamap_load_mbuf_epg (dmat=dmat@entry=0xfffff801ceb14700, map=map@entry=0x0, m=0xfffff807af688a00, segs=segs@entry=0x0, nsegs=nsegs@entry=0xfffffe014e2c90d4,
    flags=1) at /usr/src/sys/kern/subr_bus_dma.c:176
#10 0xffffffff809d4431 in _bus_dmamap_load_single_mbuf (dmat=0xfffff801ceb14700, map=0x0, m=0xfffff807af688ad8, segs=0x0, nsegs=0xfffffe014e2c90d4, flags=1)
    at /usr/src/sys/kern/subr_bus_dma.c:220
#11 bus_dmamap_load_crp_buffer (dmat=0xfffff801ceb14700, map=0x0, cb=<optimized out>, callback=0xffffffff83000840 <qat_ocf_crypto_load_buf_cb>,
    callback_arg=0xfffffe014e2c9138, flags=1) at /usr/src/sys/kern/subr_bus_dma.c:554
#12 0xffffffff82fff7ad in qat_ocf_process () from /boot/kernel/qat.ko
#13 0xffffffff80c99700 in crypto_dispatch_one (crp=0xfffff807a8af10b8, hint=0) at /usr/src/sys/opencrypto/crypto.c:1432
#14 crypto_dispatch (crp=0xfffff807a8af10b8) at /usr/src/sys/opencrypto/crypto.c:1448
#15 0xffffffff80ca1cf7 in ktls_ocf_dispatch_async_cb (crp=0xfffff801ceb14700) at /usr/src/sys/opencrypto/ktls_ocf.c:246
#16 0xffffffff82fff801 in qat_ocf_process () from /boot/kernel/qat.ko
#17 0xffffffff80c99700 in crypto_dispatch_one (crp=0xfffff807a8af10b8, hint=0) at /usr/src/sys/opencrypto/crypto.c:1432
#18 crypto_dispatch (crp=0xfffff807a8af10b8) at /usr/src/sys/opencrypto/crypto.c:1448
#19 0xffffffff80ca1cf7 in ktls_ocf_dispatch_async_cb (crp=0xfffff801ceb14700) at /usr/src/sys/opencrypto/ktls_ocf.c:246
#20 0xffffffff82fff801 in qat_ocf_process () from /boot/kernel/qat.ko
#21 0xffffffff80c99700 in crypto_dispatch_one (crp=0xfffff807a8af10b8, hint=0) at /usr/src/sys/opencrypto/crypto.c:1432
#22 crypto_dispatch (crp=0xfffff807a8af10b8) at /usr/src/sys/opencrypto/crypto.c:1448

: omit :

#245 0xffffffff80c99700 in crypto_dispatch_one (crp=0xfffff807a8af10b8, hint=0) at /usr/src/sys/opencrypto/crypto.c:1432
#246 crypto_dispatch (crp=0xfffff807a8af10b8) at /usr/src/sys/opencrypto/crypto.c:1448
#247 0xffffffff80ca1cf7 in ktls_ocf_dispatch_async_cb (crp=0xfffff801ceb14700) at /usr/src/sys/opencrypto/ktls_ocf.c:246
#248 0xffffffff82fff801 in qat_ocf_process () from /boot/kernel/qat.ko
#249 0xffffffff80c99700 in crypto_dispatch_one (crp=crp@entry=0xfffff807a8af10b8, hint=0) at /usr/src/sys/opencrypto/crypto.c:1432
#250 crypto_dispatch (crp=crp@entry=0xfffff807a8af10b8) at /usr/src/sys/opencrypto/crypto.c:1448
#251 0xffffffff80ca16bd in ktls_ocf_dispatch_async (state=0xfffff801ceaf8420, crp=0xfffff807a8af10b8) at /usr/src/sys/opencrypto/ktls_ocf.c:268
#252 ktls_ocf_tls13_aead_encrypt (state=0xfffff801ceaf8420, tls=0xfffff805c1fec108, m=0xfffffe014c5687c0, outiov=0xfffff801ceaf8000, outiovcnt=0)
    at /usr/src/sys/opencrypto/ktls_ocf.c:856
#253 0xffffffff80a2807e in ktls_encrypt_async (wq=wq@entry=0xfffff804cc22c680, top=top@entry=0xfffff807af688a00) at /usr/src/sys/kern/uipc_ktls.c:3108
#254 0xffffffff80a27a36 in ktls_work_thread (ctx=ctx@entry=0xfffff804cc22c680) at /usr/src/sys/kern/uipc_ktls.c:3254
#255 0xffffffff8094a75f in fork_exit (callout=0xffffffff80a26a50 <ktls_work_thread>, arg=0xfffff804cc22c680, frame=0xfffffe014e2ccf40) at /usr/src/sys/kern/kern_fork.c:1164
#256 <signal handler called>
Comment 15 ss3bsd 2024-12-31 11:29:37 UTC
(In reply to ss3bsd from comment #14)

My reproducing procedure for the panic above is:

1. modify wget in ports to use KTLS and build it

work/wget-1.25.0/src/openssl.c:
**
 ssl_init (void)
 {
   SSL_METHOD const *meth = NULL;
-  long ssl_options = SSL_OP_ENABLE_KTLS;
+  long ssl_options = 0;
   char *ciphers_string = NULL;
 #if !defined(LIBRESSL_VERSION_NUMBER) && (OPENSSL_VERSION_NUMBER >= 0x10100000L)
   int ssl_proto_version = 0;
**


2. run several instances of POST 

 /usr/ports/ftp/wget/work/stage/usr/local/bin/wget --post-file=VERY_LARGE_FILE --no-check-certificate -O /dev/null https://A_SERVER_OVER_10GbEther/ & 
 /usr/ports/ftp/wget/work/stage/usr/local/bin/wget --post-file=VERY_LARGE_FILE --no-check-certificate -O /dev/null https://A_SERVER_OVER_10GbEther/ & 
 /usr/ports/ftp/wget/work/stage/usr/local/bin/wget --post-file=VERY_LARGE_FILE --no-check-certificate -O /dev/null https://A_SERVER_OVER_10GbEther/ & 
 /usr/ports/ftp/wget/work/stage/usr/local/bin/wget --post-file=VERY_LARGE_FILE --no-check-certificate -O /dev/null https://A_SERVER_OVER_10GbEther/ & 
 /usr/ports/ftp/wget/work/stage/usr/local/bin/wget --post-file=VERY_LARGE_FILE --no-check-certificate -O /dev/null https://A_SERVER_OVER_10GbEther/ &
Comment 16 ss3bsd 2025-01-01 23:51:37 UTC
Created attachment 256331 [details]
draft patch that addresses comment no.11 and 14

I have tried to create a draft patch (attached) for comment no.11 (resource leak) and 14 (panic).


With high TX load test & kldunload qat,

Before applying this patch (for no.11):

qat_ocf0: unable to remove encrypt session                                                          
qat_ocf0: waiting for session un-busy failed                                                        
qat_ocf0: unable to remove encrypt session                                                          
qat_ocf0: waiting for session un-busy failed                                                        
qat_ocf0: unable to remove encrypt session                                                          
qat_ocf0: waiting for session un-busy failed                                                        
qat_ocf0: unable to remove encrypt session                                                          
qat_ocf0: waiting for session un-busy failed                                                        
qat_ocf0: unable to remove encrypt session                                                          
qat_ocf0: waiting for session un-busy failed                                                        
qat_ocf0: unable to remove encrypt session                                                          
qat_ocf0: waiting for session un-busy failed                                                        
qat_ocf0: unable to remove encrypt session                                                          
qat_ocf0: waiting for session un-busy failed                                                        
qat_ocf0: unable to remove encrypt session                                                          
qat_ocf0: detached                                                                                  
Warning: memory type qat_ocf leaked memory on destroy (250 allocations, 1024000 bytes leaked).      
qat0: qat_dev0 stopped 6 acceleration engines                                                       
qat0: Resetting device qat_dev0                                                                     
qat0: detached                                                                                     
                                                                                                    


After (EAGAIN logs are added this time for debug):

% dmesg | tail 
qat_ocf0: EAGAIN cpaCySymDpEnqueueOp
qat_ocf0: EAGAIN cpaCySymDpEnqueueOp
ktls_ocf_dispatch_async_cb EAGAIN DELAY
ktls_ocf_dispatch_async_cb EAGAIN DELAY
qat_ocf0: EAGAIN cpaCySymDpEnqueueOp
ktls_ocf_dispatch_async_cb EAGAIN DELAY
qat_ocf0: EAGAIN cpaCySymDpEnqueueOp
qat_ocf0: EAGAIN cpaCySymDpEnqueueOp
ktls_ocf_dispatch_async_cb EAGAIN DELAY
ktls_ocf_dispatch_async_cb EAGAIN DELAY
qat_ocf0: detached
qat0: qat_dev0 stopped 6 acceleration engines
qat0: Resetting device qat_dev0
qat0: detached
Comment 17 ss3bsd 2025-01-02 11:29:54 UTC
Created attachment 256343 [details]
experimental patch that utilizes extra IRQ instances of QAT driver

I have created an experimental patch that utilizes extra IRQ instances (with a queue for each) of QAT driver, which might improve the performance on high load where driver stalls (EAGAIN) occur.

This patch introduces a sysctl entry to switch the behavior.

With the option enabled (dev.qat.0.use_extra_instances=1), 
 - I see fewer driver stalls in question
 - I could not actually confirm the throughput improvement because 10GbE is saturated even with the option disabled


IRQ statistics shown in sysctl after some traffic:

<dev.qat.0.use_extra_instances: 0>

% sysctl -a | grep qat | grep irq

irq105: qat0:b0:217 @cpu0(domain0): 315445
irq106: qat0:b1:219 @cpu0(domain0): 197384
irq107: qat0:b2:221 @cpu0(domain0): 0
irq108: qat0:b3:223 @cpu0(domain0): 0
irq109: qat0:b4:225 @cpu0(domain0): 0
irq110: qat0:b5:227 @cpu0(domain0): 0
irq111: qat0:b6:229 @cpu0(domain0): 0
irq112: qat0:b7:231 @cpu0(domain0): 0
irq113: qat0:b8:233 @cpu0(domain0): 0
irq114: qat0:b9:235 @cpu0(domain0): 0
irq115: qat0:b10:237 @cpu0(domain0): 0
irq116: qat0:b11:239 @cpu0(domain0): 0
irq117: qat0:b12:241 @cpu0(domain0): 0
irq118: qat0:b13:243 @cpu0(domain0): 0
irq119: qat0:b14:245 @cpu0(domain0): 0
irq120: qat0:b15:247 @cpu0(domain0): 0
irq121: qat0:ae:249 @cpu0(domain0): 0


<dev.qat.0.use_extra_instances: 1>

% sysctl -a | grep qat | grep irq
irq105: qat0:b0:217 @cpu0(domain0): 117087
irq106: qat0:b1:219 @cpu1(domain0): 303001
irq107: qat0:b2:221 @cpu2(domain0): 631575
irq108: qat0:b3:223 @cpu3(domain0): 536653
irq109: qat0:b4:225 @cpu4(domain0): 534724
irq110: qat0:b5:227 @cpu5(domain0): 697166
irq111: qat0:b6:229 @cpu6(domain0): 708141
irq112: qat0:b7:231 @cpu7(domain0): 726079
irq113: qat0:b8:233 @cpu0(domain0): 0
irq114: qat0:b9:235 @cpu1(domain0): 0
irq115: qat0:b10:237 @cpu2(domain0): 0
irq116: qat0:b11:239 @cpu3(domain0): 0
irq117: qat0:b12:241 @cpu4(domain0): 0
irq118: qat0:b13:243 @cpu5(domain0): 0
irq119: qat0:b14:245 @cpu6(domain0): 0
irq120: qat0:b15:247 @cpu7(domain0): 0
irq121: qat0:ae:249 @cpu0(domain0): 0


Related topic found: https://forums.freebsd.org/threads/qat-engines-and-cryptodev-usage-underutilised-card.95415/

Note that how many engines used (discussed at the topic) is perhaps irrelevant to how many IRQs used.
Comment 18 ss3bsd 2025-01-02 12:12:42 UTC
By the way, the current QAT driver seems to accept more types of cfg_services value than the Manual mentions.


--
static const char *const cfg_serv[] =
    { "sym;asym", "sym", "asym", "dc", "sym;dc", "asym;dc", "cy", "cy;dc" };
--
man qat:
     cfg_services
             Override the device services enabled: symmetric, asymmetric, data
             compression.  Possible values: "sym", "asym", "dc", "sym;dc",
             "asym;dc", "sym;asym".  Default services configured are
             "sym;asym" for even and "dc" for odd devices.
--


"cy;dc" is actually accepted:

 % sysctl -a | grep cfg_services
 dev.qat.0.cfg_services: cy;dc