Bug 244792

Summary: [iscsi] ctladm islist leads to kernel panic if target ctl(4) port is disabled
Product: Base System Reporter: Aleksandr Fedorov <afedorov>
Component: kernAssignee: freebsd-scsi (Nobody) <scsi>
Status: New ---    
Severity: Affects Some People CC: afedorov, kib, mav, trasz
Priority: ---    
Version: CURRENT   
Hardware: amd64   
OS: Any   
Attachments:
Description Flags
iscsi_ioctl_list panic + debug info none

Description Aleksandr Fedorov freebsd_committer freebsd_triage 2020-03-13 12:28:50 UTC
Created attachment 212383 [details]
iscsi_ioctl_list panic + debug info

I found an issue which leads to kernel panic.

Test setup:

Machine 1 - ISCSI target.
Machine 2 - ISCSI initiator.

Disable ctl(4) port on target:
Machine 1# ctladm port -o off -p 3
Front End Ports disabled

After that, initiator trying to reconnect:

Machine 2# dmesg
...
(da23:iscsi4:0:0:5): Periph destroyed
(da22:iscsi4:0:0:4): Periph destroyed
(da19:iscsi4:0:0:3): Periph destroyed
(da17:iscsi4:0:0:2): Periph destroyed
WARNING: 192.168.101.1 (iqn.2018-11.com.vstack:target1): connection error; reconnecting
WARNING: 192.168.101.1 (iqn.2018-11.com.vstack:target1): connection error; reconnecting
...

If I try to list iscsi sessions on target side - kernel panics.

Machine 1# ctladm islist

Fatal trap 12: page fault while in kernel mode
cpuid = 11; apic id = 11
fault virtual address   = 0x17c
fault code              = supervisor read data, page not present
instruction pointer     = 0x20:0xffffffff831bb8c3
stack pointer           = 0x28:0xfffffe01c358f780
frame pointer           = 0x28:0xfffffe01c358f810
code segment            = base rx0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 27739 (ctladm)
trap number             = 12
panic: page fault
cpuid = 11
time = 1583839216
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe01c358f3e0
vpanic() at vpanic+0x185/frame 0xfffffe01c358f440
panic() at panic+0x43/frame 0xfffffe01c358f4a0
trap_fatal() at trap_fatal+0x386/frame 0xfffffe01c358f500
trap_pfault() at trap_pfault+0x99/frame 0xfffffe01c358f580
trap() at trap+0x2a7/frame 0xfffffe01c358f6b0
calltrap() at calltrap+0x8/frame 0xfffffe01c358f6b0
--- trap 0xc, rip = 0xffffffff831bb8c3, rsp = 0xfffffe01c358f780, rbp = 0xfffffe01c358f810 ---
cfiscsi_ioctl() at cfiscsi_ioctl+0x753/frame 0xfffffe01c358f810
devfs_ioctl() at devfs_ioctl+0xcc/frame 0xfffffe01c358f860
vn_ioctl() at vn_ioctl+0x132/frame 0xfffffe01c358f970
devfs_ioctl_f() at devfs_ioctl_f+0x1e/frame 0xfffffe01c358f990
kern_ioctl() at kern_ioctl+0x295/frame 0xfffffe01c358f9f0
sys_ioctl() at sys_ioctl+0x15c/frame 0xfffffe01c358fac0
amd64_syscall() at amd64_syscall+0x168/frame 0xfffffe01c358fbf0
fast_syscall_common() at fast_syscall_common+0x101/frame 0xfffffe01c358fbf0
--- syscall (54, FreeBSD ELF64, sys_ioctl), rip = 0x8004c19ba, rsp = 0x7fffffffe448, rbp = 0x7fffffffeab0 ---
KDB: enter: panic

You can see full output with debug in attachment.

The panic occurs in function cfiscsi_ioctl_list() https://svnweb.freebsd.org/base/head/sys/cam/ctl/ctl_frontend_iscsi.c?revision=358333&view=markup#l1718

Due the fact that cs->cs_target pointer is NULL (see attachment).
I add some checks to prevent panic:

Machine 1# ctladm islist
  ID Portal           Initiator name                       Target name                         
   1 192.168.101.5    iqn.1994-09.org.freebsd:q1u005.z.vstack.com iqn.2018-11.com.vstack:target4      
   3 192.168.101.4    iqn.1994-09.org.freebsd:q1u004.z.vstack.com iqn.2018-11.com.vstack:target3      
   4 192.168.101.3    iqn.1994-09.org.freebsd:q1u003.z.vstack.com iqn.2018-11.com.vstack:target2      
  74 192.168.101.2    iqn.1994-09.org.freebsd:q1u002.z.vstack.com none                                
 106 192.168.101.2    iqn.1994-09.org.freebsd:q1u002.z.vstack.com none                                
 124 192.168.101.2    iqn.1994-09.org.freebsd:q1u002.z.vstack.com none                                
 130 192.168.101.2    iqn.1994-09.org.freebsd:q1u002.z.vstack.com none                                
 147 192.168.101.2    iqn.1994-09.org.freebsd:q1u002.z.vstack.com none                                
 259 192.168.101.2    iqn.1994-09.org.freebsd:q1u002.z.vstack.com none                                
 330 192.168.101.2    iqn.1994-09.org.freebsd:q1u002.z.vstack.com none

root@q1u001:~ # ps -l -p 0 -HSwww | grep cfiscsimt                                                                                                                                                
  0   0    0   0 -16  0   0 12656 cfiscsi    DLs   -      0:00.00 [kernel/cfiscsimt]                                                                                                                
  0   0    0   0 -16  0   0 12656 cfiscsi    DLs   -      0:00.00 [kernel/cfiscsimt]                                                                                                                
  0   0    0   0 -16  0   0 12656 cfiscsi    DLs   -      0:00.00 [kernel/cfiscsimt]                                                                                                                
  0   0    0   0 -16  0   0 12656 cfiscsi    DLs   -      0:00.00 [kernel/cfiscsimt]                                                                                                                
  0   0    0   0 -16  0   0 12656 cfiscsi    DLs   -      0:00.00 [kernel/cfiscsimt]                                                                                                                
  0   0    0   0 -16  0   0 12656 cfiscsi    DLs   -      0:00.00 [kernel/cfiscsimt]                                                                                                                
  0   0    0   0 -16  0   0 12656 cfiscsi    DLs   -      0:00.00 [kernel/cfiscsimt]                                                                                                                
  0   0    0   0 -16  0   0 12656 cfiscsi    DLs   -      0:00.00 [kernel/cfiscsimt]                                                                                                                
  0   0    0   0 -16  0   0 12656 cfiscsi    DLs   -      0:00.00 [kernel/cfiscsimt] 

As you can see, there are many partially initialized session and corresponding maintenance threads: https://svnweb.freebsd.org/base/head/sys/cam/ctl/ctl_frontend_iscsi.c?revision=358333&view=markup#l1161.

After some investigation I found that cs->cs_target == NULL because session doesn't terminated correctly.

A new session is created in the cfiscsi_ioctl_handoff() function: https://svnweb.freebsd.org/base/head/sys/cam/ctl/ctl_frontend_iscsi.c?revision=358333&view=markup#l1490

Find target:
...
1505 	        ct = cfiscsi_target_find(softc, cihp->target_name,
1506 	            cihp->portal_group_tag);
1507 	        if (ct == NULL) {
1508 	                ci->status = CTL_ISCSI_ERROR;
1509 	                snprintf(ci->error_str, sizeof(ci->error_str),
1510 	                    "%s: target not found", __func__);
1511 	                return;
1512 	        }
...

Create new session: allocate struct 'cs', start manteinance thread.

1539 	                cs = cfiscsi_session_new(softc, cihp->offload);
1540 	                if (cs == NULL) {
1541 	                        ci->status = CTL_ISCSI_ERROR;
1542 	                        snprintf(ci->error_str, sizeof(ci->error_str),
1543 	                            "%s: cfiscsi_session_new failed", __func__);
1544 	                        cfiscsi_target_release(ct);
1545 	                        return;
1546 	                }
...

Check if target port is online. In our case target port is offline (ct->ct_online == 0).

1583 	        if (ct->ct_online == 0) {
1584 	                mtx_unlock(&softc->lock);
1585 	                cs->cs_handoff_in_progress = false;

Terminate session: Send cv_signal() to mantainance thread, deallocate struct, etc.

1586 	                cfiscsi_session_terminate(cs);
1587 	                cfiscsi_target_release(ct);
1588 	                ci->status = CTL_ISCSI_ERROR;
1589 	                snprintf(ci->error_str, sizeof(ci->error_str),
1590 	                    "%s: port offline", __func__);
1591 	                return;
1592 	        }

The main problem is that mantainance thread not always receive cv_signal() and stuck in cv_wait().
So we have many partially initilized sessions and mantanance threads.

I see the following problems:

1. Flags cs->cs_handoff_in_progress and cs->cs_terminating which used by mantanance thread is changed without the lock.

2. As I understand cv_signal() must be called under lock:

390 /*
391  * Signal a condition variable, wakes up one waiting thread.  Will also wakeup
392  * the swapper if the process is not in memory, so that it can bring the
393  * sleeping process in.  Note that this may also result in additional threads
394  * being made runnable.  Should be called with the same mutex as was passed to
395  * cv_wait held.
396  */
397 void
398 cv_signal(struct cv *cvp)

With next patch I can't reproduce the panic:

diff --git a/sys/cam/ctl/ctl_frontend_iscsi.c b/sys/cam/ctl/ctl_frontend_iscsi.c
index d5be20c2a215..1b7837aa8355 100644
--- a/sys/cam/ctl/ctl_frontend_iscsi.c
+++ b/sys/cam/ctl/ctl_frontend_iscsi.c
@@ -1582,8 +1582,10 @@ cfiscsi_ioctl_handoff(struct ctl_iscsi *ci)
        mtx_lock(&softc->lock);
        if (ct->ct_online == 0) {
                mtx_unlock(&softc->lock);
+               CFISCSI_SESSION_LOCK(cs);
                cs->cs_handoff_in_progress = false;
                cfiscsi_session_terminate(cs);
+               CFISCSI_SESSION_UNLOCK(cs);
                cfiscsi_target_release(ct);
                ci->status = CTL_ISCSI_ERROR;
                snprintf(ci->error_str, sizeof(ci->error_str),

3. Why wee need at all to start the mantanance thread and than if ct->ct_online == 0 immidiatelly destroy it.
Can we check ct->ct_online early?