Bug 199117 - Kernel panic when iSCSI drops
Summary: Kernel panic when iSCSI drops
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 10.1-RELEASE
Hardware: amd64 Any
: --- Affects Only Me
Assignee: Edward Tomasz Napierala
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-04-02 16:22 UTC by Randy Westlund
Modified: 2017-11-04 13:20 UTC (History)
2 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Randy Westlund 2015-04-02 16:22:53 UTC
I'm using iSCSI to connect to a remote drive over an SSH tunnel.  When the connection drops, I get a kernel panic and reboot.  This has occurred three times in the past three days.

dmesg:

WARNING: localhost (iqn.cryo): no ping reply (NOP-In) after 5 seconds; reconnecting
WARNING: localhost (iqn.cryo): no ping reply (NOP-In) after 5 seconds; reconnecting
Fatal trap 9: general protection fault while in kernel mode
cpuid = 0; apic id = 00
instruction pointer     = 0x20:0xffffffff802e1b6e
stack pointer           = 0x28:0xfffffe0119c68ae0
frame pointer           = 0x28:0xfffffe0119c68b10
code segment            = base rx0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 0 (iscsimt)
trap number             = 9
panic: general protection fault
cpuid = 0
KDB: stack backtrace:
#0 0xffffffff80963000 at kdb_backtrace+0x60
#1 0xffffffff80928125 at panic+0x155
#2 0xffffffff80d24f0f at trap_fatal+0x38f
#3 0xffffffff80d24b6c at trap+0x75c
#4 0xffffffff80d0a772 at calltrap+0x8
#5 0xffffffff81e43f47 at iscsi_session_terminate_task+0x57
#6 0xffffffff81e43ed0 at iscsi_session_cleanup+0x1c0
#7 0xffffffff81e43a78 at iscsi_maintenance_thread+0x58
#8 0xffffffff808f8b6a at fork_exit+0x9a
#9 0xffffffff80d0acae at fork_trampoline+0xe
Uptime: 15h46m39s
(da0:iscsi1:0:0:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 
(da0:iscsi1:0:0:0): CAM status: Command timeout
(da0:iscsi1:0:0:0): Error 5, Retries exhausted
(da0:iscsi1:0:0:0): Synchronize cache failed
Automatic reboot in 15 seconds - press a key on the console to abort
Rebooting...
Copyright (c) 1992-2014 The FreeBSD Project.



System info:

GENERIC kernel

uname -a

FreeBSD legion.textplain.net 10.1-RELEASE-p6 FreeBSD 10.1-RELEASE-p6 #0: Tue Feb 24 19:00:21 UTC 2015     root@amd64-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC  amd64
Comment 1 Edward Tomasz Napierala freebsd_committer 2015-04-04 09:59:06 UTC
Can you obtain a crashdump and do "where" in kgdb, as documented below?  Thanks!

https://www.freebsd.org/doc/en/books/developers-handbook/kerneldebug-gdb.html
Comment 2 Randy Westlund 2015-04-04 15:45:41 UTC
(In reply to Edward Tomasz Napierala from comment #1)

I don't have any core dumps under /var/crash.  dumpdev="AUTO" in /etc/rc.conf.  I have a ZFS mirror of two drives, each with swap space.

Here's some more info:

root@legion:/var/crash # swapinfo
Device          1K-blocks     Used    Avail Capacity
/dev/ada0p2.eli   2097152        0  2097152     0%
/dev/ada1p2.eli   2097152        0  2097152     0%
Total             4194304        0  4194304     0%


root@legion:/var/crash # savecore -C -vv
unable to open bounds file, using 0
checking for kernel dump on device /dev/ada0p2.eli
mediasize = 2147483648
sectorsize = 4096
savecore: error reading last dump header at offset 2147479552 in /dev/ada0p2.eli: Invalid argument
unable to open bounds file, using 0
checking for kernel dump on device /dev/ada1p2.eli
mediasize = 2147483648
sectorsize = 4096
savecore: error reading last dump header at offset 2147479552 in /dev/ada1p2.eli: Invalid argument
No dump exists

I tried using -f, but got the same output.  I'm unable to find the solution online.  If you can help me get crashdumps working, I'll try to trigger the crash again.
Comment 3 Edward Tomasz Napierala freebsd_committer 2015-04-10 13:17:43 UTC
Could you try to hook an USB flash drive and use that for crashdump?
Comment 4 Randy Westlund 2015-04-13 17:35:01 UTC
Dumping to the flash drive worked.  I can reliably trigger the panic and I have a dump.

I use freebsd-update, so I don't have a kernel.debug and I'm not sure how to build one that matches my 10.1-RELEASE-p6 GENERIC kernel.  So here's the dump:

https://cumulus.textplain.net/index.php/s/rr8a7OQJybNjl5L


dmesg:

WARNING: localhost (iqn.cryo): no ping reply (NOP-In) after 5 seconds; reconnecting
WARNING: localhost (iqn.cryo): no ping reply (NOP-In) after 5 seconds; reconnecting
WARNING: localhost (iqn.cryo): login timed out after 61 seconds; reconnecting
WARNING: localhost (iqn.cryo): login timed out after 61 seconds; reconnecting
WARNING: localhost (iqn.cryo): login timed out after 61 seconds; reconnecting
WARNING: localhost (iqn.cryo): login timed out after 61 seconds; reconnecting
WARNING: localhost (iqn.cryo): login timed out after 61 seconds; reconnecting
WARNING: localhost (iqn.cryo): login timed out after 61 seconds; reconnecting
WARNING: localhost (iqn.cryo): login timed out after 61 seconds; reconnecting
WARNING: localhost (iqn.cryo): login timed out after 61 seconds; reconnecting
WARNING: localhost (iqn.cryo): login timed out after 61 seconds; reconnecting
WARNING: localhost (iqn.cryo): login timed out after 61 seconds; reconnecting
WARNING: localhost (iqn.cryo): login timed out after 61 seconds; reconnecting
WARNING: localhost (iqn.cryo): login timed out after 61 seconds; reconnecting
WARNING: localhost (iqn.cryo): login timed out after 61 seconds; reconnecting
WARNING: localhost (iqn.cryo): login timed out after 61 seconds; reconnecting
WARNING: localhost (iqn.cryo): login timed out after 61 seconds; reconnecting
WARNING: localhost (iqn.cryo): login timed out after 61 seconds; reconnecting
WARNING: localhost (iqn.cryo): login timed out after 61 seconds; reconnecting
WARNING: localhost (iqn.cryo): login timed out after 61 seconds; reconnecting
WARNING: localhost (iqn.cryo): login timed out after 61 seconds; reconnecting
WARNING: localhost (iqn.cryo): login timed out after 61 seconds; reconnecting
WARNING: localhost (iqn.cryo): login timed out after 61 seconds; reconnecting
WARNING: localhost (iqn.cryo): login timed out after 61 seconds; reconnecting
WARNING: localhost (iqn.cryo): login timed out after 61 seconds; reconnecting
WARNING: localhost (iqn.cryo): login timed out after 61 seconds; reconnecting
WARNING: localhost (iqn.cryo): login timed out after 61 seconds; reconnecting
WARNING: localhost (iqn.cryo): login timed out after 61 seconds; reconnecting
WARNING: localhost (iqn.cryo): login timed out after 61 seconds; reconnecting
WARNING: localhost (iqn.cryo): login timed out after 61 seconds; reconnecting
WARNING: localhost (iqn.cryo): login timed out after 61 seconds; reconnecting
WARNING: localhost (iqn.cryo): login timed out after 61 seconds; reconnecting
WARNING: localhost (iqn.cryo): login timed out after 61 seconds; reconnecting
WARNING: localhost (iqn.cryo): login timed out after 61 seconds; reconnecting
WARNING: localhost (iqn.cryo): login timed out after 61 seconds; reconnecting
WARNING: localhost (iqn.cryo): login timed out after 61 seconds; reconnecting
WARNING: localhost (iqn.cryo): login timed out after 61 seconds; reconnecting
WARNING: localhost (iqn.cryo): login timed out after 61 seconds; reconnecting
WARNING: localhost (iqn.cryo): login timed out after 61 seconds; reconnecting
WARNING: localhost (iqn.cryo): login timed out after 61 seconds; reconnecting
WARNING: localhost (iqn.cryo): login timed out after 61 seconds; reconnecting
WARNING: localhost (iqn.cryo): login timed out after 61 seconds; reconnecting
WARNING: localhost (iqn.cryo): login timed out after 61 seconds; reconnecting
WARNING: localhost (iqn.cryo): login timed out after 61 seconds; reconnecting
WARNING: localhost (iqn.cryo): login timed out after 61 seconds; reconnecting
WARNING: localhost (iqn.cryo): login timed out after 61 seconds; reconnecting
WARNING: localhost (iqn.cryo): login timed out after 61 seconds; reconnecting
WARNING: localhost (iqn.cryo): login timed out after 61 seconds; reconnecting
WARNING: localhost (iqn.cryo): login timed out after 61 seconds; reconnecting
WARNING: localhost (iqn.cryo): login timed out after 61 seconds; reconnecting
(da1:iscsi1:0:0:0): WRITE(10). CDB: 2a 00 08 20 ff 3a 00 00 1e 00 
(da1:iscsi1:0:0:0): CAM status: SCSI Status Error
(da1:iscsi1:0:0:0): SCSI status: Check Condition
(da1:iscsi1:0:0:0): SCSI sense: UNIT ATTENTION asc:29,7 (I_T nexus loss occurred)
(da1:iscsi1:0:0:0): Retrying command (per sense data)
WARNING: localhost (iqn.cryo): no ping reply (NOP-In) after 5 seconds; reconnecting
WARNING: localhost (iqn.cryo): login timed out after 61 seconds; reconnecting
WARNING: localhost (iqn.cryo): login timed out after 61 seconds; reconnecting
(da1:iscsi1:0:0:0): WRITE(10). CDB: 2a 00 08 20 d4 47 00 00 20 00 
(da1:iscsi1:0:0:0): CAM status: SCSI Status Error
(da1:iscsi1:0:0:0): SCSI status: Check Condition
(da1:iscsi1:0:0:0): SCSI sense: UNIT ATTENTION asc:29,7 (I_T nexus loss occurred)
(da1:iscsi1:0:0:0): Retrying command (per sense data)
WARNING: icl_pdu_append_data: failed to allocate mbuf for 131072 bytes
(da1:iscsi1:0:0:0): WRITE(10). CDB: 2a 00 0c 20 4d 48 00 00 20 00 
(da1:iscsi1:0:0:0): CAM status: Resource Unavailable
(da1:iscsi1:0:0:0): Retrying command
WARNING: icl_pdu_append_data: failed to allocate mbuf for 126976 bytes
(da1:iscsi1:0:0:0): WRITE(10). CDB: 2a 00 08 27 a3 73 00 00 1f 00 
(da1:iscsi1:0:0:0): CAM status: Resource Unavailable
(da1:iscsi1:0:0:0): Retrying command
WARNING: icl_pdu_append_data: failed to allocate mbuf for 126976 bytes
(da1:iscsi1:0:0:0): WRITE(10). CDB: 2a 00 08 27 a3 92 00 00 1f 00 
(da1:iscsi1:0:0:0): CAM status: Resource Unavailable
(da1:iscsi1:0:0:0): Retrying command
WARNING: localhost (iqn.cryo): no ping reply (NOP-In) after 5 seconds; reconnecting
WARNING: localhost (iqn.cryo): no ping reply (NOP-In) after 5 seconds; reconnecting
WARNING: localhost (iqn.cryo): no ping reply (NOP-In) after 5 seconds; reconnecting
Fatal trap 9: general protection fault while in kernel mode
cpuid = 0; apic id = 00
instruction pointer     = 0x20:0xffffffff802e1b6e
stack pointer           = 0x28:0xfffffe011a1c1ae0
frame pointer           = 0x28:0xfffffe011a1c1b10
code segment            = base rx0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 0 (iscsimt)
trap number             = 9
panic: general protection fault
cpuid = 0
KDB: stack backtrace:
#0 0xffffffff80963000 at kdb_backtrace+0x60
#1 0xffffffff80928125 at panic+0x155
#2 0xffffffff80d24f0f at trap_fatal+0x38f
#3 0xffffffff80d24b6c at trap+0x75c
#4 0xffffffff80d0a772 at calltrap+0x8
#5 0xffffffff81e43f47 at iscsi_session_terminate_task+0x57
#6 0xffffffff81e43ed0 at iscsi_session_cleanup+0x1c0
#7 0xffffffff81e43a78 at iscsi_maintenance_thread+0x58
#8 0xffffffff808f8b6a at fork_exit+0x9a
#9 0xffffffff80d0acae at fork_trampoline+0xe
Uptime: 1d6h58m29s
(da1:iscsi1:0:0:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 
(da1:iscsi1:0:0:0): CAM status: Command timeout
(da1:iscsi1:0:0:0): Error 5, Retries exhausted
(da1:iscsi1:0:0:0): Synchronize cache failed
Dumping 954 out of 3930 MB:..2%..11%..21%..31%..41%..51%..61%..71%..81%..91%
Dump complete
Automatic reboot in 15 seconds - press a key on the console to abort



Steps to reproduce:

1) Forward iSCSI traffic over SSH

ssh -L 3260:localhost:3260 example.com


2) Connect to iSCSI device over SSH tunnel

iscsictl -A -p localhost -t iqn.cryo


3) Decrypt drive

geli attach /dev/da1p1


4) Import zpool

zpool import cryo


5) Start zfs send from local pool to iSCSI pool (I use zxfer)


6) Router at remote site crashes and reboots (unrelated problem)


7) After about 10 minutes, kernel panic on local machine


Sometimes simply killing the ssh tunnel is enough to cause the panic, but sometimes it causes zfs commands to hang until the connection is reestablished.  When the connection is lost due to the remote router crashing, the panic happens every time.

If there's anything else I can do to help, just let me know.
Comment 5 Randy Westlund 2015-05-13 03:01:29 UTC
This is still an issue on 10.1-RELEASE-p9.  Here's a stack trace.

(kgdb) bt
#0  doadump (textdump=<value optimized out>) at pcpu.h:219
#1  0xffffffff80927da2 in kern_reboot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:452
#2  0xffffffff80928164 in panic (fmt=<value optimized out>) at /usr/src/sys/kern/kern_shutdown.c:759
#3  0xffffffff80d24f5f in trap_fatal (frame=<value optimized out>, eva=<value optimized out>) at /usr/src/sys/amd64/amd64/trap.c:865
#4  0xffffffff80d25278 in trap_pfault (frame=0xfffffe011a263a30, usermode=<value optimized out>) at /usr/src/sys/amd64/amd64/trap.c:676
#5  0xffffffff80d248da in trap (frame=0xfffffe011a263a30) at /usr/src/sys/amd64/amd64/trap.c:440
#6  0xffffffff80d0a7c2 in calltrap () at /usr/src/sys/amd64/amd64/exception.S:232
#7  0xffffffff802e1b6e in xpt_freeze_devq (path=0x0, count=1) at /usr/src/sys/cam/cam_xpt.c:4300
#8  0xffffffff81e43f47 in iscsi_session_terminate_task (is=0xfffff800c06ca000, io=0xfffff800b53f7398, requeue=<value optimized out>) at /usr/src/sys/modules/iscsi/../../dev/iscsi/iscsi.c:297
#9  0xffffffff81e43ed0 in iscsi_session_cleanup (is=0xfffff800c06ca000, destroy_sim=<value optimized out>) at /usr/src/sys/modules/iscsi/../../dev/iscsi/iscsi.c:313
#10 0xffffffff81e43a78 in iscsi_maintenance_thread (arg=0xfffff800c06ca000) at /usr/src/sys/modules/iscsi/../../dev/iscsi/iscsi.c:396
#11 0xffffffff808f8b6a in fork_exit (callout=0xffffffff81e43a20 <iscsi_maintenance_thread>, arg=0xfffff800c06ca000, frame=0xfffffe011a263c00) at /usr/src/sys/kern/kern_fork.c:996
#12 0xffffffff80d0acfe in fork_trampoline () at /usr/src/sys/amd64/amd64/exception.S:606
#13 0x0000000000000000 in ?? ()
Comment 6 Edward Tomasz Napierala freebsd_committer 2015-05-19 10:25:03 UTC
In kgdb, can you go up to iscsi_session_terminate_task() and do "p *io" and "p *io->io_ccb"?  I didn't notice the NULL path there; looks suspicious.
Comment 7 Randy Westlund 2015-05-19 16:09:50 UTC
(In reply to Edward Tomasz Napierala from comment #6)

(kgdb) 
#8  0xffffffff81e43f47 in iscsi_session_terminate_task (is=0xfffff800c06ca000, io=0xfffff800b53f7398, requeue=<value optimized out>) at /usr/src/sys/modules/iscsi/../../dev/iscsi/iscsi.c:297
warning: Source file is more recent than executable.
297                             xpt_freeze_devq(io->io_ccb->ccb_h.path, 1);
(kgdb) p *io
$1 = {io_next = {tqe_next = 0xfffff8005b47df78, tqe_prev = 0xfffff800c06ca180}, io_ccb = 0xfffff8002e0c1800, io_received = 0, io_initiator_task_tag = 74078, io_datasn = 0}
(kgdb) p *io->io_ccb
Cannot access memory at address 0xfffff8002e0c1800
Comment 8 Edward Tomasz Napierala freebsd_committer 2016-03-17 12:45:24 UTC
Hi.  I've tried to reproduce this under 11-CURRENT and failed.  It's either that I'm doing something differently, or it was fixed in the meantime.

Could you see if this still happens under 11-CURRENT, or at least 10.3-RC2?  Thanks!
Comment 9 Edward Tomasz Napierala freebsd_committer 2016-03-18 13:41:49 UTC
Damn, I was looking in the wrong place all along.  Could you try the patch at https://reviews.freebsd.org/D5673?  Thanks!
Comment 10 commit-hook freebsd_committer 2016-03-25 16:02:24 UTC
A commit references this bug:

Author: trasz
Date: Fri Mar 25 16:01:40 UTC 2016
New revision: 297268
URL: https://svnweb.freebsd.org/changeset/base/297268

Log:
  Fix iSCSI initiator crash that could happen with out-of-memory
  conditions with in-flight IO and subsequent reconnection.

  PR:		199117
  MFC after:	1 month
  Sponsored by:	The FreeBSD Foundation
  Differential Revision:	https://reviews.freebsd.org/D5673

Changes:
  head/sys/dev/iscsi/iscsi.c