Bug 259440 - iSCSI initiator: Review for robustness (was: panic: isc->receive_pdu != NULL)
Summary: iSCSI initiator: Review for robustness (was: panic: isc->receive_pdu != NULL)
Status: In Progress
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: CURRENT
Hardware: Any Any
: --- Affects Only Me
Assignee: Ed Maste
URL:
Keywords: needs-qa
Depends on:
Blocks: 259152
  Show dependency treegraph
 
Reported: 2021-10-25 15:55 UTC by Ed Maste
Modified: 2021-12-04 08:52 UTC (History)
2 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Ed Maste freebsd_committer 2021-10-25 15:55:00 UTC
Split out from PR259152 comment #9. After we drop the connection (due to PR259355) we end up in an inconsistent state, and panic:

WARNING: icl_conn_receive_pdu: received data segment length 16384 is larger than negotiated; dropping connection

WARNING: 169.254.2.2:3260 (iqn.2015-12.com.oracleiaas:f896d637-0a99-4fc7-9aed-ecfc8c2166f2): connection error; reconnecting
panic: isc->receive_pdu != NULL
cpuid = 1
time = 1634815845
KDB: stack backtrace:
db_trace_self() at db_trace_self
db_trace_self_wrapper() at db_trace_self_wrapper+0x30
vpanic() at vpanic+0x184
panic() at panic+0x44
icl_receive_thread() at icl_receive_thread+0x734
fork_exit() at fork_exit+0x74
fork_trampoline() at fork_trampoline+0x14
KDB: enter: panic
[ thread pid 0 tid 100279 ]
Stopped at      kdb_enter+0x44: undefined       f901811f
db>
Comment 1 Ed Maste freebsd_committer 2021-10-25 21:40:22 UTC
For clarity, this issue is not reproducible with PR259355 fixed (commit fc79cf4fea72). It should be reproducible by reverting fc79cf4fea72 and testing against the Oracle Cloud iSCSI target, or possibly against a modified ctld target that's modified to avoid sending MaxRecvDataSegmentLength.

That said, this PR is probably best taken as a request to review and make the initiator more robust.
Comment 2 Kubilay Kocak freebsd_committer freebsd_triage 2021-10-26 00:46:30 UTC
^Triage: Update summary to reflect request and resolution

@Ed Are there people from which review can be requested? Please add them to maintainer-feedback ? <email>, of which their can be multiple (comma separate)

If you're otherwise going to be coordinating something out of band, feel free to self-assign
Comment 3 John Baldwin freebsd_committer freebsd_triage 2021-10-26 18:26:11 UTC
Note that in my testing (and in Chelsio's internal QA) the initiator has generally been found to be fairly robust for many failure modes.  I do think one panic I've seen is that there might be some cases (bad header CRC?) where it can try to abort the connection without freeing the current WIP PDU but due to data already in flight on the socket it tries to process another PDU and trips over the WIP PDU.  In fact, I think that's the panic you got here.  I think at one point I made it just bail here locally rather than the assertion failure for this case.  That's the only panic I have run into when using the software target or initiator though.

Something like this:

diff --git a/sys/dev/iscsi/icl_soft.c b/sys/dev/iscsi/icl_soft.c
index 45d2e53d8db0..3566aa07c7c7 100644
--- a/sys/dev/iscsi/icl_soft.c
+++ b/sys/dev/iscsi/icl_soft.c
@@ -577,8 +577,12 @@ icl_conn_receive_pdu(struct icl_soft_conn *isc, struct mbuf **r, size_t *rs)
 	bool more_needed;
 
 	if (isc->receive_state == ICL_CONN_STATE_BHS) {
-		KASSERT(isc->receive_pdu == NULL,
-		    ("isc->receive_pdu != NULL"));
+		if (isc->receive_pdu != NULL) {
+			ICL_DEBUG("dropping PDU received after error");
+			icl_conn_fail(ic);
+			return (NULL);
+		}
+			
 		request = icl_soft_conn_new_pdu(ic, M_NOWAIT);
 		if (request == NULL) {
 			ICL_DEBUG("failed to allocate PDU; "
Comment 4 Ed Maste freebsd_committer 2021-10-26 18:31:15 UTC
(In reply to John Baldwin from comment #3)
> I do think one panic I've seen is that there might be some cases (bad header
> CRC?) where it can try to abort the connection without freeing the current WIP
> PDU but due to data already in flight on the socket it tries to process another
> PDU and trips over the WIP PDU.  In fact, I think that's the panic you got
> here.

Yes this is the panic we encountered and I believe your proposed patch will solve it. Since you're confident in the initiator's robustness via review or testing I think there's nothing to do for this PR beyond the patch.
Comment 5 John Baldwin freebsd_committer freebsd_triage 2021-10-26 21:50:35 UTC
I'm a bit less confident in my patch after looking this more.  While I did use something like this as a hack when testing a bug I had in cxgbei, I'm less certain how we got into this case in the first place.  It would seem that icl_conn_start() has run while receive_pdu was still != NULL, but that shouldn't happen as during a reconnect the old socket should first be torn down via icl_conn_stop() which will free receive_pdu and clear it to NULL first.
Comment 6 Dave Cottlehuber freebsd_committer 2021-12-04 08:52:56 UTC
I think we could close this now, & re-open if necessary?