| Summary: | iSCSI initiator: Review for robustness (was: panic: isc->receive_pdu != NULL) | ||
|---|---|---|---|
| Product: | Base System | Reporter: | Ed Maste <emaste> |
| Component: | kern | Assignee: | Ed Maste <emaste> |
| Status: | Closed FIXED | ||
| Severity: | Affects Only Me | CC: | dch, jhb |
| Priority: | --- | Keywords: | needs-qa |
| Version: | CURRENT | ||
| Hardware: | Any | ||
| OS: | Any | ||
| See Also: | https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=259152 | ||
| Bug Depends on: | |||
| Bug Blocks: | 259152 | ||
|
Description
Ed Maste
2021-10-25 15:55:00 UTC
For clarity, this issue is not reproducible with PR259355 fixed (commit fc79cf4fea72). It should be reproducible by reverting fc79cf4fea72 and testing against the Oracle Cloud iSCSI target, or possibly against a modified ctld target that's modified to avoid sending MaxRecvDataSegmentLength. That said, this PR is probably best taken as a request to review and make the initiator more robust. ^Triage: Update summary to reflect request and resolution @Ed Are there people from which review can be requested? Please add them to maintainer-feedback ? <email>, of which their can be multiple (comma separate) If you're otherwise going to be coordinating something out of band, feel free to self-assign Note that in my testing (and in Chelsio's internal QA) the initiator has generally been found to be fairly robust for many failure modes. I do think one panic I've seen is that there might be some cases (bad header CRC?) where it can try to abort the connection without freeing the current WIP PDU but due to data already in flight on the socket it tries to process another PDU and trips over the WIP PDU. In fact, I think that's the panic you got here. I think at one point I made it just bail here locally rather than the assertion failure for this case. That's the only panic I have run into when using the software target or initiator though.
Something like this:
diff --git a/sys/dev/iscsi/icl_soft.c b/sys/dev/iscsi/icl_soft.c
index 45d2e53d8db0..3566aa07c7c7 100644
--- a/sys/dev/iscsi/icl_soft.c
+++ b/sys/dev/iscsi/icl_soft.c
@@ -577,8 +577,12 @@ icl_conn_receive_pdu(struct icl_soft_conn *isc, struct mbuf **r, size_t *rs)
bool more_needed;
if (isc->receive_state == ICL_CONN_STATE_BHS) {
- KASSERT(isc->receive_pdu == NULL,
- ("isc->receive_pdu != NULL"));
+ if (isc->receive_pdu != NULL) {
+ ICL_DEBUG("dropping PDU received after error");
+ icl_conn_fail(ic);
+ return (NULL);
+ }
+
request = icl_soft_conn_new_pdu(ic, M_NOWAIT);
if (request == NULL) {
ICL_DEBUG("failed to allocate PDU; "
(In reply to John Baldwin from comment #3) > I do think one panic I've seen is that there might be some cases (bad header > CRC?) where it can try to abort the connection without freeing the current WIP > PDU but due to data already in flight on the socket it tries to process another > PDU and trips over the WIP PDU. In fact, I think that's the panic you got > here. Yes this is the panic we encountered and I believe your proposed patch will solve it. Since you're confident in the initiator's robustness via review or testing I think there's nothing to do for this PR beyond the patch. I'm a bit less confident in my patch after looking this more. While I did use something like this as a hack when testing a bug I had in cxgbei, I'm less certain how we got into this case in the first place. It would seem that icl_conn_start() has run while receive_pdu was still != NULL, but that shouldn't happen as during a reconnect the old socket should first be torn down via icl_conn_stop() which will free receive_pdu and clear it to NULL first. I think we could close this now, & re-open if necessary? |