Created attachment 174985 [details] mps panic core I have done mps reinit and that is causing a panic. I am getting log "reboot after panic: mps_iocfacts_allocate failed to get IOC Facts with error 6" in the message file file. attaching a snap for the core.
Created attachment 174986 [details] fix for the issue attaching a fix for this issue. I tried this issue several times and almost 17% of time it was coming .
Comment on attachment 174986 [details] fix for the issue I triggered reinit case almost 100 times and got this issue about 17 times. After putting this fix I have triggered reinitialization over 1000 times and it is working without crashing. I was using LSI SAS2308 card.
I'm not a big fan of using msleep() to work around hardware, but can you try and put an msleep in your loop so that it pauses a bit before banging again? Thank you for your diagnosis and patch!
Are you running the latest firmware? My group was running into similar gnarly problems on ^/stable/11 with the old firmware and a new driver, but after we upgraded the firmware, everything was groovy.
Created attachment 175173 [details] Adding some delay in the previous fix. Added a delay of 1 millisecond before retry for next time in case of mps_request_sync failure. I have tested this delay thing by changing return value of mps_request_sync to failure so get go into the retry path.
(In reply to Sean Bruno from comment #3) Hi Sean, I have put an DELAY of 1 millisecond in the loop. I am not very sure about the delay timing. Can you tell that 1 ms is fine or not? (In reply to Ngie Cooper from comment #4) Hi Ngie, Yes it can be a HBA firmware issue but I think it is good only if we have fix for those situations also.(In reply to Sean Bruno from comment #3)
(In reply to prateek sethi from comment #6) Yeah, this looks like it will work. I don't have a problem with an error handler here. I'll wait a day or two for "strong objections" if there are any.
Hi Prateek, I'm not sure this is really the right thing to do to fix this because it just seems that we might be covering up a larger problem where the driver is doing something wrong. Can you attach the message file that shows all of the output from the driver prior to the panic? I can take a look and see if it gives me a clue as to what's going on. There are certain timing restrictions that need to be followed when resetting the controller - maybe those restrictions aren't being followed. Some cards are more sensitive to these restrictions than others and, if I remember correctly, the 2308 is one of them. In any case, I think we should dig a little deeper first before committing this change.
Also, it might help to have the debug_level set to 0x07 before your test. If that's too much, and you can't reproduce, then you can try 0x05, and if that's too much then the default of 0x04 will have to do. Looking at the code again, the reset timing is probably OK, but I'd still like to see exactly where this is failing. Thanks.
(In reply to Stephen McConnell from comment #9) Hi Stephen, I could not reproduce this issue now. system got panic after the mps reinit got triggered. The following are the only logs I could find (before and after the crash). Apr 27 13:34:51 Node1 kernel: mps0: Reinitializing controller Apr 27 14:52:21 Node1 syslogd: kernel boot file is /boot/kernel/kernel Apr 27 14:52:21 Node1 savecore: reboot after panic: mps_iocfacts_allocate failed to get IOC Facts with error 6 I got the message "mps0: Doorbell failed to activate" from the core analysis, which will help to find exact point of the issue. Hope these will be helpful for the further debugging. Would you please mind throwing some light on the timing restrictions, that you mentioned, to be followed when resetting the controller?
The reset timing in the driver looks fine to me. There is a requirement that the host wait a certain amount of time when it first accesses the controller during a reset, and then a certain time to wait on checking registers, etc. But, it looks fine. What doesn't make sense is that you're waiting some arbitrary amount of time after the initial failure and then it works. This time that your waiting is after the reset completes and then after some calls to other functions. After all of that, some access to the DOORBELL fails. Then, waiting 2 mSecs fixes it. That's strange. There are two ways that this will fail in Step 4 of mps_request_sync(). The first is when reading the Interrupt Status REG. If this Register does not show an interrupt within 5 seconds, it fails (that's a really long time). The second is when reading the DOORBELL REG. If the DOORBELL_USED bit is not set, it fails. I can't tell which one of these fails. But, because it fails your fix will just wait 2 mSecs and then retry, then it's successful (at least within 10 mSecs - 5 retries). What I'm wondering is, does it really matter that you have a delay between mps_request_sync() calls? To me, it looks like something is messed up in FW and just doing a retry fixes it. Now, with all of that said, I'm not sure there really is a better fix except that the delay may not need to be there. Having the delay there would make someone think that we're just not waiting long enough, which really is not the case and looks a little scary, meaning someone could think the driver timing for this is very fragile, when it's really not. Sean, let me know what you think about removing the delay. If you want the delay, I would at least say to add a comment that explains the delay and retry, since none of this is really supposed to happen and I think it's some FW or HW workaround.
A commit in branch main references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=c0e0e530ced057502f51d7a6086857305e08fae0 commit c0e0e530ced057502f51d7a6086857305e08fae0 Author: prateek sethi <prateekrootkey@gmail.com> AuthorDate: 2024-10-13 18:38:54 +0000 Commit: Warner Losh <imp@FreeBSD.org> CommitDate: 2024-10-13 21:38:01 +0000 mps/mpr: Add workaround for firmware not responding to IOC_FACTS or IOC_INIT Sometimes, especially with older firmware, mps(4) would have trouble initializing the card in one of these two steps. Add in a retry after a short delay. Sean Bruno and Stephen McConnell thought this was OK in the bug discussions, but never committed it. Steve indicated the delay might not be necessary, but the OP clearly needed to make it longer to make things work. I've kept the delay, and added the suggested comment. Ported the iocfacts part to mpr as well, since we see similar errors about once every month or two over a few thousand controllers at work. We've not seen it with IOC_INIT as far back as I can query the error log database, so I didn't port that forward. We'll see if this helps, but won't know for sure until next year (so I'm committing it now since it won't hurt and might help). We usually see this failure in connection with complicated recovery operations with a drive that's failing, though, at least in the last year's worth of failures. It's not clear this is the same as OP or not. PR: 212841 Sponsored by: Netflix Co-authored-by: imp sys/dev/mpr/mpr.c | 23 ++++++++++++++++++----- sys/dev/mps/mps.c | 38 ++++++++++++++++++++++++++++++-------- 2 files changed, 48 insertions(+), 13 deletions(-)
I neglected to flag this as MFC After, so tagged in bug.
A commit in branch stable/14 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=b57229a80465b390a31fb30b3127c4cb93d8987f commit b57229a80465b390a31fb30b3127c4cb93d8987f Author: prateek sethi <prateekrootkey@gmail.com> AuthorDate: 2024-10-13 18:38:54 +0000 Commit: Warner Losh <imp@FreeBSD.org> CommitDate: 2024-10-16 14:19:21 +0000 mps/mpr: Add workaround for firmware not responding to IOC_FACTS or IOC_INIT Sometimes, especially with older firmware, mps(4) would have trouble initializing the card in one of these two steps. Add in a retry after a short delay. Sean Bruno and Stephen McConnell thought this was OK in the bug discussions, but never committed it. Steve indicated the delay might not be necessary, but the OP clearly needed to make it longer to make things work. I've kept the delay, and added the suggested comment. Ported the iocfacts part to mpr as well, since we see similar errors about once every month or two over a few thousand controllers at work. We've not seen it with IOC_INIT as far back as I can query the error log database, so I didn't port that forward. We'll see if this helps, but won't know for sure until next year (so I'm committing it now since it won't hurt and might help). We usually see this failure in connection with complicated recovery operations with a drive that's failing, though, at least in the last year's worth of failures. It's not clear this is the same as OP or not. PR: 212841 Sponsored by: Netflix Co-authored-by: imp (cherry picked from commit c0e0e530ced057502f51d7a6086857305e08fae0) sys/dev/mpr/mpr.c | 23 ++++++++++++++++++----- sys/dev/mps/mps.c | 38 ++++++++++++++++++++++++++++++-------- 2 files changed, 48 insertions(+), 13 deletions(-)
A commit in branch stable/13 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=e360f8c8fecc78d8f2aa2aee46940ec5eca86c87 commit e360f8c8fecc78d8f2aa2aee46940ec5eca86c87 Author: prateek sethi <prateekrootkey@gmail.com> AuthorDate: 2024-10-13 18:38:54 +0000 Commit: Warner Losh <imp@FreeBSD.org> CommitDate: 2024-10-16 14:33:33 +0000 mps/mpr: Add workaround for firmware not responding to IOC_FACTS or IOC_INIT Sometimes, especially with older firmware, mps(4) would have trouble initializing the card in one of these two steps. Add in a retry after a short delay. Sean Bruno and Stephen McConnell thought this was OK in the bug discussions, but never committed it. Steve indicated the delay might not be necessary, but the OP clearly needed to make it longer to make things work. I've kept the delay, and added the suggested comment. Ported the iocfacts part to mpr as well, since we see similar errors about once every month or two over a few thousand controllers at work. We've not seen it with IOC_INIT as far back as I can query the error log database, so I didn't port that forward. We'll see if this helps, but won't know for sure until next year (so I'm committing it now since it won't hurt and might help). We usually see this failure in connection with complicated recovery operations with a drive that's failing, though, at least in the last year's worth of failures. It's not clear this is the same as OP or not. PR: 212841 Sponsored by: Netflix Co-authored-by: imp (cherry picked from commit c0e0e530ced057502f51d7a6086857305e08fae0) sys/dev/mpr/mpr.c | 23 ++++++++++++++++++----- sys/dev/mps/mps.c | 38 ++++++++++++++++++++++++++++++-------- 2 files changed, 48 insertions(+), 13 deletions(-)