Bug 212841

Summary:

getting panic during mps reinitialization.

Product:

Base System

Reporter:

prateek sethi <prateekrootkey>

Component:

kern

Assignee:

Warner Losh <imp>

Status:

Closed FIXED

Severity:

Affects Only Me

CC:

imp, ngie, prateekrootkey, sbruno, scsi, slm

Priority:

---

Keywords:

crash

Version:

9.2-RELEASE

Flags:

imp: mfc-stable14+
imp: mfc-stable13+

Hardware:

amd64

OS:

Any

Attachments:

Description	Flags
mps panic core	none
fix for the issue	none
Adding some delay in the previous fix.	none

Description prateek sethi 2016-09-20 06:52:54 UTC

Created attachment 174985 [details]
mps panic core

I have done mps reinit and that is causing a panic.
I am getting log "reboot after panic: mps_iocfacts_allocate failed to get IOC Facts with error 6" in the message file file.

attaching a snap for the core.

Comment 1 prateek sethi 2016-09-20 06:56:40 UTC

Created attachment 174986 [details]
fix for the issue

attaching a fix for this issue. I tried this issue several times and almost 17% of time it was coming .

Comment 2 prateek sethi 2016-09-20 07:07:22 UTC

Comment on attachment 174986 [details]
fix for the issue

I triggered reinit case almost 100 times and got this issue about 17 times. After putting this fix I have triggered reinitialization over 1000 times and it is working without crashing.

I was using LSI SAS2308 card.

Comment 3 Sean Bruno freebsd_committer

2016-09-20 14:08:19 UTC

I'm not a big fan of using msleep() to work around hardware, but can you try and put an msleep in your loop so that it pauses a bit before banging again?  Thank you for your diagnosis and patch!

Comment 4 Enji Cooper freebsd_committer

2016-09-20 16:09:49 UTC

Are you running the latest firmware? My group was running into similar gnarly problems on ^/stable/11 with the old firmware and a new driver, but after we upgraded the firmware, everything was groovy.

Comment 5 prateek sethi 2016-09-26 08:53:48 UTC

Created attachment 175173 [details]
Adding some delay in the previous fix.

Added a delay of 1 millisecond before retry for next time in case of mps_request_sync failure.

I have tested this delay thing by changing  return value of mps_request_sync to failure so get go into the retry path.

Comment 6 prateek sethi 2016-09-26 09:02:20 UTC

(In reply to Sean Bruno from comment #3)

Hi Sean,
I have put an DELAY of 1 millisecond in the loop. I am not very sure about the delay timing. Can you tell that 1 ms is fine or not?

(In reply to Ngie Cooper from comment #4)

Hi Ngie,
Yes it can be a HBA firmware issue but I think it is good only if we have fix for those situations also.(In reply to Sean Bruno from comment #3)

Comment 7 Sean Bruno freebsd_committer

2016-09-26 14:33:46 UTC

(In reply to prateek sethi from comment #6)
Yeah, this looks like it will work.  I don't have a problem with an error handler here.  I'll wait a day or two for "strong objections" if there are any.

Comment 8 Stephen McConnell freebsd_committer

2016-09-26 20:25:38 UTC

Hi Prateek, I'm not sure this is really the right thing to do to fix this because it just seems that we might be covering up a larger problem where the driver is doing something wrong. Can you attach the message file that shows all of the output from the driver prior to the panic? I can take a look and see if it gives me a clue as to what's going on. There are certain timing restrictions that need to be followed when resetting the controller - maybe those restrictions aren't being followed. Some cards are more sensitive to these restrictions than others and, if I remember correctly, the 2308 is one of them. In any case, I think we should dig a little deeper first before committing this change.

Comment 9 Stephen McConnell freebsd_committer

2016-09-26 20:52:10 UTC

Also, it might help to have the debug_level set to 0x07 before your test. If that's too much, and you can't reproduce, then you can try 0x05, and if that's too much then the default of 0x04 will have to do. Looking at the code again, the reset timing is probably OK, but I'd still like to see exactly where this is failing. Thanks.

Comment 10 prateek sethi 2016-09-27 09:19:52 UTC

(In reply to Stephen McConnell from comment #9)

Hi Stephen,

I could not reproduce this issue now. system got panic after the mps reinit got triggered. The following are the only logs I could find (before and after the crash). 

Apr 27 13:34:51 Node1 kernel: mps0: Reinitializing controller
Apr 27 14:52:21 Node1 syslogd: kernel boot file is /boot/kernel/kernel
Apr 27 14:52:21 Node1 savecore: reboot after panic: mps_iocfacts_allocate failed to get IOC Facts with error 6

I got the message "mps0: Doorbell failed to activate" from the core analysis, which will help to find exact point of the issue. Hope these will be helpful for the further debugging. 

Would you please mind throwing some light on the timing restrictions, that you mentioned, to be followed when resetting the controller?

Comment 11 Stephen McConnell freebsd_committer

2016-09-27 19:18:29 UTC

The reset timing in the driver looks fine to me. There is a requirement that the host wait a certain amount of time when it first accesses the controller during a reset, and then a certain time to wait on checking registers, etc. But, it looks fine.

What doesn't make sense is that you're waiting some arbitrary amount of time after the initial failure and then it works. This time that your waiting is after the reset completes and then after some calls to other functions. After all of that, some access to the DOORBELL fails. Then, waiting 2 mSecs fixes it. That's strange.

There are two ways that this will fail in Step 4 of mps_request_sync(). The first is when reading the Interrupt Status REG. If this Register does not show an interrupt within 5 seconds, it fails (that's a really long time). The second is when reading the DOORBELL REG. If the DOORBELL_USED bit is not set, it fails. I can't tell which one of these fails. But, because it fails your fix will just wait 2 mSecs and then retry, then it's successful (at least within 10 mSecs - 5 retries).

What I'm wondering is, does it really matter that you have a delay between mps_request_sync() calls? To me, it looks like something is messed up in FW and just doing a retry fixes it.

Now, with all of that said, I'm not sure there really is a better fix except that the delay may not need to be there. Having the delay there would make someone think that we're just not waiting long enough, which really is not the case and looks a little scary, meaning someone could think the driver timing for this is very fragile, when it's really not.

Sean, let me know what you think about removing the delay. If you want the delay, I would at least say to add a comment that explains the delay and retry, since none of this is really supposed to happen and I think it's some FW or HW workaround.

Comment 12 commit-hook freebsd_committer

2024-10-13 21:38:43 UTC

A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=c0e0e530ced057502f51d7a6086857305e08fae0

commit c0e0e530ced057502f51d7a6086857305e08fae0
Author:     prateek sethi <prateekrootkey@gmail.com>
AuthorDate: 2024-10-13 18:38:54 +0000
Commit:     Warner Losh <imp@FreeBSD.org>
CommitDate: 2024-10-13 21:38:01 +0000

    mps/mpr: Add workaround for firmware not responding to IOC_FACTS or IOC_INIT

    Sometimes, especially with older firmware, mps(4) would have trouble
    initializing the card in one of these two steps. Add in a retry after a
    short delay. Sean Bruno and Stephen McConnell thought this was OK in the
    bug discussions, but never committed it.  Steve indicated the delay
    might not be necessary, but the OP clearly needed to make it longer to
    make things work. I've kept the delay, and added the suggested comment.

    Ported the iocfacts part to mpr as well, since we see similar errors
    about once every month or two over a few thousand controllers at
    work. We've not seen it with IOC_INIT as far back as I can query the
    error log database, so I didn't port that forward. We'll see if this
    helps, but won't know for sure until next year (so I'm committing it now
    since it won't hurt and might help). We usually see this failure in
    connection with complicated recovery operations with a drive that's
    failing, though, at least in the last year's worth of failures. It's
    not clear this is the same as OP or not.

    PR: 212841
    Sponsored by: Netflix
    Co-authored-by: imp

 sys/dev/mpr/mpr.c | 23 ++++++++++++++++++-----
 sys/dev/mps/mps.c | 38 ++++++++++++++++++++++++++++++--------
 2 files changed, 48 insertions(+), 13 deletions(-)

Comment 13 Warner Losh freebsd_committer

2024-10-13 22:35:57 UTC

I neglected to flag this as MFC After, so tagged in bug.

Comment 14 commit-hook freebsd_committer

2024-10-16 14:32:37 UTC

A commit in branch stable/14 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=b57229a80465b390a31fb30b3127c4cb93d8987f

commit b57229a80465b390a31fb30b3127c4cb93d8987f
Author:     prateek sethi <prateekrootkey@gmail.com>
AuthorDate: 2024-10-13 18:38:54 +0000
Commit:     Warner Losh <imp@FreeBSD.org>
CommitDate: 2024-10-16 14:19:21 +0000

    mps/mpr: Add workaround for firmware not responding to IOC_FACTS or IOC_INIT

    Sometimes, especially with older firmware, mps(4) would have trouble
    initializing the card in one of these two steps. Add in a retry after a
    short delay. Sean Bruno and Stephen McConnell thought this was OK in the
    bug discussions, but never committed it.  Steve indicated the delay
    might not be necessary, but the OP clearly needed to make it longer to
    make things work. I've kept the delay, and added the suggested comment.

    Ported the iocfacts part to mpr as well, since we see similar errors
    about once every month or two over a few thousand controllers at
    work. We've not seen it with IOC_INIT as far back as I can query the
    error log database, so I didn't port that forward. We'll see if this
    helps, but won't know for sure until next year (so I'm committing it now
    since it won't hurt and might help). We usually see this failure in
    connection with complicated recovery operations with a drive that's
    failing, though, at least in the last year's worth of failures. It's
    not clear this is the same as OP or not.

    PR: 212841
    Sponsored by: Netflix
    Co-authored-by: imp

    (cherry picked from commit c0e0e530ced057502f51d7a6086857305e08fae0)

 sys/dev/mpr/mpr.c | 23 ++++++++++++++++++-----
 sys/dev/mps/mps.c | 38 ++++++++++++++++++++++++++++++--------
 2 files changed, 48 insertions(+), 13 deletions(-)

Comment 15 commit-hook freebsd_committer

2024-10-16 14:44:39 UTC

A commit in branch stable/13 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=e360f8c8fecc78d8f2aa2aee46940ec5eca86c87

commit e360f8c8fecc78d8f2aa2aee46940ec5eca86c87
Author:     prateek sethi <prateekrootkey@gmail.com>
AuthorDate: 2024-10-13 18:38:54 +0000
Commit:     Warner Losh <imp@FreeBSD.org>
CommitDate: 2024-10-16 14:33:33 +0000

    mps/mpr: Add workaround for firmware not responding to IOC_FACTS or IOC_INIT

    Sometimes, especially with older firmware, mps(4) would have trouble
    initializing the card in one of these two steps. Add in a retry after a
    short delay. Sean Bruno and Stephen McConnell thought this was OK in the
    bug discussions, but never committed it.  Steve indicated the delay
    might not be necessary, but the OP clearly needed to make it longer to
    make things work. I've kept the delay, and added the suggested comment.

    Ported the iocfacts part to mpr as well, since we see similar errors
    about once every month or two over a few thousand controllers at
    work. We've not seen it with IOC_INIT as far back as I can query the
    error log database, so I didn't port that forward. We'll see if this
    helps, but won't know for sure until next year (so I'm committing it now
    since it won't hurt and might help). We usually see this failure in
    connection with complicated recovery operations with a drive that's
    failing, though, at least in the last year's worth of failures. It's
    not clear this is the same as OP or not.

    PR: 212841
    Sponsored by: Netflix
    Co-authored-by: imp

    (cherry picked from commit c0e0e530ced057502f51d7a6086857305e08fae0)

 sys/dev/mpr/mpr.c | 23 ++++++++++++++++++-----
 sys/dev/mps/mps.c | 38 ++++++++++++++++++++++++++++++--------
 2 files changed, 48 insertions(+), 13 deletions(-)