Bug 212841

Summary: getting panic during mps reinitialization.
Product: Base System Reporter: prateek sethi <prateekrootkey>
Component: kernAssignee: freebsd-bugs (Nobody) <bugs>
Status: New ---    
Severity: Affects Only Me CC: ngie, prateekrootkey, sbruno, scsi, slm
Priority: --- Keywords: patch
Version: 9.2-RELEASE   
Hardware: amd64   
OS: Any   
Description Flags
mps panic core
fix for the issue
Adding some delay in the previous fix. none

Description prateek sethi 2016-09-20 06:52:54 UTC
Created attachment 174985 [details]
mps panic core

I have done mps reinit and that is causing a panic.
I am getting log "reboot after panic: mps_iocfacts_allocate failed to get IOC Facts with error 6" in the message file file.

attaching a snap for the core.
Comment 1 prateek sethi 2016-09-20 06:56:40 UTC
Created attachment 174986 [details]
fix for the issue

attaching a fix for this issue. I tried this issue several times and almost 17% of time it was coming .
Comment 2 prateek sethi 2016-09-20 07:07:22 UTC
Comment on attachment 174986 [details]
fix for the issue

I triggered reinit case almost 100 times and got this issue about 17 times. After putting this fix I have triggered reinitialization over 1000 times and it is working without crashing.

I was using LSI SAS2308 card.
Comment 3 Sean Bruno freebsd_committer 2016-09-20 14:08:19 UTC
I'm not a big fan of using msleep() to work around hardware, but can you try and put an msleep in your loop so that it pauses a bit before banging again?  Thank you for your diagnosis and patch!
Comment 4 Enji Cooper freebsd_committer 2016-09-20 16:09:49 UTC
Are you running the latest firmware? My group was running into similar gnarly problems on ^/stable/11 with the old firmware and a new driver, but after we upgraded the firmware, everything was groovy.
Comment 5 prateek sethi 2016-09-26 08:53:48 UTC
Created attachment 175173 [details]
Adding some delay in the previous fix.

Added a delay of 1 millisecond before retry for next time in case of mps_request_sync failure.

I have tested this delay thing by changing  return value of mps_request_sync to failure so get go into the retry path.
Comment 6 prateek sethi 2016-09-26 09:02:20 UTC
(In reply to Sean Bruno from comment #3)

Hi Sean,
I have put an DELAY of 1 millisecond in the loop. I am not very sure about the delay timing. Can you tell that 1 ms is fine or not?

(In reply to Ngie Cooper from comment #4)

Hi Ngie,
Yes it can be a HBA firmware issue but I think it is good only if we have fix for those situations also.(In reply to Sean Bruno from comment #3)
Comment 7 Sean Bruno freebsd_committer 2016-09-26 14:33:46 UTC
(In reply to prateek sethi from comment #6)
Yeah, this looks like it will work.  I don't have a problem with an error handler here.  I'll wait a day or two for "strong objections" if there are any.
Comment 8 Stephen McConnell freebsd_committer 2016-09-26 20:25:38 UTC
Hi Prateek, I'm not sure this is really the right thing to do to fix this because it just seems that we might be covering up a larger problem where the driver is doing something wrong. Can you attach the message file that shows all of the output from the driver prior to the panic? I can take a look and see if it gives me a clue as to what's going on. There are certain timing restrictions that need to be followed when resetting the controller - maybe those restrictions aren't being followed. Some cards are more sensitive to these restrictions than others and, if I remember correctly, the 2308 is one of them. In any case, I think we should dig a little deeper first before committing this change.
Comment 9 Stephen McConnell freebsd_committer 2016-09-26 20:52:10 UTC
Also, it might help to have the debug_level set to 0x07 before your test. If that's too much, and you can't reproduce, then you can try 0x05, and if that's too much then the default of 0x04 will have to do. Looking at the code again, the reset timing is probably OK, but I'd still like to see exactly where this is failing. Thanks.
Comment 10 prateek sethi 2016-09-27 09:19:52 UTC
(In reply to Stephen McConnell from comment #9)

Hi Stephen,

I could not reproduce this issue now. system got panic after the mps reinit got triggered. The following are the only logs I could find (before and after the crash). 

Apr 27 13:34:51 Node1 kernel: mps0: Reinitializing controller
Apr 27 14:52:21 Node1 syslogd: kernel boot file is /boot/kernel/kernel
Apr 27 14:52:21 Node1 savecore: reboot after panic: mps_iocfacts_allocate failed to get IOC Facts with error 6

I got the message "mps0: Doorbell failed to activate" from the core analysis, which will help to find exact point of the issue. Hope these will be helpful for the further debugging. 

Would you please mind throwing some light on the timing restrictions, that you mentioned, to be followed when resetting the controller?
Comment 11 Stephen McConnell freebsd_committer 2016-09-27 19:18:29 UTC
The reset timing in the driver looks fine to me. There is a requirement that the host wait a certain amount of time when it first accesses the controller during a reset, and then a certain time to wait on checking registers, etc. But, it looks fine.

What doesn't make sense is that you're waiting some arbitrary amount of time after the initial failure and then it works. This time that your waiting is after the reset completes and then after some calls to other functions. After all of that, some access to the DOORBELL fails. Then, waiting 2 mSecs fixes it. That's strange.

There are two ways that this will fail in Step 4 of mps_request_sync(). The first is when reading the Interrupt Status REG. If this Register does not show an interrupt within 5 seconds, it fails (that's a really long time). The second is when reading the DOORBELL REG. If the DOORBELL_USED bit is not set, it fails. I can't tell which one of these fails. But, because it fails your fix will just wait 2 mSecs and then retry, then it's successful (at least within 10 mSecs - 5 retries).

What I'm wondering is, does it really matter that you have a delay between mps_request_sync() calls? To me, it looks like something is messed up in FW and just doing a retry fixes it.

Now, with all of that said, I'm not sure there really is a better fix except that the delay may not need to be there. Having the delay there would make someone think that we're just not waiting long enough, which really is not the case and looks a little scary, meaning someone could think the driver timing for this is very fragile, when it's really not.

Sean, let me know what you think about removing the delay. If you want the delay, I would at least say to add a comment that explains the delay and retry, since none of this is really supposed to happen and I think it's some FW or HW workaround.