Bug 279737 - mpi3mr running out of chain in 14.1-RELEASE
Summary: mpi3mr running out of chain in 14.1-RELEASE
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: misc (show other bugs)
Version: 14.1-RELEASE
Hardware: amd64 Any
: --- Affects Some People
Assignee: freebsd-bugs (Nobody)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2024-06-14 14:28 UTC by Sarder Kamal
Modified: 2024-06-15 11:49 UTC (History)
3 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Sarder Kamal 2024-06-14 14:28:55 UTC
hardware: Dell PowerEdge R760xd2 
Storage Controller: Dell HBA355i Adp

extract from pciconf -lv

mpr0@pci0:80:0:0:       class=0x010700 rev=0x00 hdr=0x00 vendor=0x1000 device=0x00e6 subvendor=0x1028 subdevice=0x200b
    vendor     = 'Broadcom / LSI'
    device     = 'Fusion-MPT 12GSAS/PCIe Secure SAS38xx'
    class      = mass storage
    subclass   = SAS


# cat /boot/loader.conf
zfs_load="YES"
mpi3mr_load="YES"
hw.mpr.max_chains="16384"
dev.mpr.0.max_chains="16384"

If booting from 14.0-RELEASEp6, the system boots without any issue

when booting from 14.1-RELEASE, it fails to boot.

How to reproduce:
1. boot from a fresh installer. The boot fails to detect the drives and the message on the display indicating error is

mpr0: Out of chain frames, consider increasing hw.mpr.max_chains

I have used the loader prompt and put
set hw.mpr.max_chains=16384
set dev.mpr.0.max_chains=16384
boot

that too fails

2. install a system with 14.0-RELEASE, put the values in /boot/loader.conf, the system boots without any problem/glitch

try to upgrade to 14.1-RELEASE, after the first phase (kernel update) the system fails to boot.

Tried going to loader prompt, put the the variables as above, still it fails

However, if you select the old kernel, it boots without any issue.


If further informaiont/details are required, please ask.

Thanks and regards
SK
Comment 1 Warner Losh freebsd_committer freebsd_triage 2024-06-14 15:34:09 UTC
Something doesn't match. mpi3mr and mpr are two different drivers. mpi3mr is super new, and kinda weird and I don't think ever worked on 14.0.

mpr is mature and should work fine. Without the exact errors, though, it's hard to say for sure.

Can you clarify which one it is and provide the error messages?
Comment 2 Warner Losh freebsd_committer freebsd_triage 2024-06-14 15:44:01 UTC
Assuming it is mpr, you might look to back out:

commit 504e85ade103b0c2cafefb2d9dea86e94aef779f
Author: Kenneth D. Merry <ken@FreeBSD.org>
Date:   Thu Dec 14 15:05:17 2023 -0500

    mpr, mps:  Establish busdma boundaries for memory pools

    Most all of the memory used by the cards in the mpr(4) and mps(4)
    drivers is required, according to the specs and Broadcom developers,
    to be within a 4GB segment of memory.

to see if that "solves" the problem for you. I put "solves" in quotes because it reintroduces other bugs.

That's the only real change to mpr between releng/14.0 and releng/14.1 that looks to be any kind of change in this area.
Comment 3 Kenneth D. Merry freebsd_committer freebsd_triage 2024-06-14 19:11:16 UTC
Running out of chain frames is generally just a symptom of having a lot of I/O going in parallel.  In and of itself, that isn't a bad thing.

The difference between 14.0 and 14.1 is probably just how hard the system is pushing the controller.

You set the maximum number of chain frames to 16384, but that is actually the default value in FreeBSD/head.  Assuming it is the same in 14, that means nothing actually changed with the loader tunable.

Try going to 32768 and see if that fixes it.  It probably will, but if not, double it again and see if that works.
Comment 4 Sarder Kamal 2024-06-14 23:21:12 UTC
For some reason I am/was under the impression that most of the newer Dell HBA cards require the newer drivers, hence the mpi3mr choice. I can confirm there is no difference even if I use mpr_load="YES" in my loader.conf

So far, I have tried the following after changing to mpr_load in loader.conf

a. try to boot with 14.0 kernel (the original on the disk), it boots

b. try to boot with 14.1 kernel (the one upgraded (only kernel, have not done anything to userland yet), it boots

c. try to boot from 14.1-RELEASE installer disk, go to loader prompt and load mpr.ko, increase hw.mpr.max_chains gradually all the way up to 65536 and then try to boot. Every time it complains the same chain error and fails to boot.

So, I now have a system where the kernel has been upgraded, but I am not convinced I should upgrade the userland yet, since I can still revert back to the older kernel appears to be working fine.

What I would like to understand, how is the older (14.0p6) kernel working alright, but the 14.1 is unable to boot, throwing errors, and not giving me any confidence to move from 14.0.

If you need me to make more tests, I can do so on Monday once I return to that place. Please let me know what tests you'd like me to run.

Thanks again everyone for looking into this.

Thanks and regards
SK
Comment 5 Muhammad Moinur Rahman freebsd_committer freebsd_triage 2024-06-15 00:13:44 UTC
(In reply to Sarder Kamal from comment #4)
Have you tried these vars with 32768

hw.mpr.max_chains="16384"
dev.mpr.0.max_chains="16384"

If not please try with 32768 and 65536. I am working on getting an installer image without the last commit mentioned in comment #3.
Comment 6 Sarder Kamal 2024-06-15 08:58:49 UTC
Thanks everyone for looking into it and for your prompt response.

I can confirm that I am able to boot from the installer by setting hw.mpr.max_chains=32768 at loader prompt. I have also tried dev.mpr.0.max_chains=32768 with the hw.mpr.max_chains (to see if setting both makes any difference, it did not), and that too worked.

I have tried to reproduce the "installer" failure in various ways (using the value 32768) following the same path that I tried earlier, but could not reproduce. This indicates that increasing the value actually did the trick.

Next I tried with 65536, so far it failed in all the methods I could follow with mpr0 running out of chain frames. It tries to reinitialize, but that does not seem to work for the period I let it try.

So, I am happy to confirm that in my case (at least) setting value to 32768 worked and allowed me to continue. Setting value to 65536 did not work, it failed all the times.

Thanks everyone for looking into it, and help finding a practical solution to go forward. I am still a little perplexed as to why the value 16364 worked with 14.0 and not with 14.1 though, hopefully someone can help me understand at some point.

Thanks again, much appreciated.
SK
Comment 7 Muhammad Moinur Rahman freebsd_committer freebsd_triage 2024-06-15 11:04:32 UTC
(In reply to Sarder Kamal from comment #6)
Definitely someone can help you understand it but it's like a Pandora's Box. DMA or Direct Memory Access is used by the CPU to reduce the locking of CPU and so that another device can independently access the Memory rather than slowing down the CPU. So while making these sort of requests to access the memory the devices uses chained request rather than sequential requests which are slower than hell. The problem is with the growth of the CPU markets the OS designers has to keep up with the number of DMA chained request the OS can handle. FreeBSD also keeps pushing these limits to a higher value for increased performance. The more chained DMA requests you can complete that specific piece of hardware will be able to copy more data from the memory to the devices in each cycle without locking the CPU. Although the underlying problem is not within the driver. But now you can find some at least a bit better performance with(32768) your card although I am not sure if we can benchmark it.
Comment 8 Sarder Kamal 2024-06-15 11:49:55 UTC
(In reply to Muhammad Moinur Rahman from comment #7)

Thank you for the explanation, much appreciated.

Best regards
SK