Summary: | smartpqi - and/or ZFS - related kernel panics | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Base System | Reporter: | rainer | ||||||
Component: | kern | Assignee: | Mark Johnston <markj> | ||||||
Status: | Closed FIXED | ||||||||
Severity: | Affects Only Me | CC: | charles.orbello, deepak.ukey, duerrd561, jgitlin+freebsd, markj, pi, sigsys | ||||||
Priority: | --- | Keywords: | crash | ||||||
Version: | 11.2-RELEASE | Flags: | koobs:
mfc-stable12+
koobs: mfc-stable11+ |
||||||
Hardware: | amd64 | ||||||||
OS: | Any | ||||||||
See Also: | https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=231794 | ||||||||
Attachments: |
|
Just for the note (I have no idea if it's related or if there's relevant firmware update from HPE): 1.34 firmware you seem to be running was unstable for me as well with Microsemi HBA 1100-8i, solved by updating to 1.60 from Microsemi site. You are right, there is an update on HPE's website. Unfortunately, it's not yet part of an SPP. So I'll have to figure out a way to install it. Thanks a lot. After updating the firmware, I still get panics. The handbook should be clearer about the fact that you can't get a crashdump from ZFS. After adding an additional swap-partition on an USB drive, I got this crash-dump: (ewserv-log03-prod </root>) 0 # kgdb /boot/kernel/kernel /var/crash/vmcore.1 GNU gdb 6.1.1 [FreeBSD] Copyright 2004 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "amd64-marcel-freebsd"... Unread portion of the kernel message buffer: Fatal trap 12: page fault while in kernel mode cpuid = 3; apic id = 03 fault virtual address = 0x5a fault code = supervisor read data, page not present instruction pointer = 0x20:0xffffffff80dff90d stack pointer = 0x28:0xfffffe084ed93f00 frame pointer = 0x28:0xfffffe084ed93f40 code segment = base rx0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 0 (zio_write_issue_10) trap number = 12 panic: page fault cpuid = 3 KDB: stack backtrace: #0 0xffffffff80b3d567 at kdb_backtrace+0x67 #1 0xffffffff80af6b07 at vpanic+0x177 #2 0xffffffff80af6983 at panic+0x43 #3 0xffffffff80f77fcf at trap_fatal+0x35f #4 0xffffffff80f78029 at trap_pfault+0x49 #5 0xffffffff80f777f7 at trap+0x2c7 #6 0xffffffff80f57dac at calltrap+0x8 #7 0xffffffff80dee7e2 at kmem_back+0xf2 #8 0xffffffff80dee6c0 at kmem_malloc+0x60 #9 0xffffffff80de6172 at keg_alloc_slab+0xe2 #10 0xffffffff80de8b7e at keg_fetch_slab+0x14e #11 0xffffffff80de83b4 at zone_fetch_slab+0x64 #12 0xffffffff80de848f at zone_import+0x3f #13 0xffffffff80de4b99 at uma_zalloc_arg+0x3d9 #14 0xffffffff82351ab2 at zio_write_compress+0x1e2 #15 0xffffffff8235074c at zio_execute+0xac #16 0xffffffff80b4ed74 at taskqueue_run_locked+0x154 #17 0xffffffff80b4fed8 at taskqueue_thread_loop+0x98 Uptime: 40m34s Dumping 5489 out of 32379 MB:..1%..11%..21%..31%..41%..51%..61%..71%..81%..91% Reading symbols from /boot/kernel/geom_mirror.ko...Reading symbols from /usr/lib/debug//boot/kernel/geom_mirror.ko.debug...done. done. Loaded symbols for /boot/kernel/geom_mirror.ko Reading symbols from /boot/kernel/zfs.ko...Reading symbols from /usr/lib/debug//boot/kernel/zfs.ko.debug...done. done. Loaded symbols for /boot/kernel/zfs.ko Reading symbols from /boot/kernel/opensolaris.ko...Reading symbols from /usr/lib/debug//boot/kernel/opensolaris.ko.debug...done. done. Loaded symbols for /boot/kernel/opensolaris.ko Reading symbols from /boot/kernel/accf_data.ko...Reading symbols from /usr/lib/debug//boot/kernel/accf_data.ko.debug...done. done. Loaded symbols for /boot/kernel/accf_data.ko Reading symbols from /boot/kernel/accf_http.ko...Reading symbols from /usr/lib/debug//boot/kernel/accf_http.ko.debug...done. done. Loaded symbols for /boot/kernel/accf_http.ko Reading symbols from /boot/kernel/cc_htcp.ko...Reading symbols from /usr/lib/debug//boot/kernel/cc_htcp.ko.debug...done. done. Loaded symbols for /boot/kernel/cc_htcp.ko Reading symbols from /boot/kernel/ums.ko...Reading symbols from /usr/lib/debug//boot/kernel/ums.ko.debug...done. done. Loaded symbols for /boot/kernel/ums.ko Reading symbols from /boot/kernel/tmpfs.ko...Reading symbols from /usr/lib/debug//boot/kernel/tmpfs.ko.debug...done. done. Loaded symbols for /boot/kernel/tmpfs.ko #0 0xffffffff80af68fb in doadump (textdump=0) at /usr/src/sys/kern/kern_shutdown.c:309 309 if (dumping) (kgdb) bt #0 0xffffffff80af68fb in doadump (textdump=0) at /usr/src/sys/kern/kern_shutdown.c:309 #1 0xffffffff80af6925 in doadump (textdump=<value optimized out>) at /usr/src/sys/kern/kern_shutdown.c:315 #2 0xffffffff80af671b in kern_reboot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:382 #3 0xffffffff80af6b41 in vpanic (fmt=<value optimized out>, ap=0xfffffe084ed93c50) at /usr/src/sys/kern/kern_shutdown.c:769 #4 0xffffffff80af6983 in panic (fmt=0x0) at /usr/src/sys/kern/kern_shutdown.c:706 #5 0xffffffff80f77fcf in trap_fatal (frame=0xfffffe084ed93e40, eva=90) at /usr/src/sys/amd64/amd64/trap.c:875 #6 0xffffffff80f78029 in trap_pfault (frame=0xfffffe084ed93e40, usermode=0) at /usr/src/sys/amd64/amd64/trap.c:712 #7 0xffffffff80f777f7 in trap (frame=0xfffffe084ed93e40) at /usr/src/sys/amd64/amd64/trap.c:514 #8 0xffffffff80f57dac in Xtss_pti () at /usr/src/sys/amd64/amd64/exception.S:159 #9 0xffffffff80dff90d in vm_page_rename (m=0x3ff, new_object=0xfffff80018d8d000, new_pindex=<value optimized out>) at /usr/src/sys/vm/vm_page.c:1342 #10 0xffffffff80dee7e2 in kmem_suballoc (parent=0x262, min=0x14000, max=0xffffffff81ebc558, size=874980, superpage_align=<value optimized out>) at /usr/src/sys/vm/vm_kern.c:290 #11 0xffffffff80dee6c0 in kmem_alloc_contig (vmem=0xfffffe00d59d0000, size=18446744071594296576, flags=<value optimized out>, low=18446735303990395200, high=257, alignment=18446735278033391616, boundary=18446735278033391616, memattr=-16 '�') at /usr/src/sys/vm/vm_kern.c:254 #12 0xffffffff80de6172 in uma_prealloc (zone=0x0, items=1322860228) at /usr/src/sys/vm/uma_core.c:3150 #13 0xfffff806240140f0 in ?? () #14 0xfffffe00c51f357e in ?? () #15 0xfffffe00d59b0000 in ?? () #16 0xfffff8000d460498 in ?? () #17 0xfffff80624014140 in ?? () #18 0x02fffe00c520c000 in ?? () #19 0xfffff8000d460480 in ?? () #20 0xfffff8000d4641c0 in ?? () #21 0x0000000000000000 in ?? () Current language: auto; currently minimal Can anyone make sense of that? BTW: I've been running memtest86 v7.5 (the free edition of the commercial version that does UEFI) in this for 8h and it showed no error. I have experienced nearly the same issue, and requested help from the freebsd-fs list as I thought it might have been related to a kernel change or misconfiguration (even though the config we were using had not changed) See: https://lists.freebsd.org/pipermail/freebsd-fs/2018-September/026725.html Panic stack trace we saw was the exact same, happened under ZFS load (but not unusually high load, not higher than we've seen in production before) OK. This is a setting that I have in my sysctl.conf.local but commented out by default (because not all hosts use ZFS and I somehow thought that it's only needed on hosts that do other stuff). I stumbled about this PR[1], too, a while ago and I have adjusted it on my ZFS hosts. Just not on this one because this one isn't supposed to run much else - other hosts run mysql and/or apache+php+nginx etc.pp. [1] https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=229764 or rather, I took my settings from this one: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=163461 At least, it ran through the night. Hi, Can you please tell me how to reproduce the issue or what are steps causing this panic. Also can you please provide me the what is firmware version you are using for E208i-p SR Gen10 / P408i-a SR Gen10 cards so that i can try reproducing this on my setup and help you to resolve this. Thanks. Hi, Firmware revision is 1.60 (from HPE website). But it seems it is an ARC problem that just did not materialize on my other servers because ARC was limited there already, but is actually pretty widespread. Also, one of the first panics we got had the driver-name in the backtrace somewhere - but that was on the old firmware. I was notified of this PR privately: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=231794 which seems to describe a similar problem. (In reply to rainer from comment #9) The problem might be similar but it is certainly different. In the other bug they are getting a panic (unfortunately the panic message is not shown), while you are getting a fatal trap / page fault. Also, in your case there is no ARC calls in the stack trace. It's straight from the ZIO code to the VM code. So, it's quite possible that ARC contributes to the problem (e.g., by creating a memory pressure or some such), but there is a bug in kmem_back / kmem_malloc. Finally, in comment #3 the stack trace recorded by ddb and the stack trace shown by kgdb do not match. I suspect that that is because you passed a wrong kernel to kgdb or /usr/lib/debug/boot/kernel does not match /boot/kernel. Hi, I compiled a kernel myself with make buildkernel && make installkernel I had thought the debug kernel lived next to the kernel in /boot/kernel... (ewserv-log03-prod </root>) 0 # uname -a FreeBSD ewserv-log03-prod.everyware.zone 11.2-RELEASE-p4 FreeBSD 11.2-RELEASE-p4 #0: Fri Sep 28 16:37:02 CEST 2018 root@ewserv-log03-prod.everyware.zone:/usr/obj/usr/src/sys/GENERIC amd64 (ewserv-log03-prod </root>) 0 # ll /usr/lib/debug/boot/kernel/kernel.debug -r-xr-xr-x 1 root wheel 86179448 Sep 28 16:37 /usr/lib/debug/boot/kernel/kernel.debug (ewserv-log03-prod </root>) 0 # ll /boot/kernel/kernel -r-xr-xr-x 1 root wheel 27781528 Sep 28 16:37 /boot/kernel/kernel because I wasn't sure if the default kernel package contains a kernel with debug-symbols. What is the correct way to get a kernel with debug-symbols? I can reboot and run my tests again without the ARC reduction, to make sure this is the kernel that is producing the crashdump. It needed less than an hour to lock up. We would like to get this server back into production, but for now I can do whatever is necessary to solve this problem (apart from allowing direct logins - I'd have to wipe it) (In reply to Andriy Gapon from comment #10) > it's quite possible that ARC contributes to the problem but > there is a bug in kmem_back / kmem_malloc. This is what I felt as well when reading the source. I didn't see any specific out of memory error, but rather a page fault which (to my untrained eye) looked like the kernel trying to access a KVA page that did not exist. But I was very unsure of my theory that it was a bug as opposed to a misconfiguration. What I found odd was that we had crashes on production systems where the config in place hadn't changed in years... Is this one of the issues that is mentioned here: https://www.freebsdfoundation.org/news-and-events/newsletter/freebsd-foundation-update-september-2018/ "Of late I have also been investigating reports of issues with ZFS, particularly, those reported on FreeBSD 11.2. A number of regressions, including a kernel memory leak and issues with ARC reclamation, have already been fixed for 12.0; " (In reply to rainer from comment #13) Not yet. A few days ago someone reported a similar panic on the lists. I asked for some further info and didn't get any reply. If you or any one else can reproduce this problem, I'd at least like to see a backtrace with line numbers (the stack in comment 3 is bogus). Ideally, I'd be able to take a look at a vmcore. For this, I also need a copy of /boot/kernel and /usr/lib/debug/boot/kernel corresponding to the running kernel. OK. I've "fixed" this by adding more memory and reducing ARC, so that I have less % wired. I will see what I can do. OK, as soon as some load is on the system (in this case, 10mbit UDP syslog traffic, it panics again. supervisor write data, page not present. How do I install and activate the debug-kernel from the kernel-dbg.txz archive? (In reply to rainer from comment #16) kernel-dbg just contains kernel symbols. If you crash a kernel and get a vmcore in /var/crash, then the symbols can be used to examine the vmcore. The kernel-dbg set needs to match the kernel from the corresponding kernel.txz archive. log02-prod </root>) 0 # kgdb /boot/kernel/kernel /var/crash/vmcore.last GNU gdb 6.1.1 [FreeBSD] Copyright 2004 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "amd64-marcel-freebsd"... Unread portion of the kernel message buffer: Fatal trap 12: page fault while in kernel mode cpuid = 2; apic id = 02 fault virtual address = 0x0 fault code = supervisor write data, page not present instruction pointer = 0x20:0xffffffff80aacfae stack pointer = 0x28:0xfffffe103c5e8f60 frame pointer = 0x28:0xfffffe103c5e9130 code segment = base rx0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 87490 (syslog-ng) trap number = 12 panic: page fault cpuid = 2 KDB: stack backtrace: #0 0xffffffff80b3d587 at kdb_backtrace+0x67 #1 0xffffffff80af6b27 at vpanic+0x177 #2 0xffffffff80af69a3 at panic+0x43 #3 0xffffffff80f77fdf at trap_fatal+0x35f #4 0xffffffff80f78039 at trap_pfault+0x49 #5 0xffffffff80f77807 at trap+0x2c7 #6 0xffffffff80f57fbc at calltrap+0x8 #7 0xffffffff80aac905 at kern_kevent+0xb5 #8 0xffffffff80aac794 at sys_kevent+0xa4 #9 0xffffffff80f79068 at amd64_syscall+0xa38 #10 0xffffffff80f5880d at fast_syscall_common+0x101 Uptime: 45m56s Dumping 2959 out of 65147 MB:..1%..11%..21%..31%..41%..51%..61%..71%..81%..91% Reading symbols from /boot/kernel/geom_mirror.ko...Reading symbols from /usr/lib/debug//boot/kernel/geom_mirror.ko.debug...done. done. Loaded symbols for /boot/kernel/geom_mirror.ko Reading symbols from /boot/kernel/zfs.ko...Reading symbols from /usr/lib/debug//boot/kernel/zfs.ko.debug...done. done. Loaded symbols for /boot/kernel/zfs.ko Reading symbols from /boot/kernel/opensolaris.ko...Reading symbols from /usr/lib/debug//boot/kernel/opensolaris.ko.debug...done. done. Loaded symbols for /boot/kernel/opensolaris.ko Reading symbols from /boot/kernel/accf_data.ko...Reading symbols from /usr/lib/debug//boot/kernel/accf_data.ko.debug...done. done. Loaded symbols for /boot/kernel/accf_data.ko Reading symbols from /boot/kernel/accf_http.ko...Reading symbols from /usr/lib/debug//boot/kernel/accf_http.ko.debug...done. done. Loaded symbols for /boot/kernel/accf_http.ko Reading symbols from /boot/kernel/cc_htcp.ko...Reading symbols from /usr/lib/debug//boot/kernel/cc_htcp.ko.debug...done. done. Loaded symbols for /boot/kernel/cc_htcp.ko Reading symbols from /boot/kernel/tmpfs.ko...Reading symbols from /usr/lib/debug//boot/kernel/tmpfs.ko.debug...done. done. Loaded symbols for /boot/kernel/tmpfs.ko #0 doadump (textdump=<value optimized out>) at pcpu.h:229 229 pcpu.h: No such file or directory. in pcpu.h (kgdb) bt #0 doadump (textdump=<value optimized out>) at pcpu.h:229 #1 0xffffffff80af673b in kern_reboot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:383 #2 0xffffffff80af6b61 in vpanic (fmt=<value optimized out>, ap=<value optimized out>) at /usr/src/sys/kern/kern_shutdown.c:776 #3 0xffffffff80af69a3 in panic (fmt=<value optimized out>) at /usr/src/sys/kern/kern_shutdown.c:707 #4 0xffffffff80f77fdf in trap_fatal (frame=0xfffffe103c5e8ea0, eva=0) at /usr/src/sys/amd64/amd64/trap.c:875 #5 0xffffffff80f78039 in trap_pfault (frame=0xfffffe103c5e8ea0, usermode=0) at pcpu.h:229 #6 0xffffffff80f77807 in trap (frame=0xfffffe103c5e8ea0) at /usr/src/sys/amd64/amd64/trap.c:415 #7 0xffffffff80f57fbc in calltrap () at /usr/src/sys/amd64/amd64/exception.S:231 #8 0xffffffff80aacfae in kqueue_kevent (kq=0xfffff801cb5a4a00, td=0xfffff801e3b34000, nchanges=<value optimized out>, nevents=1, k_ops=0xfffffe103c5e91c0, timeout=0xfffffe1000000000) at /usr/src/sys/kern/kern_event.c:1623 #9 0xffffffff80aac905 in kern_kevent (td=0xfffff801e3b34000, fd=<value optimized out>, nchanges=0, nevents=1, k_ops=0xfffffe103c5e91c0, timeout=0xfffffe103c5e91b0) at /usr/src/sys/kern/kern_event.c:1023 #10 0xffffffff80aac794 in sys_kevent (td=0xfffff801e3b34000, uap=0xfffff801e3b34538) at /usr/src/sys/kern/kern_event.c:901 #11 0xffffffff80f79068 in amd64_syscall (td=0xfffff801e3b34000, traced=0) at subr_syscall.c:132 #12 0xffffffff80f5880d in fast_syscall_common () at /usr/src/sys/amd64/amd64/exception.S:479 #13 0x000000080222b81a in ?? () Previous frame inner to this frame (corrupt stack?) Current language: auto; currently minimal (kgdb) It had only booted a couple of minutes at that point. I had rm -rf'ed /var/db/freebsd-update/* in order to update the debug-symbols to the current kernel-level. And bang, it panic'ed again. This is really weird. But shows that most of my systems aren't really stressed at all, apparently. Hmm, this seems to be unrelated to ZFS - it's a panic in kevent(). Would you be willing to share the vmcore with me so that I can try and find the bug? Please mail me if so. (In reply to Mark Johnston from comment #20) I believe the kevent bug was fixed in r340734. I think I know what's causing the kmem_back() panic as well. Is the system using a Skylake CPU? Please try the patch here: https://people.freebsd.org/~markj/patches/blacklist_freecnt.diff It's a HP Gen10, using Xeon Scalable Processors. So, yes, it's Skylake (Xeon variant). x86info v1.31pre /dev/cpuctl0: No such file or directory Found 16 identical CPUs Extended Family: 0 Extended Model: 5 Family: 6 Model: 85 Stepping: 4 Type: 0 (Original OEM) CPU Model (x86info's best guess): Core i7 (Skylake-X) Processor name string (BIOS programmed): Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz Total processor threads: 16 This system has 1 eight-core processor with hyper-threading (2 threads per core) running at an estimated 2.10GHz I've applied your first patch, but have not yet been able to find the time to put the system under load. Should I apply your second patch without testing the first patch? We experienced the kmem_back issues as well, and we're running a Skylake: Intel® Xeon® Bronze 3106 Processor We haven't seen the issue since reducing ARC size, but the same config had been stable for years prior to that. (We did change hardware however) Backtrace was: trap number = 12 panic: page fault cpuid = 6 KDB: stack backtrace: #0 0xffffffff80b3d567 at kdb_backtrace+0x67 #1 0xffffffff80af6b07 at vpanic+0x177 #2 0xffffffff80af6983 at panic+0x43 #3 0xffffffff80f77fcf at trap_fatal+0x35f #4 0xffffffff80f78029 at trap_pfault+0x49 #5 0xffffffff80f777f7 at trap+0x2c7 #6 0xffffffff80f57dac at calltrap+0x8 #7 0xffffffff80dee7e2 at kmem_back+0xf2 #8 0xffffffff80dee6c0 at kmem_malloc+0x60 #9 0xffffffff80de6172 at keg_alloc_slab+0xe2 #10 0xffffffff80de8b7e at keg_fetch_slab+0x14e #11 0xffffffff80de8364 at zone_fetch_slab+0x64 #12 0xffffffff80de848f at zone_import+0x3f #13 0xffffffff80de4b99 at uma_zalloc_arg+0x3d9 #14 0xffffffff826e6ab2 at zio_write_compress+0x1e2 #15 0xffffffff826e574c at zio_execute+0xac #16 0xffffffff80bled74 at taskqueue_run_locked+0x154 #17 0xffffffff80b4fed8 at taskqueue_thread_loop+0x98 Uptime: 18d18h31m6s mpr0: Sending StopUnit: path (xpt0:mpr0:0:10:ffffffff): handle 10 mpr0: Incrementing SSU count mpr0: Sending StopUnit: path (xpt0:mpr0:0:13:ffffffff): handle 13 mpr0: Incrementing SSU count mpr0: Sending StopUnit: path Ixpt0:mpr0:0:16:ffffffff): handle 16 mpr0: Incrementing SSU count (In reply to rainer from comment #22) I think it'd be fine to just test both patches. They address completely unrelated issues. (In reply to Josh Gitlin from comment #23) Reducing the ARC size most likely just hid the bug again. It is triggered when the system is completely out of free pages, but if you're restricting the ARC it'll be harder to hit that condition. (In reply to Mark Johnston from comment #24) That makes complete sense. The servers we saw this on are iSCSI storage servers in production so we can't test a patch on them unfortunately (and I'd prefer they _not_ be out of free pages anyway) A commit references this bug: Author: markj Date: Thu Nov 29 16:31:02 UTC 2018 New revision: 341247 URL: https://svnweb.freebsd.org/changeset/base/341247 Log: Update the free page count when blacklisting pages. Otherwise the free page count will not accurately reflect the physical page allocator's state. On 11 this can trigger panics in vm_page_alloc() since the allocator state and free page count are updated atomically and we expect them to stay in sync. On 12 the bug would manifest as threads looping in vm_page_alloc(). PR: 231296 Reported by: mav, wollman, Rainer Duffner, Josh Gitlin Reviewed by: alc, kib, mav MFC after: 3 days Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D18374 Changes: head/sys/vm/vm_page.c A commit references this bug: Author: markj Date: Thu Nov 29 17:35:37 UTC 2018 New revision: 341249 URL: https://svnweb.freebsd.org/changeset/base/341249 Log: MFC r341247: Update the free page count when blacklisting pages. PR: 231296 Approved by: re (gjb, insta-MFC) Changes: _U stable/12/ stable/12/sys/vm/vm_page.c Hi, I assume, this will not make it into 12.0-RELEASE? I don't have my own infrastructure for releases and I try to only run releases, thus I take the binaries from the download-site + freebsd binary-updates. It looks like I'll either have to wait for 11.3 or 12.1 or patch those in and re-patch every time there is an update to 11.2-RELEASE or 12.0-RELEASE. Or could an errata-patch for 11.2 be released (as well as for 12.0)? Technically, I'd have to patch all my 11.2 hosts for this otherwise. Or at least those with ZFS, as to not hit this at some point in the future during high load. (In reply to rainer from comment #28) No, the issue is fixed in 12.0. I'm planning to ask for an EN for 11.2 as well. Thanks! A commit references this bug: Author: markj Date: Sun Dec 2 18:08:27 UTC 2018 New revision: 341401 URL: https://svnweb.freebsd.org/changeset/base/341401 Log: MFC r341247: Update the free page count when blacklisting pages. PR: 231296 Changes: _U stable/11/ stable/11/sys/vm/vm_page.c A commit references this bug: Author: emaste Date: Wed Dec 19 18:00:59 UTC 2018 New revision: 342225 URL: https://svnweb.freebsd.org/changeset/base/342225 Log: MFS11 r341401: Update the free page count when blacklisting pages. PR: 231296 Submitted by: markj Approved by: so Security: FreeBSD-EN-18:17.vm Sponsored by: The FreeBSD Foundation Changes: _U releng/11.2/ releng/11.2/sys/vm/vm_page.c Hello it's possible that the problem is not finish . http://download.adaptec.com/pdfs/readme/microsemi_smarthba_smartraid_relnotes_1_2019.pdf Page 6 "Fixed an issue where controller hangs when running I/O with pending DDR cache retries." Driver : https://storage.microsemi.com/en-us/speed/raid/aac/unix/smartpqi_freebsd_v1.0.3-2035_tgz.php Firmware (not available in the SPP of HPE) https://storage.microsemi.com/en-us/speed/raid/asr/fw_bios/microsemi_smartfwx100_v1_98_b0_zip.php from my side i can reproduce the crash on a 11.2 and 12.0 (r342545) with a fio on a raidz2 (8 devices) on the same material HP DL380 Gen10. [benchmark1] name=benchmark1 rw=write numjobs=10 size=100G in attachment the core.txt Created attachment 201312 [details]
core_txt_smartpqi
(In reply to Orbello Charles from comment #33) This is distinct from the other panics reported in this PR, which were not specific to smartpqi(4). Please open a new PR. Latest SPP (2019.03.01) does contain that firmware-fix. |
Created attachment 197020 [details] pic of kernel panic Hi, this is a a HPE DL380 Gen10 system. smartpqi0: <E208i-p SR Gen10> port 0x4000-0x40ff mem 0xe2800000-0xe2807fff at device 0.0 numa-domain 0 on pci4 smartpqi0: using MSI-X interrupts (16 vectors) smartpqi1: <P408i-a SR Gen10> port 0xc000-0xc0ff mem 0xf3800000-0xf3807fff at device 0.0 numa-domain 0 on pci9 smartpqi1: using MSI-X interrupts (16 vectors) (server </root>) 0 # camcontrol devlist <HP EG000300JWFVB HPD2> at scbus0 target 64 lun 0 (pass0,da0) <HP EG000300JWFVB HPD2> at scbus0 target 66 lun 0 (pass1,da1) <HPE Smart Adapter 1.34> at scbus0 target 187 lun 0 (pass2,ses0) <HPE E208i-p SR Gen10 1.34> at scbus0 target 1088 lun 0 (pass3) <HP EG001200JWFVA HPD3> at scbus1 target 64 lun 0 (pass4,da2) <HP EG001200JWFVA HPD3> at scbus1 target 65 lun 0 (pass5,da3) <HP EG001200JWFVA HPD3> at scbus1 target 66 lun 0 (pass6,da4) <HP EG001200JWFVA HPD3> at scbus1 target 67 lun 0 (pass7,da5) <HP EG001200JWFVA HPD3> at scbus1 target 68 lun 0 (pass8,da6) <HP EG001200JWFVA HPD3> at scbus1 target 69 lun 0 (pass9,da7) <HP EG001200JWFVA HPD3> at scbus1 target 70 lun 0 (pass10,da8) <HP EG001200JWFVA HPD3> at scbus1 target 71 lun 0 (pass11,da9) <HPE Smart Adapter 1.34> at scbus1 target 187 lun 0 (pass12,ses1) <HPE P408i-a SR Gen10 1.34> at scbus1 target 1088 lun 0 (pass13) <JetFlash Transcend 8GB 1.00> at scbus2 target 0 lun 0 (da10,pass14) <Generic- SD/MMC CRW 1.00> at scbus3 target 0 lun 0 (da11,pass15) We get very frequent kernel panics. The server is receiving syslogs via syslog-ng314-3.14.1_1