Bug 231296 - smartpqi - and/or ZFS - related kernel panics
Summary: smartpqi - and/or ZFS - related kernel panics
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 11.2-RELEASE
Hardware: amd64 Any
: --- Affects Only Me
Assignee: Mark Johnston
URL:
Keywords: crash
Depends on:
Blocks:
 
Reported: 2018-09-11 04:31 UTC by rainer
Modified: 2022-11-01 23:53 UTC (History)
7 users (show)

See Also:
koobs: mfc-stable12+
koobs: mfc-stable11+


Attachments
pic of kernel panic (124.78 KB, image/png)
2018-09-11 04:31 UTC, rainer
no flags Details
core_txt_smartpqi (5.45 KB, text/plain)
2019-01-21 16:27 UTC, Orbello Charles
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description rainer 2018-09-11 04:31:03 UTC
Created attachment 197020 [details]
pic of kernel panic

Hi,

this is a a HPE DL380 Gen10 system.

smartpqi0: <E208i-p SR Gen10> port 0x4000-0x40ff mem 0xe2800000-0xe2807fff at device 0.0 numa-domain 0 on pci4
smartpqi0: using MSI-X interrupts (16 vectors)
smartpqi1: <P408i-a SR Gen10> port 0xc000-0xc0ff mem 0xf3800000-0xf3807fff at device 0.0 numa-domain 0 on pci9
smartpqi1: using MSI-X interrupts (16 vectors)


(server </root>) 0 # camcontrol devlist
<HP EG000300JWFVB HPD2>            at scbus0 target 64 lun 0 (pass0,da0)
<HP EG000300JWFVB HPD2>            at scbus0 target 66 lun 0 (pass1,da1)
<HPE Smart Adapter 1.34>           at scbus0 target 187 lun 0 (pass2,ses0)
<HPE E208i-p SR Gen10 1.34>        at scbus0 target 1088 lun 0 (pass3)
<HP EG001200JWFVA HPD3>            at scbus1 target 64 lun 0 (pass4,da2)
<HP EG001200JWFVA HPD3>            at scbus1 target 65 lun 0 (pass5,da3)
<HP EG001200JWFVA HPD3>            at scbus1 target 66 lun 0 (pass6,da4)
<HP EG001200JWFVA HPD3>            at scbus1 target 67 lun 0 (pass7,da5)
<HP EG001200JWFVA HPD3>            at scbus1 target 68 lun 0 (pass8,da6)
<HP EG001200JWFVA HPD3>            at scbus1 target 69 lun 0 (pass9,da7)
<HP EG001200JWFVA HPD3>            at scbus1 target 70 lun 0 (pass10,da8)
<HP EG001200JWFVA HPD3>            at scbus1 target 71 lun 0 (pass11,da9)
<HPE Smart Adapter 1.34>           at scbus1 target 187 lun 0 (pass12,ses1)
<HPE P408i-a SR Gen10 1.34>        at scbus1 target 1088 lun 0 (pass13)
<JetFlash Transcend 8GB 1.00>      at scbus2 target 0 lun 0 (da10,pass14)
<Generic- SD/MMC CRW 1.00>         at scbus3 target 0 lun 0 (da11,pass15)


We get very frequent kernel panics.


The server is receiving syslogs via syslog-ng314-3.14.1_1
Comment 1 Yuri Pankov 2018-09-11 04:41:31 UTC
Just for the note (I have no idea if it's related or if there's relevant firmware update from HPE): 1.34 firmware you seem to be running was unstable for me as well with Microsemi HBA 1100-8i, solved by updating to 1.60 from Microsemi site.
Comment 2 rainer 2018-09-11 08:54:03 UTC
You are right, there is an update on HPE's website.

Unfortunately, it's not yet part of an SPP.
So I'll have to figure out a way to install it.

Thanks a lot.
Comment 3 rainer 2018-10-01 08:30:35 UTC
After updating the firmware, I still get panics.

The handbook should be clearer about the fact that you can't get a crashdump from ZFS.

After adding an additional swap-partition on an USB drive, I got this crash-dump:

(ewserv-log03-prod </root>) 0 # kgdb /boot/kernel/kernel /var/crash/vmcore.1
GNU gdb 6.1.1 [FreeBSD]
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "amd64-marcel-freebsd"...

Unread portion of the kernel message buffer:


Fatal trap 12: page fault while in kernel mode
cpuid = 3; apic id = 03
fault virtual address	= 0x5a
fault code		= supervisor read data, page not present
instruction pointer	= 0x20:0xffffffff80dff90d
stack pointer	        = 0x28:0xfffffe084ed93f00
frame pointer	        = 0x28:0xfffffe084ed93f40
code segment		= base rx0, limit 0xfffff, type 0x1b
			= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags	= interrupt enabled, resume, IOPL = 0
current process		= 0 (zio_write_issue_10)
trap number		= 12
panic: page fault
cpuid = 3
KDB: stack backtrace:
#0 0xffffffff80b3d567 at kdb_backtrace+0x67
#1 0xffffffff80af6b07 at vpanic+0x177
#2 0xffffffff80af6983 at panic+0x43
#3 0xffffffff80f77fcf at trap_fatal+0x35f
#4 0xffffffff80f78029 at trap_pfault+0x49
#5 0xffffffff80f777f7 at trap+0x2c7
#6 0xffffffff80f57dac at calltrap+0x8
#7 0xffffffff80dee7e2 at kmem_back+0xf2
#8 0xffffffff80dee6c0 at kmem_malloc+0x60
#9 0xffffffff80de6172 at keg_alloc_slab+0xe2
#10 0xffffffff80de8b7e at keg_fetch_slab+0x14e
#11 0xffffffff80de83b4 at zone_fetch_slab+0x64
#12 0xffffffff80de848f at zone_import+0x3f
#13 0xffffffff80de4b99 at uma_zalloc_arg+0x3d9
#14 0xffffffff82351ab2 at zio_write_compress+0x1e2
#15 0xffffffff8235074c at zio_execute+0xac
#16 0xffffffff80b4ed74 at taskqueue_run_locked+0x154
#17 0xffffffff80b4fed8 at taskqueue_thread_loop+0x98
Uptime: 40m34s
Dumping 5489 out of 32379 MB:..1%..11%..21%..31%..41%..51%..61%..71%..81%..91%

Reading symbols from /boot/kernel/geom_mirror.ko...Reading symbols from /usr/lib/debug//boot/kernel/geom_mirror.ko.debug...done.
done.
Loaded symbols for /boot/kernel/geom_mirror.ko
Reading symbols from /boot/kernel/zfs.ko...Reading symbols from /usr/lib/debug//boot/kernel/zfs.ko.debug...done.
done.
Loaded symbols for /boot/kernel/zfs.ko
Reading symbols from /boot/kernel/opensolaris.ko...Reading symbols from /usr/lib/debug//boot/kernel/opensolaris.ko.debug...done.
done.
Loaded symbols for /boot/kernel/opensolaris.ko
Reading symbols from /boot/kernel/accf_data.ko...Reading symbols from /usr/lib/debug//boot/kernel/accf_data.ko.debug...done.
done.
Loaded symbols for /boot/kernel/accf_data.ko
Reading symbols from /boot/kernel/accf_http.ko...Reading symbols from /usr/lib/debug//boot/kernel/accf_http.ko.debug...done.
done.
Loaded symbols for /boot/kernel/accf_http.ko
Reading symbols from /boot/kernel/cc_htcp.ko...Reading symbols from /usr/lib/debug//boot/kernel/cc_htcp.ko.debug...done.
done.
Loaded symbols for /boot/kernel/cc_htcp.ko
Reading symbols from /boot/kernel/ums.ko...Reading symbols from /usr/lib/debug//boot/kernel/ums.ko.debug...done.
done.
Loaded symbols for /boot/kernel/ums.ko
Reading symbols from /boot/kernel/tmpfs.ko...Reading symbols from /usr/lib/debug//boot/kernel/tmpfs.ko.debug...done.
done.
Loaded symbols for /boot/kernel/tmpfs.ko
#0  0xffffffff80af68fb in doadump (textdump=0) at /usr/src/sys/kern/kern_shutdown.c:309
309		if (dumping)
(kgdb) bt
#0  0xffffffff80af68fb in doadump (textdump=0) at /usr/src/sys/kern/kern_shutdown.c:309
#1  0xffffffff80af6925 in doadump (textdump=<value optimized out>) at /usr/src/sys/kern/kern_shutdown.c:315
#2  0xffffffff80af671b in kern_reboot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:382
#3  0xffffffff80af6b41 in vpanic (fmt=<value optimized out>, ap=0xfffffe084ed93c50) at /usr/src/sys/kern/kern_shutdown.c:769
#4  0xffffffff80af6983 in panic (fmt=0x0) at /usr/src/sys/kern/kern_shutdown.c:706
#5  0xffffffff80f77fcf in trap_fatal (frame=0xfffffe084ed93e40, eva=90) at /usr/src/sys/amd64/amd64/trap.c:875
#6  0xffffffff80f78029 in trap_pfault (frame=0xfffffe084ed93e40, usermode=0) at /usr/src/sys/amd64/amd64/trap.c:712
#7  0xffffffff80f777f7 in trap (frame=0xfffffe084ed93e40) at /usr/src/sys/amd64/amd64/trap.c:514
#8  0xffffffff80f57dac in Xtss_pti () at /usr/src/sys/amd64/amd64/exception.S:159
#9  0xffffffff80dff90d in vm_page_rename (m=0x3ff, new_object=0xfffff80018d8d000, new_pindex=<value optimized out>) at /usr/src/sys/vm/vm_page.c:1342
#10 0xffffffff80dee7e2 in kmem_suballoc (parent=0x262, min=0x14000, max=0xffffffff81ebc558, size=874980, superpage_align=<value optimized out>) at /usr/src/sys/vm/vm_kern.c:290
#11 0xffffffff80dee6c0 in kmem_alloc_contig (vmem=0xfffffe00d59d0000, size=18446744071594296576, flags=<value optimized out>, low=18446735303990395200, high=257, alignment=18446735278033391616, 
    boundary=18446735278033391616, memattr=-16 '�') at /usr/src/sys/vm/vm_kern.c:254
#12 0xffffffff80de6172 in uma_prealloc (zone=0x0, items=1322860228) at /usr/src/sys/vm/uma_core.c:3150
#13 0xfffff806240140f0 in ?? ()
#14 0xfffffe00c51f357e in ?? ()
#15 0xfffffe00d59b0000 in ?? ()
#16 0xfffff8000d460498 in ?? ()
#17 0xfffff80624014140 in ?? ()
#18 0x02fffe00c520c000 in ?? ()
#19 0xfffff8000d460480 in ?? ()
#20 0xfffff8000d4641c0 in ?? ()
#21 0x0000000000000000 in ?? ()
Current language:  auto; currently minimal


Can anyone make sense of that?
Comment 4 rainer 2018-10-02 19:11:59 UTC
BTW: I've been running memtest86 v7.5 (the free edition of the commercial version that does UEFI) in this for 8h and it showed no error.
Comment 5 Josh Gitlin 2018-10-03 14:38:54 UTC
I have experienced nearly the same issue, and requested help from the freebsd-fs list as I thought it might have been related to a kernel change or misconfiguration (even though the config we were using had not changed)

See: https://lists.freebsd.org/pipermail/freebsd-fs/2018-September/026725.html

Panic stack trace we saw was the exact same, happened under ZFS load (but not unusually high load, not higher than we've seen in production before)
Comment 6 rainer 2018-10-04 06:55:56 UTC
OK.
This is a setting that I have in my sysctl.conf.local but commented out by default (because not all hosts use ZFS and I somehow thought that it's only needed on hosts that do other stuff).

I stumbled about this PR[1], too, a while ago and I have adjusted it on my ZFS hosts.

Just not on this one because this one isn't supposed to run much else - other hosts run mysql and/or apache+php+nginx etc.pp.


[1]
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=229764
or rather, I took my settings from this one:
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=163461
Comment 7 rainer 2018-10-04 06:57:48 UTC
At least, it ran through the night.
Comment 8 Deepak Ukey 2018-10-04 11:05:26 UTC
Hi,

Can you please tell me how to reproduce the issue or what are steps causing this panic. 

Also can you please provide me the what is firmware version you are using for E208i-p SR Gen10 /  P408i-a SR Gen10 cards so that i can try reproducing this on my setup and  help you to resolve this. 

Thanks.
Comment 9 rainer 2018-10-04 12:18:40 UTC
Hi,

Firmware revision is 1.60 (from HPE website).

But it seems it is an ARC problem that just did not materialize on my other servers because ARC was limited there already, but is actually pretty widespread.

Also, one of the first panics we got had the driver-name in the backtrace somewhere - but that was on the old firmware.

I was notified of this PR privately:

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=231794

which seems to describe a similar problem.
Comment 10 Andriy Gapon freebsd_committer freebsd_triage 2018-10-04 13:25:37 UTC
(In reply to rainer from comment #9)
The problem might be similar but it is certainly different.
In the other bug they are getting a panic (unfortunately the panic message is not shown), while you are getting a fatal trap / page fault.

Also, in your case there is no ARC calls in the stack trace.  It's straight from the ZIO code to the VM code.  So, it's quite possible that ARC contributes to the problem (e.g., by creating a memory pressure or some such), but there is a bug in kmem_back / kmem_malloc.

Finally, in comment #3 the stack trace recorded by ddb and the stack trace shown by kgdb do not match.  I suspect that that is because you passed a wrong kernel to kgdb or /usr/lib/debug/boot/kernel does not match /boot/kernel.
Comment 11 rainer 2018-10-04 14:46:52 UTC
Hi,

I compiled a kernel myself with 
make buildkernel && make installkernel

I had thought the debug kernel lived next to the kernel in /boot/kernel...


(ewserv-log03-prod </root>) 0 # uname -a
FreeBSD ewserv-log03-prod.everyware.zone 11.2-RELEASE-p4 FreeBSD 11.2-RELEASE-p4 #0: Fri Sep 28 16:37:02 CEST 2018     root@ewserv-log03-prod.everyware.zone:/usr/obj/usr/src/sys/GENERIC  amd64
(ewserv-log03-prod </root>) 0 # ll /usr/lib/debug/boot/kernel/kernel.debug 
-r-xr-xr-x  1 root  wheel  86179448 Sep 28 16:37 /usr/lib/debug/boot/kernel/kernel.debug
(ewserv-log03-prod </root>) 0 # ll /boot/kernel/kernel 
-r-xr-xr-x  1 root  wheel  27781528 Sep 28 16:37 /boot/kernel/kernel

because I wasn't sure if the default kernel package contains a kernel with debug-symbols.


What is the correct way to get a kernel with debug-symbols?

I can reboot and run my tests again without the ARC reduction, to make sure this is the kernel that is producing the crashdump. It needed less than an hour to lock up.

We would like to get this server back into production, but for now I can do whatever is necessary to solve this problem (apart from allowing direct logins - I'd have to wipe it)
Comment 12 Josh Gitlin 2018-10-04 14:56:08 UTC
(In reply to Andriy Gapon from comment #10)

> it's quite possible that ARC contributes to the problem but
> there is a bug in kmem_back / kmem_malloc.

This is what I felt as well when reading the source. I didn't see any specific out of memory error, but rather a page fault which (to my untrained eye) looked like the kernel trying to access a KVA page that did not exist. But I was very unsure of my theory that it was a bug as opposed to a misconfiguration.

What I found odd was that we had crashes on production systems where the config in place hadn't changed in years...
Comment 13 rainer 2018-10-23 15:59:24 UTC
Is this one of the issues that is mentioned here:

https://www.freebsdfoundation.org/news-and-events/newsletter/freebsd-foundation-update-september-2018/

"Of late I have also been investigating reports of issues with ZFS, particularly, those reported on FreeBSD 11.2.  A number of regressions, including a kernel memory leak and issues with ARC reclamation, have already been fixed for 12.0; "
Comment 14 Mark Johnston freebsd_committer freebsd_triage 2018-10-23 16:10:05 UTC
(In reply to rainer from comment #13)
Not yet.  A few days ago someone reported a similar panic on the lists.  I asked for some further info and didn't get any reply.

If you or any one else can reproduce this problem, I'd at least like to see a backtrace with line numbers (the stack in comment 3 is bogus).  Ideally, I'd be able to take a look at a vmcore.  For this, I also need a copy of /boot/kernel and /usr/lib/debug/boot/kernel corresponding to the running kernel.
Comment 15 rainer 2018-10-29 11:37:44 UTC
OK.

I've "fixed" this by adding more memory and reducing ARC, so that I have less % wired.

I will see what I can do.
Comment 16 rainer 2018-11-07 00:21:11 UTC
OK, as soon as some load is on the system (in this case, 10mbit UDP syslog traffic, it panics again.

supervisor write data, page not present.

How do I install and activate the debug-kernel from the kernel-dbg.txz archive?
Comment 17 Mark Johnston freebsd_committer freebsd_triage 2018-11-07 00:27:48 UTC
(In reply to rainer from comment #16)
kernel-dbg just contains kernel symbols.  If you crash a kernel and get a vmcore in /var/crash, then the symbols can be used to examine the vmcore.  The kernel-dbg set needs to match the kernel from the corresponding kernel.txz archive.
Comment 18 rainer 2018-11-07 00:45:12 UTC
log02-prod </root>) 0 # kgdb /boot/kernel/kernel /var/crash/vmcore.last
GNU gdb 6.1.1 [FreeBSD]
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "amd64-marcel-freebsd"...

Unread portion of the kernel message buffer:


Fatal trap 12: page fault while in kernel mode
cpuid = 2; apic id = 02
fault virtual address	= 0x0
fault code		= supervisor write data, page not present
instruction pointer	= 0x20:0xffffffff80aacfae
stack pointer	        = 0x28:0xfffffe103c5e8f60
frame pointer	        = 0x28:0xfffffe103c5e9130
code segment		= base rx0, limit 0xfffff, type 0x1b
			= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags	= interrupt enabled, resume, IOPL = 0
current process		= 87490 (syslog-ng)
trap number		= 12
panic: page fault
cpuid = 2
KDB: stack backtrace:
#0 0xffffffff80b3d587 at kdb_backtrace+0x67
#1 0xffffffff80af6b27 at vpanic+0x177
#2 0xffffffff80af69a3 at panic+0x43
#3 0xffffffff80f77fdf at trap_fatal+0x35f
#4 0xffffffff80f78039 at trap_pfault+0x49
#5 0xffffffff80f77807 at trap+0x2c7
#6 0xffffffff80f57fbc at calltrap+0x8
#7 0xffffffff80aac905 at kern_kevent+0xb5
#8 0xffffffff80aac794 at sys_kevent+0xa4
#9 0xffffffff80f79068 at amd64_syscall+0xa38
#10 0xffffffff80f5880d at fast_syscall_common+0x101
Uptime: 45m56s
Dumping 2959 out of 65147 MB:..1%..11%..21%..31%..41%..51%..61%..71%..81%..91%

Reading symbols from /boot/kernel/geom_mirror.ko...Reading symbols from /usr/lib/debug//boot/kernel/geom_mirror.ko.debug...done.
done.
Loaded symbols for /boot/kernel/geom_mirror.ko
Reading symbols from /boot/kernel/zfs.ko...Reading symbols from /usr/lib/debug//boot/kernel/zfs.ko.debug...done.
done.
Loaded symbols for /boot/kernel/zfs.ko
Reading symbols from /boot/kernel/opensolaris.ko...Reading symbols from /usr/lib/debug//boot/kernel/opensolaris.ko.debug...done.
done.
Loaded symbols for /boot/kernel/opensolaris.ko
Reading symbols from /boot/kernel/accf_data.ko...Reading symbols from /usr/lib/debug//boot/kernel/accf_data.ko.debug...done.
done.
Loaded symbols for /boot/kernel/accf_data.ko
Reading symbols from /boot/kernel/accf_http.ko...Reading symbols from /usr/lib/debug//boot/kernel/accf_http.ko.debug...done.
done.
Loaded symbols for /boot/kernel/accf_http.ko
Reading symbols from /boot/kernel/cc_htcp.ko...Reading symbols from /usr/lib/debug//boot/kernel/cc_htcp.ko.debug...done.
done.
Loaded symbols for /boot/kernel/cc_htcp.ko
Reading symbols from /boot/kernel/tmpfs.ko...Reading symbols from /usr/lib/debug//boot/kernel/tmpfs.ko.debug...done.
done.
Loaded symbols for /boot/kernel/tmpfs.ko
#0  doadump (textdump=<value optimized out>) at pcpu.h:229
229	pcpu.h: No such file or directory.
	in pcpu.h
(kgdb) bt
#0  doadump (textdump=<value optimized out>) at pcpu.h:229
#1  0xffffffff80af673b in kern_reboot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:383
#2  0xffffffff80af6b61 in vpanic (fmt=<value optimized out>, ap=<value optimized out>) at /usr/src/sys/kern/kern_shutdown.c:776
#3  0xffffffff80af69a3 in panic (fmt=<value optimized out>) at /usr/src/sys/kern/kern_shutdown.c:707
#4  0xffffffff80f77fdf in trap_fatal (frame=0xfffffe103c5e8ea0, eva=0) at /usr/src/sys/amd64/amd64/trap.c:875
#5  0xffffffff80f78039 in trap_pfault (frame=0xfffffe103c5e8ea0, usermode=0) at pcpu.h:229
#6  0xffffffff80f77807 in trap (frame=0xfffffe103c5e8ea0) at /usr/src/sys/amd64/amd64/trap.c:415
#7  0xffffffff80f57fbc in calltrap () at /usr/src/sys/amd64/amd64/exception.S:231
#8  0xffffffff80aacfae in kqueue_kevent (kq=0xfffff801cb5a4a00, td=0xfffff801e3b34000, nchanges=<value optimized out>, nevents=1, k_ops=0xfffffe103c5e91c0, timeout=0xfffffe1000000000) at /usr/src/sys/kern/kern_event.c:1623
#9  0xffffffff80aac905 in kern_kevent (td=0xfffff801e3b34000, fd=<value optimized out>, nchanges=0, nevents=1, k_ops=0xfffffe103c5e91c0, timeout=0xfffffe103c5e91b0) at /usr/src/sys/kern/kern_event.c:1023
#10 0xffffffff80aac794 in sys_kevent (td=0xfffff801e3b34000, uap=0xfffff801e3b34538) at /usr/src/sys/kern/kern_event.c:901
#11 0xffffffff80f79068 in amd64_syscall (td=0xfffff801e3b34000, traced=0) at subr_syscall.c:132
#12 0xffffffff80f5880d in fast_syscall_common () at /usr/src/sys/amd64/amd64/exception.S:479
#13 0x000000080222b81a in ?? ()
Previous frame inner to this frame (corrupt stack?)
Current language:  auto; currently minimal
(kgdb)
Comment 19 rainer 2018-11-07 00:59:59 UTC
It had only booted a couple of minutes at that point.
I had rm -rf'ed /var/db/freebsd-update/*
in order to update the debug-symbols to the current kernel-level.

And bang, it panic'ed again.

This is really weird.
But shows that most of my systems aren't really stressed at all, apparently.
Comment 20 Mark Johnston freebsd_committer freebsd_triage 2018-11-07 01:10:23 UTC
Hmm, this seems to be unrelated to ZFS - it's a panic in kevent().  Would you be willing to share the vmcore with me so that I can try and find the bug?  Please mail me if so.
Comment 21 Mark Johnston freebsd_committer freebsd_triage 2018-11-29 00:03:28 UTC
(In reply to Mark Johnston from comment #20)
I believe the kevent bug was fixed in r340734.

I think I know what's causing the kmem_back() panic as well.  Is the system using a Skylake CPU?  Please try the patch here:

https://people.freebsd.org/~markj/patches/blacklist_freecnt.diff
Comment 22 rainer 2018-11-29 08:29:59 UTC
It's a HP Gen10, using Xeon Scalable Processors.
So, yes, it's Skylake (Xeon variant).

x86info v1.31pre
/dev/cpuctl0: No such file or directory
Found 16 identical CPUs
Extended Family: 0 Extended Model: 5 Family: 6 Model: 85 Stepping: 4
Type: 0 (Original OEM)
CPU Model (x86info's best guess): Core i7 (Skylake-X)
Processor name string (BIOS programmed): Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz

Total processor threads: 16
This system has 1 eight-core processor with hyper-threading (2 threads per core) running at an estimated 2.10GHz


I've applied your first patch, but have not yet been able to find the time to put the system under load.
Should I apply your second patch without testing the first patch?
Comment 23 Josh Gitlin 2018-11-29 15:09:46 UTC
We experienced the kmem_back issues as well, and we're running a Skylake: Intel® Xeon® Bronze 3106 Processor

We haven't seen the issue since reducing ARC size, but the same config had been stable for years prior to that. (We did change hardware however)

Backtrace was: 

trap number = 12
panic: page fault
cpuid = 6
KDB: stack backtrace:
#0 0xffffffff80b3d567 at kdb_backtrace+0x67
#1 0xffffffff80af6b07 at vpanic+0x177
#2 0xffffffff80af6983 at panic+0x43
#3 0xffffffff80f77fcf at trap_fatal+0x35f
#4 0xffffffff80f78029 at trap_pfault+0x49
#5 0xffffffff80f777f7 at trap+0x2c7
#6 0xffffffff80f57dac at calltrap+0x8
#7 0xffffffff80dee7e2 at kmem_back+0xf2
#8 0xffffffff80dee6c0 at kmem_malloc+0x60
#9 0xffffffff80de6172 at keg_alloc_slab+0xe2
#10 0xffffffff80de8b7e at keg_fetch_slab+0x14e
#11 0xffffffff80de8364 at zone_fetch_slab+0x64
#12 0xffffffff80de848f at zone_import+0x3f
#13 0xffffffff80de4b99 at uma_zalloc_arg+0x3d9
#14 0xffffffff826e6ab2 at zio_write_compress+0x1e2
#15 0xffffffff826e574c at zio_execute+0xac
#16 0xffffffff80bled74 at taskqueue_run_locked+0x154
#17 0xffffffff80b4fed8 at taskqueue_thread_loop+0x98
Uptime: 18d18h31m6s
mpr0: Sending StopUnit: path (xpt0:mpr0:0:10:ffffffff): handle 10 
mpr0: Incrementing SSU count
mpr0: Sending StopUnit: path (xpt0:mpr0:0:13:ffffffff): handle 13 
mpr0: Incrementing SSU count
mpr0: Sending StopUnit: path Ixpt0:mpr0:0:16:ffffffff): handle 16 
mpr0: Incrementing SSU count
Comment 24 Mark Johnston freebsd_committer freebsd_triage 2018-11-29 15:19:40 UTC
(In reply to rainer from comment #22)
I think it'd be fine to just test both patches.  They address completely unrelated issues.

(In reply to Josh Gitlin from comment #23)
Reducing the ARC size most likely just hid the bug again.  It is triggered when the system is completely out of free pages, but if you're restricting the ARC it'll be harder to hit that condition.
Comment 25 Josh Gitlin 2018-11-29 15:22:39 UTC
(In reply to Mark Johnston from comment #24)

That makes complete sense. The servers we saw this on are iSCSI storage servers in production so we can't test a patch on them unfortunately (and I'd prefer they _not_ be out of free pages anyway)
Comment 26 commit-hook freebsd_committer freebsd_triage 2018-11-29 16:31:47 UTC
A commit references this bug:

Author: markj
Date: Thu Nov 29 16:31:02 UTC 2018
New revision: 341247
URL: https://svnweb.freebsd.org/changeset/base/341247

Log:
  Update the free page count when blacklisting pages.

  Otherwise the free page count will not accurately reflect the physical
  page allocator's state.  On 11 this can trigger panics in
  vm_page_alloc() since the allocator state and free page count are
  updated atomically and we expect them to stay in sync.  On 12 the
  bug would manifest as threads looping in vm_page_alloc().

  PR:		231296
  Reported by:	mav, wollman, Rainer Duffner, Josh Gitlin
  Reviewed by:	alc, kib, mav
  MFC after:	3 days
  Sponsored by:	The FreeBSD Foundation
  Differential Revision:	https://reviews.freebsd.org/D18374

Changes:
  head/sys/vm/vm_page.c
Comment 27 commit-hook freebsd_committer freebsd_triage 2018-11-29 17:36:40 UTC
A commit references this bug:

Author: markj
Date: Thu Nov 29 17:35:37 UTC 2018
New revision: 341249
URL: https://svnweb.freebsd.org/changeset/base/341249

Log:
  MFC r341247:
  Update the free page count when blacklisting pages.

  PR:		231296
  Approved by:	re (gjb, insta-MFC)

Changes:
_U  stable/12/
  stable/12/sys/vm/vm_page.c
Comment 28 rainer 2018-12-01 17:33:59 UTC
Hi,

I assume, this will not make it into 12.0-RELEASE?

I don't have my own infrastructure for releases and I try to only run releases, thus I take the binaries from the download-site + freebsd binary-updates.

It looks like I'll either have to wait for 11.3 or 12.1 or patch those in and re-patch every time there is an update to 11.2-RELEASE or 12.0-RELEASE.

Or could an errata-patch for 11.2 be released (as well as for 12.0)?

Technically, I'd have to patch all my 11.2 hosts for this otherwise. Or at least those with ZFS, as to not hit this at some point in the future during high load.
Comment 29 Mark Johnston freebsd_committer freebsd_triage 2018-12-01 17:37:42 UTC
(In reply to rainer from comment #28)
No, the issue is fixed in 12.0.  I'm planning to ask for an EN for 11.2 as well.
Comment 30 rainer 2018-12-01 18:20:32 UTC
Thanks!
Comment 31 commit-hook freebsd_committer freebsd_triage 2018-12-02 18:08:59 UTC
A commit references this bug:

Author: markj
Date: Sun Dec  2 18:08:27 UTC 2018
New revision: 341401
URL: https://svnweb.freebsd.org/changeset/base/341401

Log:
  MFC r341247:
  Update the free page count when blacklisting pages.

  PR:	231296

Changes:
_U  stable/11/
  stable/11/sys/vm/vm_page.c
Comment 32 commit-hook freebsd_committer freebsd_triage 2018-12-19 18:02:04 UTC
A commit references this bug:

Author: emaste
Date: Wed Dec 19 18:00:59 UTC 2018
New revision: 342225
URL: https://svnweb.freebsd.org/changeset/base/342225

Log:
  MFS11 r341401: Update the free page count when blacklisting pages.

  PR:		231296
  Submitted by:	markj
  Approved by:	so
  Security:	FreeBSD-EN-18:17.vm
  Sponsored by:	The FreeBSD Foundation

Changes:
_U  releng/11.2/
  releng/11.2/sys/vm/vm_page.c
Comment 33 Orbello Charles 2019-01-21 16:26:43 UTC
Hello

it's possible that the problem is not finish .
http://download.adaptec.com/pdfs/readme/microsemi_smarthba_smartraid_relnotes_1_2019.pdf

Page 6
"Fixed an issue where controller hangs when running I/O with pending DDR cache retries."

Driver :
https://storage.microsemi.com/en-us/speed/raid/aac/unix/smartpqi_freebsd_v1.0.3-2035_tgz.php

Firmware (not available in the SPP of HPE) 
https://storage.microsemi.com/en-us/speed/raid/asr/fw_bios/microsemi_smartfwx100_v1_98_b0_zip.php

from my side i can reproduce the crash on a 11.2 and 12.0 (r342545) with a fio on a raidz2 (8 devices) on the same material HP DL380 Gen10.

[benchmark1]
name=benchmark1
rw=write
numjobs=10
size=100G

in attachment the core.txt
Comment 34 Orbello Charles 2019-01-21 16:27:20 UTC
Created attachment 201312 [details]
core_txt_smartpqi
Comment 35 Mark Johnston freebsd_committer freebsd_triage 2019-01-21 17:04:36 UTC
(In reply to Orbello Charles from comment #33)
This is distinct from the other panics reported in this PR, which were not specific to smartpqi(4).  Please open a new PR.
Comment 36 rainer 2019-09-03 09:13:41 UTC
Latest SPP (2019.03.01) does contain that firmware-fix.