258049 – vm_reserv_depopulate panic.

Bug 258049 - vm_reserv_depopulate panic.

Summary: vm_reserv_depopulate panic.

Status:	Open

Alias:	None

Product:	Base System
Classification:	Unclassified
Component:	kern (show other bugs)
Version:	CURRENT
Hardware:	Any Any

Importance:	--- Affects Some People
Assignee:	freebsd-bugs (Nobody)

URL:
Keywords:

Depends on:
Blocks:

Reported:	2021-08-25 20:50 UTC by dgilbert
Modified:	2021-08-30 16:12 UTC (History)
CC List:	2 users (show)

See Also:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description dgilbert 2021-08-25 20:50:37 UTC

I experienced this crash on riscv (while doing a makeworld -j8), but I'm told it has been observed on amd64, too.  This is the panic:

panic: vm_reserv_depopulate: reserv 0xffffffd3e672c560's popmap[208] is clear
cpuid = 1
time = 1629858147
KDB: stack backtrace:
db_trace_self() at db_trace_self
db_trace_self_wrapper() at db_trace_self_wrapper+0x38
kdb_backtrace() at kdb_backtrace+0x2c
vpanic() at vpanic+0x148
panic() at panic+0x2a
vm_reserv_free_page() at vm_reserv_free_page+0x37c
vm_page_free_prep() at vm_page_free_prep+0x168
vm_page_free_toq() at vm_page_free_toq+0x18
vm_page_free() at vm_page_free+0x18
vm_object_terminate() at vm_object_terminate+0xec
vm_object_deallocate() at vm_object_deallocate+0x2a6
vm_map_process_deferred() at vm_map_process_deferred+0x9c
vm_map_remove() at vm_map_remove+0xd2
vmspace_exit() at vmspace_exit+0x10e
exit1() at exit1+0x4aa
sys_sys_exit() at sys_sys_exit+0x10
do_trap_user() at do_trap_user+0x208
cpu_exception_handler_user() at cpu_exception_handler_user+0x72

The dump files can be fetched at https://nextcloud.towernet.ca/s/HiJ654HLWnbT6jD

Comment 1 Jessica Clarke freebsd_committer

2021-08-25 21:10:29 UTC

Re observed on amd64, Google has 5 hits (I assume this bug report will shortly appear too) for `"vm_reserv_depopulate" "is clear"`, all but one of which are just for the code, with the final one being for this 5 year old GitHub Gist https://gist.github.com/nomadlogic/ba58e8fd01267fbf7a2fa4fcee29e2f7 that was for FreeBSD 12.0-CURRENT on amd64 when using the old freebsd-base-graphics tree.

Looking at vm_reserv itself, whilst a caller can do stupid things and potentially cause crashes, my initial reading is that this KASSERT should be impossible no matter what the caller is doing.

Comment 2 Mark Johnston freebsd_committer

2021-08-26 12:38:07 UTC

The vmcore file on its own isn't useful without a copy of the corresponding kernel (/boot/kernel) and debug files (/usr/lib/debug/boot/kernel).  It would be useful to see a dump of the vm_page and reservation in question:

(kgdb) frame 14
(kgdb) p/x *m
(kgdb) p/x *(vm_reserv_t)0xffffffd3e672c560

Comment 3 dgilbert 2021-08-26 14:12:17 UTC

I will get on this ... but it might be tomorrow.

I will run those commands _and_ I will upload those files.

Comment 4 dgilbert 2021-08-26 14:24:33 UTC

(kgdb) frame 14
#14 0xffffffc0005a8da0 in vm_page_free_prep (m=0xffffffd3f1012168) at /usr/src/sys/vm/vm_page.c:3842
warning: Source file is more recent than executable.
3842            if ((m->flags & PG_PCPU_CACHE) == 0 && vm_reserv_free_page(m))
(kgdb) p/x *m
$1 = {plinks = {q = {tqe_next = 0xffffffd3f10121d0, tqe_prev = 0xffffffd3f1012100}, s = {ss = {
        sle_next = 0xffffffd3f10121d0}}, memguard = {p = 0xffffffd3f10121d0, v = 0xffffffd3f1012100}, uma = {
      slab = 0xffffffd3f10121d0, zone = 0xffffffd3f1012100}}, listq = {tqe_next = 0xffffffd3f10121d0,
    tqe_prev = 0xffffffd3f1012110}, object = 0x0, pindex = 0x2d0, phys_addr = 0x21f2d0000, md = {pv_list = {
      tqh_first = 0x0, tqh_last = 0xffffffd3f10121a0}, pv_gen = 0xf, pv_memattr = 0x2}, ref_count = 0x0,
  busy_lock = 0xfffffffe, a = {{flags = 0x18, queue = 0x1, act_count = 0x5}, _bits = 0x5010018}, order = 0xc,
  pool = 0x0, flags = 0x0, oflags = 0x0, psind = 0x0, segind = 0x1, valid = 0x0, dirty = 0x0}
(kgdb) p/x *(vm_reserv_t)0xffffffd3e672c560
$2 = {lock = {lock_object = {lo_name = 0xffffffc00066006c, lo_flags = 0x1030000, lo_data = 0x0,
      lo_witness = 0xffffffd3ffd8e180}, mtx_lock = 0xffffffc2227f7100}, partpopq = {tqe_next = 0xffffffd3e6756fe0,
    tqe_prev = 0xffffffd3e679c240}, objq = {le_next = 0xffffffd3e67b04a0, le_prev = 0xffffffd0a46be0c0},
  object = 0xffffffd0a46be000, pindex = 0x200, pages = 0xffffffd3f100cce8, popcnt = 0xef, domain = 0x0,
  inpartpopq = 0x1, lasttick = 0xa9d59012, popmap = {0x0, 0x0, 0x0, 0xffffffffff000000, 0xfffffffffffe040f,
    0x24fc0ffffc925927, 0xffffff0847fc9249, 0x1fffffffffffffff}}

Comment 5 dgilbert 2021-08-26 14:25:50 UTC

https://termbin.com/q8g9

for that last bit.  Heh... my standard window is 120 these days.

Comment 6 dgilbert 2021-08-26 14:30:09 UTC

I tarred /boot/kernel and /usr/lib/debug/boot/kernel into the nextcloud directory.  You can fetch them from the same place (https://nextcloud.towernet.ca/s/wPpj7zgxgDBAZ6q)

Comment 7 Mark Johnston freebsd_committer

2021-08-26 16:09:08 UTC

(In reply to dgilbert from comment #6)
Thank you.  Is the panic reproducible at all?

(In reply to dgilbert from comment #4)
I don't see anything obviously inconsistent, except:

popcnt(0xffffffffff000000) +
popcnt(0xfffffffffffe040f) +
popcnt(0x24fc0ffffc925927) +
popcnt(0xffffff0847fc9249) +
popcnt(0x1fffffffffffffff) = 231

and rv->popcnt = 239...

Comment 8 dgilbert 2021-08-26 19:17:03 UTC

make -j8 bulidworld produced that ... but it's the only time it happened to me.  make -j4 subsequently passed.  There are 4 processors on the box.  I can run a few more make -j8 on it.

Question, tho, can I upgrade to the security patches ... or should I continue to test on this week-or-two old version?

Comment 9 Mark Johnston freebsd_committer

2021-08-26 19:34:04 UTC

(In reply to dgilbert from comment #8)
I don't see any problem with updating first.

Comment 10 dgilbert 2021-08-29 21:02:33 UTC

Here's what I've found so far.  If I make -j8 with ccache full of answers, we're fine.  If I make -j4, we're fine.  If I make -j8 with ccache empty (but being filled), then I get 3 out of 4 (so far) buildworld have ended in a random crash and one in the panic you're looking at.

The code does compile (at -j4).  The system does have ZFS running on an NVMe drive.  I'm going to keep trying to trigger the panic, but my feeling is the panic is only one of the possible outcomes of the error.

Comment 11 Mark Johnston freebsd_committer

2021-08-30 15:13:22 UTC

(In reply to dgilbert from comment #10)
It would be useful to see at least the panic message and stack trace from the other panics you've hit.

Comment 12 dgilbert 2021-08-30 16:12:19 UTC

The count is one panic and 3 crashes of 4 total attempts at -j8.