273438 – PANIC: memory modified after free on relatively idle riscv -CURRENT

Bug 273438 - PANIC: memory modified after free on relatively idle riscv -CURRENT

Summary: PANIC: memory modified after free on relatively idle riscv -CURRENT

Status:	Open

Alias:	None

Product:	Base System
Classification:	Unclassified
Component:	kern (show other bugs)
Version:	CURRENT
Hardware:	riscv Any

Importance:	--- Affects Many People
Assignee:	freebsd-riscv (Team)

URL:
Keywords:	crash

Depends on:
Blocks:

Reported:	2023-08-30 06:01 UTC by dgilbert
Modified:	2024-02-08 16:47 UTC (History)
CC List:	2 users (show)

See Also:

Attachments
core.txt file for crash (210.70 KB, text/plain) 2023-08-30 06:01 UTC, dgilbert	no flags	Details
info file for crash (590 bytes, text/plain) 2023-08-30 06:02 UTC, dgilbert	no flags	Details
Another crash on another day: the core.txt file. (215.73 KB, text/plain) 2023-09-08 18:46 UTC, dgilbert	no flags	Details
another kernel core with a different signature. (198.32 KB, text/plain) 2024-01-27 20:44 UTC, dgilbert	no flags	Details
Another core.txt file from the problem. (240.02 KB, text/plain) 2024-01-31 19:26 UTC, dgilbert	no flags	Details
Core.txt from recent kernel. (282.39 KB, text/plain) 2024-02-08 16:47 UTC, dgilbert	no flags	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description dgilbert 2023-08-30 06:01:56 UTC

Created attachment 244458 [details]
core.txt file for crash

I have an -unmatched RISCV board.  It has public v4 and v6 addresses and answers ssh and web requests, but is "largely idle" when this happened.  Here's the stack trace.  I will attach the core.txt

FreeBSD 14.0-CURRENT riscv64 1400093 #0 main-n264297-b03012d0b600-dirty: Wed Jul 26 04:04:24 EDT 2023

cpuid = 2
time = 1693192346
KDB: stack backtrace:
db_trace_self() at db_trace_self
db_trace_self_wrapper() at db_trace_self_wrapper+0x36
kdb_backtrace() at kdb_backtrace+0x2c
vpanic() at vpanic+0x116
panic() at panic+0x26
trash_ctor() at trash_ctor+0x4a
item_ctor() at item_ctor+0xb8
uma_zalloc_arg() at uma_zalloc_arg+0xbc
.Lpcrel_hi1350() at .Lpcrel_hi1350+0xe
dbuf_create_bonus() at dbuf_create_bonus+0x44
.Lpcrel_hi44() at .Lpcrel_hi44+0x26
.Lpcrel_hi51() at .Lpcrel_hi51+0x26
.Lpcrel_hi284() at .Lpcrel_hi284+0x28
.Lpcrel_hi2() at .Lpcrel_hi2+0x76
.Lpcrel_hi10() at .Lpcrel_hi10+0x56
.Lpcrel_hi783() at .Lpcrel_hi783+0x3e
.Lpcrel_hi383() at .Lpcrel_hi383+0x4a
VOP_CACHEDLOOKUP_APV() at VOP_CACHEDLOOKUP_APV+0x32
vfs_cache_lookup() at vfs_cache_lookup+0xa4
VOP_LOOKUP_APV() at VOP_LOOKUP_APV+0x32
cache_fplookup_noentry() at cache_fplookup_noentry+0x1d4
cache_fplookup() at cache_fplookup+0x4e2
namei() at namei+0x144
kern_statat() at kern_statat+0xd6
sys_fstatat() at sys_fstatat+0x1c
do_trap_user() at do_trap_user+0x236
cpu_exception_handler_user() at cpu_exception_handler_user+0x72
--- syscall (552, FreeBSD ELF64, fstatat)
KDB: enter: panic

Comment 1 dgilbert 2023-08-30 06:02:56 UTC

Created attachment 244459 [details]
info file for crash

I have the vmcore and debug kernel files available on request.

Comment 2 Mark Linimon freebsd_committer

2023-08-30 15:48:35 UTC

^Triage: assign properly.

Comment 3 dgilbert 2023-09-08 18:46:18 UTC

Created attachment 244713 [details]
Another crash on another day: the core.txt file.

I have another largely idle crash.  It shares almost all the stack frames with the first one submitted.

As before, the dump and a debug kernel are available on request.

Comment 4 Ruslan Bukin 2023-09-09 11:00:12 UTC

is it ZFS related? did you try to disable it

Comment 5 dgilbert 2023-09-29 17:22:15 UTC

(In reply to Ruslan Bukin from comment #4)

This is ZFS-root, so I can't easily disable ZFS.  Another crash ... almost identical signature just occurred.

Comment 6 dgilbert 2024-01-27 20:44:11 UTC

Created attachment 248018 [details]
another kernel core with a different signature.

Here is another (fairly deep stack) core from the same machine.

Comment 7 Mitchell Horne freebsd_committer

2024-01-29 18:37:25 UTC

I looked at the back traces provided, and the allocation is speculatively belonging to the 'dbuf_kmem_cache' UMA zone, in dbuf_create(). Unfortunately, the core.txt does not provide the context of the thread responsible for the store-after-free, so there is not enough here to deconstruct what might have happened, and whether this is an OpenZFS bug, an OpenZFS/riscv bug, or a FreeBSD/riscv bug.

Commit a03c23931eec (Nov. 2023) adds additional information to the panic message, which would help in confirming some details of the allocation in question, including the offset of the store-after-free. If you update past this point it would aid in further diagnosis.

Also, inclusion of 'alltrace' ddb command output after the panic _might_ help.

https://cgit.freebsd.org/src/commit/?id=a03c23931eec567b0957c2a0b1102dba8d538d98

Comment 8 Mitchell Horne freebsd_committer

2024-01-30 20:30:23 UTC

(In reply to Mitchell Horne from comment #7)

Okay I spoke too soon re: the offset; it is trivially calculated as 0x908 - 0x7f8 = 0x188 = 392. According to gdb, the struct dmu_buf_impl member at offset 392, for your revision, is db_user, an 8 byte pointer. 

The expected contents of uninitialized memory is 0xdeadc0deadc0de, but your reports consistently show the affected address as 0x00000000de00c0de. So it is only partially overwritten, and therefore not an abuse of the db_user field in a dmu_buf_impl_t object.

So, I'm thinking this allocation missed the zone's cache (empty), and the memory could have belonged to anything before that, meaning use-after-free could exist anywhere... 

I'll have to see what other tips I can learn to help identify this. On other platforms we could use KASAN, but for riscv it is not implemented yet.

Comment 9 dgilbert 2024-01-30 21:29:32 UTC

Certainly this isn't an unusual configuration for an AMD64 in my world.  16G ram, 2T nvme, ethernet.  My home router isn't dissimilar, but it doesn't exhibit this crash.

So I'm proceeding with the theory that it's RISC-V related.  I've started the few-day-long compile (I like my machines to be self-hosted --- which, BTW, is one bug of RISC-V (it needs a patch to be self-hosted) to bring it up to this week.  It would be faster, but I also zapped the ccache out of abundance of caution.

I should be booted on the new load before the weekend.  On average, we should know in two weeks or so... maybe four.  I might be able to bring it on with a poudriere build.

Comment 10 dgilbert 2024-01-31 19:26:26 UTC

Created attachment 248101 [details]
Another core.txt file from the problem.

This file does _not_ reflect an upgrade to post November.  It happened while making world.

Comment 11 dgilbert 2024-02-08 16:47:20 UTC

Created attachment 248266 [details]
Core.txt from recent kernel.

Uploading a core.txt from a recent kernel, as requested.  As always, I have the core dump and the kernel binaries should someone need to dive in.

[1:5:305]root@ump:/var/crash> uname -a
FreeBSD ump.daveg.ca 15.0-CURRENT FreeBSD 15.0-CURRENT #0 main-n267962-d56a6f0516a7-dirty: Mon Feb  5 20:54:27 EST 2024     root@ump.daveg.ca:/usr/obj/usr/src/riscv.riscv64/sys/GENERIC riscv

... I caused this dump rather deliberately by getting the machine busy with a poudriere run.  Other core.txt's are from a relatively idle machine.  Should the machine make an idle dump, I will post it here.