Bug 250934

Summary: kernel panic in zfs
Product: Base System Reporter: John Kennedy <warlock>
Component: kernAssignee: Mariusz Zaborski <oshogbo>
Status: Closed FIXED    
Severity: Affects Only Me CC: afedorov, lwhsu, oshogbo
Priority: ---    
Version: CURRENT   
Hardware: amd64   
OS: Any   
Attachments:
Description Flags
kernel dump text none

Description John Kennedy 2020-11-07 22:41:45 UTC
Created attachment 219438 [details]
kernel dump text

I'm trying to grub-bhyve the OEL8 boot disk.  I initiate it as:

grub-bhyve -m /home/warlock/pit/cfg/root/oel8-device.map -r cd0 -M 4096M oel8

The device map file has this:

(hd0) /dev/zvol/zaux/oel8
(cd0) /zroot/stash/iso/oel8.2_x64_boot.iso

Those both obviously use ZFS resources.  To cause the panic, I just type this:

grub> ls

The ISO is Oracle's version of RedHat 8.  If it happens to be the bad actor you can probably grab your own copy:

SHA1 (oel8.2_x64_boot.iso) = 54b1094367a80893167ad8cec37e9be638503917

That is a renamed V996905-01.iso, Hopefully you won't need to grab it yourself (free, but they make you jump through hoops).  The zdev was just created like this:

zfs create -V64G -o volmode=dev zaux/oel8

I've attached the crash dump text, but the short version is below.  The "dirty" part is the r367433 pre-patch on top of r367430.

FreeBSD ouroboros.phouka.net 13.0-CURRENT FreeBSD 13.0-CURRENT #226 r367430+999604acfd94-c272718(master)-dirty: Fri Nov  6 12:56:43 PST 2020     warlock@ouroboros.phouka.net:/usr/obj/usr/src/amd64.amd64/sys/GENERIC  amd64
...
Fatal trap 12: page fault while in kernel mode
cpuid = 1; apic id = 02
fault virtual address   = 0x28
fault code              = supervisor read data, page not present
instruction pointer     = 0x20:0xffffffff82895808
stack pointer           = 0x28:0xfffffe00e0927550
frame pointer           = 0x28:0xfffffe00e09275b0
code segment            = base rx0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 3307 (grub-bhyve)
trap number             = 12
panic: page fault
cpuid = 1
time = 1604786502
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe00e0927200
vpanic() at vpanic+0x182/frame 0xfffffe00e0927250
panic() at panic+0x43/frame 0xfffffe00e09272b0
trap_fatal() at trap_fatal+0x387/frame 0xfffffe00e0927310
trap_pfault() at trap_pfault+0x97/frame 0xfffffe00e0927370
trap() at trap+0x2ab/frame 0xfffffe00e0927480
calltrap() at calltrap+0x8/frame 0xfffffe00e0927480
--- trap 0xc, rip = 0xffffffff82895808, rsp = 0xfffffe00e0927550, rbp = 0xfffffe00e09275b0 ---
zil_async_to_sync() at zil_async_to_sync+0x18/frame 0xfffffe00e09275b0
zvol_cdev_open() at zvol_cdev_open+0x322/frame 0xfffffe00e09275f0
devfs_open() at devfs_open+0x12f/frame 0xfffffe00e0927660
VOP_OPEN_APV() at VOP_OPEN_APV+0x35/frame 0xfffffe00e0927680
vn_open_vnode() at vn_open_vnode+0x19a/frame 0xfffffe00e0927720
vn_open_cred() at vn_open_cred+0x3d5/frame 0xfffffe00e0927870
kern_openat() at kern_openat+0x263/frame 0xfffffe00e09279c0
amd64_syscall() at amd64_syscall+0x131/frame 0xfffffe00e0927af0
fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe00e0927af0
--- syscall (499, FreeBSD ELF64, sys_openat), rip = 0x8009ead4a, rsp = 0x7fffffffdea8, rbp = 0x7fffffffdf20 ---
KDB: enter: panic
Uptime: 16h31m34s
Dumping 4205 out of 32633 MB:..1%..11%..21%..31%..41%..51%..61%..71%..81%..91%
Comment 1 Aleksandr Fedorov freebsd_committer freebsd_triage 2020-11-08 08:33:20 UTC
I think this is not a virtualization problem.
It's looks like a ZFS issue.

https://github.com/openzfs/zfs/pull/11152

Here, the same trace: https://drive.google.com/file/d/1-dvj8eoUNqRWL5mcVtH5Px4NbrhLfzML/view

https://github.com/openzfs/zfs/commit/ae37ceadaa2a8cf09fbf1a9baafaa6dc6e24318a
Comment 2 commit-hook freebsd_committer freebsd_triage 2020-11-08 14:08:34 UTC
A commit references this bug:

Author: oshogbo
Date: Sun Nov  8 14:08:01 UTC 2020
New revision: 367487
URL: https://svnweb.freebsd.org/changeset/base/367487

Log:
  Check if the ZVOL has been written before calling zil_async_to_sync.
  The ZIL will be opened on the first write, not earlier.

  Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
  Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
  Signed-off-by: Mariusz Zaborski <oshogbo@vexillium.org>
  OpenZFS Pull Request: https://github.com/openzfs/zfs/pull/11152
  PR:		250934

Changes:
  head/sys/contrib/openzfs/module/os/freebsd/zfs/zvol_os.c
Comment 3 Mariusz Zaborski freebsd_committer freebsd_triage 2020-11-08 14:16:32 UTC
Could you please verify if commit resolved your issue?
Comment 4 John Kennedy 2020-11-08 15:34:17 UTC
Crunching now.

I usually do a full buildworld/buildkernel which isn't the fastest way to compile the kernel.  Do you guys (kernel developers) find faster compile options (like using -DKERNFAST, no-clean options, etc) return adequate results?

I'm happy wasting my computer's time, but don't want to waste YOUR time.  :D
Comment 5 Aleksandr Fedorov freebsd_committer freebsd_triage 2020-11-08 15:36:40 UTC
Almost always, -DNO_CLEAN works well.
Comment 6 John Kennedy 2020-11-08 19:58:59 UTC
The patch works!

grub> ls
(hd0) (cd0) (cd0,msdos2) (host)
Comment 7 Mariusz Zaborski freebsd_committer freebsd_triage 2020-11-08 22:56:08 UTC
I'm pleased that this helps.