Bug 272325 - Page fault, possible ZFS related
Summary: Page fault, possible ZFS related
Status: Open
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 13.2-RELEASE
Hardware: amd64 Any
: --- Affects Only Me
Assignee: freebsd-fs (Nobody)
URL:
Keywords: crash, needs-qa
Depends on:
Blocks:
 
Reported: 2023-07-02 08:54 UTC by bsd_orsolic
Modified: 2023-12-14 15:23 UTC (History)
5 users (show)

See Also:


Attachments
backtrace (5.91 KB, text/plain)
2023-07-02 08:54 UTC, bsd_orsolic
no flags Details
backtrace (5.91 KB, text/plain)
2023-07-02 08:54 UTC, bsd_orsolic
no flags Details
backtrace (5.91 KB, text/plain)
2023-07-02 08:55 UTC, bsd_orsolic
no flags Details
backtrace (5.91 KB, text/plain)
2023-07-02 08:55 UTC, bsd_orsolic
no flags Details
backtrace (5.91 KB, text/plain)
2023-07-21 05:37 UTC, bsd_orsolic
no flags Details
crash 7 (6.05 KB, text/plain)
2023-08-14 06:46 UTC, bsd_orsolic
no flags Details
crash 8 (5.91 KB, text/plain)
2023-09-01 06:50 UTC, bsd_orsolic
no flags Details
crash 9 (6.02 KB, text/plain)
2023-09-13 19:01 UTC, bsd_orsolic
no flags Details
smart-nvme (3.03 KB, text/plain)
2023-12-04 22:40 UTC, bsd_orsolic
no flags Details
smart-hdd0 (5.10 KB, text/plain)
2023-12-04 22:41 UTC, bsd_orsolic
no flags Details
smart-hdd1 (5.10 KB, text/plain)
2023-12-04 22:42 UTC, bsd_orsolic
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description bsd_orsolic 2023-07-02 08:54:25 UTC
Created attachment 243120 [details]
backtrace

I am experiencing crashes on 13.2-RELEASE usually around 3:05 AM every few weeks.
grep 2023$ | sort
./core.txt.1:Mon Apr 10 03:05:10 CEST 2023
./core.txt.2:Thu Apr 27 03:04:54 CEST 2023
./core.txt.3:Fri May 26 03:05:20 CEST 2023
./core.txt.4:Mon Jun 12 03:05:18 CEST 2023
./core.txt.5:Sun Jul  2 03:05:14 CEST 2023
They could be ZFS related.

panic: page fault
Fatal trap 12: page fault while in kernel mode
cpuid = 6; apic id = 06
fault virtual address	= 0x0
fault code		= supervisor read data, page not present
instruction pointer	= 0x20:0xffffffff8286b65c
stack pointer	        = 0x28:0xfffffe0353890870
frame pointer	        = 0x28:0xfffffe0353890930
code segment		= base rx0, limit 0xfffff, type 0x1b
			= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags	= interrupt enabled, resume, IOPL = 0
current process		= 55948 (zfs)
trap number		= 12
panic: page fault
cpuid = 6
time = 1688259833

Lines in syslog when crashes occur:

Apr 10 03:01:00 zen-pobro root[5666]: [cron] daily
Apr 10 03:05:04 zen-pobro syslogd: kernel boot file is /boot/kernel/kernel

Apr 27 03:01:00 zen-pobro root[9491]: [cron] daily
Apr 27 03:04:48 zen-pobro syslogd: kernel boot file is /boot/kernel/kernel

May 26 03:01:00 zen-pobro root[70274]: [cron] daily
May 26 03:05:14 zen-pobro syslogd: kernel boot file is /boot/kernel/kernel

Jun 12 03:01:00 zen-pobro root[75378]: [cron] daily
Jun 12 03:05:12 zen-pobro syslogd: kernel boot file is /boot/kernel/kernel

Jul  2 03:01:00 zen-pobro root[55797]: [cron] daily
Jul  2 03:05:08 zen-pobro syslogd: kernel boot file is /boot/kernel/kernel

I have modified /etc/periodic to run 310.locate daily:
lrwxr-xr-x  1 root  wheel    31B Jun 10 12:01 /etc/periodic/daily/310.locate -> /etc/periodic/weekly/310.locate

Box is amd64 PC which is used as home workstation, ZFS and VM server.
It has ECC RAM (in desktop MBO), 2 HDDs in ZFS mirror and single NVMe SSD.

Similar PR from a decade ago: #174372
Backtraces attached.
Comment 1 bsd_orsolic 2023-07-02 08:54:56 UTC
Created attachment 243121 [details]
backtrace
Comment 2 bsd_orsolic 2023-07-02 08:55:30 UTC
Created attachment 243122 [details]
backtrace
Comment 3 bsd_orsolic 2023-07-02 08:55:49 UTC
Created attachment 243123 [details]
backtrace
Comment 4 bsd_orsolic 2023-07-21 05:36:51 UTC
Another reproduction:
Jul 21 03:01:00 zen-pobro root[98948]: [cron] daily
Jul 21 03:04:40 zen-pobro syslogd: kernel boot file is /boot/kernel/kernel
Jul 21 03:04:40 zen-pobro kernel: ---<<BOOT>>---
Jul 21 03:04:40 zen-pobro kernel: Copyright (c) 1992-2021 The FreeBSD Project.
Jul 21 03:04:40 zen-pobro kernel: Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
Jul 21 03:04:40 zen-pobro kernel:       The Regents of the University of California. All rights reserved.
Jul 21 03:04:40 zen-pobro kernel: FreeBSD is a registered trademark of The FreeBSD Foundation.

Fatal trap 12: page fault while in kernel mode
cpuid = 6; apic id = 06
fault virtual address   = 0x0
fault code              = supervisor read data, page not present
instruction pointer     = 0x20:0xffffffff8286a65c
stack pointer           = 0x28:0xfffffe024f13a870
frame pointer           = 0x28:0xfffffe024f13a930
code segment            = base rx0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 99113 (zfs)
trap number             = 12
panic: page fault
cpuid = 6
time = 1689901415
Comment 5 bsd_orsolic 2023-07-21 05:37:13 UTC
Created attachment 243515 [details]
backtrace
Comment 6 bsd_orsolic 2023-08-14 06:46:50 UTC
Created attachment 244082 [details]
crash 7

Another crash

Aug 14 03:01:00 zen-pobro root[4964]: [cron] daily
Aug 14 03:05:15 zen-pobro syslogd: kernel boot file is /boot/kernel/kernel
Aug 14 03:05:15 zen-pobro kernel: ---<<BOOT>>---
Aug 14 03:05:15 zen-pobro kernel: Copyright (c) 1992-2021 The FreeBSD Project.

Fatal trap 12: page fault while in kernel mode
cpuid = 6; apic id = 06
fault virtual address	= 0x0
fault code		= supervisor read data, page not present
instruction pointer	= 0x20:0xffffffff8286a65c
stack pointer	        = 0x28:0xfffffe01f8416870
frame pointer	        = 0x28:0xfffffe01f8416930
code segment		= base rx0, limit 0xfffff, type 0x1b
			= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags	= interrupt enabled, resume, IOPL = 0
current process		= 5110 (zfs)
trap number		= 12
panic: page fault
cpuid = 6
time = 1691975038
Comment 7 bsd_orsolic 2023-09-01 06:50:22 UTC
Created attachment 244545 [details]
crash 8

Yet another reproduction

Sep  1 03:01:00 zen-pobro root[98040]: [cron] daily
Sep  1 03:05:21 zen-pobro syslogd: kernel boot file is /boot/kernel/kernel
Sep  1 03:05:21 zen-pobro kernel: ---<<BOOT>>---
Sep  1 03:05:21 zen-pobro kernel: Copyright (c) 1992-2021 The FreeBSD Project.

Fatal trap 12: page fault while in kernel mode
cpuid = 3; apic id = 03
fault virtual address	= 0x0
fault code		= supervisor read data, page not present
instruction pointer	= 0x20:0xffffffff822e865c
stack pointer	        = 0x28:0xfffffe035b2c7870
frame pointer	        = 0x28:0xfffffe035b2c7930
code segment		= base rx0, limit 0xfffff, type 0x1b
			= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags	= interrupt enabled, resume, IOPL = 0
current process		= 98180 (zfs)
trap number		= 12
panic: page fault
cpuid = 3
time = 1693530254
KDB: stack backtrace:
#0 0xffffffff80c55825 at kdb_backtrace+0x65
#1 0xffffffff80c081a1 at vpanic+0x151
#2 0xffffffff80c08043 at panic+0x43
#3 0xffffffff810b2fa7 at trap_fatal+0x387
#4 0xffffffff810b2fff at trap_pfault+0x4f
#5 0xffffffff8108a8b8 at calltrap+0x8
#6 0xffffffff822e9828 at zap_lookup_norm+0x68
#7 0xffffffff822e97b1 at zap_lookup+0x11
#8 0xffffffff8218083c at zfs_get_zplprop+0x9c
#9 0xffffffff8230026d at zfs_ioc_objset_zplprops+0x8d
#10 0xffffffff822f973a at zfsdev_ioctl_common+0x58a
#11 0xffffffff8216d826 at zfsdev_ioctl+0x116
#12 0xffffffff80a9f116 at devfs_ioctl+0xc6
#13 0xffffffff80cfabb4 at vn_ioctl+0x1a4
#14 0xffffffff80a9f7ce at devfs_ioctl_f+0x1e
#15 0xffffffff80c762dd at kern_ioctl+0x26d
#16 0xffffffff80c75fc0 at sys_ioctl+0x100
#17 0xffffffff810b389c at amd64_syscall+0x10c
Uptime: 15d9h20m31s
Comment 8 bsd_orsolic 2023-09-13 19:01:14 UTC
Created attachment 244815 [details]
crash 9

The crash is reproduced once more.

Sep 12 03:01:00 zen-pobro root[87648]: [cron] daily
Sep 12 03:04:40 zen-pobro syslogd: kernel boot file is /boot/kernel/kernel
Sep 12 03:04:40 zen-pobro kernel: ---<<BOOT>>---
Sep 12 03:04:40 zen-pobro kernel: Copyright (c) 1992-2021 The FreeBSD Project.

panic: page fault

Fatal trap 12: page fault while in kernel mode
cpuid = 5; apic id = 05
fault virtual address	= 0x0
fault code		= supervisor read data, page not present
instruction pointer	= 0x20:0xffffffff8287565c
stack pointer	        = 0x28:0xfffffe022151f870
frame pointer	        = 0x28:0xfffffe022151f930
code segment		= base rx0, limit 0xfffff, type 0x1b
			= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags	= interrupt enabled, resume, IOPL = 0
current process		= 87839 (zfs)
trap number		= 12
panic: page fault
cpuid = 5
time = 1694480616
KDB: stack backtrace:
#0 0xffffffff80c55825 at kdb_backtrace+0x65
#1 0xffffffff80c081a1 at vpanic+0x151
#2 0xffffffff80c08043 at panic+0x43
#3 0xffffffff810b2fa7 at trap_fatal+0x387
#4 0xffffffff810b2fff at trap_pfault+0x4f
#5 0xffffffff8108a8b8 at calltrap+0x8
#6 0xffffffff82876828 at zap_lookup_norm+0x68
#7 0xffffffff828767b1 at zap_lookup+0x11
#8 0xffffffff8270d83c at zfs_get_zplprop+0x9c
#9 0xffffffff8288d26d at zfs_ioc_objset_zplprops+0x8d
#10 0xffffffff8288673a at zfsdev_ioctl_common+0x58a
#11 0xffffffff826fa826 at zfsdev_ioctl+0x116
#12 0xffffffff80a9f116 at devfs_ioctl+0xc6
#13 0xffffffff80cfabb4 at vn_ioctl+0x1a4
#14 0xffffffff80a9f7ce at devfs_ioctl_f+0x1e
#15 0xffffffff80c762dd at kern_ioctl+0x26d
#16 0xffffffff80c75fc0 at sys_ioctl+0x100
#17 0xffffffff810b389c at amd64_syscall+0x10c
Uptime: 10d23h58m39s
Comment 9 Graham Perrin 2023-10-03 19:18:28 UTC
Do all pools scrub without error? 

How is S.M.A.R.T. status for each of the hard disk drives?

(In reply to porsolic from comment #0)

> Similar PR from a decade ago: #174372

From bug 174372 comment 4 (the recent closure): 

> ⋯ I'm sure the other issue that was linked is unrelated to this.
Comment 10 Graham Perrin 2023-10-03 19:21:09 UTC
(Sorry, the repeat addition of 174372 was a slip of the fingers. I'll repeat the removal. Apologies for the noise.)
Comment 11 bsd_orsolic 2023-12-04 22:40:56 UTC
Created attachment 246786 [details]
smart-nvme
Comment 12 bsd_orsolic 2023-12-04 22:41:52 UTC
Created attachment 246787 [details]
smart-hdd0
Comment 13 bsd_orsolic 2023-12-04 22:42:14 UTC
Created attachment 246788 [details]
smart-hdd1
Comment 14 bsd_orsolic 2023-12-04 22:43:33 UTC
(In reply to Graham Perrin from comment #9)

zpool scrub of mirrored mechanical disk finishes without error.
zpool scrub of single NVMe disk finishes with (semi?) error:
  scan: scrub repaired 0B in 00:20:58 with 0 errors on Thu Nov 30 14:56:10 2023
errors: 3 data errors, use '-v' for a list

But "show -v" shows more than 3 errors (189 to be exact), all are in contained inside my home partition's snapshots, like:
        pool/encrypted/home:<0x1>
        pool/encrypted/home@auto_daily-2023-12-03_02.01.00--1w:<0x1>
        pool/encrypted/home@auto_hourly-2023-12-02_22.00.00--2d:<0x1>
        pool/encrypted/home@auto_hourly-2023-12-02_20.00.00--2d:<0x1>

All that error can be fixed with "clear" and "scrub":
  scan: scrub repaired 0B in 00:13:29 with 0 errors on Mon Dec  4 23:37:59 2023

Smartctl from disks attached

Regarding related PR: that's a fascinating find and reply for a decade old PR!
I also have PCI cards (which are all passed to VMs): Intel 4xGbit, Intel Wifi, integrated 2.5Gbit, cheap Asmedia USB controller.

Although I did not experienced crashes at 3AM after I upgraded to 14.0 branch