272325 – Page fault, possible ZFS related

Bug 272325 - Page fault, possible ZFS related

Summary: Page fault, possible ZFS related

Status:	Open

Alias:	None

Product:	Base System
Classification:	Unclassified
Component:	kern (show other bugs)
Version:	13.2-RELEASE
Hardware:	amd64 Any

Importance:	--- Affects Only Me
Assignee:	freebsd-fs (Nobody)

URL:
Keywords:	crash, needs-qa

Depends on:
Blocks:

Reported:	2023-07-02 08:54 UTC by bsd_orsolic
Modified:	2023-12-14 15:23 UTC (History)
CC List:	5 users (show)

See Also:

Attachments
backtrace (5.91 KB, text/plain) 2023-07-02 08:54 UTC, bsd_orsolic	no flags	Details
backtrace (5.91 KB, text/plain) 2023-07-02 08:54 UTC, bsd_orsolic	no flags	Details
backtrace (5.91 KB, text/plain) 2023-07-02 08:55 UTC, bsd_orsolic	no flags	Details
backtrace (5.91 KB, text/plain) 2023-07-02 08:55 UTC, bsd_orsolic	no flags	Details
backtrace (5.91 KB, text/plain) 2023-07-21 05:37 UTC, bsd_orsolic	no flags	Details
crash 7 (6.05 KB, text/plain) 2023-08-14 06:46 UTC, bsd_orsolic	no flags	Details
crash 8 (5.91 KB, text/plain) 2023-09-01 06:50 UTC, bsd_orsolic	no flags	Details
crash 9 (6.02 KB, text/plain) 2023-09-13 19:01 UTC, bsd_orsolic	no flags	Details
smart-nvme (3.03 KB, text/plain) 2023-12-04 22:40 UTC, bsd_orsolic	no flags	Details
smart-hdd0 (5.10 KB, text/plain) 2023-12-04 22:41 UTC, bsd_orsolic	no flags	Details
smart-hdd1 (5.10 KB, text/plain) 2023-12-04 22:42 UTC, bsd_orsolic	no flags	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description bsd_orsolic 2023-07-02 08:54:25 UTC

Created attachment 243120 [details]
backtrace

I am experiencing crashes on 13.2-RELEASE usually around 3:05 AM every few weeks.
grep 2023$ | sort
./core.txt.1:Mon Apr 10 03:05:10 CEST 2023
./core.txt.2:Thu Apr 27 03:04:54 CEST 2023
./core.txt.3:Fri May 26 03:05:20 CEST 2023
./core.txt.4:Mon Jun 12 03:05:18 CEST 2023
./core.txt.5:Sun Jul  2 03:05:14 CEST 2023
They could be ZFS related.

panic: page fault
Fatal trap 12: page fault while in kernel mode
cpuid = 6; apic id = 06
fault virtual address	= 0x0
fault code		= supervisor read data, page not present
instruction pointer	= 0x20:0xffffffff8286b65c
stack pointer	        = 0x28:0xfffffe0353890870
frame pointer	        = 0x28:0xfffffe0353890930
code segment		= base rx0, limit 0xfffff, type 0x1b
			= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags	= interrupt enabled, resume, IOPL = 0
current process		= 55948 (zfs)
trap number		= 12
panic: page fault
cpuid = 6
time = 1688259833

Lines in syslog when crashes occur:

Apr 10 03:01:00 zen-pobro root[5666]: [cron] daily
Apr 10 03:05:04 zen-pobro syslogd: kernel boot file is /boot/kernel/kernel

Apr 27 03:01:00 zen-pobro root[9491]: [cron] daily
Apr 27 03:04:48 zen-pobro syslogd: kernel boot file is /boot/kernel/kernel

May 26 03:01:00 zen-pobro root[70274]: [cron] daily
May 26 03:05:14 zen-pobro syslogd: kernel boot file is /boot/kernel/kernel

Jun 12 03:01:00 zen-pobro root[75378]: [cron] daily
Jun 12 03:05:12 zen-pobro syslogd: kernel boot file is /boot/kernel/kernel

Jul  2 03:01:00 zen-pobro root[55797]: [cron] daily
Jul  2 03:05:08 zen-pobro syslogd: kernel boot file is /boot/kernel/kernel

I have modified /etc/periodic to run 310.locate daily:
lrwxr-xr-x  1 root  wheel    31B Jun 10 12:01 /etc/periodic/daily/310.locate -> /etc/periodic/weekly/310.locate

Box is amd64 PC which is used as home workstation, ZFS and VM server.
It has ECC RAM (in desktop MBO), 2 HDDs in ZFS mirror and single NVMe SSD.

Similar PR from a decade ago: #174372
Backtraces attached.

Comment 1 bsd_orsolic 2023-07-02 08:54:56 UTC

Created attachment 243121 [details]
backtrace

Comment 2 bsd_orsolic 2023-07-02 08:55:30 UTC

Created attachment 243122 [details]
backtrace

Comment 3 bsd_orsolic 2023-07-02 08:55:49 UTC

Created attachment 243123 [details]
backtrace

Comment 4 bsd_orsolic 2023-07-21 05:36:51 UTC

Another reproduction:
Jul 21 03:01:00 zen-pobro root[98948]: [cron] daily
Jul 21 03:04:40 zen-pobro syslogd: kernel boot file is /boot/kernel/kernel
Jul 21 03:04:40 zen-pobro kernel: ---<<BOOT>>---
Jul 21 03:04:40 zen-pobro kernel: Copyright (c) 1992-2021 The FreeBSD Project.
Jul 21 03:04:40 zen-pobro kernel: Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
Jul 21 03:04:40 zen-pobro kernel:       The Regents of the University of California. All rights reserved.
Jul 21 03:04:40 zen-pobro kernel: FreeBSD is a registered trademark of The FreeBSD Foundation.

Fatal trap 12: page fault while in kernel mode
cpuid = 6; apic id = 06
fault virtual address   = 0x0
fault code              = supervisor read data, page not present
instruction pointer     = 0x20:0xffffffff8286a65c
stack pointer           = 0x28:0xfffffe024f13a870
frame pointer           = 0x28:0xfffffe024f13a930
code segment            = base rx0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 99113 (zfs)
trap number             = 12
panic: page fault
cpuid = 6
time = 1689901415

Comment 5 bsd_orsolic 2023-07-21 05:37:13 UTC

Created attachment 243515 [details]
backtrace

Comment 6 bsd_orsolic 2023-08-14 06:46:50 UTC

Created attachment 244082 [details]
crash 7

Another crash

Aug 14 03:01:00 zen-pobro root[4964]: [cron] daily
Aug 14 03:05:15 zen-pobro syslogd: kernel boot file is /boot/kernel/kernel
Aug 14 03:05:15 zen-pobro kernel: ---<<BOOT>>---
Aug 14 03:05:15 zen-pobro kernel: Copyright (c) 1992-2021 The FreeBSD Project.

Fatal trap 12: page fault while in kernel mode
cpuid = 6; apic id = 06
fault virtual address	= 0x0
fault code		= supervisor read data, page not present
instruction pointer	= 0x20:0xffffffff8286a65c
stack pointer	        = 0x28:0xfffffe01f8416870
frame pointer	        = 0x28:0xfffffe01f8416930
code segment		= base rx0, limit 0xfffff, type 0x1b
			= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags	= interrupt enabled, resume, IOPL = 0
current process		= 5110 (zfs)
trap number		= 12
panic: page fault
cpuid = 6
time = 1691975038

Comment 7 bsd_orsolic 2023-09-01 06:50:22 UTC

Created attachment 244545 [details]
crash 8

Yet another reproduction

Sep  1 03:01:00 zen-pobro root[98040]: [cron] daily
Sep  1 03:05:21 zen-pobro syslogd: kernel boot file is /boot/kernel/kernel
Sep  1 03:05:21 zen-pobro kernel: ---<<BOOT>>---
Sep  1 03:05:21 zen-pobro kernel: Copyright (c) 1992-2021 The FreeBSD Project.

Fatal trap 12: page fault while in kernel mode
cpuid = 3; apic id = 03
fault virtual address	= 0x0
fault code		= supervisor read data, page not present
instruction pointer	= 0x20:0xffffffff822e865c
stack pointer	        = 0x28:0xfffffe035b2c7870
frame pointer	        = 0x28:0xfffffe035b2c7930
code segment		= base rx0, limit 0xfffff, type 0x1b
			= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags	= interrupt enabled, resume, IOPL = 0
current process		= 98180 (zfs)
trap number		= 12
panic: page fault
cpuid = 3
time = 1693530254
KDB: stack backtrace:
#0 0xffffffff80c55825 at kdb_backtrace+0x65
#1 0xffffffff80c081a1 at vpanic+0x151
#2 0xffffffff80c08043 at panic+0x43
#3 0xffffffff810b2fa7 at trap_fatal+0x387
#4 0xffffffff810b2fff at trap_pfault+0x4f
#5 0xffffffff8108a8b8 at calltrap+0x8
#6 0xffffffff822e9828 at zap_lookup_norm+0x68
#7 0xffffffff822e97b1 at zap_lookup+0x11
#8 0xffffffff8218083c at zfs_get_zplprop+0x9c
#9 0xffffffff8230026d at zfs_ioc_objset_zplprops+0x8d
#10 0xffffffff822f973a at zfsdev_ioctl_common+0x58a
#11 0xffffffff8216d826 at zfsdev_ioctl+0x116
#12 0xffffffff80a9f116 at devfs_ioctl+0xc6
#13 0xffffffff80cfabb4 at vn_ioctl+0x1a4
#14 0xffffffff80a9f7ce at devfs_ioctl_f+0x1e
#15 0xffffffff80c762dd at kern_ioctl+0x26d
#16 0xffffffff80c75fc0 at sys_ioctl+0x100
#17 0xffffffff810b389c at amd64_syscall+0x10c
Uptime: 15d9h20m31s

Comment 8 bsd_orsolic 2023-09-13 19:01:14 UTC

Created attachment 244815 [details]
crash 9

The crash is reproduced once more.

Sep 12 03:01:00 zen-pobro root[87648]: [cron] daily
Sep 12 03:04:40 zen-pobro syslogd: kernel boot file is /boot/kernel/kernel
Sep 12 03:04:40 zen-pobro kernel: ---<<BOOT>>---
Sep 12 03:04:40 zen-pobro kernel: Copyright (c) 1992-2021 The FreeBSD Project.

panic: page fault

Fatal trap 12: page fault while in kernel mode
cpuid = 5; apic id = 05
fault virtual address	= 0x0
fault code		= supervisor read data, page not present
instruction pointer	= 0x20:0xffffffff8287565c
stack pointer	        = 0x28:0xfffffe022151f870
frame pointer	        = 0x28:0xfffffe022151f930
code segment		= base rx0, limit 0xfffff, type 0x1b
			= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags	= interrupt enabled, resume, IOPL = 0
current process		= 87839 (zfs)
trap number		= 12
panic: page fault
cpuid = 5
time = 1694480616
KDB: stack backtrace:
#0 0xffffffff80c55825 at kdb_backtrace+0x65
#1 0xffffffff80c081a1 at vpanic+0x151
#2 0xffffffff80c08043 at panic+0x43
#3 0xffffffff810b2fa7 at trap_fatal+0x387
#4 0xffffffff810b2fff at trap_pfault+0x4f
#5 0xffffffff8108a8b8 at calltrap+0x8
#6 0xffffffff82876828 at zap_lookup_norm+0x68
#7 0xffffffff828767b1 at zap_lookup+0x11
#8 0xffffffff8270d83c at zfs_get_zplprop+0x9c
#9 0xffffffff8288d26d at zfs_ioc_objset_zplprops+0x8d
#10 0xffffffff8288673a at zfsdev_ioctl_common+0x58a
#11 0xffffffff826fa826 at zfsdev_ioctl+0x116
#12 0xffffffff80a9f116 at devfs_ioctl+0xc6
#13 0xffffffff80cfabb4 at vn_ioctl+0x1a4
#14 0xffffffff80a9f7ce at devfs_ioctl_f+0x1e
#15 0xffffffff80c762dd at kern_ioctl+0x26d
#16 0xffffffff80c75fc0 at sys_ioctl+0x100
#17 0xffffffff810b389c at amd64_syscall+0x10c
Uptime: 10d23h58m39s

Comment 9 Graham Perrin 2023-10-03 19:18:28 UTC

Do all pools scrub without error? 

How is S.M.A.R.T. status for each of the hard disk drives?

(In reply to porsolic from comment #0)

> Similar PR from a decade ago: #174372

From bug 174372 comment 4 (the recent closure): 

> ⋯ I'm sure the other issue that was linked is unrelated to this.

Comment 10 Graham Perrin 2023-10-03 19:21:09 UTC

(Sorry, the repeat addition of 174372 was a slip of the fingers. I'll repeat the removal. Apologies for the noise.)

Comment 11 bsd_orsolic 2023-12-04 22:40:56 UTC

Created attachment 246786 [details]
smart-nvme

Comment 12 bsd_orsolic 2023-12-04 22:41:52 UTC

Created attachment 246787 [details]
smart-hdd0

Comment 13 bsd_orsolic 2023-12-04 22:42:14 UTC

Created attachment 246788 [details]
smart-hdd1

Comment 14 bsd_orsolic 2023-12-04 22:43:33 UTC

(In reply to Graham Perrin from comment #9)

zpool scrub of mirrored mechanical disk finishes without error.
zpool scrub of single NVMe disk finishes with (semi?) error:
  scan: scrub repaired 0B in 00:20:58 with 0 errors on Thu Nov 30 14:56:10 2023
errors: 3 data errors, use '-v' for a list

But "show -v" shows more than 3 errors (189 to be exact), all are in contained inside my home partition's snapshots, like:
        pool/encrypted/home:<0x1>
        pool/encrypted/home@auto_daily-2023-12-03_02.01.00--1w:<0x1>
        pool/encrypted/home@auto_hourly-2023-12-02_22.00.00--2d:<0x1>
        pool/encrypted/home@auto_hourly-2023-12-02_20.00.00--2d:<0x1>

All that error can be fixed with "clear" and "scrub":
  scan: scrub repaired 0B in 00:13:29 with 0 errors on Mon Dec  4 23:37:59 2023

Smartctl from disks attached

Regarding related PR: that's a fascinating find and reply for a decade old PR!
I also have PCI cards (which are all passed to VMs): Intel 4xGbit, Intel Wifi, integrated 2.5Gbit, cheap Asmedia USB controller.

Although I did not experienced crashes at 3AM after I upgraded to 14.0 branch