Bug 250784 - integer divide fault in zfs_rmnode
Summary: integer divide fault in zfs_rmnode
Status: Open
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 12.2-RELEASE
Hardware: amd64 Any
: --- Affects Only Me
Assignee: freebsd-fs (Nobody)
URL:
Keywords: crash, regression
Depends on:
Blocks:
 
Reported: 2020-11-01 18:32 UTC by Steven Schlansker
Modified: 2022-10-12 00:49 UTC (History)
3 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Steven Schlansker 2020-11-01 18:32:44 UTC
I recently upgraded to 
FreeBSD 12.2-RELEASE r366954 GENERIC amd64
and run plexmediaserver-plexpass-1.20.3.3483
Now I consistently receive a kernel panic soon after boot:

Fatal trap 18: integer divide fault while in kernel mode
cpuid = 11; apic id = 0b
instruction pointer	= 0x20:0xffffffff82963725
stack pointer	        = 0x28:0xfffffe009f78d480
frame pointer	        = 0x28:0xfffffe009f78d510
code segment		= base rx0, limit 0xfffff, type 0x1b
			= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags	= interrupt enabled, resume, IOPL = 0
current process		= 2247 (Plex Media Server)
trap number		= 18
panic: integer divide fault
cpuid = 11
time = 1604252527
KDB: stack backtrace:
#0 0xffffffff80c0a8f5 at kdb_backtrace+0x65
#1 0xffffffff80bbeb1b at vpanic+0x17b
#2 0xffffffff80bbe993 at panic+0x43
#3 0xffffffff8108f911 at trap_fatal+0x391
#4 0xffffffff8108ed97 at trap+0x67
#5 0xffffffff81066f38 at calltrap+0x8
#6 0xffffffff82a148ce at zfs_rmnode+0x5e
#7 0xffffffff82a35612 at zfs_freebsd_reclaim+0x42
#8 0xffffffff812482db at VOP_RECLAIM_APV+0x7b
#9 0xffffffff80c8e376 at vgonel+0x216
#10 0xffffffff80c8e9c5 at vrecycle+0x45
#11 0xffffffff82a355bd at zfs_freebsd_inactive+0xd
#12 0xffffffff812481ab at VOP_INACTIVE_APV+0x7b
#13 0xffffffff80c8d0b0 at vinactive+0xf0
#14 0xffffffff80c8d5c8 at vputx+0x2b8
#15 0xffffffff80c9722c at kern_unlinkat+0x20c
#16 0xffffffff810904c7 at amd64_syscall+0x387
#17 0xffffffff8106785e at fast_syscall_common+0xf8
Uptime: 1m27s

I am not clear exactly which file is unlinked.  Thank you for any assistance in resolving this situation, and let me know if I can provide additional information.
Comment 1 Andriy Gapon freebsd_committer freebsd_triage 2020-11-02 07:51:52 UTC
Do you have a crash dump?
Comment 2 Steven Schlansker 2020-11-02 19:38:18 UTC
Hi Andriy, yes, I did capture a crash dump - the panic trace above is from `dmesg -M`

I tried to follow the FAQ entries about kgdb debugging, but despite running a GENERIC kernel which should have symbols, kgdb does not seem to be symbolicating my bt:

% /usr/libexec/kgdb /boot/kernel/kernel vmcore.0
GNU gdb 6.1.1 [FreeBSD]
This GDB was configured as "amd64-marcel-freebsd"...(no debugging symbols found)...
Attempt to extract a component of a value that is not a structure pointer.
Attempt to extract a component of a value that is not a structure pointer.
#0  0xffffffff80bbe90a in doadump ()
(kgdb) bt
#0  0xffffffff80bbe90a in doadump ()
#1  0xffffffff80bbe735 in kern_reboot ()
#2  0xffffffff80bbeb73 in vpanic ()
#3  0xffffffff80bbe993 in panic ()
#4  0xffffffff8108f911 in trap_fatal ()
#5  0xffffffff8108ed97 in trap ()
#6  0xffffffff81066f38 in calltrap ()
#7  0xffffffff82963725 in ?? ()
#8  0xfffff80432466800 in ?? ()
#9  0x0000000002000000 in ?? ()
#10 0xfffff804325045c0 in ?? ()
#11 0x000000000a379546 in ?? ()
#12 0xffffffffffffffff in ?? ()
#13 0xfffff8000521bc00 in ?? ()
#14 0xfffff80003a9f988 in ?? ()
#15 0x0000000000000000 in ?? ()
Comment 3 Steven Schlansker 2020-11-09 23:19:19 UTC
Is there any additional information I can provide here?  Thanks
Comment 4 Mark Johnston freebsd_committer freebsd_triage 2020-11-12 20:03:37 UTC
Try using kgdb from ports (comes with the gdb package), the one in the base system is ancient and slated to be removed.

Though, the lack of symbols I think implies that your /usr/lib/debug/boot/kernel is not populated.  Kernel debug symbols are shipped as a separate set, for 12.2 they'll be in https://ftp.freebsd.org/pub/FreeBSD/releases/amd64/12.2-RELEASE/kernel-dbg.txz
Comment 5 Steven Schlansker 2020-11-14 20:41:22 UTC
That works much better.  I tried to briefly dig in and tried to identify the division causing the panic, which does look like an integer division of zero (minimum) by zero (iblkrange):

Reading symbols from /boot/kernel/kernel...
Reading symbols from /usr/lib/debug//boot/kernel/kernel.debug...

Unread portion of the kernel message buffer:


Fatal trap 18: integer divide fault while in kernel mode
cpuid = 11; apic id = 0b
instruction pointer     = 0x20:0xffffffff82963725
stack pointer           = 0x28:0xfffffe009f78d480
frame pointer           = 0x28:0xfffffe009f78d510
code segment            = base rx0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 2247 (Plex Media Server)
trap number             = 18
panic: integer divide fault
cpuid = 11
time = 1604252527
KDB: stack backtrace:
#0 0xffffffff80c0a8f5 at kdb_backtrace+0x65
#1 0xffffffff80bbeb1b at vpanic+0x17b
#2 0xffffffff80bbe993 at panic+0x43
#3 0xffffffff8108f911 at trap_fatal+0x391
#4 0xffffffff8108ed97 at trap+0x67
#5 0xffffffff81066f38 at calltrap+0x8
#6 0xffffffff82a148ce at zfs_rmnode+0x5e
#7 0xffffffff82a35612 at zfs_freebsd_reclaim+0x42
#8 0xffffffff812482db at VOP_RECLAIM_APV+0x7b
#9 0xffffffff80c8e376 at vgonel+0x216
#10 0xffffffff80c8e9c5 at vrecycle+0x45
#11 0xffffffff82a355bd at zfs_freebsd_inactive+0xd
#12 0xffffffff812481ab at VOP_INACTIVE_APV+0x7b
#13 0xffffffff80c8d0b0 at vinactive+0xf0
#14 0xffffffff80c8d5c8 at vputx+0x2b8
#15 0xffffffff80c9722c at kern_unlinkat+0x20c
#16 0xffffffff810904c7 at amd64_syscall+0x387
#17 0xffffffff8106785e at fast_syscall_common+0xf8
Uptime: 1m27s
(da0:isci0:0:0:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 
(da0:isci0:0:0:0): CAM status: Command timeout
(da0:isci0:0:0:0): Error 5, Retries exhausted
(da0:isci0:0:0:0): Synchronize cache failed
Dumping 1431 out of 32694 MB:..2%..11%..21%..31%..41%..51%..61%..71%..81%..91%

__curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55
warning: Source file is more recent than executable.
55              __asm("movq %%gs:%P1,%0" : "=r" (td) : "n" (offsetof(struct pcpu,
(kgdb) 
(kgdb) bt
#0  __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55
#1  doadump (textdump=<optimized out>) at /usr/src/sys/kern/kern_shutdown.c:371
#2  0xffffffff80bbe735 in kern_reboot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:451
#3  0xffffffff80bbeb73 in vpanic (fmt=<optimized out>, ap=<optimized out>) at /usr/src/sys/kern/kern_shutdown.c:880
#4  0xffffffff80bbe993 in panic (fmt=<unavailable>) at /usr/src/sys/kern/kern_shutdown.c:807
#5  0xffffffff8108f911 in trap_fatal (frame=0xfffffe009f78d3c0, eva=0) at /usr/src/sys/amd64/amd64/trap.c:921
#6  0xffffffff8108ed97 in trap (frame=0xfffffe009f78d3c0) at /usr/src/sys/amd64/amd64/trap.c:214
#7  <signal handler called>
#8  0xffffffff82963725 in get_next_chunk (dn=0xfffff804325045c0, start=<optimized out>, minimum=0, l1blks=<optimized out>)
    at /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu.c:721
#9  dmu_free_long_range_impl (os=<optimized out>, dn=0xfffff804325045c0, offset=0, length=<optimized out>)
    at /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu.c:811
#10 dmu_free_long_range (os=<optimized out>, object=<optimized out>, offset=<optimized out>, length=<optimized out>)
    at /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu.c:887
#11 0xffffffff82a148ce in zfs_rmnode (zp=0xfffff803fc50a900)
    at /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_dir.c:452
#12 0xffffffff82a35612 in zfs_freebsd_reclaim (ap=<optimized out>)
    at /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c:5416
#13 0xffffffff812482db in VOP_RECLAIM_APV (vop=0xffffffff82adf1a0 <zfs_vnodeops>, a=0xfffffe009f78d720) at vnode_if.c:2021
#14 0xffffffff80c8e376 in VOP_RECLAIM (vp=0xfffff80414bde780, td=0xfffff80003b5d000) at ./vnode_if.h:830
#15 vgonel (vp=0xfffff80414bde780) at /usr/src/sys/kern/vfs_subr.c:3501
#16 0xffffffff80c8e9c5 in vrecyclel (vp=0xfffff80414bde780) at /usr/src/sys/kern/vfs_subr.c:3342
#17 vrecycle (vp=0xfffff80414bde780) at /usr/src/sys/kern/vfs_subr.c:3323
#18 0xffffffff82a355bd in zfs_freebsd_inactive (ap=<optimized out>)
    at /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c:5387
#19 0xffffffff812481ab in VOP_INACTIVE_APV (vop=0xffffffff82adf1a0 <zfs_vnodeops>, a=0xfffffe009f78d7f8) at vnode_if.c:1955
#20 0xffffffff80c8d0b0 in VOP_INACTIVE (vp=0xfffff80414bde780, td=0xfffff80003b5d000) at ./vnode_if.h:807
#21 vinactive (vp=0xfffff80414bde780, td=0xfffff80003b5d000) at /usr/src/sys/kern/vfs_subr.c:3166
#22 0xffffffff80c8d5c8 in vputx (vp=0xfffff80414bde780, func=2) at /usr/src/sys/kern/vfs_subr.c:2894
#23 0xffffffff80c9722c in kern_unlinkat (td=0xfffff80003b5d000, fd=-100, 
    path=0x810904000 <error: Cannot access memory at address 0x810904000>, pathseg=UIO_USERSPACE, oldinum=<optimized out>)
    at /usr/src/sys/kern/vfs_syscalls.c:1840
#24 0xffffffff810904c7 in syscallenter (td=0xfffff80003b5d000) at /usr/src/sys/amd64/amd64/../../kern/subr_syscall.c:144
#25 amd64_syscall (td=0xfffff80003b5d000, traced=0) at /usr/src/sys/amd64/amd64/trap.c:1163
#26 <signal handler called>
#27 0x0000000809f805aa in ?? ()
Backtrace stopped: Cannot access memory at address 0x7fffffffe238



/sys/amd64(kgdb) frame 7
#7  <signal handler called>
(kgdb) up
#8  0xffffffff82963725 in get_next_chunk (dn=0xfffff804325045c0, start=<optimized out>, minimum=0, l1blks=<optimized out>)
    at /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu.c:721
warning: Source file is more recent than executable.
721                 (roundup(*start, iblkrange) - (minimum / iblkrange * iblkrange)) /
(kgdb) list
716              * L1 blocks in this range have data. If we can, we use this
717              * worst case value as an estimate so we can avoid having to look
718              * at the object's actual data.
719              */
720             uint64_t total_l1blks =
721                 (roundup(*start, iblkrange) - (minimum / iblkrange * iblkrange)) /
722                 iblkrange;
723             if (total_l1blks <= maxblks) {
724                     *l1blks = total_l1blks;
725                     *start = minimum;
(kgdb) print iblkrange
$1 = 0
(kgdb) print minimum
$2 = 0
(kgdb) print start
$3 = <optimized out>
Comment 6 Steven Schlansker 2020-12-06 22:37:39 UTC
Still affects 12.2-RELEASE-p1
We've identified the offending files and quarantined them, preventing unintentional crashes for the moment...
Comment 7 Steven Schlansker 2021-01-05 18:19:25 UTC
Unfortunately, the quarantine is not working.  We seem to still suffer crashes when deleting files.  I have tried to make new ZFS Filesystems, and still observe the crash there, so it must be at some lower level in the zpool.

ZFS has no fsck utility.  How can we recover our pool to a working state? :(
Comment 8 Mark Johnston freebsd_committer freebsd_triage 2021-01-08 17:45:29 UTC
Sorry, I have no idea what the issue is here.  I'd suggest asking on the freebsd-fs mailing list for suggestions.
Comment 9 Xin LI freebsd_committer freebsd_triage 2021-01-27 23:23:35 UTC
Looks like https://github.com/openzfs/zfs/issues/8778 ?
Comment 10 Peter Eriksson 2021-01-28 11:44:14 UTC
Definitely looks like the same bug considering that the simple fix from OpenZFS is not applied to the FreeBSD zfs code - so that bug is definitely there.
Comment 11 Steven Schlansker 2021-01-29 07:26:24 UTC
I can confirm, the patch that Xin supplied on freebsd-fs clears this issue up entirely for us.  Thank you!