Bug 216134 - linux emulation: panic in linux_file.c#419, getdents, when running (memory hungry) backup app using NFS [kernel core dump exists]
Summary: linux emulation: panic in linux_file.c#419, getdents, when running (memory hu...
Status: New
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 11.0-RELEASE
Hardware: Any Any
: --- Affects Only Me
Assignee: Dmitry Chagin
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-01-16 10:00 UTC by Palle Girgensohn
Modified: 2017-05-01 13:17 UTC (History)
3 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Palle Girgensohn freebsd_committer 2017-01-16 10:00:52 UTC
Hi,

We run  a FreeBSD-11.0-RELEASE-p6 system, GENERIC kernel.
running in VMWare, 32GB, 4 cores. 

open-vm-tools-nox11-1280544_16,1 
linux_base-c6_64-6.8

The application is IBM:s Tivoli backup, TSM. It normally runs with no problems, but last month it has shown a series of panics. We believe this is due to too many files to backup. It tries to backup snapshot of the a file system mounted using NFS, so suddenly there where four times the exepected amount of files. TSM does not handle many files very good. Still, you would not expect the kernel to panic. This is not a resource problem. the memory graphs are flat until the crash, no warnings.

Since the release binary version has no debug symbols (a separate bug?) I built a kernel to get a core dump consistent with the debug symbols. Seems the stack is corrupted though? Anything more I can do with this? Tips?

# kgdb kernel.debug /var/crash/vmcore.3
GNU gdb 6.1.1 [FreeBSD]
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "amd64-marcel-freebsd"...

Unread portion of the kernel message buffer:
panic: vm_fault: fault on nofault entry, addr: fffffe0004aa9000
cpuid = 1
KDB: stack backtrace:
#0 0xffffffff80b24077 at kdb_backtrace+0x67
#1 0xffffffff80ad93e2 at vpanic+0x182
#2 0xffffffff80ad9253 at panic+0x43
#3 0xffffffff80e12601 at vm_fault_hold+0x2721
#4 0xffffffff80e0fe98 at vm_fault+0x78
#5 0xffffffff80fa0e59 at trap_pfault+0xf9
#6 0xffffffff80fa04ec at trap+0x26c
#7 0xffffffff80f84141 at calltrap+0x8
#8 0xffffffff80fa16ae at amd64_syscall+0x4ce
#9 0xffffffff80f8442b at Xfast_syscall+0xfb
Uptime: 11d0h40m52s
Dumping 2657 out of 32735 MB:..1%..11%..21%..31%..41%..51%..61%..71%..81%..91%

Reading symbols from /boot/kernel/accf_data.ko...Reading symbols from /usr/lib/debug//boot/kernel/accf_data.ko.debug...done.
done.
Loaded symbols for /boot/kernel/accf_data.ko
Reading symbols from /boot/kernel/accf_http.ko...Reading symbols from /usr/lib/debug//boot/kernel/accf_http.ko.debug...done.
done.
Loaded symbols for /boot/kernel/accf_http.ko
Reading symbols from /boot/kernel/nullfs.ko...Reading symbols from /usr/lib/debug//boot/kernel/nullfs.ko.debug...done.
done.
Loaded symbols for /boot/kernel/nullfs.ko
Reading symbols from /boot/kernel/linprocfs.ko...Reading symbols from /usr/lib/debug//boot/kernel/linprocfs.ko.debug...done.
done.
Loaded symbols for /boot/kernel/linprocfs.ko
Reading symbols from /boot/kernel/linux_common.ko...Reading symbols from /usr/lib/debug//boot/kernel/linux_common.ko.debug...done.
done.
Loaded symbols for /boot/kernel/linux_common.ko
Reading symbols from /usr/local/lib/vmware-tools/modules/drivers/vmmemctl.ko...done.
Loaded symbols for /usr/local/lib/vmware-tools/modules/drivers/vmmemctl.ko
Reading symbols from /usr/local/lib/vmware-tools/modules/drivers/vmxnet.ko...done.
Loaded symbols for /usr/local/lib/vmware-tools/modules/drivers/vmxnet.ko
Reading symbols from /usr/local/lib/vmware-tools/modules/drivers/vmblock.ko...done.
Loaded symbols for /usr/local/lib/vmware-tools/modules/drivers/vmblock.ko
Reading symbols from /usr/local/lib/vmware-tools/modules/drivers/vmhgfs.ko...done.
Loaded symbols for /usr/local/lib/vmware-tools/modules/drivers/vmhgfs.ko
Reading symbols from /boot/kernel/linux.ko...Reading symbols from /usr/lib/debug//boot/kernel/linux.ko.debug...done.
done.
Loaded symbols for /boot/kernel/linux.ko
Reading symbols from /boot/kernel/linux64.ko...Reading symbols from /usr/lib/debug//boot/kernel/linux64.ko.debug...done.
done.
Loaded symbols for /boot/kernel/linux64.ko
#0  doadump (textdump=<value optimized out>) at pcpu.h:221
221		__asm("movq %%gs:%1,%0" : "=r" (td)
(kgdb) bt
#0  doadump (textdump=<value optimized out>) at pcpu.h:221
#1  0xffffffff80ad8e69 in kern_reboot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:366
#2  0xffffffff80ad941b in vpanic (fmt=<value optimized out>, ap=<value optimized out>) at /usr/src/sys/kern/kern_shutdown.c:759
#3  0xffffffff80ad9253 in panic (fmt=0x0) at /usr/src/sys/kern/kern_shutdown.c:690
#4  0xffffffff80e12601 in vm_fault_hold (map=<value optimized out>, vaddr=<value optimized out>, fault_type=<value optimized out>, fault_flags=<value optimized out>, m_hold=<value optimized out>) at /usr/src/sys/vm/vm_fault.c:330
#5  0xffffffff80e0fe98 in vm_fault (map=0xfffff80003000000, vaddr=<value optimized out>, fault_type=1 '\001', fault_flags=<value optimized out>) at /usr/src/sys/vm/vm_fault.c:273
#6  0xffffffff80fa0e59 in trap_pfault (frame=0xfffffe085e0638d0, usermode=0) at /usr/src/sys/amd64/amd64/trap.c:741
#7  0xffffffff80fa04ec in trap (frame=0xfffffe085e0638d0) at /usr/src/sys/amd64/amd64/trap.c:442
#8  0xffffffff80f84141 in calltrap () at /usr/src/sys/amd64/amd64/exception.S:236
#9  0xffffffff8248785a in getdents_common (td=<value optimized out>, args=<value optimized out>, is64bit=<value optimized out>) at /usr/src/sys/modules/linux64/../../compat/linux/linux_file.c:416
#10 0xffffffff80fa16ae in amd64_syscall (td=<value optimized out>, traced=0) at subr_syscall.c:135
#11 0xffffffff80f8442b in Xfast_syscall () at /usr/src/sys/amd64/amd64/exception.S:396
#12 0x00000008034a8fc5 in ?? ()
Previous frame inner to this frame (corrupt stack?)
Current language:  auto; currently minimal
(kgdb)
Comment 1 Dmitry Chagin freebsd_committer 2017-01-16 17:42:53 UTC
please, install gdb712 package and use /usr/local/bin/kgdb instead of vanilla kgdb.
Comment 2 Dmitry Chagin freebsd_committer 2017-01-16 17:43:13 UTC
grab
Comment 3 Palle Girgensohn freebsd_committer 2017-01-16 23:48:32 UTC
# /usr/local/bin/kgdb /boot/kernel/kernel /var/crash/vmcore.3
GNU gdb (GDB) 7.12 [GDB v7.12 for FreeBSD]
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-portbld-freebsd11.0".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /boot/kernel/kernel...Reading symbols from /usr/lib/debug//boot/kernel/kernel.debug...done.
done.

Unread portion of the kernel message buffer:
panic: vm_fault: fault on nofault entry, addr: fffffe0004aa9000
cpuid = 1
KDB: stack backtrace:
#0 0xffffffff80b24077 at kdb_backtrace+0x67
#1 0xffffffff80ad93e2 at vpanic+0x182
#2 0xffffffff80ad9253 at panic+0x43
#3 0xffffffff80e12601 at vm_fault_hold+0x2721
#4 0xffffffff80e0fe98 at vm_fault+0x78
#5 0xffffffff80fa0e59 at trap_pfault+0xf9
#6 0xffffffff80fa04ec at trap+0x26c
#7 0xffffffff80f84141 at calltrap+0x8
#8 0xffffffff80fa16ae at amd64_syscall+0x4ce
#9 0xffffffff80f8442b at Xfast_syscall+0xfb
Uptime: 11d0h40m52s
Dumping 2657 out of 32735 MB:..1%..11%..21%..31%..41%..51%..61%..71%..81%..91%

Reading symbols from /boot/kernel/accf_data.ko...Reading symbols from /usr/lib/debug//boot/kernel/accf_data.ko.debug...done.
done.
Reading symbols from /boot/kernel/accf_http.ko...Reading symbols from /usr/lib/debug//boot/kernel/accf_http.ko.debug...done.
done.
Reading symbols from /boot/kernel/nullfs.ko...Reading symbols from /usr/lib/debug//boot/kernel/nullfs.ko.debug...done.
done.
Reading symbols from /boot/kernel/linprocfs.ko...Reading symbols from /usr/lib/debug//boot/kernel/linprocfs.ko.debug...done.
done.
Reading symbols from /boot/kernel/linux_common.ko...Reading symbols from /usr/lib/debug//boot/kernel/linux_common.ko.debug...done.
done.
Reading symbols from /usr/local/lib/vmware-tools/modules/drivers/vmmemctl.ko...(no debugging symbols found)...done.
Reading symbols from /usr/local/lib/vmware-tools/modules/drivers/vmxnet.ko...(no debugging symbols found)...done.
Reading symbols from /usr/local/lib/vmware-tools/modules/drivers/vmblock.ko...(no debugging symbols found)...done.
Reading symbols from /usr/local/lib/vmware-tools/modules/drivers/vmhgfs.ko...(no debugging symbols found)...done.
Reading symbols from /boot/kernel/linux.ko...Reading symbols from /usr/lib/debug//boot/kernel/linux.ko.debug...done.
done.
Reading symbols from /boot/kernel/linux64.ko...Reading symbols from /usr/lib/debug//boot/kernel/linux64.ko.debug...done.
done.
__curthread () at ./machine/pcpu.h:221
221		__asm("movq %%gs:%1,%0" : "=r" (td)
(kgdb) bt
#0  __curthread () at ./machine/pcpu.h:221
#1  doadump (textdump=<optimized out>) at /usr/src/sys/kern/kern_shutdown.c:298
#2  0xffffffff80ad8e69 in kern_reboot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:366
#3  0xffffffff80ad941b in vpanic (fmt=<optimized out>, ap=0xfffffe085e0634c0) at /usr/src/sys/kern/kern_shutdown.c:759
#4  0xffffffff80ad9253 in panic (fmt=<unavailable>) at /usr/src/sys/kern/kern_shutdown.c:690
#5  0xffffffff80e12601 in vm_fault_hold (map=<optimized out>, vaddr=<optimized out>, fault_type=<optimized out>, fault_flags=0, m_hold=0x0) at /usr/src/sys/vm/vm_fault.c:330
#6  0xffffffff80e0fe98 in vm_fault (map=0xfffff80003000000, vaddr=<optimized out>, fault_type=1 '\001', fault_flags=<optimized out>) at /usr/src/sys/vm/vm_fault.c:273
#7  0xffffffff80fa0e59 in trap_pfault (frame=0xfffffe085e0638d0, usermode=0) at /usr/src/sys/amd64/amd64/trap.c:741
#8  0xffffffff80fa04ec in trap (frame=0xfffffe085e0638d0) at /usr/src/sys/amd64/amd64/trap.c:442
#9  <signal handler called>
#10 getdents_common (td=<optimized out>, args=<optimized out>, is64bit=<optimized out>) at /usr/src/sys/modules/linux64/../../compat/linux/linux_file.c:419
#11 0xffffffff80fa16ae in syscallenter (td=<optimized out>, sa=<optimized out>) at /usr/src/sys/amd64/amd64/../../kern/subr_syscall.c:135
#12 amd64_syscall (td=<optimized out>, traced=0) at /usr/src/sys/amd64/amd64/trap.c:942
#13 <signal handler called>
#14 0x00000008034a8fc5 in ?? ()
Backtrace stopped: Cannot access memory at address 0x877feb7c0
Comment 4 Palle Girgensohn freebsd_committer 2017-02-02 10:26:26 UTC
Hi,

We get crashes at a rate of a few per week in getdents, still. 

Any ideas, anything we can test?

Palle

# /usr/local/bin/kgdb /boot/kernel/kernel /var/crash/vmcore.3
GNU gdb (GDB) 7.12 [GDB v7.12 for FreeBSD]
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-portbld-freebsd11.0".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /boot/kernel/kernel...Reading symbols from /usr/lib/debug//boot/kernel/kernel.debug...done.
done.

Unread portion of the kernel message buffer:
panic: vm_fault: fault on nofault entry, addr: fffffe0004aa9000
cpuid = 1
KDB: stack backtrace:
#0 0xffffffff80b24077 at kdb_backtrace+0x67
#1 0xffffffff80ad93e2 at vpanic+0x182
#2 0xffffffff80ad9253 at panic+0x43
#3 0xffffffff80e12601 at vm_fault_hold+0x2721
#4 0xffffffff80e0fe98 at vm_fault+0x78
#5 0xffffffff80fa0e59 at trap_pfault+0xf9
#6 0xffffffff80fa04ec at trap+0x26c
#7 0xffffffff80f84141 at calltrap+0x8
#8 0xffffffff80fa16ae at amd64_syscall+0x4ce
#9 0xffffffff80f8442b at Xfast_syscall+0xfb
Uptime: 11d0h40m52s
Dumping 2657 out of 32735 MB:..1%..11%..21%..31%..41%..51%..61%..71%..81%..91%

Reading symbols from /boot/kernel/accf_data.ko...Reading symbols from /usr/lib/debug//boot/kernel/accf_data.ko.debug...done.
done.
Reading symbols from /boot/kernel/accf_http.ko...Reading symbols from /usr/lib/debug//boot/kernel/accf_http.ko.debug...done.
done.
Reading symbols from /boot/kernel/nullfs.ko...Reading symbols from /usr/lib/debug//boot/kernel/nullfs.ko.debug...done.
done.
Reading symbols from /boot/kernel/linprocfs.ko...Reading symbols from /usr/lib/debug//boot/kernel/linprocfs.ko.debug...done.
done.
Reading symbols from /boot/kernel/linux_common.ko...Reading symbols from /usr/lib/debug//boot/kernel/linux_common.ko.debug...done.
done.
Reading symbols from /usr/local/lib/vmware-tools/modules/drivers/vmmemctl.ko...(no debugging symbols found)...done.
Reading symbols from /usr/local/lib/vmware-tools/modules/drivers/vmxnet.ko...(no debugging symbols found)...done.
Reading symbols from /usr/local/lib/vmware-tools/modules/drivers/vmblock.ko...(no debugging symbols found)...done.
Reading symbols from /usr/local/lib/vmware-tools/modules/drivers/vmhgfs.ko...(no debugging symbols found)...done.
Reading symbols from /boot/kernel/linux.ko...Reading symbols from /usr/lib/debug//boot/kernel/linux.ko.debug...done.
done.
Reading symbols from /boot/kernel/linux64.ko...Reading symbols from /usr/lib/debug//boot/kernel/linux64.ko.debug...done.
done.
__curthread () at ./machine/pcpu.h:221
221	./machine/pcpu.h: No such file or directory.
(kgdb) bt
#0  __curthread () at ./machine/pcpu.h:221
#1  doadump (textdump=<optimized out>) at /usr/src/sys/kern/kern_shutdown.c:298
#2  0xffffffff80ad8e69 in kern_reboot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:366
#3  0xffffffff80ad941b in vpanic (fmt=<optimized out>, ap=0xfffffe085e0634c0) at /usr/src/sys/kern/kern_shutdown.c:759
#4  0xffffffff80ad9253 in panic (fmt=<unavailable>) at /usr/src/sys/kern/kern_shutdown.c:690
#5  0xffffffff80e12601 in vm_fault_hold (map=<optimized out>, vaddr=<optimized out>, fault_type=<optimized out>, fault_flags=0, m_hold=0x0) at /usr/src/sys/vm/vm_fault.c:330
#6  0xffffffff80e0fe98 in vm_fault (map=0xfffff80003000000, vaddr=<optimized out>, fault_type=1 '\001', fault_flags=<optimized out>) at /usr/src/sys/vm/vm_fault.c:273
#7  0xffffffff80fa0e59 in trap_pfault (frame=0xfffffe085e0638d0, usermode=0) at /usr/src/sys/amd64/amd64/trap.c:741
#8  0xffffffff80fa04ec in trap (frame=0xfffffe085e0638d0) at /usr/src/sys/amd64/amd64/trap.c:442
#9  <signal handler called>
#10 getdents_common (td=<optimized out>, args=<optimized out>, is64bit=<optimized out>) at /usr/src/sys/modules/linux64/../../compat/linux/linux_file.c:419
#11 0xffffffff80fa16ae in syscallenter (td=<optimized out>, sa=<optimized out>) at /usr/src/sys/amd64/amd64/../../kern/subr_syscall.c:135
#12 amd64_syscall (td=<optimized out>, traced=0) at /usr/src/sys/amd64/amd64/trap.c:942
#13 <signal handler called>
#14 0x00000008034a8fc5 in ?? ()
Backtrace stopped: Cannot access memory at address 0x877feb7c0
Comment 5 Konstantin Belousov freebsd_committer 2017-02-02 17:20:00 UTC
Try to understand which directory was listed when panic occur.

I doubt that your panic is a linuxolator bug, much more likely it is underlying fs VOP_READDIR() issue.  That said, list of loadded modules includes quite suspicious names like *fs, which might be the culprit.  E.g. do not use that module, is the panic reproducable then ?
Comment 6 Palle Girgensohn freebsd_committer 2017-02-02 17:32:41 UTC
(In reply to Konstantin Belousov from comment #5)

By modules named *fs, you mean these:
$ kldstat|grep fs
Id Refs Address            Size     Name
 4    1 0xffffffff82419000 665d     nullfs.ko
 5    1 0xffffffff82420000 a9f1     linprocfs.ko
10    1 0xffffffff8243c000 8eed     vmhgfs.ko

vmhgfs.ko is for VmWare, it is a wmware client.

I'll try without that vmhgfs.ko.

As stated above, it is NFS files. I cannot see exactly which one though. Not sure it is the same, sometimes if dies after around 2M files, sometimes after around 5M files. Last line this morning was:

2017-02-02 07.10.38 ANS1898I ***** Processed 5 708 500 files *****

# cat /var/crash/info.7
Dump header from device: /dev/da0p3
  Architecture: amd64
  Architecture Version: 2
  Dump Length: 2562928640
  Blocksize: 512
  Dumptime: Thu Feb  2 07:11:11 2017
  Hostname: XXX
  Magic: FreeBSD Kernel Dump
  Version String: FreeBSD 11.0-RELEASE-p6 #0: Mon Jan  2 22:53:23 CET 2017
    girgen@XXX:/usr/obj/usr/src/sys/GENERIC
  Panic String: vm_fault: fault on nofault entry, addr: fffffe00045c6000
  Dump Parity: 664679908
  Bounds: 7
  Dump Status: good
Comment 7 Palle Girgensohn freebsd_committer 2017-02-02 17:38:57 UTC
(In reply to Palle Girgensohn from comment #6)

I forgot to mention the backup process that seem to kill the machine is also running through some nullfs mounted volumes. The reason for nullfs mounts is to make sure the linuxulator does not backup it's own files instead of ones available for the base system (since it checks /compat/linux first when looking for files). 

$ df
Filesystem                             1K-blocks      Used     Avail Capacity  Mounted on
/dev/da0p2                              30450268  23108028   4906220    82%    /
devfs                                          1         1         0   100%    /dev
/dev/da1p1                             203113076  84053648 102810384    45%    /opt
/opt                                   203113076  84053648 102810384    45%    /backup_opt
/etc                                    30450268  23108028   4906220    82%    /backup_etc
/var                                    30450268  23108028   4906220    82%    /backup_var
procfs                                         4         4         0   100%    /proc
linprocfs                                      4         4         0   100%    /compat/linux/proc
nfsserver:/volume 1603534848 849065344 754469504    53%    /opt/nfs
nfsserver:/volume 1603534848 849065344 754469504    53%    /backup_opt/nfs

$ mount
/dev/da0p2 on / (ufs, local, journaled soft-updates)
devfs on /dev (devfs, local, multilabel)
/dev/da1p1 on /opt (ufs, local, journaled soft-updates)
/opt on /backup_opt (nullfs, local)
/etc on /backup_etc (nullfs, local)
/var on /backup_var (nullfs, local)
procfs on /proc (procfs, local)
linprocfs on /compat/linux/proc (linprocfs, local)
nfsserver:/volume on /opt/nfs (nfs)
nfsserver:/volume on /backup_opt/nfs (nfs)
Comment 8 Dmitry Chagin freebsd_committer 2017-05-01 13:17:17 UTC
I agree with Konstantin, can you try without nullfs?