I recently installed 10.1 on a server that had been running 8.4. Since installing 10.1 it panics and reboots when under load. It is rare for it to go a couple of days without panicing. Here is the info from the latest crash. FreeBSD name.replaced 10.1-RELEASE-p6 FreeBSD 10.1-RELEASE-p6 #0: Fri Feb 27 11:17:46 CST 2015 root@name.replaced:/usr/obj/usr/src/sys/TEST amd64 (The TEST kernel is just GENERIC and optionsIPFIREWALL_DEFAULT_TO_ACCEPT) Hostname process = trap number = 12 1341 (amavisd-milter) panic: page fault cpuid = 3 KDB: stack backtrace: #0 0xffffffff809638e0 at kdb_backtrace+0x60 #1 0xffffffff80928a05 at panic+0x155 #2 0xffffffff80d305ff at trap_fatal+0x38f #3 0xffffffff80d30918 at trap_pfault+0x308 #4 0xffffffff80d2ff7a at trap+0x47a #5 0xffffffff80d15e62 at calltrap+0x8 #6 0xffffffff809c66a5 at vfs_busy+0x45 #7 0xffffffff809bebae at lookup+0x75e #8 0xffffffff809be154 at namei+0x4d4 #9 0xffffffff809d244e at kern_statat_vnhook+0xae #10 0xffffffff809d22dd at sys_stat+0x2d #11 0xffffffff80d30f31 at amd64_syscall+0x351 #12 0xffffffff80d1614b at Xfast_syscall+0xfb Uptime: 2d21h52m53s Dumping 1202 out of 16347 MB:..2%..11%..22%..31%..42%..51%..62%..71%..82%..91% Reading symbols from /boot/kernel/ipl.ko.symbols...done. Loaded symbols for /boot/kernel/ipl.ko.symbols Reading symbols from /boot/kernel/ums.ko.symbols...done. Loaded symbols for /boot/kernel/ums.ko.symbols #0 doadump (textdump=<value optimized out>) at pcpu.h:219 219 pcpu.h: No such file or directory. in pcpu.h (kgdb) #0 doadump (textdump=<value optimized out>) at pcpu.h:219 #1 0xffffffff80928682 in kern_reboot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:452 #2 0xffffffff80928a44 in panic (fmt=<value optimized out>) at /usr/src/sys/kern/kern_shutdown.c:759 #3 0xffffffff80d305ff in trap_fatal (frame=<value optimized out>, eva=<value optimized out>) at /usr/src/sys/amd64/amd64/trap.c:865 #4 0xffffffff80d30918 in trap_pfault (frame=0xfffffe0469de95d0, usermode=<value optimized out>) at /usr/src/sys/amd64/amd64/trap.c:676 #5 0xffffffff80d2ff7a in trap (frame=0xfffffe0469de95d0) at /usr/src/sys/amd64/amd64/trap.c:440 #6 0xffffffff80d15e62 in calltrap () at /usr/src/sys/amd64/amd64/exception.S:232 #7 0xffffffff809152c1 in __mtx_lock_sleep (c=0xfffff80012a82348, tid=18446735277953190176, opts=0, file=0x0, line=0) at /usr/src/sys/kern/kern_mutex.c:433 #8 0xffffffff809c66a5 in vfs_busy (mp=0xfffff80012a82330, flags=0) at /usr/src/sys/kern/vfs_subr.c:411 #9 0xffffffff809bebae in lookup (ndp=0xfffffe0469de9948) at /usr/src/sys/kern/vfs_lookup.c:783 #10 0xffffffff809be154 in namei (ndp=0xfffffe0469de9948) at /usr/src/sys/kern/vfs_lookup.c:302 #11 0xffffffff809d244e in kern_statat_vnhook (td=0xfffff80014110920, flag=<value optimized out>, fd=-100, path=0x803c11850 <Address 0x803c11850 out of bounds>, pathseg=UIO_USERSPACE, sbp=0xfffffe0469de9a60, hook=0x100d08) at /usr/src/sys/kern/vfs_syscalls.c:2284 #12 0xffffffff809d22dd in sys_stat (td=0xfffff80014110920, uap=0xfffffe0469de9b80) at /usr/src/sys/kern/vfs_syscalls.c:2264 #13 0xffffffff80d30f31 in amd64_syscall (td=0xfffff80014110920, traced=0) at subr_syscall.c:134 #14 0xffffffff80d1614b in Xfast_syscall () at /usr/src/sys/amd64/amd64/exception.S:391 #15 0x0000000800f9239a in ?? () Previous frame inner to this frame (corrupt stack?) Hostname language: auto; Hostnamely minimal
I put the GENERIC kernel back, and it paniced in about an hour. FreeBSD name.replaced 10.1-RELEASE-p6 FreeBSD 10.1-RELEASE-p6 #0: Tue Feb 24 19:00:21 UTC 2015 root@amd64-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC amd64 processor eflags = panic: page fault cpuid = 18 KDB: stack backtrace: #0 0xffffffff80963000 at kdb_backtrace+0x60 #1 0xffffffff80928125 at panic+0x155 #2 0xffffffff80d24f0f at trap_fatal+0x38f #3 0xffffffff80d25228 at trap_pfault+0x308 #4 0xffffffff80d2488a at trap+0x47a #5 0xffffffff80d0a772 at calltrap+0x8 #6 0xffffffff809c12fe at vfs_ref+0x3e #7 0xffffffff809b8d91 at vop_stdgetwritemount+0x21 #8 0xffffffff80e43f07 at VOP_GETWRITEMOUNT_APV+0xa7 #9 0xffffffff809d67db at vn_start_write+0x3b #10 0xffffffff809d9550 at vn_write+0xb0 #11 0xffffffff809d9932 at vn_io_fault_doio+0x22 #12 0xffffffff809d750c at vn_io_fault1+0x7c #13 0xffffffff809d5a0b at vn_io_fault+0x18b #14 0xffffffff8097a437 at dofilewrite+0x87 #15 0xffffffff8097a168 at kern_writev+0x68 #16 0xffffffff8097a0f3 at sys_write+0x63 #17 0xffffffff80d25841 at amd64_syscall+0x351 Uptime: 1h18m16s Dumping 796 out of 16347 MB:..3%..11%..21%..31%..41%..51%..61%..71%..81%..91% Reading symbols from /boot/kernel/ipl.ko.symbols...done. Loaded symbols for /boot/kernel/ipl.ko.symbols Reading symbols from /boot/kernel/ums.ko.symbols...done. Loaded symbols for /boot/kernel/ums.ko.symbols Reading symbols from /boot/kernel/ipfw.ko.symbols...done. Loaded symbols for /boot/kernel/ipfw.ko.symbols #0 doadump (textdump=<value optimized out>) at pcpu.h:219 219 pcpu.h: No such file or directory. in pcpu.h (kgdb) #0 doadump (textdump=<value optimized out>) at pcpu.h:219 #1 0xffffffff80927da2 in kern_reboot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:452 #2 0xffffffff80928164 in panic (fmt=<value optimized out>) at /usr/src/sys/kern/kern_shutdown.c:759 #3 0xffffffff80d24f0f in trap_fatal (frame=<value optimized out>, eva=<value optimized out>) at /usr/src/sys/amd64/amd64/trap.c:865 #4 0xffffffff80d25228 in trap_pfault (frame=0xfffffe04695c1560, usermode=<value optimized out>) at /usr/src/sys/amd64/amd64/trap.c:676 #5 0xffffffff80d2488a in trap (frame=0xfffffe04695c1560) at /usr/src/sys/amd64/amd64/trap.c:440 #6 0xffffffff80d0a772 in calltrap () at /usr/src/sys/amd64/amd64/exception.S:232 #7 0xffffffff809149e1 in __mtx_lock_sleep (c=0xfffff80012c2c348, tid=18446735277931354400, opts=0, file=0x0, line=0) at /usr/src/sys/kern/kern_mutex.c:433 #8 0xffffffff809c12fe in vfs_ref (mp=0xfffff80012c2c330) at /usr/src/sys/kern/vfs_mount.c:433 #9 0xffffffff809b8d91 in vop_stdgetwritemount (ap=0xfffffe04695c1728) at /usr/src/sys/kern/vfs_default.c:591 #10 0xffffffff80e43f07 in VOP_GETWRITEMOUNT_APV (vop=<value optimized out>, a=<value optimized out>) at vnode_if.c:2341 #11 0xffffffff809d67db in vn_start_write (vp=0xfffff802bb5af1d8, mpp=0xfffffe04695c1788, flags=257) at vnode_if.h:963 #12 0xffffffff809d9550 in vn_write (fp=0xfffff801b3fe4230, uio=0xfffffe04695c1ab0, active_cred=0xfffff800128ce700, flags=0, td=0x0) at /usr/src/sys/kern/vfs_vnops.c:854 #13 0xffffffff809d9932 in vn_io_fault_doio (args=<value optimized out>, uio=0xfffff80012c3d920, td=0x0) at /usr/src/sys/kern/vfs_vnops.c:986 #14 0xffffffff809d750c in vn_io_fault1 () at /usr/src/sys/kern/vfs_vnops.c:1042 #15 0xffffffff809d5a0b in vn_io_fault (fp=0xfffff801b3fe4230, uio=0xfffffe04695c1ab0, active_cred=<value optimized out>, flags=0, td=0xfffff80012c3d920) at /usr/src/sys/kern/vfs_vnops.c:1147 #16 0xffffffff8097a437 in dofilewrite (td=0xfffff80012c3d920, fd=31, fp=0xfffff801b3fe4230, auio=0xfffffe04695c1ab0, offset=<value optimized out>, flags=0) at file.h:304 #17 0xffffffff8097a168 in kern_writev (td=0xfffff80012c3d920, fd=31, auio=0xfffffe04695c1ab0) at /usr/src/sys/kern/sys_generic.c:467 #18 0xffffffff8097a0f3 in sys_write (td=<value optimized out>, uap=<value optimized out>) at /usr/src/sys/kern/sys_generic.c:382 #19 0xffffffff80d25841 in amd64_syscall (td=0xfffff80012c3d920, traced=0) at subr_syscall.c:134 #20 0xffffffff80d0aa5b in Xfast_syscall () at /usr/src/sys/amd64/amd64/exception.S:391 #21 0x0000000800fa3bda in ?? () Previous frame inner to this frame (corrupt stack?) Current language: auto; currently minimal
Can you provide a copy of /var/run/dmesg.boot ? What type of hardware is that ? Have you tried to run a fsck on the filesystems in single-user mode ? What does mount -v show ?
Created attachment 154795 [details] dmesg.boot The requested dmesg.boot
The hardware is a Dell R710. I have not tried fsck by hand. It has been doing background fsck after the crashes since I disabled the journal. I was under the impression that fsck was not really necessary with FFS and soft updates. I can do a fsck later today. # mount -v /dev/mfid0s1a on / (ufs, NFS exported, local, soft-updates, writes: sync 0 async 290424, reads: sync 14443 async 748, fsid a527ac54915e0c04) devfs on /dev (devfs, local, multilabel, fsid 00ff007171000000) procfs on /proc (procfs, local, fsid 01ff000202000000) /dev/mfid0s1d on /home (ufs, local, soft-updates, writes: sync 38 async 148, reads: sync 80 async 35, fsid a627ac54c8732378) /dev/mfid0s1e on /var (ufs, local, soft-updates, writes: sync 734986 async 1149177, reads: sync 8517 async 4503, fsid a627ac548e8fd108) /dev/mfid0s1f on /opt (ufs, local, soft-updates, writes: sync 161 async 1649, reads: sync 574 async 68, fsid a627ac54c4f566ef) /dev/mfid0s1g on /space (ufs, local, soft-updates, writes: sync 2033 async 17366, reads: sync 90145 async 30234, fsid 13bf2a50d8123d8a) /dev/md0 on /var/tmp/vscan (ufs, local, soft-updates, writes: sync 393738 async 251122, reads: sync 18 async 0, fsid bdc90f5599002a1e)
From my experience, background-fsck does not always work, so I always configure background_fsck="NO" fsck_y_enable="YES" in /etc/rc.conf
I booted to single user mode and did an fsck. No errors were reported.
(In reply to Dan from comment #6) Please build a debug kernel. The following (somewhat spurious) options added to your kernel config should do the trick: # Debugging support. Always need this: options KDB # Enable kernel debugger support. options KDB_TRACE # Print a stack trace for a panic. # For full debugger support use (turn off in stable branch): options DDB # Support DDB. options GDB # Support remote GDB. options DEADLKRES # Enable the deadlock resolver options INVARIANTS # Enable calls of extra sanity checking options INVARIANT_SUPPORT # Extra sanity checks of internal structures, required by INVARIANTS options WITNESS # Enable checks to detect deadlocks and cycles options WITNESS_SKIPSPIN # Don't run witness on spinlocks for speed options MALLOC_DEBUG_MAXZONES=8 # Separate malloc(9) zones options DEBUG_VFS_LOCKS
I found one host with mfi raid, but not with 10.1. What does mfiutil version show ? What does mfiutil show firmware say ?
Debug kernel up and running. Do you want the entire core.txt when it crashes again, or just the summary I provided before? mfiutil version 1.0.15 mfi0 Firmware Package Version: 6.3.1-0003 mfi0 Firmware Images: Name Version Date Time Status APP 1.22.32-1371 Sep 12 2011 19:29:24 active BIOS 2.04.00 active BCON 1.1-46-e_15-Rel Mar 2 2008 14:06:08 active CTLR 1.02-015B Jan 27 2009 12:02:58 active PCLI 01.00-022:#%00005 Sep 18 2008 20:27:14 active BTBL 1.00.00.01-0011 Nov 27 2007 18:29:20 active
The kernel paniced and dumped some output to the console. It is now sitting at the db> prompt. Are there commands you would like me to run? Fatal trap 9: general protection fault while in kernel mode cpuid = 13; apic id = 01 instruction pointer = 0x20:0xffffffff80d434b8 stack pointer = 0x28:0xfffffe00002645f0 frame pointer = 0x28:0xfffffe00002646d0 code segment = base rx0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 12 (irq256: bce0) lock order reversal: (Giant after non-sleepable) 1st 0xffffffff81c577e8 ipf frag rwlock (ipf frag rwlock) @ /usr/src/sys/modules/ipfilter/../../contrib/ipfilter/netinet/ip_frag.c:691 2nd 0xffffffff81840f38 Giant (Giant) @ /usr/src/sys/dev/usb/input/ukbd.c:1984 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe0000264070 kdb_backtrace() at kdb_backtrace+0x39/frame 0xfffffe0000264120 witness_checkorder() at witness_checkorder+0xdc2/frame 0xfffffe00002641b0 __mtx_lock_flags() at __mtx_lock_flags+0xa8/frame 0xfffffe0000264200 ukbd_poll() at ukbd_poll+0x2c/frame 0xfffffe0000264230 kbdmux_poll() at kbdmux_poll+0x5b/frame 0xfffffe0000264250 cngrab() at cngrab+0x35/frame 0xfffffe0000264270 kdb_trap() at kdb_trap+0x11f/frame 0xfffffe00002642c0 trap_fatal() at trap_fatal+0x34c/frame 0xfffffe0000264320 trap() at trap+0x818/frame 0xfffffe0000264530 calltrap() at calltrap+0x8/frame 0xfffffe0000264530 --- trap 0x9, rip = 0xffffffff80d434b8, rsp = 0xfffffe00002645f0, rbp = 0xfffffe00002646d0 --- bcmp() at bcmp+0x8/frame 0xfffffe00002646d0 ipf_frag_known() at ipf_frag_known+0x51/frame 0xfffffe0000264700 ipf_check() at ipf_check+0x308/frame 0xfffffe0000264880 pfil_run_hooks() at pfil_run_hooks+0x96/frame 0xfffffe0000264910 ip_input() at ip_input+0x32c/frame 0xfffffe0000264960 netisr_dispatch_src() at netisr_dispatch_src+0x86/frame 0xfffffe00002649d0 ether_demux() at ether_demux+0x144/frame 0xfffffe0000264a00 ether_nh_input() at ether_nh_input+0x32a/frame 0xfffffe0000264a30 netisr_dispatch_src() at netisr_dispatch_src+0x86/frame 0xfffffe0000264aa0 bce_intr() at bce_intr+0x6f9/frame 0xfffffe0000264b30 intr_event_execute_handlers() at intr_event_execute_handlers+0x93/frame 0xfffffe0000264b70 ithread_loop() at ithread_loop+0xa6/frame 0xfffffe0000264bb0 fork_exit() at fork_exit+0x84/frame 0xfffffe0000264bf0 fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe0000264bf0 --- trap 0, rip = 0, rsp = 0xfffffe0000264cb0, rbp = 0 --- [ thread pid 12 tid 100072 ] Stopped at bcmp+0x8: repe cmpsq (%rsi),%es:(%rdi) db>
@Dan can you retry with the module aio.ko loaded? Does it panic then?
It died many times in the ipf_frag_known function. I removed 'keep frags' from my ipfilter rules and it has not crashed in 7 days, which is a record since installing FreeBSD 10. I am going to let it run a few more days and then put the 'keep frags' back to see if it crashes. How does the aio.ko module fit in?
I have observed a similar behaviour when running ctld, samba4, nfs along with an ezjail running apache24, php5 and mariadb55-server. The filesystem is ZFS. When I run a script which 1)stops all these services and the ezjail, 2)makes ZFS snapshots and 3) restarts the services along with the ezjail, it throws a kernel panic and reboots. Culprit seems to be __mtx_lock_sleep, but I cannot find out the service calling it. When the aio module is loaded the kernel never panics. Hardware is an X9SCM/E3-1220Lv2, 8GB ECC-RAM and an IBM ServerRAID M1015. Vanilla GENERIC kernel.
With "keep frags" removed from my ipfilter rules the machine ran for 14 days with no crash. I put the "keep frags" back and it crashed in 1 day 21 hours. It crashed in ipf_frag_known again. There appears to be a bug in the ipfilter code related to "keep frags." As a test, I am loading the aio kernel module as suggested to see if that makes a difference. I am leaving "keep frags" in for this test.
I enabled loading of the aio module on reboot and rebooted. When the server came back up, one of my interfaces was not working. I tried to reboot the server again, and it hung during the shutdown and then panicked. The subsequent reboot froze up during configuration of one of the interfaces. I had to power cycle, and the same thing happened again. I had to go into single user mode and remove the aio module. After than the machine booted fine and all interfaces were working. I do not know what the aio module is supposed to do, but it does NOT work on my server at all.
Removing all frag rules kept the box stable for a while, but now it is crashing again. I have going to have to go to 9.3. 10.1 is not usable for a production system. Here is an excerpt from core.txt: Fatal trap 12: page fault while in kernel mode cpuid = 11; apic id = 35 fault virtual address = 0x18 fault code = supervisor read data, page not present instruction pointer = 0x20:0xffffffff80485761 stack pointer = 0x28:0xfffffe0467cbfa90 frame pointer = 0x28:0xfffffe0467cbfb20 code segment = base rx0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 12 (irq257: bce1) trap number = 12 panic: page fault cpuid = 11 KDB: stack backtrace: #0 0xffffffff80963000 at kdb_backtrace+0x60 #1 0xffffffff80928125 at panic+0x155 #2 0xffffffff80d258df at trap_fatal+0x38f #3 0xffffffff80d25bf8 at trap_pfault+0x308 #4 0xffffffff80d2525a at trap+0x47a #5 0xffffffff80d0b142 at calltrap+0x8 #6 0xffffffff808faf4b at intr_event_execute_handlers+0xab #7 0xffffffff808fb396 at ithread_loop+0x96 #8 0xffffffff808f8b6a at fork_exit+0x9a #9 0xffffffff80d0b67e at fork_trampoline+0xe Uptime: 3m3s
I just put another FreeBSD 10.1 system in production and it crashed in just over a day. Why isn't the FreeBSD community complaining morr about this? Even a moderate load and FreeBSD 10.1 crashes. I'm not doing anything fancy on this new box, it is just a DNS server. If this doesn't get fixed I will be forced by my company to switch to Linux. Unread portion of the kernel message buffer: panic: duplicate worklist: 0xfffff8000e1abe00 cpuid = 0 KDB: stack backtrace: #0 0xffffffff80963010 at kdb_backtrace+0x60 #1 0xffffffff80928135 at panic+0x155 #2 0xffffffff80b61bb2 at softdep_disk_write_complete+0x1992 #3 0xffffffff809b2ba3 at bufdone_finish+0x33 #4 0xffffffff809b2a07 at bufdone+0x77 #5 0xffffffff808907c0 at g_io_deliver+0x240 #6 0xffffffff808907c0 at g_io_deliver+0x240 #7 0xffffffff8088e4bb at g_disk_done+0xfb #8 0xffffffff80308a4c at dadone+0xd4c #9 0xffffffff802e52a1 at xpt_done_process+0x5b1 #10 0xffffffff802e86a6 at xpt_done_td+0x126 #11 0xffffffff808f8b7a at fork_exit+0x9a #12 0xffffffff80d0b6ce at fork_trampoline+0xe Uptime: 1d8h21m50s Dumping 134 out of 999 MB:..12%..24%..36%..48%..60%..72%..84%..96% Reading symbols from /boot/modules/vmxnet3.ko...done. Loaded symbols for /boot/modules/vmxnet3.ko Reading symbols from /boot/kernel/ipl.ko.symbols...done. Loaded symbols for /boot/kernel/ipl.ko.symbols Reading symbols from /boot/kernel/ipfw.ko.symbols...done. Loaded symbols for /boot/kernel/ipfw.ko.symbols Reading symbols from /boot/modules/vmmemctl.ko...done. Loaded symbols for /boot/modules/vmmemctl.ko #0 doadump (textdump=<value optimized out>) at pcpu.h:219 219 pcpu.h: No such file or directory. in pcpu.h (kgdb) #0 doadump (textdump=<value optimized out>) at pcpu.h:219 #1 0xffffffff80927db2 in kern_reboot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:452 #2 0xffffffff80928174 in panic (fmt=<value optimized out>) at /usr/src/sys/kern/kern_shutdown.c:759 #3 0xffffffff80b61bb2 in softdep_disk_write_complete ( bp=<value optimized out>) at /usr/src/sys/ufs/ffs/ffs_softdep.c:10974 #4 0xffffffff809b2ba3 in bufdone_finish (bp=0xfffffe003d27ca88) at buf.h:420 #5 0xffffffff809b2a07 in bufdone (bp=<value optimized out>) at /usr/src/sys/kern/vfs_bio.c:3762 #6 0xffffffff808907c0 in g_io_deliver (bp=0xfffff800314a92e8, error=<value optimized out>) at /usr/src/sys/geom/geom_io.c:669 #7 0xffffffff808907c0 in g_io_deliver (bp=0xfffff8000e6189b0, error=<value optimized out>) at /usr/src/sys/geom/geom_io.c:669 #8 0xffffffff8088e4bb in g_disk_done (bp=0xfffff8000dc2e4d8) at /usr/src/sys/geom/geom_disk.c:252 #9 0xffffffff80308a4c in dadone (periph=<value optimized out>, done_ccb=<value optimized out>) at /usr/src/sys/cam/scsi/scsi_da.c:3017 #10 0xffffffff802e52a1 in xpt_done_process (ccb_h=0xfffff80002d0c000) at /usr/src/sys/cam/cam_xpt.c:5249 #11 0xffffffff802e86a6 in xpt_done_td (arg=0xffffffff814b6c80) at /usr/src/sys/cam/cam_xpt.c:5276 #12 0xffffffff808f8b7a in fork_exit ( callout=0xffffffff802e8580 <xpt_done_td>, arg=0xffffffff814b6c80, frame=0xfffffe004e66ec00) at /usr/src/sys/kern/kern_fork.c:996 #13 0xffffffff80d0b6ce in fork_trampoline () at /usr/src/sys/amd64/amd64/exception.S:606 #14 0x0000000000000000 in ?? () Current language: auto; currently minimal (kgdb)
This was fixed by r316809.