When upgrading from 10.1-RELEASE to 10.2-RELEASE we have noticed processes randomly terminating. Signal 11 seg fault. This has been an on-going issue and cannot track this down to any particular bug. With -p7 covering FreeBSD-EN-15:20.vm, we thought that maybe this was the issue we were seeing. However, -p7 did not fix the problem. Our environment is several FreeBSD 10.2 amd64 VM's running a multi-mail server with Exim and Dovecot. Hypervisor is VMware ESXi 5.1. Several times a day (or sometimes not at all) Exim and/or Dovecot will exit with signal 11 with similar back traces. Usually from a function in libc or libthr. A couple of example back traces below. Occasionally (although rare) we've seen other processes crash with similar "cannot access memory" seg faults. However, we've not yet seen this since -p7 though we are monitoring closely. Exim backtrace: #0 0x000000080119e4b6 in pthread_suspend_all_np () from /lib/libthr.so.3 [New Thread 803006400 (LWP 100098/<unknown>)] (gdb) bt #0 0x000000080119e4b6 in pthread_suspend_all_np () from /lib/libthr.so.3 #1 0x00000008011a126a in pthread_getspecific () from /lib/libthr.so.3 #2 0x00000008011a5c96 in __pthread_cxa_finalize () from /lib/libthr.so.3 #3 0x0000000000423536 in daemon_go () #4 0x0000000000438ee9 in main () Dovecot backtrace: #0 0x000000080061b6bc in _rtld_is_dlopened () from /libexec/ld-elf.so.1 (gdb) bt #0 0x000000080061b6bc in _rtld_is_dlopened () from /libexec/ld-elf.so.1 #1 0x000000080061b2ab in _rtld_is_dlopened () from /libexec/ld-elf.so.1 #2 0x0000000800614c8d in _r_debug_postinit () from /libexec/ld-elf.so.1 #3 0x000000080061246d in .text () from /libexec/ld-elf.so.1 #4 0x000000000040abf8 in service_process_create () #5 0x000000000040a38a in services_monitor_reap_children () #6 0x00000008008c2bd3 in io_loop_call_io () from /usr/local/lib/dovecot/libdovecot.so.0 #7 0x00000008008c46ef in io_loop_handler_run_internal () from /usr/local/lib/dovecot/libdovecot.so.0 #8 0x00000008008c30d4 in io_loop_handler_run () from /usr/local/lib/dovecot/libdovecot.so.0 #9 0x00000008008c2eb8 in io_loop_run () from /usr/local/lib/dovecot/libdovecot.so.0 #10 0x000000080085f1d8 in master_service_run () from /usr/local/lib/dovecot/libdovecot.so.0 #11 0x0000000000406512 in main () After checking in on the Exim and Dovecot communities, all indications are that this seems to be library based in some way. We have tried doing fresh installs of the base OS/VM's from a fresh SVN co and complete world rebuild. However, the problem still persists.... Sysctl vars: kern.corefile=/var/tmp/%N.core kern.timecounter.hardware=ACPI-fast # kern.ipc.maxsockbuf=2097152 kern.ipc.somaxconn=2048 kern.maxfiles=65536 kern.maxfilesperproc=16384 net.inet.tcp.sendspace=131072 net.inet.tcp.recvspace=131072 net.inet.udp.recvspace=131072 net.inet.udp.maxdgram=16384 # net.inet.tcp.msl=7500 net.inet.tcp.fast_finwait2_recycle=1 net.inet.icmp.log_redirect=0 net.inet.icmp.drop_redirect=1 net.inet.tcp.delayed_ack=0 net.inet.ip.redirect=0 net.inet6.ip6.redirect=0 net.link.ether.inet.log_arp_wrong_iface=0 kern.sugid_coredump=1 # net.inet.tcp.keepidle=60000 net.inet.tcp.keepintvl=10000
Recent Exim sig 11 with debug symbols: #0 0x00000008011f84b6 in pthread_suspend_all_np () from /lib/libthr.so.3 [New Thread 803006400 (LWP 100090/<unknown>)] (gdb) bt #0 0x00000008011f84b6 in pthread_suspend_all_np () from /lib/libthr.so.3 #1 0x00000008011fb26a in pthread_getspecific () from /lib/libthr.so.3 #2 0x00000008011ffc96 in __pthread_cxa_finalize () from /lib/libthr.so.3 #3 0x000000000042879c in handle_smtp_call (listen_sockets=0x803065140, listen_socket_count=4, accept_socket=7, accepted=0x7fffffffcbb0) at daemon.c:355 #4 0x000000000042777b in daemon_go () at daemon.c:2040 #5 0x0000000000449d13 in main (argc=3, cargv=0x7fffffffebb8) at exim.c:4719
Still seeing "cannot access memory" seg-faults on -p7 FreeBSD 10.2-RELEASE-p7 FreeBSD 10.2-RELEASE-p7 #0 r290449M amd64 Core was generated by `exim'. Program terminated with signal 11, Segmentation fault. Cannot access memory at address 0x800978ac0 #0 0x00000008023a3937 in ?? () (gdb) bt #0 0x00000008023a3937 in ?? () Cannot access memory at address 0x7fffffffc310
All similar crashes, still. Keep in mind, these didn't appear at all in 10.1 (at least we've never seen them). After upgrading to RELENG_10_2 is when this started. Core was generated by `dovecot'. Program terminated with signal 11, Segmentation fault. Reading symbols from /usr/local/lib/dovecot/libdovecot.so.0...done. Loaded symbols for /usr/local/lib/dovecot/libdovecot.so.0 Reading symbols from /lib/libc.so.7...done. Loaded symbols for /lib/libc.so.7 Reading symbols from /libexec/ld-elf.so.1...done. Loaded symbols for /libexec/ld-elf.so.1 #0 0x000000000040db60 in service_process_create (service=Cannot access memory at address 0x7fffffffb0b0 ) at service-process.c:311 311 service-process.c: No such file or directory. in service-process.c (gdb) bt #0 0x000000000040db60 in service_process_create (service=Cannot access memory at address 0x7fffffffb0b0 ) at service-process.c:311 Cannot access memory at address 0x7fffffffb0c8 Current language: auto; currently minimal (gdb
Still seeing programs (various) randomly crash with "cannot access memory" type core dumps. As previously mentioned, we never experienced this on 10.1-RELEASE. Only after upgrading VM"s to 10.2 did we start noticing this.. This GDB was configured as "amd64-marcel-freebsd"... Core was generated by `exim'. Program terminated with signal 11, Segmentation fault. #0 0x00000000004b9f6b in store_reset_3 (ptr=Cannot access memory at address 0x7fffffffc838 ) at store.c:360 360 store.c: No such file or directory. in store.c (gdb)
Can you provide some more information from one of your cores? Specifically, can you provide the output of 'x/i $rip' and 'info registers' as a start? The gdb messages aren't really telling us why it crashed.
GNU gdb 6.1.1 [FreeBSD] Copyright 2004 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "amd64-marcel-freebsd"... Core was generated by `dovecot'. Program terminated with signal 11, Segmentation fault. Reading symbols from /usr/local/lib/dovecot/libdovecot.so.0...done. Loaded symbols for /usr/local/lib/dovecot/libdovecot.so.0 Reading symbols from /lib/libc.so.7...done. Loaded symbols for /lib/libc.so.7 Reading symbols from /libexec/ld-elf.so.1...done. Loaded symbols for /libexec/ld-elf.so.1 #0 0x000000000040db76 in service_process_create (service=Cannot access memory at address 0x7fffffffb0c0 ) at service-process.c:313 313 service-process.c: No such file or directory. in service-process.c (gdb) bt full #0 0x000000000040db76 in service_process_create (service=Cannot access memory at address 0x7fffffffb0c0 ) at service-process.c:313 uid_counter = 73470 process = (struct service_process *) Cannot access memory at address 0x7fffffffb0b8 Current language: auto; currently minimal (gdb) x/i $rip 0x40db76 <service_process_create+310>: mov %eax,0x20(%rcx) (gdb) info registers rax 0x11efe 73470 rbx 0x0 0 rcx 0x8014b1040 34381434944 rdx 0x0 0 rsi 0x40d5e0 4249056 rdi 0x0 0 rbp 0x7fffffffb0d0 0x7fffffffb0d0 rsp 0x7fffffffb090 0x7fffffffb090 r8 0x2 2 r9 0x3 3 r10 0x2710 10000 r11 0x202 514 r12 0x7fffffffb328 140737488335656 r13 0x7fffffffb350 140737488335696 r14 0x7fffffffb330 140737488335664 r15 0x3 3 rip 0x40db76 0x40db76 <service_process_create+310> eflags 0x10202 66050 cs 0x43 67 ss 0x3b 59 ds 0x0 0 es 0x0 0 fs 0x0 0 gs 0x0 0 (gdb)
GNU gdb 6.1.1 [FreeBSD] Copyright 2004 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "amd64-marcel-freebsd"... Core was generated by `dovecot'. Program terminated with signal 11, Segmentation fault. Reading symbols from /usr/local/lib/dovecot/libdovecot.so.0...done. Loaded symbols for /usr/local/lib/dovecot/libdovecot.so.0 Reading symbols from /lib/libc.so.7...done. Loaded symbols for /lib/libc.so.7 Reading symbols from /libexec/ld-elf.so.1...done. Loaded symbols for /libexec/ld-elf.so.1 #0 0x00000008006226bc in _rtld_is_dlopened () from /libexec/ld-elf.so.1 (gdb) x/i $rip 0x8006226bc <_rtld_is_dlopened+15468>: mov %edi,0x213236(%rip) # 0x8008358f8 <_rtld_atfork_post+2174936> (gdb) info registers rax 0x0 0 rbx 0x800835b50 34368346960 rcx 0x8014b1310 34381435664 rdx 0x8014b13d0 34381435856 rsi 0x7fffffffaf08 140737488334600 rdi 0x1 1 rbp 0x7fffffffaed0 0x7fffffffaed0 rsp 0x7fffffffaed0 0x7fffffffaed0 r8 0x103 259 r9 0x8080808080808080 -9187201950435737472 r10 0x1 1 r11 0x801406e30 34380738096 r12 0x7fffffffb328 140737488335656 r13 0x7fffffffb350 140737488335696 r14 0x7fffffffaf08 140737488334600 r15 0x800834850 34368342096 rip 0x8006226bc 0x8006226bc <_rtld_is_dlopened+15468> eflags 0x10202 66050 cs 0x43 67 ss 0x3b 59 ds 0x0 0 es 0x0 0 fs 0x0 0 gs 0x0 0 (gdb)
If it makes any difference at all, these are "diskless" booted VM servers with read-only mounted NFS_ROOT. The only sysctl knobs turned are below. This is still happening even after a fresh build of 10.2-p10 kern.corefile=/var/spool/tmp/%N.core kern.timecounter.hardware=ACPI-fast kern.ipc.maxsockbuf=2097152 kern.ipc.somaxconn=2048 kern.maxfiles=65536 kern.maxfilesperproc=16384 net.inet.tcp.sendspace=131072 net.inet.tcp.recvspace=131072 net.inet.udp.recvspace=131072 net.inet.udp.maxdgram=16384 net.inet.tcp.msl=7500 net.inet.tcp.fast_finwait2_recycle=1 net.inet.icmp.log_redirect=0 net.inet.icmp.drop_redirect=1 net.inet.tcp.delayed_ack=0 net.inet.ip.redirect=0 net.inet6.ip6.redirect=0 net.link.ether.inet.log_arp_wrong_iface=0 kern.sugid_coredump=1 net.inet.tcp.keepidle=60000 net.inet.tcp.keepintvl=10000
And a recent Exim crash under 10.2 as well.. Core was generated by `exim'. Program terminated with signal 11, Segmentation fault. Reading symbols from /lib/libcrypt.so.5...(no debugging symbols found)...done. Loaded symbols for /lib/libcrypt.so.5 Reading symbols from /lib/libm.so.5...(no debugging symbols found)...done. Loaded symbols for /lib/libm.so.5 Reading symbols from /lib/libutil.so.9...(no debugging symbols found)...done. Loaded symbols for /lib/libutil.so.9 Reading symbols from /usr/local/lib/libspf2.so.2...(no debugging symbols found)...done. Loaded symbols for /usr/local/lib/libspf2.so.2 Reading symbols from /lib/libthr.so.3...(no debugging symbols found)...done. Loaded symbols for /lib/libthr.so.3 Reading symbols from /usr/local/lib/libpq.so.5...(no debugging symbols found)...done. Loaded symbols for /usr/local/lib/libpq.so.5 Reading symbols from /usr/local/lib/perl5/5.22/mach/CORE/libperl.so.5.22...(no debugging symbols found)...done. Loaded symbols for /usr/local/lib/perl5/5.22/mach/CORE/libperl.so.5.22 Reading symbols from /usr/lib/libssl.so.7...(no debugging symbols found)...done. Loaded symbols for /usr/lib/libssl.so.7 Reading symbols from /lib/libcrypto.so.7...(no debugging symbols found)...done. Loaded symbols for /lib/libcrypto.so.7 Reading symbols from /usr/local/lib/libpcre.so.1...(no debugging symbols found)...done. Loaded symbols for /usr/local/lib/libpcre.so.1 Reading symbols from /lib/libc.so.7...(no debugging symbols found)...done. Loaded symbols for /lib/libc.so.7 Reading symbols from /usr/local/lib/libintl.so.8...(no debugging symbols found)...done. Loaded symbols for /usr/local/lib/libintl.so.8 Reading symbols from /libexec/ld-elf.so.1...(no debugging symbols found)...done. Loaded symbols for /libexec/ld-elf.so.1 #0 0x000000080119e4b6 in pthread_suspend_all_np () from /lib/libthr.so.3 [New Thread 803006400 (LWP 100084/<unknown>)] (gdb) bt full #0 0x000000080119e4b6 in pthread_suspend_all_np () from /lib/libthr.so.3 No symbol table info available. #1 0x00000008011a126a in pthread_getspecific () from /lib/libthr.so.3 No symbol table info available. #2 0x00000008011a5c96 in __pthread_cxa_finalize () from /lib/libthr.so.3 No symbol table info available. #3 0x0000000000423536 in daemon_go () No symbol table info available. #4 0x0000000000438ee9 in main () No symbol table info available. (gdb) x/i $rip 0x80119e4b6 <pthread_suspend_all_np+2502>: lock cmpxchg %ecx,(%rdi) (gdb) info registers rax 0x1 1 rbx 0x0 0 rcx 0x0 0 rdx 0x1 1 rsi 0x8011a833c 34378253116 rdi 0x8013b4480 34380399744 rbp 0x7fffffffd440 0x7fffffffd440 rsp 0x7fffffffd440 0x7fffffffd440 r8 0x2 2 r9 0x3 3 r10 0x2710 10000 r11 0x246 582 r12 0x803006400 34410095616 r13 0x709b50 7379792 r14 0x8013b4460 34380399712 r15 0x0 0 rip 0x80119e4b6 0x80119e4b6 <pthread_suspend_all_np+2502> eflags 0x10246 66118 cs 0x43 67 ss 0x3b 59 ds 0x0 0 es 0x0 0 fs 0x0 0 gs 0x0 0 (gdb)
So far I don't see anything unusual to suggest that these are anything other than crashes due to application bugs. Is this the exact same exim / dovecot binaries you were using on 10.1 or did you also recompile those against 10.2 (and/or upgrade to newer versions)? One thing that might help with debugging these is building the user libraries with debug symbols. If you are building your own worlds, just add 'WITH_DEBUG_FILES=yes' to /etc/src.conf and build a new world. Note that this will install debug symbols for all binaries and libraries in the base system to /usr/lib/debug/.
Since we're using a FreeBSD diskless boot and NFS_ROOT I normally buildworld (and kernel) from sources and install in a new NFS_ROOT path, then boot the diskless servers with the new root path. Before booting however, I will also force rebuild (portupgrade -af) everything against the new version of FreeBSD (kernel/userland, libs, etc). So the binaries that are crashing *are* being rebuilt under FreeBSD 10.2. We never experienced this in 10.1 and in fact, we even rolled back and booted one of the diskless servers using an old NFS_ROOT path with 10.1 (same versions of Exim and Dovecot) and those have not crashed. Usually I will experience these types of seg faults on every VM server at least once a week. (currently these servers are lightly loaded because of this issue). Normally I would have no problem believing this is an application but since it does not appear to happen on 10.1 and only on 10.2, coupled with the fact I'm seeing similar crashes on two different applications that were otherwise running fine is very suspect. The applications are compiled with debugging symbols, but the crash always happens in a system library. I've contacted both Dovecot and Exim teams on these crashes and both agreed that it seemed to be some type of system library issue. I'm not ruling out that it may be something with our install, but it's fairly straight forward FreeBSD diskless install. Running out of places to look.
Adding WITH_DEBUG_SYMBOLS so we get debugging symbols in the system libraries would still be useful. Is it possible to grab the binaries from a 10.1 VM and run them in a 10.2 VM (I realize that might be a bit hacky, but if it's feasible it might help narrow down what change is causing the regression)?
I am building world with DEBUG SYMBOLS now and will stage a new root and boot a few VM's into that as well as make sure DEBUG symbols are compiled with both Exim and Dovecot. Once that's done hopefully we'll get a bit more out of the debugger. As for the binaries, that might be possible with Exim, but Dovecot has a slew of associated libraries and other binary depends. Lets start with the DEBUG symbols and see if it reveals more of whats going on.
(In reply to rblayzor from comment #13) Don't forget you can run with a new kernel and old system to help isolate it, and also don't forget that you can replace libraries one at a time or in groups using /etc/libmap.conf (man libmap.conf) so for example you can make all the system use the new libraries but just one application use the old ones. or you can even make just one instance of the app use differnet libries by starting it from a different binary or giving a differnt path to it..
After installing a new world and booting to new root NFS. I've seen one crash (usually takes time). However this one is a bit odd. Kernel message: Failed to write core file for process exim (error 14) pid 58267 (exim), uid 26: exited on signal 11 But I do get a core file, but this is all: GNU gdb 6.1.1 [FreeBSD] Copyright 2004 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "amd64-marcel-freebsd"... Core was generated by `exim'. Program terminated with signal 11, Segmentation fault. Cannot access memory at address 0x800978ac0 #0 0x00000008023a2990 in ?? () (gdb) bt full #0 0x00000008023a2990 in ?? () No symbol table info available. Cannot access memory at address 0x7fffffff4f70 (gdb) bt info No symbol "info" in current context. (gdb) x/i $rip 0x8023a2990: Cannot access memory at address 0x8023a2990 (gdb) info registers rax 0xffffffffffffffd0 -48 rbx 0x2010 8208 rcx 0x800788330 34367636272 rdx 0x0 0 rsi 0x3ff3 16371 rdi 0x8030033c1 34410083265 rbp 0x7fffffff5420 0x7fffffff5420 rsp 0x7fffffff4f70 0x7fffffff4f70 r8 0x7fffffff4670 140737488307824 r9 0x80269f6e0 34400237280 r10 0xfffffffffffff000 -4096 r11 0x8030033c0 34410083264 r12 0x7fffffffed40 140737488350528 r13 0x803229000 34412335104 r14 0x2010 8208 r15 0x3000 12288 rip 0x8023a2990 0x8023a2990 eflags 0x10206 66054 cs 0x43 67 ss 0x3b 59 ds 0x0 0 es 0x0 0 fs 0x0 0 gs 0x0 0
Hmm, the EFAULT error (14) is quite odd. I wonder if you tripped over the bug that is fixed in HEAD by r287442, r287537, and r288944. That bug was present in 10.1 though (and 10.0), not unique to 10.2. However, a corrupted core dump is why you can't do a full backtrace. :(
I'm not claiming that the bug here that I've reported does not exist in 10.0 or 10.1, I'm just saying that we've seen this bug during our entire release cycle of 10.1. (which was almost a good year). We only started noticing this issue with process seg faults after updating to 10.2. We were running the same versions of Dovecot and Exim on 10.1 when we started noticing it in 10.2. That's when we force rebuilt all of the binaries under 10.2 to no avail. Dovecot and Exim are NOT the only processes that we've seen core out, they are just the processes we see the most often. I've seen cron do it, ntpd and some other port binaries. I remember looking at their core files and seeing the same "cannot access memory" errors in the backtrace. It was only then I started paying attention more and have been gathering more and more debug information on this strange issue.
(In reply to rblayzor from comment #17) Should read "I'm just saying that we've NEVER seen this bug during our entire release cycle of 10.1."
Oh, I should be clear that the revisions I mentioned only address the issue of core dumps being corrupt, it does not address the root cause of the seg faults you are seeing. However, having proper core dumps is needed to debug the root cause. You can try backporting the changes I mentioned previously to your 10.2 kernel (you can get by with just the first two for testing) to see if we can get more usaable core dumps to use in tracking down the root cause. If you aren't used to backporting changes I will try to do it in the next day or so (those fixes should be in 10.3 anyway). Once I've done the backport I can give you a patch to apply to your 10.2 kernel.
These broken cores do not happen all the time, I'm actually waiting for some crashes to report. I'm pretty sure I should see a couple in the coming days. If I keep having broken cores with no useful output I already checked out 10-STABLE and have it lined up to possibly try building a new NFS_ROOT system. (though that may take some time). I would rather wait now while I have latest 10.2 running to at least see if I can grab some useful debugs.
After a couple of day s like clockwork, I got another crash. I thought this one looked like a normal Dovecot bug and reported it to the list to which Timo responded: "There's no way it can crash here, unless something's badly broken. Either a buggy compiler or maybe memory corruption earlier in the process. Can you reproduce it? Can you run it via valgrind?" Core was generated by `dovecot'. Program terminated with signal 11, Segmentation fault. Reading symbols from /usr/local/lib/dovecot/libdovecot.so.0...done. Loaded symbols for /usr/local/lib/dovecot/libdovecot.so.0 Reading symbols from /lib/libc.so.7...Reading symbols from /usr/lib/debug//lib/libc.so.7.debug...done. done. Loaded symbols for /lib/libc.so.7 Reading symbols from /libexec/ld-elf.so.1...Reading symbols from /usr/lib/debug//libexec/ld-elf.so.1.debug...done. done. Loaded symbols for /libexec/ld-elf.so.1 #0 0x0000000800910c74 in io_loop_time_refresh () at ioloop.c:635 635 ioloop.c: No such file or directory. in ioloop.c (gdb) bt full #0 0x0000000800910c74 in io_loop_time_refresh () at ioloop.c:635 No locals. #1 0x0000000800910961 in io_loop_timeouts_start_new (ioloop=0x801421080) at ioloop.c:436 to_idx = (struct timeout * const *) 0x801419880 #2 0x0000000800910915 in io_loop_handler_run (ioloop=0x801421080) at ioloop.c:606 No locals. #3 0x000000080091073f in io_loop_run (ioloop=0x801421080) at ioloop.c:583 No locals. #4 0x000000080086aaab in master_service_run (service=0x80141b140, callback=0) at master-service.c:640 No locals. #5 0x0000000000406509 in main (argc=3, argv=0x7fffffffb4c0) at main.c:888 set = (struct master_settings *) 0x801429110 error = 0x800caa2c3 "H\203Ä ]Ã\017\037\200" doveconf_arg = 0x0 orig_info_callback = (failure_callback_t *) 0x8008f0220 <i_syslog_error_handler> orig_debug_callback = (failure_callback_t *) 0x8008f0220 <i_syslog_error_handler> foreground = false ask_key_pass = false i = 3 c = -1 doubleopts = <value optimized out> Current language: auto; currently minimal (gdb)
When I see processes core dump along with the line message 'Failed to write core file for process XXX (error 14)", that is generally where I will see an incomplete core file with a "cannot access memory". So that may be related to the bug you mentioned. I do get occasional good core dumps, as noted above. At this time I cannot run stable to patch the core dump issue. I did however install clang3.6. I have rebuilt the ports we are seeing problems with against that just to rule it out. That is however, very unlikely. I will continue to gather more dumps if they produce valid cores. If there is a patch I can rebuild against that fixes the "error 14" issue, I'll give is a whirl.
(In reply to rblayzor from comment #22) This smells like a pager error. You mentioned somewhere that the images for the programs which failed are located on NFS volume ? What is in dmesg ? I believe that pager somehow fails to page in, and this causes both access errors in the live processes (SIGSEGV) and EFAULT when kernel reads process mappings to dump them into core files.
(In reply to Konstantin Belousov from comment #23) These VM's boot diskless and mount their root file system as RO. Trying to mount root from nfs: []... NFS ROOT: 10.0.0.211:/vol/bsdroot/v0f Where "v0f" is nothing more than subdir for a new rootfs. "bsdroot" is just a NetApp volume RO volume from these hosts. There is a VMDK (virtual disk) that is mounted at boot time; which is a thin provision disk 10GB in size. 2GB a 2Gbps swap partition and a 8Gb scratch disk for things like /var or /var/tmp da0 at mpt0 bus 0 scbus0 target 0 lun 0 da0: <VMware Virtual disk 1.0> Fixed Direct Access SCSI-2 device bootpc_init: wired to interface 'vmx0' da0: 320.000MB/s transfers (160.000MHz, offset 127, 16bit) da0: Command Queueing enabled da0: 10240MB (20971520 512 byte sectors: 255H 63S/T 1305C) da0: quirks=0x40<RETRY_BUSY> The port binaries are built on a another FreeBSD 10.2 platform (same VM config) and then we rsync /usr/local to be part of the rootfs. We always keep the port builds and the base system fresh and in sync. Typically when we want roll out a new version of FreeBSD we'll upgrade the build VM first, then create a new subdir for rootfs and install there. Followed by a complete port rebuild and resync. This has always worked for us since FreeBSD 8.x just fine... Only recently we've seen these random, strange, unexplainable core dumps.
(In reply to rblayzor from comment #24) Of course, I mean dmesg after the mysterious seg fault.
(In reply to Konstantin Belousov from comment #25) I think there are two different issues maybe going on here. My original bug report should of been more specific in that application/processes that apparently ran fine prior to upgrading to 10.2-RELEASE suddenly seem to be core dumping for some reason. When these processes core dump, sometimes they are leaving good core files, sometimes they are not. (perhaps most of the time). I'm seeing a something in common with the bug John brought up. Usually when I see something like: Failed to write core file for process exim (error 14) pid 8298 (exim), uid 26: exited on signal 11 The core file is incomplete, and some of the last messages I see in the core file is "cannot access memory", or something long those lines. We'll chalk these up to maybe the bug he mentioned with the system not dumping these core files. I am telling the system to drop all cores to a specific directory that has plenty of space and is r/w. The main problem however... I have to collect more core dumps on. (if they can drop core successfully). I posted one previously that just happened the other day that Timo Sirainen said should not be possible due to the output from the backtrace. I'm still waiting for more segfaults and still post those backtraces when they are available.
Just had one of the VM's sig11 on Dovecot last night... This one is giving me the cannot access memory error ... This GDB was configured as "amd64-marcel-freebsd"... Core was generated by `dovecot'. Program terminated with signal 11, Segmentation fault. Reading symbols from /usr/local/lib/dovecot/libdovecot.so.0...done. Loaded symbols for /usr/local/lib/dovecot/libdovecot.so.0 Reading symbols from /lib/libc.so.7...Reading symbols from /usr/lib/debug//lib/libc.so.7.debug...done. done. Loaded symbols for /lib/libc.so.7 Reading symbols from /libexec/ld-elf.so.1...Reading symbols from /usr/lib/debug//libexec/ld-elf.so.1.debug...done. done. Loaded symbols for /libexec/ld-elf.so.1 #0 memcpy () at bcopy.S:65 65 bcopy.S: No such file or directory. in bcopy.S (gdb) bt full #0 memcpy () at bcopy.S:65 No locals. #1 0x0000000800915484 in sig_handler (signo=20, si=0x7fffffffa9a0, context=0x7fffffffa630) at lib-signals.c:132 h = (struct signal_handler *) 0x8014198e0 saved_errno = 14 c = 0 '\0' tmp_si = {si_signo = 0, si_errno = 0, si_code = 0, si_pid = 0, si_uid = 0, si_status = 0, si_addr = 0x0, si_value = {sival_int = 0, sival_ptr = 0x0, sigval_int = 0, sigval_ptr = 0x0}, _reason = {_fault = {_trapno = 0}, _timer = {_timerid = 0, _overrun = 0}, _mesgq = {_mqd = 0}, _poll = {_band = 0}, __spare__ = {__spare1__ = 0, __spare2__ = 0x7fffffffa5c8}}} #2 <signal handler called> No symbol table info available. #3 0x0000000000411fc8 in _fini () No symbol table info available. #4 0x000000080061c9a8 in objlist_call_fini () at /.amd_mnt/juno/pub/FreeBSD/10.2-RELEASE/libexec/rtld-elf/rtld.c:2359 list = (Objlist *) Cannot access memory at address 0x0 Current language: auto; currently asm (gdb) x/i $rip 0x800cd1a76 <memcpy+22>: rep movsq %ds:(%rsi),%es:(%rdi) (gdb) info registers rax 0x800b893f0 34371834864 rbx 0x7fffffffaef8 140737488334584 rcx 0xa 10 rdx 0x50 80 rsi 0x7fffffffa9a0 140737488333216 rdi 0x800b893f0 34371834864 rbp 0x7fffffffa610 0x7fffffffa610 rsp 0x7fffffffa578 0x7fffffffa578 r8 0xffff800800b8ea50 -140703116498352 r9 0xfffff8000a811000 -8795916791808 r10 0x800636200 34366251520 r11 0x246 582 r12 0x800636200 34366251520 r13 0x800834830 34368342064 r14 0x7fffffffaef8 140737488334584 r15 0x0 0 rip 0x800cd1a76 0x800cd1a76 <memcpy+22> eflags 0x10206 66054 cs 0x43 67 ss 0x3b 59 ds 0x0 0 es 0x0 0 fs 0x0 0 gs 0x0 0 (gdb)
On that last crash, upon starting Dovecot-- this message appeared. First time we've seen it, but maybe due to all the debug symbols and misc debugs enabled: Starting dovecot. Last died with error (see error log for more information): gettimeofday(): Bad address
That would imply gettimeofday() is failing with EFAULT. That seems like a really odd error. That should only be failing if the address passed to gettimeofday() is bad, but that seems weird. Maybe try setting kern.timecounter.fast_gettime=0 as a test? Some of your faults have addresses in the shared page that holds the fast timecounter stuff. It's a bit of a hail mary, but if it works then it narrows down what we need to look at to find the real bug.
(In reply to John Baldwin from comment #29) Fast gettimeofday cannot fail with EFAULT, there is no such return path. An access to the wrong memory from fast gettimeofday code would result in real fault.
Probably not entirely helpful. This appears to be an unsuccessful core file again: Failed to write core file for process dovecot (error 14) pid 718 (dovecot), uid 0: exited on signal 11 #0 service_process_create (service=Cannot access memory at address 0x7fffffffb250 ) at service-process.c:311 uid_counter = 232627 process = (struct service_process *) Cannot access memory at address 0x7fffffffb248 Current language: auto; currently minimal (gdb) x/i $rip 0x40dd08 <service_process_create+296>: mov %eax,-0x2c(%rbp) (gdb) info registers rax 0x4a78 19064 rbx 0x0 0 rcx 0x800c6625a 34372739674 rdx 0x0 0 rsi 0x40d760 4249440 rdi 0x0 0 rbp 0x7fffffffb260 0x7fffffffb260 rsp 0x7fffffffb220 0x7fffffffb220 r8 0x2 2 r9 0x1 1 r10 0x2710 10000 r11 0x202 514 r12 0x7fffffffb4b8 140737488336056 r13 0x7fffffffb4e0 140737488336096 r14 0x7fffffffb4c0 140737488336064 r15 0x3 3 rip 0x40dd08 0x40dd08 <service_process_create+296> eflags 0x10202 66050 cs 0x43 67 ss 0x3b 59 ds 0x0 0 es 0x0 0 fs 0x0 0 gs 0x0 0 (gdb)
Can you get 'procstat -v /path/to/foo.core' output from that last core?
PID START END PRT RES PRES REF SHD FL TP PATH 718 0x400000 0x417000 r-x 19 19 1 0 CN-- vn 718 0x617000 0x618000 rw- 1 1 1 0 CN-- df 718 0x800617000 0x800635000 r-x 30 30 34 0 CN-- vn /libexec/ld-elf.so.1 718 0x800635000 0x800663000 rw- 36 36 1 0 CN-- df 718 0x800834000 0x800836000 rw- 2 2 1 0 CN-- df 718 0x800836000 0x800983000 r-x 216 222 38 19 CN-- vn /usr/local/lib/dovecot/libdovecot.so.0.0.0 718 0x800983000 0x800b82000 --- 0 0 1 0 CN-- df 718 0x800b82000 0x800b89000 rw- 7 0 1 0 CN-- vn /usr/local/lib/dovecot/libdovecot.so.0.0.0 718 0x800b89000 0x800b8b000 rw- 2 2 1 0 CN-- df 718 0x800b8b000 0x800d02000 r-x 375 386 67 33 CN-- vn /lib/libc.so.7 718 0x800d02000 0x800f01000 --- 0 0 1 0 CN-- df 718 0x800f01000 0x800f0d000 rw- 12 0 1 0 CN-- vn /lib/libc.so.7 718 0x800f0d000 0x800f37000 rw- 13 13 1 0 CN-- df 718 0x801000000 0x801800000 rw- 94 94 1 0 CN-- df 718 0x7ffffffdf000 0x7ffffffff000 rw- 8 0 1 0 C--D df 718 0x7ffffffff000 0x800000000000 r-x 1 1 36 0 ---- ph
https://svnweb.freebsd.org/changeset/base/295454 Is the backport of the coredump fixes I mentioned previously to stable/10. I haven't tried to apply those to 10.2, but I think there's a decent chance that will apply cleanly.
I was not able to roll out a 10/stable NFS root based on some other issues restricting that. If there is a patch file against 10.2 I can certainly try that.
The patch from the commit applies cleanly to a 10.2 checkout: > pwd /usr/home/john/work/freebsd/releng/10.2 > svn merge -c 295454 ^/stable/10 --- Merging r295454 into '.': U lib/libprocstat/libprocstat.c U lib/libutil/kinfo_getfile.c U lib/libutil/kinfo_getvmmap.c U share/man/man5/core.5 U sys/kern/imgact_elf.c U sys/kern/kern_descrip.c U sys/kern/kern_exec.c U sys/kern/kern_proc.c U sys/sys/exec.h U sys/sys/user.h G . --- Recording mergeinfo for merge of r295454 into '.': G . It might be simplest to apply it your tree using something like: % cd /path/to/10.2/tree % svn diff -c 295454 https://svn.freebsd.org/stable/10 > foo.patch % svn patch foo.patch
Is building the kernel enough for this patch or does it require a complete buildworld?
You can get by with a new kernel only. If the bug this protects against occurs, procstat from an old world might not be able to read file (-f) or VM info (-v) from a core dump correctly, but gdb, etc. should work fine.
I did get the patch and it installed cleanly. When I booted the new kernel, I did notice I got the same error previously, and a partial core file. I've decided to stop beating a dead horse on 10.2 and checked out 10.3-BETA1. (FreeBSD 10.3-BETA1 #1 r295491M) I've built the VM's and new NFS root with this, however, my binaries/ports are built against 10.2 still. (not sure if that's the problem). If I still see the problem happen on 10.3 I'll grab what I can and then I'll attempt to rebuild the ports against 10.3....
Just experienced the same issue on 10.3-BETA1 FreeBSD 10.3-BETA1 #1 r295491M Failed to write core file for process dovecot (error 14) pid 716 (dovecot), uid 0: exited on signal 11 ls -l /var/spool/tmp/dovecot.core -rw------- 1 root wheel 8925184 Feb 11 19:47 /var/spool/tmp/dovecot.core #0 service_process_create (service=Cannot access memory at address 0x7fffffffb250 ) at service-process.c:311 311 service-process.c: No such file or directory. in service-process.c (gdb) bt full #0 service_process_create (service=Cannot access memory at address 0x7fffffffb250 ) at service-process.c:311 uid_counter = 7828 process = (struct service_process *) Cannot access memory at address 0x7fffffffb248 Current language: auto; currently minimal (gdb) x/i $rip 0x40dd08 <service_process_create+296>: mov %eax,-0x2c(%rbp) (gdb) info registers rax 0x4a41 19009 rbx 0x0 0 rcx 0x800c66f0a 34372742922 rdx 0x0 0 rsi 0x40d760 4249440 rdi 0x0 0 rbp 0x7fffffffb260 0x7fffffffb260 rsp 0x7fffffffb220 0x7fffffffb220 r8 0x2 2 r9 0x1 1 r10 0x2710 10000 r11 0x202 514 r12 0x7fffffffb4b8 140737488336056 r13 0x7fffffffb4e0 140737488336096 r14 0x7fffffffb4c0 140737488336064 r15 0x3 3 rip 0x40dd08 0x40dd08 <service_process_create+296> eflags 0x10202 66050 cs 0x43 67 ss 0x3b 59 ds 0x0 0 es 0x0 0 fs 0x0 0 gs 0x0 0 procstat -v /var/spool/tmp/dovecot.core PID START END PRT RES PRES REF SHD FL TP PATH 716 0x400000 0x417000 r-x 22 23 1 0 CN-- vn 716 0x617000 0x618000 rw- 1 1 1 0 CN-- df 716 0x800617000 0x800635000 r-x 30 30 34 0 CN-- vn /libexec/ld-elf.so.1 716 0x800635000 0x800663000 rw- 36 36 1 0 CN-- df 716 0x800834000 0x800836000 rw- 2 2 1 0 CN-- df 716 0x800836000 0x800983000 r-x 226 232 38 19 CN-- vn /usr/local/lib/dovecot/libdovecot.so.0.0.0 716 0x800983000 0x800b82000 --- 0 0 1 0 CN-- df 716 0x800b82000 0x800b89000 rw- 7 0 1 0 CN-- vn /usr/local/lib/dovecot/libdovecot.so.0.0.0 716 0x800b89000 0x800b8b000 rw- 2 2 1 0 CN-- df 716 0x800b8b000 0x800d03000 r-x 376 387 66 32 CN-- vn /lib/libc.so.7 716 0x800d03000 0x800f02000 --- 0 0 1 0 CN-- df 716 0x800f02000 0x800f0e000 rw- 12 0 1 0 CN-- vn /lib/libc.so.7 716 0x800f0e000 0x800f38000 rw- 13 13 1 0 CN-- df 716 0x801000000 0x801800000 rw- 94 94 1 0 CN-- df 716 0x7ffffffdf000 0x7ffffffff000 rw- 8 0 1 0 C--D df 716 0x7ffffffff000 0x800000000000 r-x 1 1 36 0 ---- ph
and here is a recent core (that actually dumped full) for exim: #0 _thr_signal_postfork_child () at /.amd_mnt/juno/pub/FreeBSD/10-STABLE/lib/libthr/thread/thr_sig.c:508 508 /.amd_mnt/juno/pub/FreeBSD/10-STABLE/lib/libthr/thread/thr_sig.c: No such file or directory. in /.amd_mnt/juno/pub/FreeBSD/10-STABLE/lib/libthr/thread/thr_sig.c [New Thread 803006400 (LWP 100045/<unknown>)] (gdb) bt full #0 _thr_signal_postfork_child () at /.amd_mnt/juno/pub/FreeBSD/10-STABLE/lib/libthr/thread/thr_sig.c:508 No locals. #1 0x000000080120706d in __thr_fork () at /.amd_mnt/juno/pub/FreeBSD/10-STABLE/lib/libthr/thread/thr_fork.c:203 rtld_locks = {-4800, 32767, -4760, 32767, -4792, 32767, 3, 0} curthread = (struct pthread *) 0x803006400 cancelsave = 0 ret = 0 errsave = 4 #2 0x0000000000428c54 in handle_smtp_call (listen_sockets=0x803065140, listen_socket_count=4, accept_socket=7, accepted=0x7fffffffcd38) at daemon.c:355 pid = 32767 interface_sockaddr = {v4 = {sin_len = 28 '\034', sin_family = 28 '\034', sin_port = 6400, sin_addr = {s_addr = 0}, sin_zero = 0x7fffffffc9a0 "&\aðX\001\020"}, v6 = {sin6_len = 28 '\034', sin6_family = 28 '\034', sin6_port = 6400, sin6_flowinfo = 0, sin6_addr = { __u6_addr = {__u6_addr8 = 0x7fffffffc9a0 "&\aðX\001\020", __u6_addr16 = 0x7fffffffc9a0, __u6_addr32 = 0x7fffffffc9a0}}, sin6_scope_id = 0}, v0 = { sa_len = 28 '\034', sa_family = 28 '\034', sa_data = 0x7fffffffc99a ""}} ifsize = 28 dup_accept_socket = 9 max_for_this_host = 10 wfsize = 100 wfptr = 22 use_log_write_selector = 1980 whofrom = (uschar *) 0x803065e30 "[2607:f058:110:2::f:0]" reset_point = (void *) 0x803065e00 #3 0x0000000000427c55 in daemon_go () at daemon.c:2040 accept_socket = 7 sk = 0 max_socket = 6 lcount = 0 select_errno = 4 select_failed = 0 select_listen = {__fds_bits = 0x7fffffffcc40} accepted = {sin6_len = 28 '\034', sin6_family = 28 '\034', sin6_port = 24059, sin6_flowinfo = 0, sin6_addr = {__u6_addr = { __u6_addr8 = 0x7fffffffcd40 "&\aðX\001\020", __u6_addr16 = 0x7fffffffcd40, __u6_addr32 = 0x7fffffffcd40}}, sin6_scope_id = 0} len = 28 pid = 56724 pw = (struct passwd *) 0x802698328 listen_sockets = (int *) 0x803065140 listen_socket_count = 4 addresses = (ip_address_item *) 0x803065040 last_connection_time = 1455363261 #4 0x000000000044af25 in main (argc=3, cargv=0x7fffffffed48) at exim.c:4719 rsopts = 0x761b00 argv = (uschar **) 0x7fffffffed48 arg_receive_timeout = -1 arg_smtp_receive_timeout = -1 arg_error_handling = 0 filter_sfd = -1 filter_ufd = -1 group_count = 1 i = 0 ---Type <return> to continue, or q <return> to quit--- rv = 0 list_queue_option = 0 msg_action = 0 msg_action_arg = -1 namelen = 20 queue_only_reason = 0 perl_start_option = 0 recipients_arg = 3 sender_address_domain = 0 test_retry_arg = -1 test_rewrite_arg = -1 arg_queue_only = 0 bi_option = 0 checking = 0 count_queue = 0 expansion_test = 0 extract_recipients = 0 flag_G = 0 flag_n = 0 forced_delivery = 0 f_end_dot = 0 deliver_give_up = 0 list_queue = 0 list_options = 0 local_queue_only = 7868416 more = 1 one_msg_action = 0 queue_only_set = 0 receiving_message = 0 sender_ident_set = 0 session_local_queue_only = -5288 unprivileged = 0 removed_privilege = 0 usage_wanted = 0 verify_address_mode = 0 verify_as_sender = 0 version_printed = 0 alias_arg = (uschar *) 0x0 called_as = (uschar *) 0x554616 "" cmdline_syslog_name = (uschar *) 0x0 start_queue_run_id = (uschar *) 0x0 stop_queue_run_id = (uschar *) 0x0 expansion_test_message = (uschar *) 0x0 ftest_domain = (uschar *) 0x0 ftest_localpart = (uschar *) 0x0 ftest_prefix = (uschar *) 0x0 ftest_suffix = (uschar *) 0x0 log_oneline = (uschar *) 0x0 malware_test_file = (uschar *) 0x0 ---Type <return> to continue, or q <return> to quit--- real_sender_address = (uschar *) 0x7fffffffeb58 " \005v" originator_home = (uschar *) 0x803065018 "/root" sz = 140737488350168 reset_point = (void *) 0x80231a310 pw = (struct passwd *) 0x802698328 statbuf = {st_dev = 1895890688, st_ino = 5, st_mode = 8576, st_nlink = 1, st_uid = 0, st_gid = 0, st_rdev = 5, st_atim = {tv_sec = 1455201710, tv_nsec = 90063000}, st_mtim = {tv_sec = 1455201749, tv_nsec = 0}, st_ctim = {tv_sec = 1455201749, tv_nsec = 0}, st_size = 0, st_blocks = 0, st_blksize = 4096, st_flags = 0, st_gen = 0, st_lspare = 0, st_birthtim = {tv_sec = -1, tv_nsec = 0}} passed_qr_pid = 0 passed_qr_pipe = -1 group_list = 0x7fffffffd8f0 info_flag = CMDINFO_NONE info_stdout = 0 Current language: auto; currently minimal
Caught exim and dovecot full core dumps, same VM, couple hours apart... #0 handle_smtp_call (listen_sockets=0x803065140, listen_socket_count=4, accept_socket=7, accepted=0x7fffffffcd38) at daemon.c:375 375 daemon.c: No such file or directory. in daemon.c [New Thread 803006400 (LWP 100155/<unknown>)] (gdb) bt full #0 handle_smtp_call (listen_sockets=0x803065140, listen_socket_count=4, accept_socket=7, accepted=0x7fffffffcd38) at daemon.c:375 i = 32767 queue_only_reason = 0 old_pool = 0 save_debug_selector = 0 local_queue_only = 32767 session_local_queue_only = -4800 act = {__sigaction_u = {__sa_handler = 0, __sa_sigaction = 0}, sa_flags = 0, sa_mask = { __bits = 0x7fffffffc91c}} pid = 0 interface_sockaddr = {v4 = {sin_len = 28 '\034', sin_family = 28 '\034', sin_port = 6400, sin_addr = {s_addr = 0}, sin_zero = 0x7fffffffc9a0 "&\aðX\001\020"}, v6 = { sin6_len = 28 '\034', sin6_family = 28 '\034', sin6_port = 6400, sin6_flowinfo = 0, sin6_addr = {__u6_addr = {__u6_addr8 = 0x7fffffffc9a0 "&\aðX\001\020", __u6_addr16 = 0x7fffffffc9a0, __u6_addr32 = 0x7fffffffc9a0}}, sin6_scope_id = 0}, v0 = { sa_len = 28 '\034', sa_family = 28 '\034', sa_data = 0x7fffffffc99a ""}} ifsize = 28 dup_accept_socket = 9 max_for_this_host = 10 wfsize = 100 wfptr = 22 use_log_write_selector = 1980 whofrom = (uschar *) 0x803065e30 "[2607:f058:110:2::f:0]" reset_point = (void *) 0x803065e00 #1 0x0000000000427c55 in daemon_go () at daemon.c:2040 accept_socket = 7 sk = 0 max_socket = 6 lcount = 0 select_errno = 4 select_failed = 0 select_listen = {__fds_bits = 0x7fffffffcc40} accepted = {sin6_len = 28 '\034', sin6_family = 28 '\034', sin6_port = 46531, sin6_flowinfo = 0, sin6_addr = {__u6_addr = {__u6_addr8 = 0x7fffffffcd40 "&\aðX\001\020", __u6_addr16 = 0x7fffffffcd40, __u6_addr32 = 0x7fffffffcd40}}, sin6_scope_id = 0} len = 28 pid = 53159 pw = (struct passwd *) 0x802698328 listen_sockets = (int *) 0x803065140 listen_socket_count = 4 addresses = (ip_address_item *) 0x803065040 last_connection_time = 1455463157 #2 0x000000000044af25 in main (argc=3, cargv=0x7fffffffed48) at exim.c:4719 rsopts = 0x761b00 argv = (uschar **) 0x7fffffffed48 arg_receive_timeout = -1 arg_smtp_receive_timeout = -1 arg_error_handling = 0 ---Type <return> to continue, or q <return> to quit--- filter_sfd = -1 filter_ufd = -1 group_count = 1 i = 0 rv = 0 list_queue_option = 0 msg_action = 0 msg_action_arg = -1 namelen = 20 queue_only_reason = 0 perl_start_option = 0 recipients_arg = 3 sender_address_domain = 0 test_retry_arg = -1 test_rewrite_arg = -1 arg_queue_only = 0 bi_option = 0 checking = 0 count_queue = 0 expansion_test = 0 extract_recipients = 0 flag_G = 0 flag_n = 0 forced_delivery = 0 f_end_dot = 0 deliver_give_up = 0 list_queue = 0 list_options = 0 local_queue_only = 7868416 more = 1 one_msg_action = 0 queue_only_set = 0 receiving_message = 0 sender_ident_set = 0 session_local_queue_only = -5288 unprivileged = 0 removed_privilege = 0 usage_wanted = 0 verify_address_mode = 0 verify_as_sender = 0 version_printed = 0 alias_arg = (uschar *) 0x0 called_as = (uschar *) 0x554616 "" cmdline_syslog_name = (uschar *) 0x0 start_queue_run_id = (uschar *) 0x0 stop_queue_run_id = (uschar *) 0x0 expansion_test_message = (uschar *) 0x0 ftest_domain = (uschar *) 0x0 ftest_localpart = (uschar *) 0x0 ---Type <return> to continue, or q <return> to quit--- ftest_prefix = (uschar *) 0x0 ftest_suffix = (uschar *) 0x0 log_oneline = (uschar *) 0x0 malware_test_file = (uschar *) 0x0 real_sender_address = (uschar *) 0x7fffffffeb58 " \005v" originator_home = (uschar *) 0x803065018 "/root" sz = 140737488350168 reset_point = (void *) 0x80231a310 pw = (struct passwd *) 0x802698328 statbuf = {st_dev = 1895890688, st_ino = 5, st_mode = 8576, st_nlink = 1, st_uid = 0, st_gid = 0, st_rdev = 5, st_atim = {tv_sec = 1455201710, tv_nsec = 90063000}, st_mtim = { tv_sec = 1455201749, tv_nsec = 0}, st_ctim = {tv_sec = 1455201749, tv_nsec = 0}, st_size = 0, st_blocks = 0, st_blksize = 4096, st_flags = 0, st_gen = 0, st_lspare = 0, st_birthtim = { tv_sec = -1, tv_nsec = 0}} passed_qr_pid = 0 passed_qr_pipe = -1 group_list = 0x7fffffffd8f0 info_flag = CMDINFO_NONE info_stdout = 0 Current language: auto; currently minimal (gdb) x/i $rip 0x428c9e <handle_smtp_call+2222>: mov %eax,0x764b34 (gdb) info registers rax 0x7bc 1980 rbx 0x0 0 rcx 0x4 4 rdx 0x1 1 rsi 0x8030064e8 34410095848 rdi 0x3 3 rbp 0x7fffffffc9d0 0x7fffffffc9d0 rsp 0x7fffffffc830 0x7fffffffc830 r8 0x0 0 r9 0xfffff8003b2a1000 -8795100409856 r10 0x1 1 r11 0x246 582 r12 0x7fffffffed40 140737488350528 r13 0x7fffffffed68 140737488350568 r14 0x7fffffffed48 140737488350536 r15 0x3 3 rip 0x428c9e 0x428c9e <handle_smtp_call+2222> eflags 0x10202 66050 cs 0x43 67 ss 0x3b 59 ds 0x0 0 es 0x0 0 fs 0x0 0 gs 0x0 0 procstat -v exim.core PID START END PRT RES PRES REF SHD FL TP PATH 53547 0x400000 0x560000 r-x 324 637 3 1 CN-- vn /usr/local/sbin/exim 53547 0x760000 0x76a000 rw- 10 0 1 0 C--- vn /usr/local/sbin/exim 53547 0x76a000 0x779000 rw- 15 0 1 0 C--- df 53547 0x800760000 0x80077e000 r-x 30 32 35 0 CN-- vn /libexec/ld-elf.so.1 53547 0x80077e000 0x800785000 rw- 7 7 2 0 CN-- df 53547 0x800787000 0x8007b7000 rw- 48 48 2 1 CN-- df 53547 0x80097d000 0x80097f000 rw- 2 2 2 0 CN-- df 53547 0x80097f000 0x80098d000 r-x 14 16 7 3 CN-- vn 53547 0x80098d000 0x800b8d000 --- 0 0 2 0 CN-- df 53547 0x800b8d000 0x800b8e000 rw- 1 0 2 0 CN-- vn 53547 0x800b8e000 0x800b9f000 rw- 0 0 0 0 ---- -- 53547 0x800b9f000 0x800bc8000 r-x 41 46 9 4 CN-- vn 53547 0x800bc8000 0x800dc7000 --- 0 0 2 0 CN-- df 53547 0x800dc7000 0x800dc8000 rw- 1 0 2 0 CN-- vn 53547 0x800dc8000 0x800dd8000 r-x 16 18 19 9 CN-- vn 53547 0x800dd8000 0x800fd8000 --- 0 0 2 0 CN-- df 53547 0x800fd8000 0x800fd9000 rw- 1 0 2 0 CN-- vn 53547 0x800fd9000 0x800fdb000 rw- 0 0 0 0 ---- -- 53547 0x800fdb000 0x800ff3000 r-x 20 23 3 1 CN-- vn 53547 0x800ff3000 0x8011f3000 --- 0 0 2 0 CN-- df 53547 0x8011f3000 0x8011f5000 rw- 2 0 2 0 CN-- vn 53547 0x8011f5000 0x80120d000 r-x 24 26 20 9 CN-- vn /lib/libthr.so.3 53547 0x80120d000 0x80140d000 --- 0 0 2 0 CN-- df 53547 0x80140d000 0x80140e000 rw- 1 0 1 0 C--- vn /lib/libthr.so.3 53547 0x80140e000 0x80141a000 rw- 12 0 1 0 C--- df 53547 0x80141a000 0x801445000 r-x 41 45 5 2 CN-- vn 53547 0x801445000 0x801645000 --- 0 0 2 0 CN-- df 53547 0x801645000 0x801648000 rw- 3 0 2 0 CN-- vn 53547 0x801648000 0x80181d000 r-x 425 434 3 1 CN-- vn 53547 0x80181d000 0x801a1c000 --- 0 0 2 0 CN-- df 53547 0x801a1c000 0x801a25000 rw- 9 0 2 0 CN-- vn 53547 0x801a25000 0x801a88000 r-x 87 98 27 13 CN-- vn /usr/lib/libssl.so.7 53547 0x801a88000 0x801c88000 --- 0 0 2 0 CN-- df 53547 0x801c88000 0x801c92000 rw- 10 0 2 0 CN-- vn /usr/lib/libssl.so.7 53547 0x801c92000 0x801e5f000 r-x 461 501 31 15 CN-- vn /lib/libcrypto.so.7 53547 0x801e5f000 0x80205f000 --- 0 0 2 0 CN-- df 53547 0x80205f000 0x802087000 rw- 40 0 2 0 CN-- vn /lib/libcrypto.so.7 53547 0x802087000 0x802089000 rw- 2 2 2 0 CN-- df 53547 0x802089000 0x8020fe000 r-x 45 46 8 3 CN-- vn 53547 0x8020fe000 0x8022fd000 --- 0 0 2 0 CN-- df 53547 0x8022fd000 0x8022fe000 rw- 1 0 2 0 CN-- vn 53547 0x8022fe000 0x802476000 r-x 376 390 68 33 CN-- vn /lib/libc.so.7 53547 0x802476000 0x802675000 --- 0 0 2 0 CN-- df 53547 0x802675000 0x802681000 rw- 12 0 1 0 C--- vn /lib/libc.so.7 53547 0x802681000 0x8026ab000 rw- 42 0 1 0 C--- df 53547 0x8026ab000 0x8026b5000 r-x 10 14 10 4 CN-- vn /usr/local/lib/libintl.so.8.1.4 53547 0x8026b5000 0x8028b5000 --- 0 0 2 0 CN-- df 53547 0x8028b5000 0x8028b6000 rw- 1 0 2 0 CN-- vn /usr/local/lib/libintl.so.8.1.4 53547 0x802c00000 0x803400000 rw- 512 0 1 0 C--- df 53547 0x7fffdfffe000 0x7fffdffff000 --- 0 0 0 0 ---- -- 53547 0x7ffffffdf000 0x7ffffffff000 rw- 10 0 1 0 C--D df 53547 0x7ffffffff000 0x800000000000 r-x 1 1 37 0 ---- ph
#0 t_pop () at data-stack.c:275 275 data-stack.c: No such file or directory. in data-stack.c (gdb) bt full #0 t_pop () at data-stack.c:275 frame_block = (struct stack_frame_block *) 0x8014b1378 #1 0x000000080090e584 in io_loop_call_io (io=0x80144e4c0) at ioloop.c:560 ioloop = (struct ioloop *) 0x801421080 t_id = 2 #2 0x0000000800911497 in io_loop_handler_run_internal (ioloop=0x801421080) at ioloop-kqueue.c:151 ctx = (struct ioloop_handler_context *) 0x801419880 events = (struct kevent *) 0x8014b9000 event = (const struct kevent *) 0x8014b9000 tv = {tv_sec = 43499, tv_usec = 883196} ts = {tv_sec = 43499, tv_nsec = 883196000} io = (struct io_file *) 0x80144e4c0 events_count = 67 ret = 1 i = 0 #3 0x000000080090eb0e in io_loop_handler_run (ioloop=0x801421080) at ioloop.c:607 No locals. #4 0x000000080090e92f in io_loop_run (ioloop=0x801421080) at ioloop.c:583 No locals. #5 0x000000080086a8bb in master_service_run (service=0x80141b140, callback=0) at master-service.c:640 No locals. #6 0x0000000000406558 in main (argc=3, argv=0x7fffffffb4c0) at main.c:888 set = (struct master_settings *) 0x801429110 error = 0x800ca9283 "H\203Ä ]Ã\017\037\200" doveconf_arg = 0x0 orig_info_callback = (failure_callback_t *) 0x8008ee880 <i_syslog_error_handler> orig_debug_callback = (failure_callback_t *) 0x8008ee880 <i_syslog_error_handler> foreground = false ask_key_pass = false i = 3 c = -1 doubleopts = 0x7fffffffb3a0 Current language: auto; currently minimal (gdb) x/i $rip 0x8008eb306 <t_pop+86>: mov %rax,0x29d523(%rip) # 0x800b88830 <current_block> (gdb) info registers rax 0x801406000 34380734464 rbx 0x0 0 rcx 0x80140f280 34380771968 rdx 0x0 0 rsi 0x7fffffffb198 140737488335256 rdi 0x0 0 rbp 0x7fffffffb280 0x7fffffffb280 rsp 0x7fffffffb260 0x7fffffffb260 r8 0x0 0 r9 0xfffffe007c476718 -2196938201320 r10 0x0 0 r11 0x246 582 r12 0x7fffffffb4b8 140737488336056 r13 0x7fffffffb4e0 140737488336096 r14 0x7fffffffb4c0 140737488336064 r15 0x3 3 rip 0x8008eb306 0x8008eb306 <t_pop+86> eflags 0x10246 66118 cs 0x43 67 ss 0x3b 59 ds 0x0 0 es 0x0 0 fs 0x0 0 gs 0x0 0 procstat -v dovecot.core PID START END PRT RES PRES REF SHD FL TP PATH 716 0x400000 0x417000 r-x 19 19 1 0 CN-- vn 716 0x617000 0x618000 rw- 1 1 1 0 CN-- df 716 0x800617000 0x800635000 r-x 30 32 32 0 CN-- vn /libexec/ld-elf.so.1 716 0x800635000 0x800663000 rw- 36 36 1 0 C--- df 716 0x800834000 0x800836000 rw- 2 2 1 0 CN-- df 716 0x800836000 0x800983000 r-x 213 219 34 17 CN-- vn /usr/local/lib/dovecot/libdovecot.so.0.0.0 716 0x800983000 0x800b82000 --- 0 0 1 0 CN-- df 716 0x800b82000 0x800b89000 rw- 7 0 1 0 C--- vn /usr/local/lib/dovecot/libdovecot.so.0.0.0 716 0x800b89000 0x800b8b000 rw- 2 2 1 0 CN-- df 716 0x800b8b000 0x800d03000 r-x 376 390 63 31 CN-- vn /lib/libc.so.7 716 0x800d03000 0x800f02000 --- 0 0 1 0 CN-- df 716 0x800f02000 0x800f0e000 rw- 12 0 1 0 CN-- vn /lib/libc.so.7 716 0x800f0e000 0x800f38000 rw- 13 13 1 0 CN-- df 716 0x801000000 0x801800000 rw- 94 94 1 0 C--- df 716 0x7ffffffdf000 0x7ffffffff000 rw- 8 8 1 0 C--D df 716 0x7ffffffff000 0x800000000000 r-x 1 1 34 0 ---- ph
FreeBSD 10.3-BETA1 #1 r295491M: Failed to write core file for process exim (error 14) pid 743 (exim), uid 26: exited on signal 11 #0 __thr_fork () at /.amd_mnt/juno/pub/FreeBSD/10-STABLE/lib/libthr/thread/thr_fork.c:186 186 /.amd_mnt/juno/pub/FreeBSD/10-STABLE/lib/libthr/thread/thr_fork.c: No such file or directory. in /.amd_mnt/juno/pub/FreeBSD/10-STABLE/lib/libthr/thread/thr_fork.c [New Thread 803006400 (LWP 100046/<unknown>)] (gdb) bt full #0 __thr_fork () at /.amd_mnt/juno/pub/FreeBSD/10-STABLE/lib/libthr/thread/thr_fork.c:186 rtld_locks = Cannot access memory at address 0x7fffffffc7d0 Current language: auto; currently minimal (gdb) x/i $rip 0x801206f3e <__thr_fork+190>: callq 0x8012090e0 <__error_threaded> (gdb) info registers rax 0x17e44 97860 rbx 0x0 0 rcx 0x80238f70a 34397026058 rdx 0x0 0 rsi 0x801209658 34378651224 rdi 0x2 2 rbp 0x7fffffffc820 0x7fffffffc820 rsp 0x7fffffffc7c0 0x7fffffffc7c0 r8 0x2 2 r9 0x2 2 r10 0x2710 10000 r11 0x246 582 r12 0x803006400 34410095616 r13 0x76b5b0 7779760 r14 0x17e44 97860 r15 0x0 0 rip 0x801206f3e 0x801206f3e <__thr_fork+190> eflags 0x10246 66118 cs 0x43 67 ss 0x3b 59 ds 0x0 0 es 0x0 0 fs 0x0 0 gs 0x0 0 procstat -f exim.core PID COMM FD T V FLAGS REF OFFSET PRO NAME 743 exim text v r r------- - - - /usr/local/sbin/exim 743 exim cwd v d r------- - - - /var/spool/exim 743 exim root v d r------- - - - / 743 exim 0 v c rw------ 3 0 - /dev/null 743 exim 1 v c rw------ 3 0 - /dev/null 743 exim 2 v c rw------ 3 0 - /dev/null 743 exim 3 s - rw------ 1 0 TCP ::.25 ::.0 743 exim 4 s - rw------ 1 0 TCP 0.0.0.0:25 0.0.0.0:0 743 exim 5 s - rw------ 2 0 TCP 2607:f058:110:1::1:2.25 2607:f058:110:1::f:1.38009 743 exim 6 s - rw---n-- 1 0 UDP 0.0.0.0:863 0.0.0.0:0 743 exim 7 s - rw------ 2 0 TCP 2607:f058:110:1::1:2.25 2607:f058:110:1::f:1.38009 743 exim 8 s - rw------ 1 0 UDD /var/run/log
#0 0x000000080120716f in __thr_fork () at /.amd_mnt/juno/pub/FreeBSD/10-STABLE/lib/libthr/thread/thr_fork.c:257 257 /.amd_mnt/juno/pub/FreeBSD/10-STABLE/lib/libthr/thread/thr_fork.c: No such file or directory. in /.amd_mnt/juno/pub/FreeBSD/10-STABLE/lib/libthr/thread/thr_fork.c Cannot find new threads: generic error (gdb) bt full #0 0x000000080120716f in __thr_fork () at /.amd_mnt/juno/pub/FreeBSD/10-STABLE/lib/libthr/thread/thr_fork.c:257 rtld_locks = Cannot access memory at address 0x7fffffffc7d0 Current language: auto; currently minimal (gdb) x/i $rip 0x80120716f <__thr_fork+751>: mov %ecx,(%rax) (gdb) info registers rax 0x8026a9358 34400277336 rbx 0x0 0 rcx 0x4 4 rdx 0x1 1 rsi 0x8030064e8 34410095848 rdi 0x3 3 rbp 0x7fffffffc820 0x7fffffffc820 rsp 0x7fffffffc7c0 0x7fffffffc7c0 r8 0x0 0 r9 0xfffff800484e49e0 -8794879931936 r10 0x1 1 r11 0x246 582 r12 0x803006400 34410095616 r13 0x76b5b0 7779760 r14 0x6f3f 28479 r15 0x0 0 rip 0x80120716f 0x80120716f <__thr_fork+751> eflags 0x10202 66050 cs 0x43 67 ss 0x3b 59 ds 0x0 0 es 0x0 0 fs 0x0 0 gs 0x0 0 (gdb) procstat -v exim.core PID START END PRT RES PRES REF SHD FL TP PATH 99063 0x400000 0x560000 r-x 313 626 2 1 CN-- vn /usr/local/sbin/exim 99063 0x760000 0x76a000 rw- 10 0 1 0 CN-- vn /usr/local/sbin/exim 99063 0x76a000 0x779000 rw- 5 5 1 0 CN-- df 99063 0x800760000 0x80077e000 r-x 30 32 31 0 CN-- vn /libexec/ld-elf.so.1 99063 0x80077e000 0x800785000 rw- 7 7 1 0 CN-- df 99063 0x800787000 0x8007b7000 rw- 40 40 1 0 CN-- df 99063 0x80097d000 0x80097f000 rw- 2 2 1 0 CN-- df 99063 0x80097f000 0x80098d000 r-x 14 16 11 3 CN-- vn 99063 0x80098d000 0x800b8d000 --- 0 0 1 0 CN-- df 99063 0x800b8d000 0x800b8e000 rw- 1 0 1 0 CN-- vn 99063 0x800b8e000 0x800b9f000 rw- 0 0 0 0 ---- -- 99063 0x800b9f000 0x800bc8000 r-x 41 46 18 6 CN-- vn 99063 0x800bc8000 0x800dc7000 --- 0 0 1 0 CN-- df 99063 0x800dc7000 0x800dc8000 rw- 1 0 1 0 CN-- vn 99063 0x800dc8000 0x800dd8000 r-x 16 18 25 10 CN-- vn 99063 0x800dd8000 0x800fd8000 --- 0 0 1 0 CN-- df 99063 0x800fd8000 0x800fd9000 rw- 1 0 1 0 CN-- vn 99063 0x800fd9000 0x800fdb000 rw- 0 0 0 0 ---- -- 99063 0x800fdb000 0x800ff3000 r-x 24 27 2 1 CN-- vn 99063 0x800ff3000 0x8011f3000 --- 0 0 1 0 CN-- df 99063 0x8011f3000 0x8011f5000 rw- 2 0 1 0 CN-- vn 99063 0x8011f5000 0x80120d000 r-x 24 26 23 8 CN-- vn /lib/libthr.so.3 99063 0x80120d000 0x80140d000 --- 0 0 1 0 CN-- df 99063 0x80140d000 0x80140e000 rw- 1 0 1 0 CN-- vn /lib/libthr.so.3 99063 0x80140e000 0x80141a000 rw- 12 12 1 0 C--- df 99063 0x80141a000 0x801445000 r-x 39 43 2 1 CN-- vn 99063 0x801445000 0x801645000 --- 0 0 1 0 CN-- df 99063 0x801645000 0x801648000 rw- 3 0 1 0 CN-- vn 99063 0x801648000 0x80181d000 r-x 429 438 9 2 CN-- vn /usr/local/lib/perl5/5.22/mach/CORE/libperl.so.5.22.1 99063 0x80181d000 0x801a1c000 --- 0 0 1 0 CN-- df 99063 0x801a1c000 0x801a25000 rw- 9 0 1 0 CN-- vn /usr/local/lib/perl5/5.22/mach/CORE/libperl.so.5.22.1 99063 0x801a25000 0x801a88000 r-x 81 92 8 4 CN-- vn /usr/lib/libssl.so.7 99063 0x801a88000 0x801c88000 --- 0 0 1 0 CN-- df 99063 0x801c88000 0x801c92000 rw- 10 0 1 0 CN-- vn /usr/lib/libssl.so.7 99063 0x801c92000 0x801e5f000 r-x 461 501 12 6 CN-- vn /lib/libcrypto.so.7 99063 0x801e5f000 0x80205f000 --- 0 0 1 0 CN-- df 99063 0x80205f000 0x802087000 rw- 40 0 1 0 CN-- vn /lib/libcrypto.so.7 99063 0x802087000 0x802089000 rw- 1 1 1 0 CN-- df 99063 0x802089000 0x8020fe000 r-x 43 44 7 3 CN-- vn 99063 0x8020fe000 0x8022fd000 --- 0 0 1 0 CN-- df 99063 0x8022fd000 0x8022fe000 rw- 1 0 1 0 CN-- vn 99063 0x8022fe000 0x802476000 r-x 376 390 55 24 CN-- vn /lib/libc.so.7 99063 0x802476000 0x802675000 --- 0 0 1 0 CN-- df 99063 0x802675000 0x802681000 rw- 12 0 1 0 CN-- vn /lib/libc.so.7 99063 0x802681000 0x8026ab000 rw- 11 0 1 0 C--- df 99063 0x8026ab000 0x8026b5000 r-x 10 14 7 3 CN-- vn /usr/local/lib/libintl.so.8.1.4 99063 0x8026b5000 0x8028b5000 --- 0 0 1 0 CN-- df 99063 0x8028b5000 0x8028b6000 rw- 1 0 1 0 CN-- vn /usr/local/lib/libintl.so.8.1.4 99063 0x802c00000 0x803400000 rw- 222 222 1 0 C--- df 99063 0x7fffdfffe000 0x7fffdffff000 --- 0 0 0 0 ---- -- 99063 0x7ffffffdf000 0x7ffffffff000 rw- 10 10 1 0 C--D df 99063 0x7ffffffff000 0x800000000000 r-x 1 1 33 0 ---- ph
(In reply to Robert Blayzor from comment #43) So are you binaries and libs executed from NFS mount ? Could you try to mount them statically instead of using automounter ? Even better, try with a local filesystem. Another thing to do is to set sysctl machdep.uprintf_signal to 1. I would need the matching output on the failure and procstat -v output from the (corrupted) core. Do you run ntpd ? Do failures continue to appear if you stop both ntpd and automounter ?
I'm not using auto-mounter on the VM's I'm seeing this failure on. They do however have the root file system mounted on NFS, which is done at boot; NFS_ROOT. (diskless boot) The entire root file system is mounted RO, and includes the entire base system (including /usr/local) The only other thing that's created is a memory disk for /var and /etc and any NFS file systems used for mail. I do run ntpd (from ports) on each VM. I can try stopping ntpd and see if that makes a difference. But knowing VM hosts, the clocks may drift quite a bit.... Will disable ntpd and report back.
(In reply to Robert Blayzor from comment #47) uprintf_signal is more interesting. Wait, VM ? Which hypervisor ? Is it possible to reproduce the issue on bare metal ?
VMWare ESXi 5.1u2 I have *NOT* seen this behavior on other VM's created on the same set of ESXi hosts. The common thread (difference) seems to be these are setup to be "diskless" boot VM's where their root filesystem is mounted RO / via NFS. (NetApp) I have bare metal hosts running 10.2 with no problem. But at the same time, I have VM hosts, which have a local disk (vmdk) which are fine, and they're on the same physical ESXI hosts. I'll set sysctl machdep.uprintf_signal as well.
DIfferent process, similar crash... core file did dump. Note, the "amd" mount you see is just the machine I complied the binaries (buildworld) on. No auto mounter is used on the diskless boot VM's. pid 63755 (managesieve-login), uid 0: exited on signal 11 (core dumped) #0 memset () at /.amd_mnt/juno/pub/FreeBSD/10-STABLE/lib/libc/amd64/string/memset.S:51 51 /.amd_mnt/juno/pub/FreeBSD/10-STABLE/lib/libc/amd64/string/memset.S: No such file or directory. in /.amd_mnt/juno/pub/FreeBSD/10-STABLE/lib/libc/amd64/string/memset.S (gdb) bt full #0 memset () at /.amd_mnt/juno/pub/FreeBSD/10-STABLE/lib/libc/amd64/string/memset.S:51 No locals. #1 0x00000000106189e0 in map_object (fd=9, path=0x1062a260 "/lib/libc.so.7", sb=0x7fffffffd920) at /.amd_mnt/juno/pub/FreeBSD/10-STABLE/libexec/rtld-elf/map_object.c:242 phsize = <value optimized out> stack_flags = Cannot access memory at address 0x7 Current language: auto; currently asm (gdb) x/i $rip 0x1061c524 <memset+68>: rep stos %rax,%es:(%rdi) (gdb) info registers rax 0x0 0 rbx 0x1 1 rcx 0x4a 74 rdx 0x0 0 rsi 0x0 0 rdi 0x11787db0 293109168 rbp 0x7fffffffd8f0 0x7fffffffd8f0 rsp 0x7fffffffd7a8 0x7fffffffd7a8 r8 0x250 592 r9 0xfffffe00945fd6b8 -2196533946696 r10 0x0 0 r11 0x11787db0 293109168 r12 0x9 9 r13 0x1063a000 274964480 r14 0x1062a260 274899552 r15 0x7fffffffd920 140737488345376 rip 0x1061c524 0x1061c524 <memset+68> eflags 0x10202 66050 cs 0x43 67 ss 0x3b 59 ds 0x0 0 es 0x0 0 fs 0x0 0 gs 0x0 0 procstat -v managesieve-login.core PID START END PRT RES PRES REF SHD FL TP PATH 63755 0x400000 0x40b000 r-x 7 8 1 0 CN-- vn /usr/local/libexec/dovecot/managesieve-login 63755 0x60b000 0x60c000 rw- 1 1 1 0 ---- df 63755 0x1060b000 0x10629000 r-x 30 30 32 0 CN-- vn /libexec/ld-elf.so.1 63755 0x10629000 0x1063a000 rw- 11 11 1 0 ---- df 63755 0x1063a000 0x1063b000 r-- 1 387 64 31 CN-- vn /lib/libc.so.7 63755 0x10828000 0x1082a000 rw- 2 2 1 0 ---- df 63755 0x1082a000 0x1084b000 r-x 27 28 18 9 CN-- vn /usr/local/lib/dovecot/libdovecot-login.so.0.0.0 63755 0x1084b000 0x10a4a000 --- 0 0 1 0 ---- df 63755 0x10a4a000 0x10a4c000 rw- 2 0 1 0 C--- vn /usr/local/lib/dovecot/libdovecot-login.so.0.0.0 63755 0x10a4c000 0x10aaf000 r-x 80 90 28 14 CN-- vn /usr/lib/libssl.so.7 63755 0x10aaf000 0x10caf000 --- 0 0 1 0 ---- df 63755 0x10caf000 0x10cb9000 rw- 10 0 1 0 C--- vn /usr/lib/libssl.so.7 63755 0x10cb9000 0x10e86000 r-x 461 501 30 15 CN-- vn /lib/libcrypto.so.7 63755 0x10e86000 0x11086000 --- 0 0 1 0 ---- df 63755 0x11086000 0x110ae000 rw- 40 0 1 0 C--- vn /lib/libcrypto.so.7 63755 0x110ae000 0x110b0000 rw- 0 0 0 0 ---- -- 63755 0x110b0000 0x111fd000 r-x 213 219 36 18 CN-- vn /usr/local/lib/dovecot/libdovecot.so.0.0.0 63755 0x111fd000 0x113fc000 --- 0 0 1 0 ---- df 63755 0x113fc000 0x11403000 rw- 7 0 1 0 C--- vn /usr/local/lib/dovecot/libdovecot.so.0.0.0 63755 0x11403000 0x11405000 rw- 0 0 0 0 ---- -- 63755 0x11405000 0x1157d000 r-x 376 387 64 31 CN-- vn /lib/libc.so.7 63755 0x1157d000 0x1177c000 --- 0 0 2 0 ---- df 63755 0x1177c000 0x11788000 rw- 12 0 1 0 C--- vn /lib/libc.so.7 63755 0x11788000 0x117b2000 --- 0 0 2 0 ---- df 63755 0x7ffffffdf000 0x7ffffffff000 rw- 3 3 1 0 ---D df 63755 0x7ffffffff000 0x800000000000 r-x 1 1 34 0 ---- ph
(In reply to Robert Blayzor from comment #50) This is n-th copy of the same data. I need to see the message from kernel on the segfault, it is either printed on the controlling terminal of the process, or in the log if no control terminal was active.
(In reply to Konstantin Belousov from comment #51) Hm, no, we apparently do not fall back to printf() if no control terminal is assigned. Please do the following: in sys/amd64/amd64/trap.c, near the end of the function trap(), in the fragment if (uprintf_signal) { uprintf("pid %d comm %s: signal %d err %lx code %d type %d " replace uprintf() with printf(). The messages should appear in the dmesg.
I have made the changes and installed the new kernel + reboot. Will report what dmesg gives next go around. I was merely supplying data to show that I have multiple different processes crashing for apparently (what looks like) a similar condition. More to come...
pid 852 comm exim: signal 11 err 7 code 2 type 12 addr 0x7fffffffc7b8 rsp 0x7fffffffc7c0 rip 0x801206f3e <e8 9d 21 00 00 45 85 f6> Failed to write core file for process exim (error 14) pid 852 (exim), uid 26: exited on signal 11 procstat -v /var/spool/tmp/exim.core PID START END PRT RES PRES REF SHD FL TP PATH 852 0x400000 0x560000 r-x 299 310 2 1 CN-- vn /usr/local/sbin/exim 852 0x760000 0x76a000 rw- 10 0 1 0 CN-- vn /usr/local/sbin/exim 852 0x76a000 0x779000 rw- 5 5 1 0 CN-- df 852 0x800760000 0x80077e000 r-x 30 30 34 0 CN-- vn /libexec/ld-elf.so.1 852 0x80077e000 0x800785000 rw- 7 7 1 0 CN-- df 852 0x800787000 0x8007b7000 rw- 40 40 1 0 CN-- df 852 0x80097d000 0x80097f000 rw- 2 2 1 0 CN-- df 852 0x80097f000 0x80098d000 r-x 8 9 6 3 CN-- vn 852 0x80098d000 0x800b8d000 --- 0 0 1 0 CN-- df 852 0x800b8d000 0x800b8e000 rw- 1 0 1 0 CN-- vn 852 0x800b8e000 0x800b9f000 rw- 0 0 0 0 ---- -- 852 0x800b9f000 0x800bc8000 r-x 41 41 8 4 CN-- vn 852 0x800bc8000 0x800dc7000 --- 0 0 1 0 CN-- df 852 0x800dc7000 0x800dc8000 rw- 1 0 1 0 CN-- vn 852 0x800dc8000 0x800dd8000 r-x 13 14 18 9 CN-- vn 852 0x800dd8000 0x800fd8000 --- 0 0 1 0 CN-- df 852 0x800fd8000 0x800fd9000 rw- 1 0 1 0 CN-- vn 852 0x800fd9000 0x800fdb000 rw- 0 0 0 0 ---- -- 852 0x800fdb000 0x800ff3000 r-x 15 17 2 1 CN-- vn 852 0x800ff3000 0x8011f3000 --- 0 0 1 0 CN-- df 852 0x8011f3000 0x8011f5000 rw- 2 0 1 0 CN-- vn 852 0x8011f5000 0x80120d000 r-x 24 25 25 12 CN-- vn /lib/libthr.so.3 852 0x80120d000 0x80140d000 --- 0 0 1 0 CN-- df 852 0x80140d000 0x80140e000 rw- 1 0 1 0 CN-- vn /lib/libthr.so.3 852 0x80140e000 0x80141a000 rw- 12 12 1 0 CN-- df 852 0x80141a000 0x801445000 r-x 35 38 4 2 CN-- vn 852 0x801445000 0x801645000 --- 0 0 1 0 CN-- df 852 0x801645000 0x801648000 rw- 3 0 1 0 CN-- vn 852 0x801648000 0x80181d000 r-x 368 376 2 1 CN-- vn 852 0x80181d000 0x801a1c000 --- 0 0 1 0 CN-- df 852 0x801a1c000 0x801a25000 rw- 9 0 1 0 CN-- vn 852 0x801a25000 0x801a88000 r-x 80 90 26 13 CN-- vn /usr/lib/libssl.so.7 852 0x801a88000 0x801c88000 --- 0 0 1 0 CN-- df 852 0x801c88000 0x801c92000 rw- 10 0 1 0 CN-- vn /usr/lib/libssl.so.7 852 0x801c92000 0x801e5f000 r-x 461 501 30 15 CN-- vn /lib/libcrypto.so.7 852 0x801e5f000 0x80205f000 --- 0 0 1 0 CN-- df 852 0x80205f000 0x802087000 rw- 40 0 1 0 CN-- vn /lib/libcrypto.so.7 852 0x802087000 0x802089000 rw- 1 1 1 0 CN-- df 852 0x802089000 0x8020fe000 r-x 43 43 7 3 CN-- vn 852 0x8020fe000 0x8022fd000 --- 0 0 1 0 CN-- df 852 0x8022fd000 0x8022fe000 rw- 1 0 1 0 CN-- vn 852 0x8022fe000 0x802476000 r-x 376 387 67 33 CN-- vn /lib/libc.so.7 852 0x802476000 0x802675000 --- 0 0 1 0 CN-- df 852 0x802675000 0x802681000 rw- 12 0 1 0 CN-- vn /lib/libc.so.7 852 0x802681000 0x8026ab000 rw- 11 11 1 0 CN-- df 852 0x8026ab000 0x8026b5000 r-x 10 11 9 4 CN-- vn 852 0x8026b5000 0x8028b5000 --- 0 0 1 0 CN-- df 852 0x8028b5000 0x8028b6000 rw- 1 0 1 0 CN-- vn 852 0x802c00000 0x803400000 rw- 218 218 1 0 CN-- df 852 0x7fffdfffe000 0x7fffdffff000 --- 0 0 0 0 ---- -- 852 0x7ffffffdf000 0x7ffffffff000 rw- 10 0 1 0 C--D df 852 0x7ffffffff000 0x800000000000 r-x 1 1 36 0 ---- ph
bt from last (gdb) bt full #0 __thr_fork () at /.amd_mnt/juno/pub/FreeBSD/10-STABLE/lib/libthr/thread/thr_fork.c:186 rtld_locks = Cannot access memory at address 0x7fffffffc7d0 Current language: auto; currently minimal (gdb)
(In reply to Robert Blayzor from comment #55) The uprintf_signal data was useful, thanks. It is quite clean that the issue occured in the child after the multithreaded fork. An evidence is in the 'CN' flags (copy on write and need copy) for all map entries of the process, except the stack. The number of stacks mapped also suggests that the parent only had one thread executing during the fork. This is confirmed also by the backtrace from the core, but for different reasons I trust the core less. The fault occured on the stack access, as indicated by the fault address. What is very interesting is the error code 7, which is hardware data indicating that this was user-mode write to the page mapped read-only. Which is again consistent with the state after fork. What I do not understand is why page fault handler did not performed COW and changed the page permission to writeable, which it should, due to rw permission on the map entry. Could you, please, apply the attached debugging patch on top of the used source and provide me the same data as now, i.e. procstat -v code and kernel messages from the patch and uprintf_signal.
Created attachment 167570 [details] Debugging patch to track source of KERN_PROTECTION_FAILURE
After both patches... pid 716 (dovecot), uid 0: exited on signal 11 (core dumped) (gdb) bt full #0 memcpy () at bcopy.S:65 No locals. #1 0x0000000800915484 in sig_handler (signo=20, si=0x7fffffffa9a0, context=0x7fffffffa630) at lib-signals.c:132 h = (struct signal_handler *) 0x8014198e0 saved_errno = 14 c = 0 '\0' tmp_si = {si_signo = 0, si_errno = 0, si_code = 0, si_pid = 0, si_uid = 0, si_status = 0, si_addr = 0x0, si_value = {sival_int = 0, sival_ptr = 0x0, sigval_int = 0, sigval_ptr = 0x0}, _reason = {_fault = {_trapno = 0}, _timer = {_timerid = 0, _overrun = 0}, _mesgq = {_mqd = 0}, _poll = {_band = 0}, __spare__ = {__spare1__ = 0, __spare2__ = 0x7fffffffa5c8}}} #2 <signal handler called> No symbol table info available. #3 0x0000000000411fc8 in _fini () No symbol table info available. #4 0x000000080061c9a8 in objlist_call_fini () at /.amd_mnt/juno/pub/FreeBSD/10-STABLE/libexec/rtld-elf/rtld.c:2354 list = (Objlist *) Cannot access memory at address 0x0 Current language: auto; currently asm (gdb) q dev@mta1 [/var/spool/tmp] sudo procstat -v /var/spool/tmp/dovecot.core PID START END PRT RES PRES REF SHD FL TP PATH 716 0x400000 0x417000 r-x 20 20 1 0 CN-- vn 716 0x617000 0x618000 rw- 1 1 1 0 C--- df 716 0x800617000 0x800635000 r-x 30 30 32 0 CN-- vn /libexec/ld-elf.so.1 716 0x800635000 0x800663000 rw- 36 36 1 0 C--- df 716 0x800834000 0x800836000 rw- 2 2 1 0 C--- df 716 0x800836000 0x800983000 r-x 225 231 34 17 CN-- vn /usr/local/lib/dovecot/libdovecot.so.0.0.0 716 0x800983000 0x800b82000 --- 0 0 1 0 CN-- df 716 0x800b82000 0x800b89000 rw- 7 0 1 0 C--- vn /usr/local/lib/dovecot/libdovecot.so.0.0.0 716 0x800b89000 0x800b8b000 rw- 2 0 1 0 C--- df 716 0x800b8b000 0x800d03000 r-x 376 387 63 31 CN-- vn /lib/libc.so.7 716 0x800d03000 0x800f02000 --- 0 0 1 0 CN-- df 716 0x800f02000 0x800f0e000 rw- 12 0 1 0 C--- vn /lib/libc.so.7 716 0x800f0e000 0x800f38000 rw- 15 15 1 0 C--- df 716 0x801000000 0x801800000 rw- 94 94 1 0 C--- df 716 0x7ffffffdf000 0x7ffffffff000 rw- 8 8 1 0 C--D df 716 0x7ffffffff000 0x800000000000 r-x 1 1 34 0 ---- ph (gdb) bt full #0 memcpy () at bcopy.S:65 No locals. #1 0x0000000800915484 in sig_handler (signo=20, si=0x7fffffffa9a0, context=0x7fffffffa630) at lib-signals.c:132 h = (struct signal_handler *) 0x8014198e0 saved_errno = 14 c = 0 '\0' tmp_si = {si_signo = 0, si_errno = 0, si_code = 0, si_pid = 0, si_uid = 0, si_status = 0, si_addr = 0x0, si_value = {sival_int = 0, sival_ptr = 0x0, sigval_int = 0, sigval_ptr = 0x0}, _reason = {_fault = {_trapno = 0}, _timer = {_timerid = 0, _overrun = 0}, _mesgq = {_mqd = 0}, _poll = {_band = 0}, __spare__ = {__spare1__ = 0, __spare2__ = 0x7fffffffa5c8}}} #2 <signal handler called> No symbol table info available. #3 0x0000000000411fc8 in _fini () No symbol table info available. #4 0x000000080061c9a8 in objlist_call_fini () at /.amd_mnt/juno/pub/FreeBSD/10-STABLE/libexec/rtld-elf/rtld.c:2354 list = (Objlist *) Cannot access memory at address 0x0 Current language: auto; currently asm
Thinking about this a little, going out on a limb. The only thing that differs on these servers compared to other diskless boot servers we have. (that use the same read only NFS root) is they have a larger memory disks for "/var" when created from /etc/rc.initdiskless. /dev/md0 8.8M 3.8M 4.3M 47% /etc /dev/md1 457M 516K 420M 0% /var Because these are mail servers, /var may be a bit busier than on say DNS and web servers. (as /var/tmp is there, etc, temporary file writes from files being scanned by viruses). I've never seen any "out of memory" errors or that mount ever close to being anywhere near full; but I'm sure there are a lot of read/write going on. I know it's a stretch but it's the only thing I can think of that differs from the other diskless servers that are setup the exact same way. This appears to be happening on only our mail servers, which consequently happen to be our busiest.
(In reply to Robert Blayzor from comment #58) And where is uprintf_signal output, as well as the printfs from the debugging patch ?
I will make sure the patches didn't get overwritten somewhere. But I believe they are in there.. One thing different about this crash is that Dovecot actually managed to log something before the crash that I've not seen before: dovecot: master: Fatal: gettimeofday(): Bad address That was in regards to the last backtrace.
Confirmed that the two patches in in the kernel I built and installed... from trap.c, around line 601: if (uprintf_signal) { printf("pid %d comm %s: signal %d err %lx code %d type %d " "addr 0x%lx rsp 0x%lx rip 0x%lx " "<%02x %02x %02x %02x %02x %02x %02x %02x>\n", ...
I have rechecked, recompiled patches against latest 10.3 Now, under normal conditions my dmesg just floods with, thousands of the messages below: pid 3142 kpf 1 pid 3142 kpf 1 pid 3142 kpf 1 pid 3143 kpf 1 pid 3143 kpf 1 pid 3143 kpf 1 pid 3144 kpf 1 pid 3144 kpf 1 pid 3145 kpf 1 pid 3145 kpf 1 pid 3146 kpf 1 pid 3146 kpf 1 pid 3146 kpf 1 pid 3147 kpf 1 pid 3147 kpf 1 pid 3147 kpf 1 pid 3149 kpf 1 pid 3149 kpf 1 pid 3149 kpf 1 pid 3147 kpf 1 pid 3147 kpf 1 pid 3147 kpf 1 pid 3147 kpf 1 pid 3147 kpf 1 pid 3150 kpf 1 pid 3150 kpf 1 pid 3147 kpf 1 pid 3151 kpf 1 pid 3151 kpf 1 pid 3151 kpf 1 pid 3151 kpf 1 pid 3151 kpf 1 pid 3151 kpf 1 ...
(In reply to Robert Blayzor from comment #63) Sorry for delay with replying. For debugging purposes, please set debug.vn_io_fault_enable to 0, this should stop the message flood.
Here was a recent crash... Mar 15 11:58:17.019 mta1.alb kernel: pid 69942 kpf 2 Mar 15 11:58:17.019 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:17.019 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:17.019 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:17.019 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:17.019 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:17.019 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:17.019 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:17.019 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:17.221 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:17.221 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:17.221 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:17.221 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:17.221 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:17.221 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:17.221 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:17.221 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:17.221 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:17.221 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:17.221 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:17.221 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:17.221 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:17.221 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:17.221 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:17.221 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:17.221 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:17.221 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:17.221 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:17.221 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:17.221 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:17.221 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:17.221 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:17.221 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:17.221 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:17.425 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:17.425 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:17.425 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:17.425 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:17.425 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:17.425 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:17.425 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:17.425 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:17.425 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:17.425 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:17.425 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:17.425 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:17.425 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:17.425 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:17.425 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:17.425 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:17.425 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:17.425 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:17.425 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:17.425 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:17.425 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:17.425 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:17.625 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:17.625 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:17.625 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:17.625 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:17.625 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:17.625 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:17.625 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:17.625 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:17.625 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:17.625 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:17.625 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:17.625 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:17.625 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:17.625 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:17.625 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:17.625 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:17.625 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:17.625 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:17.625 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:17.625 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:17.626 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:17.626 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:17.626 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:17.626 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:17.833 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:17.833 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:17.833 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:17.833 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:17.833 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:17.833 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:17.833 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:17.833 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:17.833 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:17.833 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:17.834 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:17.834 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:17.834 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:17.834 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:17.834 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:17.834 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:17.834 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:17.834 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:17.834 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:17.834 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:17.834 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:17.834 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:17.834 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:17.834 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:17.834 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:18.034 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:18.034 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:18.034 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:18.034 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:18.034 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:18.034 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:18.034 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:18.035 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:18.035 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:18.035 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:18.035 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:18.035 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:18.035 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:18.035 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:18.035 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:18.035 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:18.035 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:18.035 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:18.035 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:18.035 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:18.035 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:18.035 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:18.035 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:18.035 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:18.035 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:18.035 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:18.238 mta1.alb kernel: pid 69942 kpf 1 Mar 15 11:58:18.238 mta1.alb kernel: <6>pid 69942 (exim), uid 26: exited on signal 11 (core dumped) procstat -v exim.core Password: PID START END PRT RES PRES REF SHD FL TP PATH 69942 0x400000 0x560000 r-x 314 325 3 1 CN-- vn /usr/local/sbin/exim 69942 0x760000 0x76a000 rw- 10 0 2 1 CN-- vn /usr/local/sbin/exim 69942 0x76a000 0x779000 rw- 5 5 2 1 CN-- df 69942 0x800760000 0x80077e000 r-x 30 30 33 0 CN-- vn /libexec/ld-elf.so.1 69942 0x80077e000 0x800785000 rw- 7 7 2 0 CN-- df 69942 0x800787000 0x8007b7000 rw- 40 40 2 1 CN-- df 69942 0x80097d000 0x80097f000 rw- 2 2 2 0 CN-- df 69942 0x80097f000 0x80098d000 r-x 14 15 7 3 CN-- vn 69942 0x80098d000 0x800b8d000 --- 0 0 2 0 CN-- df 69942 0x800b8d000 0x800b8e000 rw- 1 0 2 0 CN-- vn 69942 0x800b8e000 0x800b9f000 rw- 0 0 0 0 ---- -- 69942 0x800b9f000 0x800bc8000 r-x 41 41 9 4 CN-- vn 69942 0x800bc8000 0x800dc7000 --- 0 0 2 0 CN-- df 69942 0x800dc7000 0x800dc8000 rw- 1 0 2 0 CN-- vn 69942 0x800dc8000 0x800dd8000 r-x 16 17 19 9 CN-- vn 69942 0x800dd8000 0x800fd8000 --- 0 0 2 0 CN-- df 69942 0x800fd8000 0x800fd9000 rw- 1 0 2 0 CN-- vn 69942 0x800fd9000 0x800fdb000 rw- 0 0 0 0 ---- -- 69942 0x800fdb000 0x800ff3000 r-x 18 20 3 1 CN-- vn 69942 0x800ff3000 0x8011f3000 --- 0 0 2 0 CN-- df 69942 0x8011f3000 0x8011f5000 rw- 2 0 2 0 CN-- vn 69942 0x8011f5000 0x80120d000 r-x 24 25 18 8 CN-- vn 69942 0x80120d000 0x80140d000 --- 0 0 2 0 CN-- df 69942 0x80140d000 0x80140e000 rw- 1 0 2 0 CN-- vn 69942 0x80140e000 0x80141a000 rw- 12 0 1 0 C--- df 69942 0x80141a000 0x801445000 r-x 38 41 5 2 CN-- vn 69942 0x801445000 0x801645000 --- 0 0 2 0 CN-- df 69942 0x801645000 0x801648000 rw- 3 0 2 0 CN-- vn 69942 0x801648000 0x80181d000 r-x 404 412 3 1 CN-- vn /usr/local/lib/perl5/5.22/mach/CORE/libperl.so.5.22.1 69942 0x80181d000 0x801a1c000 --- 0 0 2 0 CN-- df 69942 0x801a1c000 0x801a25000 rw- 9 0 2 0 CN-- vn /usr/local/lib/perl5/5.22/mach/CORE/libperl.so.5.22.1 69942 0x801a25000 0x801a88000 r-x 84 94 27 13 CN-- vn /usr/lib/libssl.so.7 69942 0x801a88000 0x801c88000 --- 0 0 2 0 CN-- df 69942 0x801c88000 0x801c92000 rw- 10 0 2 0 CN-- vn /usr/lib/libssl.so.7 69942 0x801c92000 0x801e5f000 r-x 461 501 31 15 CN-- vn /lib/libcrypto.so.7 69942 0x801e5f000 0x80205f000 --- 0 0 2 0 CN-- df 69942 0x80205f000 0x802087000 rw- 40 0 2 0 CN-- vn /lib/libcrypto.so.7 69942 0x802087000 0x802089000 rw- 1 1 2 0 CN-- df 69942 0x802089000 0x8020fe000 r-x 45 45 8 3 CN-- vn 69942 0x8020fe000 0x8022fd000 --- 0 0 2 0 CN-- df 69942 0x8022fd000 0x8022fe000 rw- 1 0 2 0 CN-- vn 69942 0x8022fe000 0x802476000 r-x 376 387 64 31 CN-- vn /lib/libc.so.7 69942 0x802476000 0x802675000 --- 0 0 2 0 CN-- df 69942 0x802675000 0x802681000 rw- 12 0 2 0 CN-- vn /lib/libc.so.7 69942 0x802681000 0x8026ab000 rw- 11 11 2 1 CN-- df 69942 0x8026ab000 0x8026b5000 r-x 10 11 10 4 CN-- vn /usr/local/lib/libintl.so.8.1.4 69942 0x8026b5000 0x8028b5000 --- 0 0 2 0 CN-- df 69942 0x8028b5000 0x8028b6000 rw- 1 0 2 0 CN-- vn /usr/local/lib/libintl.so.8.1.4 69942 0x802c00000 0x803400000 rw- 218 0 1 0 C--- df 69942 0x7fffdfffe000 0x7fffdffff000 --- 0 0 0 0 ---- -- 69942 0x7ffffffdf000 0x7ffffffff000 rw- 10 0 1 0 C--D df 69942 0x7ffffffff000 0x800000000000 r-x 1 1 35 0 ---- ph
backtrace from previous crash: #1 0x000000080120706d in __thr_fork () at /.amd_mnt/juno/pub/FreeBSD/10-STABLE/lib/libthr/thread/thr_fork.c:203 rtld_locks = {-4800, 32767, -4760, 32767, -4792, 32767, 3, 0} curthread = (struct pthread *) 0x803006400 cancelsave = 0 ret = 0 errsave = 4 #2 0x0000000000428c54 in handle_smtp_call (listen_sockets=0x803065140, listen_socket_count=4, accept_socket=7, accepted=0x7fffffffcd38) at daemon.c:355 pid = 32767 interface_sockaddr = {v4 = {sin_len = 28 '\034', sin_family = 28 '\034', sin_port = 6400, sin_addr = {s_addr = 0}, sin_zero = 0x7fffffffc9a0 "&\aðX\001\020"}, v6 = { sin6_len = 28 '\034', sin6_family = 28 '\034', sin6_port = 6400, sin6_flowinfo = 0, sin6_addr = {__u6_addr = {__u6_addr8 = 0x7fffffffc9a0 "&\aðX\001\020", __u6_addr16 = 0x7fffffffc9a0, __u6_addr32 = 0x7fffffffc9a0}}, sin6_scope_id = 0}, v0 = {sa_len = 28 '\034', sa_family = 28 '\034', sa_data = 0x7fffffffc99a ""}} ifsize = 28 dup_accept_socket = 9 max_for_this_host = 10 wfsize = 100 wfptr = 22 use_log_write_selector = 1980 whofrom = (uschar *) 0x803065e30 "[2607:f058:110:2::f:0]" reset_point = (void *) 0x803065e00
Sorry for another possible hail mary, but I've noticed that many of your faults are in the thread library. There were several changes to libthr between 10.1 and 10.2. They should all be bug fixes (in particular fixes to allow applications to dlopen() an object that uses libthr). If these changes to libthr are relevant then there are a few things to investigate and/or try: 1) Can you run ldd on your dovecot and exim binaries to see if they are directly linked against libthr? 2) Do you know if your binaries are loading any "plugins" (e.g. SASL auth mechanisms, milters (I know postfix and sendmail can both use milters for things like greylisting, etc.))? If you are, are any of your plugins linked against libthr? If the answer to 1) is no, then the first thing to possibly try is to set LD_PRELOAD=libthr.so.3 in the environment of the dovecot and exim binaries when you start them up. (If using rc.d scripts you can do this via foo_env variables in rc.conf.)
Exim most certainly is linked directly... ldd /usr/local/sbin/exim | grep thr libthr.so.3 => /lib/libthr.so.3 (0x8011f5000) Also many of the dovecot libs are linked against libthr as well... ldd /usr/local/lib/dovecot/libdovecot* | grep thr libthr.so.3 => /lib/libthr.so.3 (0x801fb0000) libthr.so.3 => /lib/libthr.so.3 (0x801fb0000) libthr.so.3 => /lib/libthr.so.3 (0x801fb0000) libthr.so.3 => /lib/libthr.so.3 (0x801b91000) libthr.so.3 => /lib/libthr.so.3 (0x801b91000) libthr.so.3 => /lib/libthr.so.3 (0x801b91000)
(In reply to Robert Blayzor from comment #68) Ok. Some of the changes in 10.2 fixed problems related to loading libthr.so.3 at runtime after the binary had started (which is why I asked my question). I think that my suggestion of using LD_PRELOAD would not be a useful test given your reply. It might still be useful to try putting a 10.1 libthr in your 10.2 image to see if just swapping out libthr fixes the issue. If it does it at least helps narrow down where the problem lies.
(In reply to Robert Blayzor from comment #65) If vn_io_fault_enable is set to 0, I do not see how kpf 1 message can occur. Could you, please, double-check that the setting is right ? Also, do you have any third-party module loaded ? Did you tried to upgrade the hypervisor ?
I will double check the sysctl and set if not the case. It may not have persisted across reboot of the last kernel. We want to avoid upgrading major revs of the hypervisor as we have other VM's that do not have NFS diskless root that we have no problems with. The only difference between these hosts and other hosts on the same hypervisor is these boot with no local (virtual) disk for the root FS; only tmp/swap. We have no 3rd party modules loaded.
With the recent patches... I still get occasional unexpected crashes... Here is all I got on the last one.... Mar 29 05:26:18.617 mxs0.alb kernel: pid 97296 kpf 2 Mar 29 05:26:18.617 mxs0.alb kernel: pid 97296 kpf 2 Mar 29 05:26:18.617 mxs0.alb kernel: pid 97296 kpf 2 Mar 29 05:26:18.617 mxs0.alb kernel: pid 97296 kpf 2 Mar 29 05:26:18.825 mxs0.alb kernel: <4>Failed to write core file for process exim (error 14) Mar 29 05:26:18.825 mxs0.alb kernel: <6>pid 97296 (exim), uid 26: exited on signal 11 (gdb) bt full #0 __thr_fork () at /.amd_mnt/juno/pub/FreeBSD/10-STABLE/lib/libthr/thread/thr_fork.c:186 rtld_locks = Cannot access memory at address 0x7fffffffc7d0 Current language: auto; currently min So still looks like a libthr issue. I will see if I can grab a /lib/libthr.so.3 from a FreeBSD 10.1 install I have locally and see if I still get these problems.
I copied /lib/libthr.so.3 from a 10.1 physical server I have and rebooted the diskless VM's in question. That did not go over well. The servers booted ok, but when processes launched that use it, they would start but just hang. Not sure what the proper procedure would be here but if copying /lib/libthr.so.3 in place and rebooting should of worked, it did not. Needless to say, I had to roll that change back.
(In reply to Robert Blayzor from comment #72) If the only reports were about the 'kpf 2' during the failure, please try the new debugging patch. It still requires setting of debug.vn_io_fault to zero. As before, I need the dmesg output of the failing episode. Patch is against HEAD, and it would be extra good if you tried this on the HEAD kernel, no need to update usermode.
Created attachment 168808 [details] Debugging patch to track the origin of OBJ_DEAD
Can this be hand patched against 10-STABLE ? That's what our SVN co and install is based on. Or is there some other stuff in HEAD that is required? We are pre-production on this so I'm trying to limit pain from installing some bleeding edge changes.
(In reply to Robert Blayzor from comment #76) There are some changes in HEAD which might be relevant to the area I am trying to drill into. I.e. it might be beneficial for you to try. Note that only kernel is needed, I already mentioned that you do not need to update usermode. Anyway, I moved the patch to stable/10, but I cannot even compile it right now. So if any compilation issues are there, please fix.
Created attachment 168832 [details] OBJ_DEAD tracker for stable/10
Here you go... 9718 vm_fault OBJ_DEAD 0xfffff8000afb4330 pid 9718 kpf 2 9718 vm_fault OBJ_DEAD 0xfffff8000afb4330 pid 9718 kpf 2 9718 vm_fault OBJ_DEAD 0xfffff8000afb4330 pid 9718 kpf 2 9718 vm_fault OBJ_DEAD 0xfffff8000afb4330 pid 9718 kpf 2 Failed to write core file for process dovecot (error 14) pid 9718 (dovecot), uid 0: exited on signal 11 (gdb) bt full #0 0x000000000040a85a in service_process_create () No symbol table info available. Cannot access memory at address 0x7fffffffae18 procstat -v dovecot.core PID START END PRT RES PRES REF SHD FL TP PATH 9718 0x400000 0x411000 r-x 17 17 2 0 CN-- vn 9718 0x610000 0x611000 rw- 1 1 2 1 CN-- df 9718 0x800610000 0x80062e000 r-x 30 30 33 0 CN-- vn /libexec/ld-elf.so.1 9718 0x80062e000 0x80065c000 rw- 36 36 2 1 CN-- df 9718 0x80082d000 0x80082f000 rw- 2 2 2 0 CN-- df 9718 0x80082f000 0x800922000 r-x 171 177 35 17 CN-- vn /usr/local/lib/dovecot/libdovecot.so.0.0.0 9718 0x800922000 0x800b22000 --- 0 0 2 0 CN-- df 9718 0x800b22000 0x800b28000 rw- 6 0 2 1 CN-- vn /usr/local/lib/dovecot/libdovecot.so.0.0.0 9718 0x800b28000 0x800b2a000 rw- 2 2 2 1 CN-- df 9718 0x800b2a000 0x800ca2000 r-x 376 387 64 31 CN-- vn /lib/libc.so.7 9718 0x800ca2000 0x800ea1000 --- 0 0 2 0 CN-- df 9718 0x800ea1000 0x800ead000 rw- 12 0 2 0 CN-- vn /lib/libc.so.7 9718 0x800ead000 0x800ed7000 rw- 13 13 2 0 CN-- df 9718 0x801000000 0x801800000 rw- 94 0 1 0 C--- df 9718 0x7ffffffdf000 0x7ffffffff000 rw- 8 0 1 0 C--D df 9718 0x7ffffffff000 0x800000000000 r-x 1 1 35 0 ---- ph
(In reply to Robert Blayzor from comment #79) The page fault was not handled since some object in the shadow chain which backs the faulted region, has the OBJ_DEAD flag set. I do not see why would this state valid for the object in question. There is another PR 204764, where the same flag is set for a vnode object and the manifestation of the problem is different. Instead of vm_fault, the object was found by vnode_create_vobject(), which sleeps forever waiting for the object termination to finish. Since the flag was not set by the termination conditions, termination does not happen and lookup is stuck forever. Right now, I have no idea why does this happen. Either we have a bug in VM by spuriously setting the flag (but code reading does not support this possibility), or some random memory access happens and corrupts the vm object memory. I do not know. I attached some additions to the debugging patch, which both asserts that the object is write-locked when object->flags are modified, and also it slightly changes the layout of struct vm_object. So if the issue is VM bug, most likely unlocked modifications, it could be catched. Or, if the problem is the memory corruption, it should migrate to other place. Still, it is only speculation. Please make sure that you have INVARIANTS and WITNESS in your kernel config enabled.
Created attachment 168870 [details] OBJ_DEAD tracker for stable/10, v2: assert that the object is locked, change layout
Getting a compile build error with the latest patch: In file included from /.amd_mnt/juno/pub/FreeBSD/10-STABLE/sys/compat/freebsd32/freebsd32_misc.c:95: /.amd_mnt/juno/pub/FreeBSD/10-STABLE/sys/vm/vm_object.h:257:3: error: implicit declaration of function 'rw_assert' is invalid in C99 [-Werror,-Wimplicit-function-declaration] VM_OBJECT_ASSERT_WLOCKED(object); ^ /.amd_mnt/juno/pub/FreeBSD/10-STABLE/sys/vm/vm_object.h:229:2: note: expanded from macro 'VM_OBJECT_ASSERT_WLOCKED' rw_assert(&(object)->lock, RA_WLOCKED) ^ /.amd_mnt/juno/pub/FreeBSD/10-STABLE/sys/vm/vm_object.h:257:3: error: use of undeclared identifier 'RA_WLOCKED' /.amd_mnt/juno/pub/FreeBSD/10-STABLE/sys/vm/vm_object.h:229:29: note: expanded from macro 'VM_OBJECT_ASSERT_WLOCKED' rw_assert(&(object)->lock, RA_WLOCKED)
Created attachment 168872 [details] OBJ_DEAD tracker for stable/10, v2.1 This one builds, but might not boot. Ensure that you have WITNESS and INVARIANTS.
(In reply to Konstantin Belousov from comment #80) I wonder if this is related to the MFC I did in r284100 of the changes to keep VM objects on the VM list. It doesn't alter handling of the DEAD flag, but does affect the object type. It was merged to 10 in between 10.1 and 10.2 so it would fit the time window.
This builds without the following... options INVARIANTS options WITNESS But when I add those two build options, building kernel fails... sys_machdep.o:/.amd_mnt/juno/pub/FreeBSD/10-STABLE/sys/amd64/amd64/sys_machdep.c:514: more undefined references to `__mtx_assert' follow intel_idpgtbl.o: In function `ctx_idmap_nextlvl': /.amd_mnt/juno/pub/FreeBSD/10-STABLE/sys/x86/iommu/intel_idpgtbl.c:117: undefined reference to `__rw_assert' intel_idpgtbl.o: In function `ctx_map_buf_locked': /.amd_mnt/juno/pub/FreeBSD/10-STABLE/sys/x86/iommu/intel_idpgtbl.c:424: undefined reference to `__rw_assert' intel_idpgtbl.o: In function `ctx_unmap_buf_locked': /.amd_mnt/juno/pub/FreeBSD/10-STABLE/sys/x86/iommu/intel_idpgtbl.c:604: undefined reference to `__rw_assert' intel_idpgtbl.o: In function `ctx_free_pgtbl': /.amd_mnt/juno/pub/FreeBSD/10-STABLE/sys/x86/iommu/intel_idpgtbl.c:708: undefined reference to `__rw_assert' /.amd_mnt/juno/pub/FreeBSD/10-STABLE/sys/x86/iommu/intel_idpgtbl.c:718: undefined reference to `__rw_assert' intel_idpgtbl.o:/.amd_mnt/juno/pub/FreeBSD/10-STABLE/sys/x86/iommu/intel_idpgtbl.c:356: more undefined references to `__rw_assert' follow intel_qi.o: In function `dmar_qi_invalidate_locked': /.amd_mnt/juno/pub/FreeBSD/10-STABLE/sys/x86/iommu/intel_qi.c:222: undefined reference to `__mtx_assert' intel_qi.o: In function `dmar_qi_advance_tail': /.amd_mnt/juno/pub/FreeBSD/10-STABLE/sys/x86/iommu/intel_qi.c:99: undefined reference to `__mtx_assert' intel_qi.o: In function `dmar_qi_ensure': /.amd_mnt/juno/pub/FreeBSD/10-STABLE/sys/x86/iommu/intel_qi.c:109: undefined reference to `__mtx_assert' intel_qi.o: In function `dmar_qi_advance_tail': /.amd_mnt/juno/pub/FreeBSD/10-STABLE/sys/x86/iommu/intel_qi.c:99: undefined reference to `__mtx_assert' intel_qi.o: In function `dmar_qi_emit': /.amd_mnt/juno/pub/FreeBSD/10-STABLE/sys/x86/iommu/intel_qi.c:142: undefined reference to `__mtx_assert' intel_qi.o:/.amd_mnt/juno/pub/FreeBSD/10-STABLE/sys/x86/iommu/intel_qi.c:178: more undefined references to `__mtx_assert' follow *** Error code 1 Stop. make[2]: stopped in /opt/obj/.amd_mnt/juno/pub/FreeBSD/10-STABLE/sys/ESXPXE *** Error code 1 Stop. make[1]: stopped in /.amd_mnt/juno/pub/FreeBSD/10-STABLE *** Error code 1 Stop. make: stopped in /.amd_mnt/juno/pub/FreeBSD/10-STABLE
(In reply to Robert Blayzor from comment #85) You need also option INVARIANT_SUPPORT. I assumed that this is a common knowledge, sorry.
(In reply to John Baldwin from comment #84) Thank you for the pointer, it is something for me to keep in mind. But I do not see how would it be directly related. OBJT_DEAD and OBJ_DEAD are two different things. In the other PR debugging data, the OBJ_DEAD flag is set, but the object itself is fine and not obliterated. Previous debugging patch results make me confident that the OBJ_DEAD flag was not set with vm_object_set_flags(), while source code only sets the OBJ_DEAD with the calls of that helper.
Just a follow-up on this. The last patch has been running for a few days and I've not had an unexpected crash yet. I'm not saying it won't happen, it's not unusual for us not to see this for several days; then suddenly we may start getting them at a rate of 1+ per day. Just following up as we continue to monitor this...
Not sure this related or part of the problem, but the first time I've seen this on any of the VM's kernel messages... lock order reversal: 1st 0xfffff80009d9e240 ufs (ufs) @ /.amd_mnt/juno/pub/FreeBSD/10-STABLE/sys/kern/vfs_subr.c:2253 2nd 0xfffffe007c5bef50 bufwait (bufwait) @ /.amd_mnt/juno/pub/FreeBSD/10-STABLE/sys/ufs/ffs/ffs_vnops.c:262 3rd 0xfffff800117edd50 ufs (ufs) @ /.amd_mnt/juno/pub/FreeBSD/10-STABLE/sys/kern/vfs_subr.c:2253 KDB: stack backtrace: #0 0xffffffff80484260 at kdb_backtrace+0x60 #1 0xffffffff804a30d4 at witness_checkorder+0xe24 #2 0xffffffff8042b98a at __lockmgr_args+0x9ea #3 0xffffffff806818f4 at ffs_lock+0x84 #4 0xffffffff8073dc2c at VOP_LOCK1_APV+0xfc #5 0xffffffff805057da at _vn_lock+0xaa #6 0xffffffff804f6de7 at vget+0x67 #7 0xffffffff804ea601 at vfs_hash_get+0xe1 #8 0xffffffff8067cf00 at ffs_vgetf+0x40 #9 0xffffffff80674bf0 at softdep_sync_buf+0xac0 #10 0xffffffff80682656 at ffs_syncvnode+0x286 #11 0xffffffff80659a5e at ffs_truncate+0x6ae #12 0xffffffff8068987a at ufs_direnter+0x81a #13 0xffffffff806926a5 at ufs_makeinode+0x575 #14 0xffffffff8068e58d at ufs_create+0x2d #15 0xffffffff8073b5c1 at VOP_CREATE_APV+0xf1 #16 0xffffffff80504e93 at vn_open_cred+0x2c3 #17 0xffffffff804fdf8f at kern_openat+0x26f
In 12 days of running since the last debug patch, I've seen only the below on a couple of the VM's.... lock order reversal: 1st 0xfffff80009e5c418 ufs (ufs) @ /.amd_mnt/juno/pub/FreeBSD/10-STABLE/sys/kern/vfs_subr.c:2253 2nd 0xfffffe007c5c37f8 bufwait (bufwait) @ /.amd_mnt/juno/pub/FreeBSD/10-STABLE/sys/ufs/ffs/ffs_vnops.c:262 3rd 0xfffff80064499418 ufs (ufs) @ /.amd_mnt/juno/pub/FreeBSD/10-STABLE/sys/kern/vfs_subr.c:2253 KDB: stack backtrace: #0 0xffffffff80484260 at kdb_backtrace+0x60 #1 0xffffffff804a30d4 at witness_checkorder+0xe24 #2 0xffffffff8042b98a at __lockmgr_args+0x9ea #3 0xffffffff806818f4 at ffs_lock+0x84 #4 0xffffffff8073dc2c at VOP_LOCK1_APV+0xfc #5 0xffffffff805057da at _vn_lock+0xaa #6 0xffffffff804f6de7 at vget+0x67 #7 0xffffffff804ea601 at vfs_hash_get+0xe1 #8 0xffffffff8067cf00 at ffs_vgetf+0x40 #9 0xffffffff80674bf0 at softdep_sync_buf+0xac0 #10 0xffffffff80682656 at ffs_syncvnode+0x286 #11 0xffffffff80659a5e at ffs_truncate+0x6ae #12 0xffffffff8068987a at ufs_direnter+0x81a #13 0xffffffff806926a5 at ufs_makeinode+0x575 #14 0xffffffff8068e58d at ufs_create+0x2d #15 0xffffffff8073b5c1 at VOP_CREATE_APV+0xf1 #16 0xffffffff80504e93 at vn_open_cred+0x2c3 #17 0xffffffff804fdf8f at kern_openat+0x26f
dmesg: 14157 vm_fault OBJ_DEAD 0xfffff8004d9b74c8 pid 14157 kpf 2 pid 14157 comm exim: signal 11 err 7 code 2 type 12 addr 0x764b34 rsp 0x7fffffffc830 rip 0x428c9e <89 04 25 34 4b 76 00 c7> pid 14157 (exim), uid 26: exited on signal 11 (core dumped) procstat -v exim.core PID START END PRT RES PRES REF SHD FL TP PATH 14157 0x400000 0x560000 r-x 316 327 3 1 CN-- vn /usr/local/sbin/exim 14157 0x760000 0x76a000 rw- 10 0 1 0 C--- vn /usr/local/sbin/exim 14157 0x76a000 0x779000 rw- 5 0 1 0 C--- df 14157 0x800760000 0x80077e000 r-x 30 30 32 0 CN-- vn /libexec/ld-elf.so.1 14157 0x80077e000 0x800785000 rw- 7 7 2 0 CN-- df 14157 0x800787000 0x8007b7000 rw- 40 40 2 1 CN-- df 14157 0x80097d000 0x80097f000 rw- 2 2 2 0 CN-- df 14157 0x80097f000 0x80098d000 r-x 14 15 12 3 CN-- vn 14157 0x80098d000 0x800b8d000 --- 0 0 2 0 CN-- df 14157 0x800b8d000 0x800b8e000 rw- 1 0 2 0 CN-- vn 14157 0x800b8e000 0x800b9f000 rw- 0 0 0 0 ---- -- 14157 0x800b9f000 0x800bc8000 r-x 41 41 19 6 CN-- vn 14157 0x800bc8000 0x800dc7000 --- 0 0 2 0 CN-- df 14157 0x800dc7000 0x800dc8000 rw- 1 0 2 0 CN-- vn 14157 0x800dc8000 0x800dd8000 r-x 16 17 26 10 CN-- vn 14157 0x800dd8000 0x800fd8000 --- 0 0 2 0 CN-- df 14157 0x800fd8000 0x800fd9000 rw- 1 0 2 0 CN-- vn 14157 0x800fd9000 0x800fdb000 rw- 0 0 0 0 ---- -- 14157 0x800fdb000 0x800ff3000 r-x 24 26 3 1 CN-- vn 14157 0x800ff3000 0x8011f3000 --- 0 0 2 0 CN-- df 14157 0x8011f3000 0x8011f5000 rw- 2 0 2 0 CN-- vn 14157 0x8011f5000 0x80120d000 r-x 24 25 24 8 CN-- vn /lib/libthr.so.3 14157 0x80120d000 0x80140d000 --- 0 0 2 0 CN-- df 14157 0x80140d000 0x80140e000 rw- 1 0 1 0 C--- vn /lib/libthr.so.3 14157 0x80140e000 0x80141a000 rw- 12 0 1 0 C--- df 14157 0x80141a000 0x801445000 r-x 38 41 3 1 CN-- vn 14157 0x801445000 0x801645000 --- 0 0 2 0 CN-- df 14157 0x801645000 0x801648000 rw- 3 0 2 0 CN-- vn 14157 0x801648000 0x80181d000 r-x 424 432 10 2 CN-- vn 14157 0x80181d000 0x801a1c000 --- 0 0 2 0 CN-- df 14157 0x801a1c000 0x801a25000 rw- 9 0 2 0 CN-- vn 14157 0x801a25000 0x801a88000 r-x 84 94 9 4 CN-- vn /usr/lib/libssl.so.7 14157 0x801a88000 0x801c88000 --- 0 0 2 0 CN-- df 14157 0x801c88000 0x801c92000 rw- 10 0 2 0 CN-- vn /usr/lib/libssl.so.7 14157 0x801c92000 0x801e5f000 r-x 461 501 13 6 CN-- vn /lib/libcrypto.so.7 14157 0x801e5f000 0x80205f000 --- 0 0 2 0 CN-- df 14157 0x80205f000 0x802087000 rw- 40 0 2 0 CN-- vn /lib/libcrypto.so.7 14157 0x802087000 0x802089000 rw- 1 1 2 0 CN-- df 14157 0x802089000 0x8020fe000 r-x 45 45 8 3 CN-- vn 14157 0x8020fe000 0x8022fd000 --- 0 0 2 0 CN-- df 14157 0x8022fd000 0x8022fe000 rw- 1 0 2 0 CN-- vn 14157 0x8022fe000 0x802476000 r-x 376 387 56 24 CN-- vn /lib/libc.so.7 14157 0x802476000 0x802675000 --- 0 0 2 0 CN-- df 14157 0x802675000 0x802681000 rw- 12 0 1 0 C--- vn /lib/libc.so.7 14157 0x802681000 0x8026ab000 rw- 11 0 1 0 C--- df 14157 0x8026ab000 0x8026b5000 r-x 10 11 8 3 CN-- vn /usr/local/lib/libintl.so.8.1.4 14157 0x8026b5000 0x8028b5000 --- 0 0 2 0 CN-- df 14157 0x8028b5000 0x8028b6000 rw- 1 0 2 0 CN-- vn /usr/local/lib/libintl.so.8.1.4 14157 0x802c00000 0x803400000 rw- 222 0 1 0 C--- df 14157 0x7fffdfffe000 0x7fffdffff000 --- 0 0 0 0 ---- -- 14157 0x7ffffffdf000 0x7ffffffff000 rw- 10 0 1 0 C--D df 14157 0x7ffffffff000 0x800000000000 r-x 1 1 34 0 ---- ph (gdb) bt full #0 handle_smtp_call (listen_sockets=0x8030650b8, listen_socket_count=2, accept_socket=5, accepted=0x7fffffffcd38) at daemon.c:375 i = 32767 queue_only_reason = 0 old_pool = 0 save_debug_selector = 0 local_queue_only = 32767 session_local_queue_only = -4800 act = {__sigaction_u = {__sa_handler = 0, __sa_sigaction = 0}, sa_flags = 0, sa_mask = {__bits = 0x7fffffffc91c}} pid = 0 interface_sockaddr = {v4 = {sin_len = 28 '\034', sin_family = 28 '\034', sin_port = 6400, sin_addr = {s_addr = 0}, sin_zero = 0x7fffffffc9a0 "&\aðX\001\020"}, v6 = {sin6_len = 28 '\034', sin6_family = 28 '\034', sin6_port = 6400, sin6_flowinfo = 0, sin6_addr = {__u6_addr = { __u6_addr8 = 0x7fffffffc9a0 "&\aðX\001\020", __u6_addr16 = 0x7fffffffc9a0, __u6_addr32 = 0x7fffffffc9a0}}, sin6_scope_id = 0}, v0 = {sa_len = 28 '\034', sa_family = 28 '\034', sa_data = 0x7fffffffc99a ""}} ifsize = 28 dup_accept_socket = 7 max_for_this_host = 10 wfsize = 100 wfptr = 22 use_log_write_selector = 1980 whofrom = (uschar *) 0x80308a060 "[2607:f058:110:1::f:0]" reset_point = (void *) 0x80308a030 #1 0x0000000000427c55 in daemon_go () at daemon.c:2040 accept_socket = 5 sk = 0 max_socket = 4 lcount = 0 select_errno = 4 select_failed = 0 select_listen = {__fds_bits = 0x7fffffffcc40} accepted = {sin6_len = 28 '\034', sin6_family = 28 '\034', sin6_port = 37545, sin6_flowinfo = 0, sin6_addr = {__u6_addr = { __u6_addr8 = 0x7fffffffcd40 "&\aðX\001\020", __u6_addr16 = 0x7fffffffcd40, __u6_addr32 = 0x7fffffffcd40}}, sin6_scope_id = 0} len = 28 pid = 14053 pw = (struct passwd *) 0x802698328 listen_sockets = (int *) 0x8030650b8 listen_socket_count = 2 addresses = (ip_address_item *) 0x803065038 last_connection_time = 1460982877 #2 0x000000000044af25 in main (argc=3, cargv=0x7fffffffed48) at exim.c:4719 rsopts = 0x761b00 argv = (uschar **) 0x7fffffffed48 arg_receive_timeout = -1 arg_smtp_receive_timeout = -1 arg_error_handling = 0 filter_sfd = -1 filter_ufd = -1 group_count = 2 ---Type <return> to continue, or q <return> to quit--- i = 0 rv = 0 list_queue_option = 0 msg_action = 0 msg_action_arg = -1 namelen = 20 queue_only_reason = 0 perl_start_option = 0 recipients_arg = 3 sender_address_domain = 0 test_retry_arg = -1 test_rewrite_arg = -1 arg_queue_only = 0 bi_option = 0 checking = 0 count_queue = 0 expansion_test = 0 extract_recipients = 0 flag_G = 0 flag_n = 0 forced_delivery = 0 f_end_dot = 0 deliver_give_up = 0 list_queue = 0 list_options = 0 local_queue_only = 7868416 more = 1 one_msg_action = 0 queue_only_set = 0 receiving_message = 0 sender_ident_set = 0 session_local_queue_only = -5288 unprivileged = 0 removed_privilege = 0 usage_wanted = 0 verify_address_mode = 0 verify_as_sender = 0 version_printed = 0 alias_arg = (uschar *) 0x0 called_as = (uschar *) 0x554616 "" cmdline_syslog_name = (uschar *) 0x0 start_queue_run_id = (uschar *) 0x0 stop_queue_run_id = (uschar *) 0x0 expansion_test_message = (uschar *) 0x0 ftest_domain = (uschar *) 0x0 ftest_localpart = (uschar *) 0x0 ftest_prefix = (uschar *) 0x0 ---Type <return> to continue, or q <return> to quit--- ftest_suffix = (uschar *) 0x0 log_oneline = (uschar *) 0x0 malware_test_file = (uschar *) 0x0 real_sender_address = (uschar *) 0x7fffffffeb58 " \005v" originator_home = (uschar *) 0x803065018 "/root" sz = 140737488350168 reset_point = (void *) 0x80231a310 pw = (struct passwd *) 0x802698328 statbuf = {st_dev = 1895890688, st_ino = 66, st_mode = 8592, st_nlink = 1, st_uid = 1001, st_gid = 4, st_rdev = 66, st_atim = {tv_sec = 1460632918, tv_nsec = 0}, st_mtim = {tv_sec = 1460632918, tv_nsec = 0}, st_ctim = {tv_sec = 1460632918, tv_nsec = 0}, st_size = 0, st_blocks = 0, st_blksize = 4096, st_flags = 0, st_gen = 0, st_lspare = 0, st_birthtim = {tv_sec = -1, tv_nsec = 0}} passed_qr_pid = 0 passed_qr_pipe = -1 group_list = 0x7fffffffd8f0 info_flag = CMDINFO_NONE info_stdout = 0
(In reply to Robert Blayzor from comment #91) This happens with the _latest_ debugging patch applied, am I right ? Please do 'kgdb kernel.debug /dev/mem' on the machine, and then at kgdb prompt, do 'p *(struct vm_object *)0xfffff8004d9b74c8'.
kgdb kernel.debug /dev/mem This GDB was configured as "amd64-marcel-freebsd"...kernel.debug: No such file or directory. Can't open a vmcore without a kernel I assume I need to have a kernel compiled with something else?
(In reply to Robert Blayzor from comment #93) You need the kernel and kernel.debug files from the kernel installation you are running. Erm, wait, in your case the issue is the SEGFAULT, not the hang, right ? The object is probably already destroyed then.
The processes are actually SIG11 and dying. (core dump). They are not hanging.
Similar sigfault today... 844 vm_fault OBJ_DEAD 0xfffff8000cf704c8 pid 844 kpf 2 pid 844 comm exim: signal 11 err 7 code 2 type 12 addr 0x803006434 rsp 0x7fffffffc7b0 rip 0x801201a67 <ff 48 34 74 02 5d c3 48> 844 vm_fault OBJ_DEAD 0xfffff8000cf704c8 pid 844 kpf 2 Failed to write core file for process exim (error 14) pid 844 (exim), uid 26: exited on signal 11 (gdb) bt full #0 _thr_signal_unblock (curthread=0x803006400) at /.amd_mnt/juno/pub/FreeBSD/10-STABLE/lib/libthr/thread/thr_sig.c:107 No locals. Cannot access memory at address 0x7fffffffc7b8 Current language: auto; currently minimal procstat -v exim.core PID START END PRT RES PRES REF SHD FL TP PATH 844 0x400000 0x560000 r-x 316 327 3 1 CN-- vn /usr/local/sbin/exim 844 0x760000 0x76a000 rw- 10 0 2 1 CN-- vn /usr/local/sbin/exim 844 0x76a000 0x779000 rw- 5 5 2 1 CN-- df 844 0x800760000 0x80077e000 r-x 30 30 34 0 CN-- vn /libexec/ld-elf.so.1 844 0x80077e000 0x800785000 rw- 7 7 2 0 CN-- df 844 0x800787000 0x8007b7000 rw- 40 40 2 1 CN-- df 844 0x80097d000 0x80097f000 rw- 2 2 2 0 CN-- df 844 0x80097f000 0x80098d000 r-x 14 15 14 4 CN-- vn /lib/libcrypt.so.5 844 0x80098d000 0x800b8d000 --- 0 0 2 0 CN-- df 844 0x800b8d000 0x800b8e000 rw- 1 0 2 0 CN-- vn /lib/libcrypt.so.5 844 0x800b8e000 0x800b9f000 rw- 0 0 0 0 ---- -- 844 0x800b9f000 0x800bc8000 r-x 41 41 19 6 CN-- vn /lib/libm.so.5 844 0x800bc8000 0x800dc7000 --- 0 0 2 0 CN-- df 844 0x800dc7000 0x800dc8000 rw- 1 0 2 0 CN-- vn /lib/libm.so.5 844 0x800dc8000 0x800dd8000 r-x 16 17 26 10 CN-- vn /lib/libutil.so.9 844 0x800dd8000 0x800fd8000 --- 0 0 2 0 CN-- df 844 0x800fd8000 0x800fd9000 rw- 1 0 2 0 CN-- vn /lib/libutil.so.9 844 0x800fd9000 0x800fdb000 rw- 0 0 0 0 ---- -- 844 0x800fdb000 0x800ff3000 r-x 24 26 3 1 CN-- vn /usr/local/lib/libspf2.so.2.1.0 844 0x800ff3000 0x8011f3000 --- 0 0 2 0 CN-- df 844 0x8011f3000 0x8011f5000 rw- 2 0 2 1 CN-- vn /usr/local/lib/libspf2.so.2.1.0 844 0x8011f5000 0x80120d000 r-x 24 25 28 10 CN-- vn /lib/libthr.so.3 844 0x80120d000 0x80140d000 --- 0 0 2 0 CN-- df 844 0x80140d000 0x80140e000 rw- 1 0 2 1 CN-- vn /lib/libthr.so.3 844 0x80140e000 0x80141a000 rw- 12 0 1 0 C--- df 844 0x80141a000 0x801445000 r-x 38 41 5 2 CN-- vn /usr/local/lib/libpq.so.5 844 0x801445000 0x801645000 --- 0 0 2 0 CN-- df 844 0x801645000 0x801648000 rw- 3 0 2 0 CN-- vn /usr/local/lib/libpq.so.5 844 0x801648000 0x80181d000 r-x 424 432 10 2 CN-- vn /usr/local/lib/perl5/5.22/mach/CORE/libperl.so.5.22.1 844 0x80181d000 0x801a1c000 --- 0 0 2 0 CN-- df 844 0x801a1c000 0x801a25000 rw- 9 0 2 0 CN-- vn /usr/local/lib/perl5/5.22/mach/CORE/libperl.so.5.22.1 844 0x801a25000 0x801a88000 r-x 86 96 11 5 CN-- vn /usr/lib/libssl.so.7 844 0x801a88000 0x801c88000 --- 0 0 2 0 CN-- df 844 0x801c88000 0x801c92000 rw- 10 0 2 0 CN-- vn /usr/lib/libssl.so.7 844 0x801c92000 0x801e5f000 r-x 461 501 15 7 CN-- vn /lib/libcrypto.so.7 844 0x801e5f000 0x80205f000 --- 0 0 2 0 CN-- df 844 0x80205f000 0x802087000 rw- 40 0 2 0 CN-- vn /lib/libcrypto.so.7 844 0x802087000 0x802089000 rw- 1 1 2 0 CN-- df 844 0x802089000 0x8020fe000 r-x 45 45 8 3 CN-- vn /usr/local/lib/libpcre.so.1.2.5 844 0x8020fe000 0x8022fd000 --- 0 0 2 0 CN-- df 844 0x8022fd000 0x8022fe000 rw- 1 0 2 0 CN-- vn /usr/local/lib/libpcre.so.1.2.5 844 0x8022fe000 0x802476000 r-x 376 387 60 26 CN-- vn /lib/libc.so.7 844 0x802476000 0x802675000 --- 0 0 2 0 CN-- df 844 0x802675000 0x802681000 rw- 12 0 2 1 CN-- vn /lib/libc.so.7 844 0x802681000 0x8026ab000 rw- 12 12 2 1 CN-- df 844 0x8026ab000 0x8026b5000 r-x 10 11 10 4 CN-- vn /usr/local/lib/libintl.so.8.1.4 844 0x8026b5000 0x8028b5000 --- 0 0 2 0 CN-- df 844 0x8028b5000 0x8028b6000 rw- 1 0 2 0 CN-- vn /usr/local/lib/libintl.so.8.1.4 844 0x802c00000 0x803400000 rw- 191 0 1 0 C--- df 844 0x7fffdfffe000 0x7fffdffff000 --- 0 0 0 0 ---- -- 844 0x7ffffffdf000 0x7ffffffff000 rw- 10 0 1 0 C--D df 844 0x7ffffffff000 0x800000000000 r-x 1 1 36 0 ---- ph
https://reviews.freebsd.org/D6085
Sounds related for sure. Is there an official PR for that issue this could be marked duplicate of or is this it? Or are we still on two different problems?
(In reply to Robert Blayzor from comment #98) I have no idea why your debugging data does not contain backtraces of the OBJ_DEAD acquisition points. Would it, I write the patch two months earlier. I am highly confident that the issues fixed in the patch from the review are same as your SIGSEGV.
It certainly seems like it. I'll work on patching this in and giving it a go. Whats interesting is how this issue creeps up and why. I've only ever had this problem with diskless servers that mount / as NFS RO.
In the review, which release is this patch for? Tried patching it against 10_STABLE and it failed.
I assume this is from CURRENT. Any way this can be MFC to 10-STABLE ?
Created attachment 169679 [details] Patch from D6085, adopted for stable/10.
My first attempt at this patch against 10-STABLE. I have a 10.3-RELEASE userland, and built a 10-STABLE kernel with this patch. When I boot the kernel it just hangs after normal kernel boot messages, right after: NFS ROOT: 10.1.60.211:/vol/v10 I will do a fresh checkout of 10-STABLE and build kernel against that and attempt to reboot to see if it's the patch or the STABLE kernel. 10.3-RELEASE still builds fine.
I have tried the patch for stable/10 and it causes the system to hang at boot, at what what appears to be the start of init. (but not sure). Last message I get is the kernel message about root mounted from NFS and my vmx0 boot interface comes up. System is pingable, but hangs at this point. If I back out the patch, it works fine. I've tried the patch against stable/10 and also releng/10.3. Both of which seem to patch ok.
(In reply to Robert Blayzor from comment #105) Break into ddb and do 'ps' then 'alltrace', for the hung boot.
I did manage to get the patch to work against a pristine checkout of RELENG_10. So I have several VM's running now with the patch in place. Not sure what was in STABLE/10 that was hanging up the boot, perhaps something in previous patch. Will monitor and update.
It's been longer than average now and I have not run into the processes terminating abnormally with the last patch installed against 10.3. HOWEVER, I have noticed a new issue with the network stack that seems to be happening at roughly the same interval. I'm not sure if the two are related or if fixing one problem manifested into another. Basically now we're getting the severs filling up with TCP connections stuck in a "CLOSED" state. We'll end up getting thousands of them until connections to the processes just time out. Sometimes we'll see kernel messages: sonewconn: pcb 0xfffff80001b36930: Listen queue overflow: 767 already in queue awaiting acceptance (46 occurrences) sonewconn: pcb 0xfffff80001b36930: Listen queue overflow: 767 already in queue awaiting acceptance (46 occurrences) sonewconn: pcb 0xfffff80001b36930: Listen queue overflow: 767 already in queue awaiting acceptance (50 occurrences) sonewconn: pcb 0xfffff80001b36930: Listen queue overflow: 767 already in queue awaiting acceptance (42 occurrences) sonewconn: pcb 0xfffff80001b36930: Listen queue overflow: 767 already in queue awaiting acceptance (44 occurrences) sonewconn: pcb 0xfffff80001b36930: Listen queue overflow: 767 already in queue awaiting acceptance (38 occurrences) sonewconn: pcb 0xfffff80001b36930: Listen queue overflow: 767 already in queue awaiting acceptance (40 occurrences) [...] But not always... Currently I have a server in this state and I'll see: tcp6 0 0 mta1.imap mta-slb-1.alb1.i.20511 CLOSED tcp6 0 0 mta1.pop3 mta-slb-1.alb1.i.43879 CLOSED tcp6 0 0 mta1.imap mta-slb-0.alb1.i.5259 CLOSED tcp6 0 0 mta1.pop3 mta-slb-0.alb1.i.12519 CLOSED tcp6 0 0 mta1.imap mta-slb-1.alb1.i.1316 CLOSED tcp6 0 0 mta1.pop3 mta-slb-1.alb1.i.65343 CLOSED tcp6 0 0 mta1.imap mta-slb-0.alb1.i.16289 CLOSED tcp6 0 0 mta1.pop3 mta-slb-0.alb1.i.19215 CLOSED tcp6 32 0 mta1.sieve mta-slb-0.alb1.i.19549 CLOSED tcp6 0 0 mta1.imap mta-slb-1.alb1.i.49287 CLOSED tcp6 32 0 mta1.sieve mta-slb-1.alb1.i.54187 CLOSED tcp6 0 0 mta1.pop3 mta-slb-1.alb1.i.39767 CLOSED tcp6 0 0 mta1.imap mta-slb-0.alb1.i.54366 CLOSED tcp6 0 0 mta1.pop3 mta-slb-0.alb1.i.47579 CLOSED tcp6 0 0 mta1.imap mta-slb-1.alb1.i.48798 CLOSED tcp6 0 0 mta1.pop3 mta-slb-1.alb1.i.40190 CLOSED ... [ 1000's of lines truncated ] It's not just Dovecot either, we also will see several stuck in CLOSED from Exim as well. So it doesn't look like an application issue, in fact, sockstat shows these stuck sockets not related to the process anymore... ie: ? ? ? ? tcp6 2607:f058:110:2::1:1:143 2607:f058:110:2::f:1:56602 ? ? ? ? tcp6 2607:f058:110:2::1:1:4190 2607:f058:110:2::f:0:32558 ? ? ? ? tcp6 2607:f058:110:2::1:1:110 2607:f058:110:2::f:1:53931 ? ? ? ? tcp6 2607:f058:110:2::1:1:110 2607:f058:110:2::f:0:58671 ? ? ? ? tcp6 2607:f058:110:2::1:1:143 2607:f058:110:2::f:1:58788 ? ? ? ? tcp6 2607:f058:110:2::1:1:143 2607:f058:110:2::f:0:30523 ? ? ? ? tcp6 2607:f058:110:2::1:1:110 2607:f058:10::10:32375 ? ? ? ? tcp6 2607:f058:110:2::1:1:143 2607:f058:110:2::f:0:46131 ? ? ? ? tcp6 2607:f058:110:2::1:1:110 2607:f058:110:2::f:1:50671 ? ? ? ? tcp6 2607:f058:110:2::1:1:143 2607:f058:110:2::f:1:4223 ? ? ? ? tcp6 2607:f058:110:2::1:1:143 2607:f058:110:2::f:1:15773 ? ? ? ? tcp6 2607:f058:110:2::1:1:110 2607:f058:110:2::f:0:26610 ? ? ? ? tcp6 2607:f058:110:2::1:1:4190 2607:f058:110:2::f:0:38765 ? ? ? ? tcp6 2607:f058:110:2::1:1:143 2607:f058:110:2::f:0:42310 ? ? ? ? tcp6 2607:f058:110:2::1:1:143 2607:f058:110:2::f:1:5643 ? ? ? ? tcp6 2607:f058:110:2::1:1:143 2607:f058:110:2::f:0:37143 ? ? ? ? tcp6 2607:f058:110:2::1:1:4190 2607:f058:110:2::f:0:24906 [...] (agan, thousands of lines truncated) netstat -ans -p tcp tcp: 8497364 packets sent 3825626 data packets (484438498 bytes) 12 data packets (5560 bytes) retransmitted 1 data packet unnecessarily retransmitted 0 resends initiated by MTU discovery 4106401 ack-only packets (0 delayed) 0 URG only packets 0 window probe packets 313890 window update packets 251435 control packets 5525333 packets received 4126773 acks (for 483181992 bytes) 108333 duplicate acks 0 acks for unsent data 3787497 packets (1012422242 bytes) received in-sequence 151 completely duplicate packets (0 bytes) 0 old duplicate packets 0 packets with some dup. data (0 bytes duped) 0 out-of-order packets (0 bytes) 0 packets (0 bytes) of data after window 0 window probes 7067 window update packets 6376 packets received after close 0 discarded for bad checksums 0 discarded for bad header offset fields 0 discarded because packet too short 0 discarded due to memory problems 7546 connection requests 315652 connection accepts 0 bad connection attempts 0 listen queue overflows 73753 ignored RSTs in the windows 323196 connections established (including accepts) 323132 connections closed (including 1414 drops) 121869 connections updated cached RTT on close 121869 connections updated cached RTT variance on close 0 connections updated cached ssthresh on close 2 embryonic connections dropped 4126615 segments updated rtt (of 3726607 attempts) 12 retransmit timeouts 0 connections dropped by rexmit timeout 0 persist timeouts 0 connections dropped by persist timeout 14 Connections (fin_wait_2) dropped because of timeout 19343 keepalive timeouts 19298 keepalive probes sent 45 connections dropped by keepalive 240267 correct ACK header predictions 792401 correct data packet header predictions 315653 syncache entries added 0 retransmitted 0 dupsyn 0 dropped 315652 completed 0 bucket overflow 0 cache overflow 1 reset 0 stale 0 aborted 0 badack 0 unreach 0 zone failures 315653 cookies sent 0 cookies received 164 hostcache entries added 0 bucket overflow 0 SACK recovery episodes 0 segment rexmits in SACK recovery episodes 0 byte rexmits in SACK recovery episodes 0 SACK options (SACK blocks) received 0 SACK options (SACK blocks) sent 0 SACK scoreboard overflow 0 packets with ECN CE bit set 0 packets with ECN ECT(0) bit set 0 packets with ECN ECT(1) bit set 0 successful ECN handshakes 0 times ECN reduced the congestion window 0 packets with valid tcp-md5 signature received 0 packets with invalid tcp-md5 signature received 0 packets with tcp-md5 signature mismatch 0 packets with unexpected tcp-md5 signature received 0 packets without expected tcp-md5 signature received If I attempt to kill and restart the processes, sometimes it works, sometimes it doesn't and I have to end up rebooting the server.
(In reply to Robert Blayzor from comment #108) Thank you for the testing. I am confident that your SIGSEGV issue is resolved. Also, I am sure that the network issue you described is unrelated. Please track it in the different bug report.
(In reply to Konstantin Belousov from comment #109) > Also, I am sure that the network issue you described is unrelated. > Please track it in the different bug report. Are you referring to "Bug 204764 - Filesystem deadlock, process in vodead state"?
Opened a new PR on the excessive CLOSED socket issue I'm seeing: PR 209471
Bug 209471 was opened for the socket issue. Interested parties may want to look at that, starting to look as if these two problems are related.
I have posted an update to Bug 209471. The latest update is that when we see the problem ported in that bug appear, the process is stuck in "vmf_de" which is from the patch suggested here.
Even with the suggested patch, now seeing processes occasionally hang in state: dev@mta0 [~] procstat -ta | grep 7893 7893 100155 dovecot - 0 120 sleep vmf_de dev@mta0 [~] procstat -ta | grep 7893 7893 100155 dovecot - 1 120 sleep vmf_de dev@mta0 [~] procstat -ta | grep 7893 7893 100155 dovecot - 2 120 sleep vmf_de dev@mta0 [~] procstat -ta | grep 7893 7893 100155 dovecot - 2 120 sleep vmf_de dev@mta0 [~] procstat -ta | grep 7893 7893 100155 dovecot - 3 120 sleep vmf_de dev@mta0 [~] procstat -ta | grep 7893 7893 100155 dovecot - 2 120 sleep vmf_de dev@mta0 [~] procstat -ta | grep 7893 7893 100155 dovecot - 2 120 sleep vmf_de When this happens we usually end up with a process that leaves a lot of detached sockets stuck in a CLOSED state. The process normally becomes unkillable and only a reboot clears the problem.
Robert, check out https://svnweb.freebsd.org/changeset/base/302063 from 204764 and let us know if that helps any.
Should I leave the patch from comment #103 in place while using patch/diff from PR 204764 ?
(In reply to Robert Blayzor from comment #116) They are complementary. Committed change should fix the vmf_de problem for you.
Tried building new kernel with recommend changes, when compiling I get an error: /sys/vm/vm_meter.c:145:13: error: use of undeclared identifier 'vm_cnt' &vm_cnt.v_free_count) ^ 1 error generated. *** Error code 1
(In reply to Robert Blayzor from comment #118) vm_cnt was called just cnt in previous branches. Use &cnt.v_free_count.
A commit references this bug: Author: kib Date: Mon Jun 27 21:54:20 UTC 2016 New revision: 302236 URL: https://svnweb.freebsd.org/changeset/base/302236 Log: If the vm_fault() handler raced with the vm_object_collapse() sleepable scan, iteration over the shadow chain looking for a page could find an OBJ_DEAD object. Such state of the mapping is only transient, the dead object will be terminated and removed from the chain shortly. We must not return KERN_PROTECTION_FAILURE unless the object type is changed to OBJT_DEAD in the chain, indicating that paging on this address is really impossible. Returning KERN_PROTECTION_FAILURE prematurely causes spurious SIGSEGV delivered to processes, or kernel accesses to UVA spuriously failing with EFAULT. If the object with OBJ_DEAD flag is found, only return KERN_PROTECTION_FAILURE when object type is already OBJT_DEAD. Otherwise, sleep a tick and retry the fault handling. Ideally, we would wait until the OBJ_DEAD flag is resolved, e.g. by waiting until the paging on this object is finished. But to do so, we need to reference the dead object, while vm_object_collapse() insists on owning the final reference on the collapsed object. This could be fixed by e.g. changing the assert to shared reference release between vm_fault() and vm_object_collapse(), but it seems to be too much complications for rare boundary condition. PR: 204426 Tested by: pho Reviewed by: alc Sponsored by: The FreeBSD Foundation X-Differential revision: https://reviews.freebsd.org/D6085 MFC after: 2 weeks Approved by: re (gjb) Changes: head/sys/vm/vm_fault.c
I've been running the last set of patches on this, and so far so good. Then again, it's not unusual for us to go on several days before we see the problem appear. Will report back in one week.
We are still seeing a problem with servers getting hosed up with hundreds and hundreds of sockets in a CLOSED state... I've experienced this with two different Apache2.4 servers... Both times, almost identical issue... dmesg sonewconn: pcb 0xfffff80001b34000: Listen queue overflow: 767 already in queue awaiting acceptance (34 occurrences) sonewconn: pcb 0xfffff80001b34000: Listen queue overflow: 767 already in queue awaiting acceptance (39 occurrences) sonewconn: pcb 0xfffff80001b34000: Listen queue overflow: 767 already in queue awaiting acceptance (36 occurrences) sonewconn: pcb 0xfffff80001b34000: Listen queue overflow: 767 already in queue awaiting acceptance (37 occurrences) sonewconn: pcb 0xfffff80001b34000: Listen queue overflow: 767 already in queue awaiting acceptance (42 occurrences) sonewconn: pcb 0xfffff80001b34000: Listen queue overflow: 767 already in queue awaiting acceptance (38 occurrences) sonewconn: pcb 0xfffff80001b34000: Listen queue overflow: 767 already in queue awaiting acceptance (39 occurrences) sonewconn: pcb 0xfffff80001b34000: Listen queue overflow: 767 already in queue awaiting acceptance (36 occurrences) ... webmail3 [~] netstat -an | grep " CLOSED" tcp4 113 0 127.0.0.1.80 127.0.0.1.14511 CLOSED tcp4 113 0 127.0.0.1.80 127.0.0.1.26064 CLOSED tcp4 113 0 127.0.0.1.80 127.0.0.1.57807 CLOSED tcp4 113 0 127.0.0.1.80 127.0.0.1.39385 CLOSED tcp4 113 0 127.0.0.1.80 127.0.0.1.22623 CLOSED tcp4 113 0 127.0.0.1.80 127.0.0.1.65280 CLOSED tcp4 113 0 127.0.0.1.80 127.0.0.1.27494 CLOSED tcp4 113 0 127.0.0.1.80 127.0.0.1.12996 CLOSED tcp4 113 0 127.0.0.1.80 127.0.0.1.59834 CLOSED tcp4 113 0 127.0.0.1.80 127.0.0.1.32036 CLOSED tcp4 113 0 127.0.0.1.80 127.0.0.1.34364 CLOSED tcp6 0 0 2607:f058:110:3:.80 2607:f058:110:3:.35467 CLOSED tcp6 0 0 2607:f058:110:3:.80 2607:f058:110:3:.28605 CLOSED tcp6 0 0 2607:f058:110:3:.80 2607:f058:110:3:.21232 CLOSED tcp6 0 0 2607:f058:110:3:.80 2607:f058:110:3:.48099 CLOSED tcp6 0 0 2607:f058:110:3:.80 2607:f058:110:3:.65026 CLOSED tcp6 0 0 2607:f058:110:3:.80 2607:f058:110:3:.54254 CLOSED tcp6 0 0 2607:f058:110:3:.80 2607:f058:110:3:.35803 CLOSED tcp6 0 0 2607:f058:110:3:.80 2607:f058:110:3:.11435 CLOSED tcp6 0 0 2607:f058:110:3:.80 2607:f058:110:3:.46650 CLOSED tcp6 0 0 2607:f058:110:3:.80 2607:f058:110:3:.7447 CLOSED ... webmail3 [~] netstat -an | grep " CLOSED" | wc -l 980 webmail3 [~] sudo procstat -at | grep httpd 699 100076 httpd - 0 120 sleep select 755 100093 httpd - 3 120 sleep pipewr 4220 100124 httpd - 3 120 sleep pipewr 6812 100448 httpd - 2 120 sleep pipewr 18402 100112 httpd - 0 120 sleep pipewr 24419 100122 httpd - 3 120 sleep pipewr 24590 100142 httpd - 2 120 sleep pipewr 24591 100431 httpd - 3 120 sleep pipewr 24592 100095 httpd - 2 120 sleep pipewr 24593 100432 httpd - 0 120 sleep pipewr 24594 100134 httpd - 2 120 sleep pipewr 24595 100052 httpd - 3 120 sleep pipewr 24596 100128 httpd - 2 120 sleep pipewr 24597 100062 httpd - 2 120 sleep pipewr 24598 100047 httpd - 3 120 sleep pipewr 24599 100073 httpd - 2 120 sleep pipewr 24600 100129 httpd - 0 120 sleep pipewr ... webmail3 [~] sudo procstat -at | grep httpd | wc -l 201
After several weeks of the patches from this PR installed, we can report we've not had either issue in some time. Will these patches be MFC, and if possible to 10-STABLE ?
(In reply to Robert Blayzor from comment #123) All relevant patches were already merged to stable/10 for around a month.
There is a commit referencing this PR, but it's still not closed and has been inactive for some time. Closing the PR as fixed but feel free to re-open it if the issue hasn't been completely resolved. Thanks