Bug 204426 - Processes terminating cannot access memory
Summary: Processes terminating cannot access memory
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 10.3-RELEASE
Hardware: amd64 Any
: --- Affects Some People
Assignee: freebsd-threads (Nobody)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-11-10 12:46 UTC by Robert Blayzor
Modified: 2019-01-21 18:00 UTC (History)
11 users (show)

See Also:


Attachments
Debugging patch to track source of KERN_PROTECTION_FAILURE (2.23 KB, patch)
2016-02-29 15:16 UTC, Konstantin Belousov
no flags Details | Diff
Debugging patch to track the origin of OBJ_DEAD (1.36 KB, patch)
2016-03-31 08:13 UTC, Konstantin Belousov
no flags Details | Diff
OBJ_DEAD tracker for stable/10 (1.35 KB, patch)
2016-03-31 14:51 UTC, Konstantin Belousov
no flags Details | Diff
OBJ_DEAD tracker for stable/10, v2: assert that the object is locked, change layout (2.01 KB, patch)
2016-04-01 14:51 UTC, Konstantin Belousov
no flags Details | Diff
OBJ_DEAD tracker for stable/10, v2.1 (2.80 KB, patch)
2016-04-01 16:14 UTC, Konstantin Belousov
no flags Details | Diff
Patch from D6085, adopted for stable/10. (5.29 KB, patch)
2016-04-25 15:13 UTC, Konstantin Belousov
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Robert Blayzor 2015-11-10 12:46:42 UTC
When upgrading from 10.1-RELEASE to 10.2-RELEASE we have noticed processes randomly terminating. Signal 11 seg fault.

This has been an on-going issue and cannot track this down to any particular bug. With -p7 covering FreeBSD-EN-15:20.vm, we thought that maybe this was the issue we were seeing. However, -p7 did not fix the problem.

Our environment is several FreeBSD 10.2 amd64 VM's running a multi-mail server with Exim and Dovecot. Hypervisor is VMware ESXi 5.1.

Several times a day (or sometimes not at all) Exim and/or Dovecot will exit with signal 11 with similar back traces. Usually from a function in libc or libthr. A couple of example back traces below.

Occasionally (although rare) we've seen other processes crash with similar "cannot access memory" seg faults. However, we've not yet seen this since -p7 though we are monitoring closely.



Exim backtrace:

#0  0x000000080119e4b6 in pthread_suspend_all_np () from /lib/libthr.so.3
[New Thread 803006400 (LWP 100098/<unknown>)]
(gdb) bt
#0  0x000000080119e4b6 in pthread_suspend_all_np () from /lib/libthr.so.3
#1  0x00000008011a126a in pthread_getspecific () from /lib/libthr.so.3
#2  0x00000008011a5c96 in __pthread_cxa_finalize () from /lib/libthr.so.3
#3  0x0000000000423536 in daemon_go ()
#4  0x0000000000438ee9 in main ()



Dovecot backtrace:

#0  0x000000080061b6bc in _rtld_is_dlopened () from /libexec/ld-elf.so.1
(gdb) bt
#0  0x000000080061b6bc in _rtld_is_dlopened () from /libexec/ld-elf.so.1
#1  0x000000080061b2ab in _rtld_is_dlopened () from /libexec/ld-elf.so.1
#2  0x0000000800614c8d in _r_debug_postinit () from /libexec/ld-elf.so.1
#3  0x000000080061246d in .text () from /libexec/ld-elf.so.1
#4  0x000000000040abf8 in service_process_create ()
#5  0x000000000040a38a in services_monitor_reap_children ()
#6  0x00000008008c2bd3 in io_loop_call_io () from /usr/local/lib/dovecot/libdovecot.so.0
#7  0x00000008008c46ef in io_loop_handler_run_internal ()
   from /usr/local/lib/dovecot/libdovecot.so.0
#8  0x00000008008c30d4 in io_loop_handler_run () from /usr/local/lib/dovecot/libdovecot.so.0
#9  0x00000008008c2eb8 in io_loop_run () from /usr/local/lib/dovecot/libdovecot.so.0
#10 0x000000080085f1d8 in master_service_run () from /usr/local/lib/dovecot/libdovecot.so.0
#11 0x0000000000406512 in main ()



After checking in on the Exim and Dovecot communities, all indications are that this seems to be library based in some way.

We have tried doing fresh installs of the base OS/VM's from a fresh SVN co and complete world rebuild. However, the problem still persists....



Sysctl vars:

kern.corefile=/var/tmp/%N.core
kern.timecounter.hardware=ACPI-fast
#
kern.ipc.maxsockbuf=2097152
kern.ipc.somaxconn=2048
kern.maxfiles=65536
kern.maxfilesperproc=16384
net.inet.tcp.sendspace=131072
net.inet.tcp.recvspace=131072
net.inet.udp.recvspace=131072
net.inet.udp.maxdgram=16384
#
net.inet.tcp.msl=7500
net.inet.tcp.fast_finwait2_recycle=1
net.inet.icmp.log_redirect=0
net.inet.icmp.drop_redirect=1
net.inet.tcp.delayed_ack=0
net.inet.ip.redirect=0
net.inet6.ip6.redirect=0
net.link.ether.inet.log_arp_wrong_iface=0
kern.sugid_coredump=1
#
net.inet.tcp.keepidle=60000
net.inet.tcp.keepintvl=10000
Comment 1 Robert Blayzor 2015-11-16 13:03:07 UTC
Recent Exim sig 11 with debug symbols:


#0  0x00000008011f84b6 in pthread_suspend_all_np () from /lib/libthr.so.3
[New Thread 803006400 (LWP 100090/<unknown>)]
(gdb) bt
#0  0x00000008011f84b6 in pthread_suspend_all_np () from /lib/libthr.so.3
#1  0x00000008011fb26a in pthread_getspecific () from /lib/libthr.so.3
#2  0x00000008011ffc96 in __pthread_cxa_finalize () from /lib/libthr.so.3
#3  0x000000000042879c in handle_smtp_call (listen_sockets=0x803065140, listen_socket_count=4,
    accept_socket=7, accepted=0x7fffffffcbb0) at daemon.c:355
#4  0x000000000042777b in daemon_go () at daemon.c:2040
#5  0x0000000000449d13 in main (argc=3, cargv=0x7fffffffebb8) at exim.c:4719
Comment 2 Robert Blayzor 2015-11-19 00:17:55 UTC
Still seeing "cannot access memory" seg-faults on -p7

FreeBSD 10.2-RELEASE-p7 FreeBSD 10.2-RELEASE-p7 #0 r290449M amd64


Core was generated by `exim'.
Program terminated with signal 11, Segmentation fault.
Cannot access memory at address 0x800978ac0
#0  0x00000008023a3937 in ?? ()
(gdb) bt
#0  0x00000008023a3937 in ?? ()
Cannot access memory at address 0x7fffffffc310
Comment 3 Robert Blayzor 2015-11-19 13:13:05 UTC
All similar crashes, still. Keep in mind, these didn't appear at all in 10.1 (at least we've never seen them). After upgrading to RELENG_10_2 is when this started.


Core was generated by `dovecot'.
Program terminated with signal 11, Segmentation fault.
Reading symbols from /usr/local/lib/dovecot/libdovecot.so.0...done.
Loaded symbols for /usr/local/lib/dovecot/libdovecot.so.0
Reading symbols from /lib/libc.so.7...done.
Loaded symbols for /lib/libc.so.7
Reading symbols from /libexec/ld-elf.so.1...done.
Loaded symbols for /libexec/ld-elf.so.1
#0  0x000000000040db60 in service_process_create (service=Cannot access memory at address 0x7fffffffb0b0
) at service-process.c:311
311	service-process.c: No such file or directory.
	in service-process.c
(gdb) bt
#0  0x000000000040db60 in service_process_create (service=Cannot access memory at address 0x7fffffffb0b0
) at service-process.c:311
Cannot access memory at address 0x7fffffffb0c8
Current language:  auto; currently minimal
(gdb
Comment 4 Robert Blayzor 2015-12-03 11:33:01 UTC
Still seeing programs (various) randomly crash with "cannot access memory" type core dumps. As previously mentioned, we never experienced this on 10.1-RELEASE. Only after upgrading VM"s to 10.2 did we start noticing this..


This GDB was configured as "amd64-marcel-freebsd"...
Core was generated by `exim'.
Program terminated with signal 11, Segmentation fault.
#0  0x00000000004b9f6b in store_reset_3 (ptr=Cannot access memory at address 0x7fffffffc838
) at store.c:360
360	store.c: No such file or directory.
	in store.c
(gdb)
Comment 5 John Baldwin freebsd_committer freebsd_triage 2015-12-04 19:54:25 UTC
Can you provide some more information from one of your cores?  Specifically, can you provide the output of 'x/i $rip' and 'info registers' as a start?  The gdb messages aren't really telling us why it crashed.
Comment 6 Robert Blayzor 2016-01-25 12:00:58 UTC
GNU gdb 6.1.1 [FreeBSD]
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "amd64-marcel-freebsd"...
Core was generated by `dovecot'.
Program terminated with signal 11, Segmentation fault.
Reading symbols from /usr/local/lib/dovecot/libdovecot.so.0...done.
Loaded symbols for /usr/local/lib/dovecot/libdovecot.so.0
Reading symbols from /lib/libc.so.7...done.
Loaded symbols for /lib/libc.so.7
Reading symbols from /libexec/ld-elf.so.1...done.
Loaded symbols for /libexec/ld-elf.so.1
#0  0x000000000040db76 in service_process_create (service=Cannot access memory at address 0x7fffffffb0c0
) at service-process.c:313
313	service-process.c: No such file or directory.
	in service-process.c
(gdb) bt full
#0  0x000000000040db76 in service_process_create (service=Cannot access memory at address 0x7fffffffb0c0
) at service-process.c:313
	uid_counter = 73470
	process = (struct service_process *) Cannot access memory at address 0x7fffffffb0b8
Current language:  auto; currently minimal
(gdb) x/i $rip
0x40db76 <service_process_create+310>:	mov    %eax,0x20(%rcx)
(gdb) info registers
rax            0x11efe	73470
rbx            0x0	0
rcx            0x8014b1040	34381434944
rdx            0x0	0
rsi            0x40d5e0	4249056
rdi            0x0	0
rbp            0x7fffffffb0d0	0x7fffffffb0d0
rsp            0x7fffffffb090	0x7fffffffb090
r8             0x2	2
r9             0x3	3
r10            0x2710	10000
r11            0x202	514
r12            0x7fffffffb328	140737488335656
r13            0x7fffffffb350	140737488335696
r14            0x7fffffffb330	140737488335664
r15            0x3	3
rip            0x40db76	0x40db76 <service_process_create+310>
eflags         0x10202	66050
cs             0x43	67
ss             0x3b	59
ds             0x0	0
es             0x0	0
fs             0x0	0
gs             0x0	0
(gdb)
Comment 7 Robert Blayzor 2016-01-25 12:06:14 UTC
GNU gdb 6.1.1 [FreeBSD]
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "amd64-marcel-freebsd"...
Core was generated by `dovecot'.
Program terminated with signal 11, Segmentation fault.
Reading symbols from /usr/local/lib/dovecot/libdovecot.so.0...done.
Loaded symbols for /usr/local/lib/dovecot/libdovecot.so.0
Reading symbols from /lib/libc.so.7...done.
Loaded symbols for /lib/libc.so.7
Reading symbols from /libexec/ld-elf.so.1...done.
Loaded symbols for /libexec/ld-elf.so.1
#0  0x00000008006226bc in _rtld_is_dlopened () from /libexec/ld-elf.so.1
(gdb) x/i $rip
0x8006226bc <_rtld_is_dlopened+15468>:
    mov    %edi,0x213236(%rip)        # 0x8008358f8 <_rtld_atfork_post+2174936>
(gdb) info registers
rax            0x0	0
rbx            0x800835b50	34368346960
rcx            0x8014b1310	34381435664
rdx            0x8014b13d0	34381435856
rsi            0x7fffffffaf08	140737488334600
rdi            0x1	1
rbp            0x7fffffffaed0	0x7fffffffaed0
rsp            0x7fffffffaed0	0x7fffffffaed0
r8             0x103	259
r9             0x8080808080808080	-9187201950435737472
r10            0x1	1
r11            0x801406e30	34380738096
r12            0x7fffffffb328	140737488335656
r13            0x7fffffffb350	140737488335696
r14            0x7fffffffaf08	140737488334600
r15            0x800834850	34368342096
rip            0x8006226bc	0x8006226bc <_rtld_is_dlopened+15468>
eflags         0x10202	66050
cs             0x43	67
ss             0x3b	59
ds             0x0	0
es             0x0	0
fs             0x0	0
gs             0x0	0
(gdb)
Comment 8 Robert Blayzor 2016-01-25 12:09:49 UTC
If it makes any difference at all, these are "diskless" booted VM servers with read-only mounted NFS_ROOT. The only sysctl knobs turned are below. This is still happening even after a fresh build of 10.2-p10

kern.corefile=/var/spool/tmp/%N.core
kern.timecounter.hardware=ACPI-fast
kern.ipc.maxsockbuf=2097152
kern.ipc.somaxconn=2048
kern.maxfiles=65536
kern.maxfilesperproc=16384
net.inet.tcp.sendspace=131072
net.inet.tcp.recvspace=131072
net.inet.udp.recvspace=131072
net.inet.udp.maxdgram=16384
net.inet.tcp.msl=7500
net.inet.tcp.fast_finwait2_recycle=1
net.inet.icmp.log_redirect=0
net.inet.icmp.drop_redirect=1
net.inet.tcp.delayed_ack=0
net.inet.ip.redirect=0
net.inet6.ip6.redirect=0
net.link.ether.inet.log_arp_wrong_iface=0
kern.sugid_coredump=1
net.inet.tcp.keepidle=60000
net.inet.tcp.keepintvl=10000
Comment 9 Robert Blayzor 2016-01-26 12:52:05 UTC
And a recent Exim crash under 10.2 as well.. 


Core was generated by `exim'.
Program terminated with signal 11, Segmentation fault.
Reading symbols from /lib/libcrypt.so.5...(no debugging symbols found)...done.
Loaded symbols for /lib/libcrypt.so.5
Reading symbols from /lib/libm.so.5...(no debugging symbols found)...done.
Loaded symbols for /lib/libm.so.5
Reading symbols from /lib/libutil.so.9...(no debugging symbols found)...done.
Loaded symbols for /lib/libutil.so.9
Reading symbols from /usr/local/lib/libspf2.so.2...(no debugging symbols found)...done.
Loaded symbols for /usr/local/lib/libspf2.so.2
Reading symbols from /lib/libthr.so.3...(no debugging symbols found)...done.
Loaded symbols for /lib/libthr.so.3
Reading symbols from /usr/local/lib/libpq.so.5...(no debugging symbols found)...done.
Loaded symbols for /usr/local/lib/libpq.so.5
Reading symbols from /usr/local/lib/perl5/5.22/mach/CORE/libperl.so.5.22...(no debugging symbols found)...done.
Loaded symbols for /usr/local/lib/perl5/5.22/mach/CORE/libperl.so.5.22
Reading symbols from /usr/lib/libssl.so.7...(no debugging symbols found)...done.
Loaded symbols for /usr/lib/libssl.so.7
Reading symbols from /lib/libcrypto.so.7...(no debugging symbols found)...done.
Loaded symbols for /lib/libcrypto.so.7
Reading symbols from /usr/local/lib/libpcre.so.1...(no debugging symbols found)...done.
Loaded symbols for /usr/local/lib/libpcre.so.1
Reading symbols from /lib/libc.so.7...(no debugging symbols found)...done.
Loaded symbols for /lib/libc.so.7
Reading symbols from /usr/local/lib/libintl.so.8...(no debugging symbols found)...done.
Loaded symbols for /usr/local/lib/libintl.so.8
Reading symbols from /libexec/ld-elf.so.1...(no debugging symbols found)...done.
Loaded symbols for /libexec/ld-elf.so.1
#0  0x000000080119e4b6 in pthread_suspend_all_np () from /lib/libthr.so.3
[New Thread 803006400 (LWP 100084/<unknown>)]
(gdb) bt full
#0  0x000000080119e4b6 in pthread_suspend_all_np () from /lib/libthr.so.3
No symbol table info available.
#1  0x00000008011a126a in pthread_getspecific () from /lib/libthr.so.3
No symbol table info available.
#2  0x00000008011a5c96 in __pthread_cxa_finalize () from /lib/libthr.so.3
No symbol table info available.
#3  0x0000000000423536 in daemon_go ()
No symbol table info available.
#4  0x0000000000438ee9 in main ()
No symbol table info available.
(gdb) x/i $rip
0x80119e4b6 <pthread_suspend_all_np+2502>:	lock cmpxchg %ecx,(%rdi)
(gdb) info registers
rax            0x1	1
rbx            0x0	0
rcx            0x0	0
rdx            0x1	1
rsi            0x8011a833c	34378253116
rdi            0x8013b4480	34380399744
rbp            0x7fffffffd440	0x7fffffffd440
rsp            0x7fffffffd440	0x7fffffffd440
r8             0x2	2
r9             0x3	3
r10            0x2710	10000
r11            0x246	582
r12            0x803006400	34410095616
r13            0x709b50	7379792
r14            0x8013b4460	34380399712
r15            0x0	0
rip            0x80119e4b6	0x80119e4b6 <pthread_suspend_all_np+2502>
eflags         0x10246	66118
cs             0x43	67
ss             0x3b	59
ds             0x0	0
es             0x0	0
fs             0x0	0
gs             0x0	0
(gdb)
Comment 10 John Baldwin freebsd_committer freebsd_triage 2016-01-26 18:40:53 UTC
So far I don't see anything unusual to suggest that these are anything other than crashes due to application bugs.  Is this the exact same exim / dovecot binaries you were using on 10.1 or did you also recompile those against 10.2 (and/or upgrade to newer versions)?

One thing that might help with debugging these is building the user libraries with debug symbols.  If you are building your own worlds, just add 'WITH_DEBUG_FILES=yes' to /etc/src.conf and build a new world.  Note that this will install debug symbols for all binaries and libraries in the base system to /usr/lib/debug/.
Comment 11 Robert Blayzor 2016-01-26 18:51:59 UTC
Since we're using a FreeBSD diskless boot and NFS_ROOT I normally buildworld (and kernel) from sources and install in a new NFS_ROOT path, then boot the diskless servers with the new root path.

Before booting however, I will also force rebuild (portupgrade -af) everything against the new version of FreeBSD (kernel/userland, libs, etc). So the binaries that are crashing *are* being rebuilt under FreeBSD 10.2.

We never experienced this in 10.1 and in fact, we even rolled back and booted one of the diskless servers using an old NFS_ROOT path with 10.1 (same versions of Exim and Dovecot) and those have not crashed. Usually I will experience these types of seg faults on every VM server at least once a week. (currently these servers are lightly loaded because of this issue).

Normally I would have no problem believing this is an application but since it does not appear to happen on 10.1 and only on 10.2, coupled with the fact I'm seeing similar crashes on two different applications that were otherwise running fine is very suspect. 

The applications are compiled with debugging symbols, but the crash always happens in a system library. I've contacted both Dovecot and Exim teams on these crashes and both agreed that it seemed to be some type of system library issue.

I'm not ruling out that it may be something with our install, but it's fairly straight forward FreeBSD diskless install. Running out of places to look.
Comment 12 John Baldwin freebsd_committer freebsd_triage 2016-01-26 19:56:57 UTC
Adding WITH_DEBUG_SYMBOLS so we get debugging symbols in the system libraries would still be useful.  Is it possible to grab the binaries from a 10.1 VM and run them in a 10.2 VM (I realize that might be a bit hacky, but if it's feasible it might help narrow down what change is causing the regression)?
Comment 13 Robert Blayzor 2016-01-26 20:06:53 UTC
I am building world with DEBUG SYMBOLS now and will stage a new root and boot a few VM's into that as well as make sure DEBUG symbols are compiled with both Exim and Dovecot. Once that's done hopefully we'll get a bit more out of the debugger.

As for the binaries, that might be possible with Exim, but Dovecot has a slew of associated libraries and other binary depends. 

Lets start with the DEBUG symbols and see if it reveals more of whats going on.
Comment 14 Julian Elischer freebsd_committer freebsd_triage 2016-01-26 21:36:27 UTC
(In reply to rblayzor from comment #13)

Don't forget you can run with a new kernel and old system to help isolate it, and also don't forget that you can replace libraries one at a time or in groups using /etc/libmap.conf
 (man libmap.conf)

so for example you can make all the system use the new libraries but just one application use the old ones.
or you can even make just one instance of the app use differnet libries by starting it from a different binary or giving a differnt path to it..
Comment 15 Robert Blayzor 2016-01-28 23:31:54 UTC
After installing a new world and booting to new root NFS. I've seen one crash (usually takes time). However this one is a bit odd. Kernel message:

Failed to write core file for process exim (error 14)
pid 58267 (exim), uid 26: exited on signal 11


But I do get a core file, but this is all:

GNU gdb 6.1.1 [FreeBSD]
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "amd64-marcel-freebsd"...
Core was generated by `exim'.
Program terminated with signal 11, Segmentation fault.
Cannot access memory at address 0x800978ac0
#0  0x00000008023a2990 in ?? ()
(gdb) bt full
#0  0x00000008023a2990 in ?? ()
No symbol table info available.
Cannot access memory at address 0x7fffffff4f70
(gdb) bt info
No symbol "info" in current context.
(gdb) x/i $rip
0x8023a2990:	Cannot access memory at address 0x8023a2990
(gdb) info registers
rax            0xffffffffffffffd0	-48
rbx            0x2010	8208
rcx            0x800788330	34367636272
rdx            0x0	0
rsi            0x3ff3	16371
rdi            0x8030033c1	34410083265
rbp            0x7fffffff5420	0x7fffffff5420
rsp            0x7fffffff4f70	0x7fffffff4f70
r8             0x7fffffff4670	140737488307824
r9             0x80269f6e0	34400237280
r10            0xfffffffffffff000	-4096
r11            0x8030033c0	34410083264
r12            0x7fffffffed40	140737488350528
r13            0x803229000	34412335104
r14            0x2010	8208
r15            0x3000	12288
rip            0x8023a2990	0x8023a2990
eflags         0x10206	66054
cs             0x43	67
ss             0x3b	59
ds             0x0	0
es             0x0	0
fs             0x0	0
gs             0x0	0
Comment 16 John Baldwin freebsd_committer freebsd_triage 2016-01-29 18:44:13 UTC
Hmm, the EFAULT error (14) is quite odd.  I wonder if you tripped over the bug that is fixed in HEAD by r287442, r287537, and r288944.

That bug was present in 10.1 though (and 10.0), not unique to 10.2.  However, a corrupted core dump is why you can't do a full backtrace. :(
Comment 17 Robert Blayzor 2016-01-29 19:15:29 UTC
I'm not claiming that the bug here that I've reported does not exist in 10.0 or 10.1, I'm just saying that we've seen this bug during our entire release cycle of 10.1. (which was almost a good year).

We only started noticing this issue with process seg faults after updating to 10.2. We were running the same versions of Dovecot and Exim on 10.1 when we started noticing it in 10.2. That's when we force rebuilt all of the binaries under 10.2 to no avail.

Dovecot and Exim are NOT the only processes that we've seen core out, they are just the processes we see the most often. I've seen cron do it, ntpd and some other port binaries. I remember looking at their core files and seeing the same "cannot access memory" errors in the backtrace. It was only then I started paying attention more and have been gathering more and more debug information on this strange issue.
Comment 18 Robert Blayzor 2016-01-29 19:16:32 UTC
(In reply to rblayzor from comment #17)

Should read "I'm just saying that we've NEVER seen this bug during our entire release cycle of 10.1."
Comment 19 John Baldwin freebsd_committer freebsd_triage 2016-01-29 19:50:02 UTC
Oh, I should be clear that the revisions I mentioned only address the issue of core dumps being corrupt, it does not address the root cause of the seg faults you are seeing.  However, having proper core dumps is needed to debug the root cause.

You can try backporting the changes I mentioned previously to your 10.2 kernel (you can get by with just the first two for testing) to see if we can get more usaable core dumps to use in tracking down the root cause.  If you aren't used to backporting changes I will try to do it in the next day or so (those fixes should be in 10.3 anyway).  Once I've done the backport I can give you a patch to apply to your 10.2 kernel.
Comment 20 Robert Blayzor 2016-01-29 20:27:07 UTC
These broken cores do not happen all the time, I'm actually waiting for some crashes to report. I'm pretty sure I should see a couple in the coming days. 

If I keep having broken cores with no useful output I already checked out 10-STABLE and have it lined up to possibly try building a new NFS_ROOT system. (though that may take some time).

I would rather wait now while I have latest 10.2 running to at least see if I can grab some useful debugs.
Comment 21 Robert Blayzor 2016-01-31 11:17:40 UTC
After a couple of day s like clockwork, I got another crash. I thought this one looked like a normal Dovecot bug and reported it to the list to which Timo responded: 

"There's no way it can crash here, unless something's badly broken. Either a buggy compiler or maybe memory corruption earlier in the process. Can you reproduce it? Can you run it via valgrind?"




Core was generated by `dovecot'.
Program terminated with signal 11, Segmentation fault.
Reading symbols from /usr/local/lib/dovecot/libdovecot.so.0...done.
Loaded symbols for /usr/local/lib/dovecot/libdovecot.so.0
Reading symbols from /lib/libc.so.7...Reading symbols from /usr/lib/debug//lib/libc.so.7.debug...done.
done.
Loaded symbols for /lib/libc.so.7
Reading symbols from /libexec/ld-elf.so.1...Reading symbols from /usr/lib/debug//libexec/ld-elf.so.1.debug...done.
done.
Loaded symbols for /libexec/ld-elf.so.1
#0  0x0000000800910c74 in io_loop_time_refresh () at ioloop.c:635
635	ioloop.c: No such file or directory.
	in ioloop.c
(gdb) bt full
#0  0x0000000800910c74 in io_loop_time_refresh () at ioloop.c:635
No locals.
#1  0x0000000800910961 in io_loop_timeouts_start_new (ioloop=0x801421080) at ioloop.c:436
	to_idx = (struct timeout * const *) 0x801419880
#2  0x0000000800910915 in io_loop_handler_run (ioloop=0x801421080) at ioloop.c:606
No locals.
#3  0x000000080091073f in io_loop_run (ioloop=0x801421080) at ioloop.c:583
No locals.
#4  0x000000080086aaab in master_service_run (service=0x80141b140, callback=0) at master-service.c:640
No locals.
#5  0x0000000000406509 in main (argc=3, argv=0x7fffffffb4c0) at main.c:888
	set = (struct master_settings *) 0x801429110
	error = 0x800caa2c3 "H\203Ä ]Ã\017\037\200"
	doveconf_arg = 0x0
	orig_info_callback = (failure_callback_t *) 0x8008f0220 <i_syslog_error_handler>
	orig_debug_callback = (failure_callback_t *) 0x8008f0220 <i_syslog_error_handler>
	foreground = false
	ask_key_pass = false
	i = 3
	c = -1
	doubleopts = <value optimized out>
Current language:  auto; currently minimal
(gdb)
Comment 22 Robert Blayzor 2016-02-01 12:51:32 UTC
When I see processes core dump along with the line message 'Failed to write core file for process XXX (error 14)", that is generally where I will see an incomplete core file with a "cannot access memory". So that may be related to the bug you mentioned.

I do get occasional good core dumps, as noted above. At this time I cannot run stable to patch the core dump issue.

I did however install clang3.6. I have rebuilt the ports we are seeing problems with against that just to rule it out. That is however, very unlikely. 

I will continue to gather more dumps if they produce valid cores. If there is a patch I can rebuild against that fixes the "error 14" issue, I'll give is a whirl.
Comment 23 Konstantin Belousov freebsd_committer freebsd_triage 2016-02-01 13:06:10 UTC
(In reply to rblayzor from comment #22)
This smells like a pager error.  You mentioned somewhere that the images for the programs which failed are located on NFS volume ?  What is in dmesg ?

I believe that pager somehow fails to page in, and this causes both access errors in the live processes (SIGSEGV) and EFAULT when kernel reads process mappings to dump them into core files.
Comment 24 Robert Blayzor 2016-02-01 13:34:33 UTC
(In reply to Konstantin Belousov from comment #23)

These VM's boot diskless and mount their root file system as RO. 

Trying to mount root from nfs: []...
NFS ROOT: 10.0.0.211:/vol/bsdroot/v0f


Where "v0f" is nothing more than subdir for a new rootfs.  "bsdroot" is just a NetApp volume RO volume from these hosts.


There is a VMDK (virtual disk) that is mounted at boot time; which is a thin provision disk 10GB in size.  2GB a 2Gbps swap partition and a 8Gb scratch disk for things like /var or /var/tmp


da0 at mpt0 bus 0 scbus0 target 0 lun 0
da0: <VMware Virtual disk 1.0> Fixed Direct Access SCSI-2 device
bootpc_init: wired to interface 'vmx0'
da0: 320.000MB/s transfers (160.000MHz, offset 127, 16bit)
da0: Command Queueing enabled
da0: 10240MB (20971520 512 byte sectors: 255H 63S/T 1305C)
da0: quirks=0x40<RETRY_BUSY>


The port binaries are built on a another FreeBSD 10.2 platform (same VM config) and then we rsync /usr/local to be part of the rootfs.

We always keep the port builds and the base system fresh and in sync. Typically when we want roll out a new version of FreeBSD we'll upgrade the build VM first, then create a new subdir for rootfs and install there. Followed by a complete port rebuild and resync. This has always worked for us since FreeBSD 8.x just fine...


Only recently we've seen these random, strange, unexplainable core dumps.
Comment 25 Konstantin Belousov freebsd_committer freebsd_triage 2016-02-01 17:17:10 UTC
(In reply to rblayzor from comment #24)
Of course, I mean dmesg after the mysterious seg fault.
Comment 26 Robert Blayzor 2016-02-01 21:05:37 UTC
(In reply to Konstantin Belousov from comment #25)

I think there are two different issues maybe going on here. My original bug report should of been more specific in that application/processes that apparently ran fine prior to upgrading to 10.2-RELEASE suddenly seem to be core dumping for some reason.

When these processes core dump, sometimes they are leaving good core files, sometimes they are not. (perhaps most of the time). I'm seeing a something in common with the bug John brought up. Usually when I see something like:

Failed to write core file for process exim (error 14)
pid 8298 (exim), uid 26: exited on signal 11


The core file is incomplete, and some of the last messages I see in the core file is "cannot access memory", or something long those lines. We'll chalk these up to maybe the bug he mentioned with the system not dumping these core files. I am telling the system to drop all cores to a specific directory that has plenty of space and is r/w.


The main problem however... I have to collect more core dumps on. (if they can drop core successfully). I posted one previously that just happened the other day that Timo Sirainen said should not be possible due to the output from the backtrace.

I'm still waiting for more segfaults and still post those backtraces when they are available.
Comment 27 Robert Blayzor 2016-02-05 12:34:59 UTC
Just had one of the VM's sig11 on Dovecot last night... This one is giving me the cannot access memory error ...


This GDB was configured as "amd64-marcel-freebsd"...
Core was generated by `dovecot'.
Program terminated with signal 11, Segmentation fault.
Reading symbols from /usr/local/lib/dovecot/libdovecot.so.0...done.
Loaded symbols for /usr/local/lib/dovecot/libdovecot.so.0
Reading symbols from /lib/libc.so.7...Reading symbols from /usr/lib/debug//lib/libc.so.7.debug...done.
done.
Loaded symbols for /lib/libc.so.7
Reading symbols from /libexec/ld-elf.so.1...Reading symbols from /usr/lib/debug//libexec/ld-elf.so.1.debug...done.
done.
Loaded symbols for /libexec/ld-elf.so.1
#0  memcpy () at bcopy.S:65
65	bcopy.S: No such file or directory.
	in bcopy.S
(gdb) bt full
#0  memcpy () at bcopy.S:65
No locals.
#1  0x0000000800915484 in sig_handler (signo=20, si=0x7fffffffa9a0, context=0x7fffffffa630)
    at lib-signals.c:132
	h = (struct signal_handler *) 0x8014198e0
	saved_errno = 14
	c = 0 '\0'
	tmp_si = {si_signo = 0, si_errno = 0, si_code = 0, si_pid = 0, si_uid = 0, si_status = 0,
  si_addr = 0x0, si_value = {sival_int = 0, sival_ptr = 0x0, sigval_int = 0, sigval_ptr = 0x0},
  _reason = {_fault = {_trapno = 0}, _timer = {_timerid = 0, _overrun = 0}, _mesgq = {_mqd = 0},
    _poll = {_band = 0}, __spare__ = {__spare1__ = 0, __spare2__ = 0x7fffffffa5c8}}}
#2  <signal handler called>
No symbol table info available.
#3  0x0000000000411fc8 in _fini ()
No symbol table info available.
#4  0x000000080061c9a8 in objlist_call_fini ()
    at /.amd_mnt/juno/pub/FreeBSD/10.2-RELEASE/libexec/rtld-elf/rtld.c:2359
	list = (Objlist *) Cannot access memory at address 0x0
Current language:  auto; currently asm
(gdb) x/i $rip
0x800cd1a76 <memcpy+22>:	rep movsq %ds:(%rsi),%es:(%rdi)
(gdb) info registers
rax            0x800b893f0	34371834864
rbx            0x7fffffffaef8	140737488334584
rcx            0xa	10
rdx            0x50	80
rsi            0x7fffffffa9a0	140737488333216
rdi            0x800b893f0	34371834864
rbp            0x7fffffffa610	0x7fffffffa610
rsp            0x7fffffffa578	0x7fffffffa578
r8             0xffff800800b8ea50	-140703116498352
r9             0xfffff8000a811000	-8795916791808
r10            0x800636200	34366251520
r11            0x246	582
r12            0x800636200	34366251520
r13            0x800834830	34368342064
r14            0x7fffffffaef8	140737488334584
r15            0x0	0
rip            0x800cd1a76	0x800cd1a76 <memcpy+22>
eflags         0x10206	66054
cs             0x43	67
ss             0x3b	59
ds             0x0	0
es             0x0	0
fs             0x0	0
gs             0x0	0
(gdb)
Comment 28 Robert Blayzor 2016-02-05 13:18:31 UTC
On that last crash, upon starting Dovecot-- this message appeared. First time we've seen it, but maybe due to all the debug symbols and misc debugs enabled:


Starting dovecot.
Last died with error (see error log for more information): gettimeofday(): Bad address
Comment 29 John Baldwin freebsd_committer freebsd_triage 2016-02-05 20:15:41 UTC
That would imply gettimeofday() is failing with EFAULT.  That seems like a really odd error.  That should only be failing if the address passed to gettimeofday() is bad, but that seems weird.

Maybe try setting kern.timecounter.fast_gettime=0 as a test?  Some of your faults have addresses in the shared page that holds the fast timecounter stuff.  It's a bit of a hail mary, but if it works then it narrows down what we need to look at to find the real bug.
Comment 30 Konstantin Belousov freebsd_committer freebsd_triage 2016-02-05 20:26:10 UTC
(In reply to John Baldwin from comment #29)
Fast gettimeofday cannot fail with EFAULT, there is no such return path. An access to the wrong memory from fast gettimeofday code would result in real fault.
Comment 31 Robert Blayzor 2016-02-07 02:50:35 UTC
Probably not entirely helpful. This appears to be an unsuccessful core file again:


Failed to write core file for process dovecot (error 14)
pid 718 (dovecot), uid 0: exited on signal 11



#0  service_process_create (service=Cannot access memory at address 0x7fffffffb250
) at service-process.c:311
	uid_counter = 232627
	process = (struct service_process *) Cannot access memory at address 0x7fffffffb248
Current language:  auto; currently minimal
(gdb) x/i $rip
0x40dd08 <service_process_create+296>:	mov    %eax,-0x2c(%rbp)
(gdb) info registers
rax            0x4a78	19064
rbx            0x0	0
rcx            0x800c6625a	34372739674
rdx            0x0	0
rsi            0x40d760	4249440
rdi            0x0	0
rbp            0x7fffffffb260	0x7fffffffb260
rsp            0x7fffffffb220	0x7fffffffb220
r8             0x2	2
r9             0x1	1
r10            0x2710	10000
r11            0x202	514
r12            0x7fffffffb4b8	140737488336056
r13            0x7fffffffb4e0	140737488336096
r14            0x7fffffffb4c0	140737488336064
r15            0x3	3
rip            0x40dd08	0x40dd08 <service_process_create+296>
eflags         0x10202	66050
cs             0x43	67
ss             0x3b	59
ds             0x0	0
es             0x0	0
fs             0x0	0
gs             0x0	0
(gdb)
Comment 32 John Baldwin freebsd_committer freebsd_triage 2016-02-08 21:35:08 UTC
Can you get 'procstat -v /path/to/foo.core' output from that last core?
Comment 33 Robert Blayzor 2016-02-08 21:55:28 UTC
  PID              START                END PRT  RES PRES REF SHD   FL TP PATH
  718           0x400000           0x417000 r-x   19   19   1   0 CN-- vn
  718           0x617000           0x618000 rw-    1    1   1   0 CN-- df
  718        0x800617000        0x800635000 r-x   30   30  34   0 CN-- vn /libexec/ld-elf.so.1
  718        0x800635000        0x800663000 rw-   36   36   1   0 CN-- df
  718        0x800834000        0x800836000 rw-    2    2   1   0 CN-- df
  718        0x800836000        0x800983000 r-x  216  222  38  19 CN-- vn /usr/local/lib/dovecot/libdovecot.so.0.0.0
  718        0x800983000        0x800b82000 ---    0    0   1   0 CN-- df
  718        0x800b82000        0x800b89000 rw-    7    0   1   0 CN-- vn /usr/local/lib/dovecot/libdovecot.so.0.0.0
  718        0x800b89000        0x800b8b000 rw-    2    2   1   0 CN-- df
  718        0x800b8b000        0x800d02000 r-x  375  386  67  33 CN-- vn /lib/libc.so.7
  718        0x800d02000        0x800f01000 ---    0    0   1   0 CN-- df
  718        0x800f01000        0x800f0d000 rw-   12    0   1   0 CN-- vn /lib/libc.so.7
  718        0x800f0d000        0x800f37000 rw-   13   13   1   0 CN-- df
  718        0x801000000        0x801800000 rw-   94   94   1   0 CN-- df
  718     0x7ffffffdf000     0x7ffffffff000 rw-    8    0   1   0 C--D df
  718     0x7ffffffff000     0x800000000000 r-x    1    1  36   0 ---- ph
Comment 34 John Baldwin freebsd_committer freebsd_triage 2016-02-10 00:10:14 UTC
https://svnweb.freebsd.org/changeset/base/295454

Is the backport of the coredump fixes I mentioned previously to stable/10.  I haven't tried to apply those to 10.2, but I think there's a decent chance that will apply cleanly.
Comment 35 Robert Blayzor 2016-02-10 12:11:35 UTC
I was not able to roll out a 10/stable NFS root based on some other issues restricting that. If there is a patch file against 10.2 I can certainly try that.
Comment 36 John Baldwin freebsd_committer freebsd_triage 2016-02-10 18:50:17 UTC
The patch from the commit applies cleanly to a 10.2 checkout:

> pwd
/usr/home/john/work/freebsd/releng/10.2
> svn merge -c 295454 ^/stable/10
--- Merging r295454 into '.':
U    lib/libprocstat/libprocstat.c
U    lib/libutil/kinfo_getfile.c
U    lib/libutil/kinfo_getvmmap.c
U    share/man/man5/core.5
U    sys/kern/imgact_elf.c
U    sys/kern/kern_descrip.c
U    sys/kern/kern_exec.c
U    sys/kern/kern_proc.c
U    sys/sys/exec.h
U    sys/sys/user.h
 G   .
--- Recording mergeinfo for merge of r295454 into '.':
 G   .


It might be simplest to apply it your tree using something like:

% cd /path/to/10.2/tree
% svn diff -c 295454 https://svn.freebsd.org/stable/10 > foo.patch
% svn patch foo.patch
Comment 37 Robert Blayzor 2016-02-10 19:47:59 UTC
Is building the kernel enough for this patch or does it require a complete buildworld?
Comment 38 John Baldwin freebsd_committer freebsd_triage 2016-02-10 22:38:16 UTC
You can get by with a new kernel only.  If the bug this protects against occurs, procstat from an old world might not be able to read file (-f) or VM info (-v) from a core dump correctly, but gdb, etc. should work fine.
Comment 39 Robert Blayzor 2016-02-11 15:22:22 UTC
I did get the patch and it installed cleanly. When I booted the new kernel, I did notice I got the same error previously, and a partial core file. I've decided to stop beating a dead horse on 10.2 and checked out 10.3-BETA1. (FreeBSD 10.3-BETA1 #1 r295491M)

I've built the VM's and new NFS root with this, however, my binaries/ports are built against 10.2 still. (not sure if that's the problem). If I still see the problem happen on 10.3 I'll grab what I can and then I'll attempt to rebuild the ports against 10.3....
Comment 40 Robert Blayzor 2016-02-11 22:24:52 UTC
Just experienced the same issue on 10.3-BETA1 FreeBSD 10.3-BETA1 #1 r295491M

Failed to write core file for process dovecot (error 14)
pid 716 (dovecot), uid 0: exited on signal 11

ls -l /var/spool/tmp/dovecot.core
-rw-------  1 root  wheel  8925184 Feb 11 19:47 /var/spool/tmp/dovecot.core


#0  service_process_create (service=Cannot access memory at address 0x7fffffffb250
) at service-process.c:311
311	service-process.c: No such file or directory.
	in service-process.c
(gdb) bt full
#0  service_process_create (service=Cannot access memory at address 0x7fffffffb250
) at service-process.c:311
	uid_counter = 7828
	process = (struct service_process *) Cannot access memory at address 0x7fffffffb248
Current language:  auto; currently minimal
(gdb) x/i $rip
0x40dd08 <service_process_create+296>:	mov    %eax,-0x2c(%rbp)
(gdb) info registers
rax            0x4a41	19009
rbx            0x0	0
rcx            0x800c66f0a	34372742922
rdx            0x0	0
rsi            0x40d760	4249440
rdi            0x0	0
rbp            0x7fffffffb260	0x7fffffffb260
rsp            0x7fffffffb220	0x7fffffffb220
r8             0x2	2
r9             0x1	1
r10            0x2710	10000
r11            0x202	514
r12            0x7fffffffb4b8	140737488336056
r13            0x7fffffffb4e0	140737488336096
r14            0x7fffffffb4c0	140737488336064
r15            0x3	3
rip            0x40dd08	0x40dd08 <service_process_create+296>
eflags         0x10202	66050
cs             0x43	67
ss             0x3b	59
ds             0x0	0
es             0x0	0
fs             0x0	0
gs             0x0	0


procstat -v /var/spool/tmp/dovecot.core
  PID              START                END PRT  RES PRES REF SHD   FL TP PATH
  716           0x400000           0x417000 r-x   22   23   1   0 CN-- vn
  716           0x617000           0x618000 rw-    1    1   1   0 CN-- df
  716        0x800617000        0x800635000 r-x   30   30  34   0 CN-- vn /libexec/ld-elf.so.1
  716        0x800635000        0x800663000 rw-   36   36   1   0 CN-- df
  716        0x800834000        0x800836000 rw-    2    2   1   0 CN-- df
  716        0x800836000        0x800983000 r-x  226  232  38  19 CN-- vn /usr/local/lib/dovecot/libdovecot.so.0.0.0
  716        0x800983000        0x800b82000 ---    0    0   1   0 CN-- df
  716        0x800b82000        0x800b89000 rw-    7    0   1   0 CN-- vn /usr/local/lib/dovecot/libdovecot.so.0.0.0
  716        0x800b89000        0x800b8b000 rw-    2    2   1   0 CN-- df
  716        0x800b8b000        0x800d03000 r-x  376  387  66  32 CN-- vn /lib/libc.so.7
  716        0x800d03000        0x800f02000 ---    0    0   1   0 CN-- df
  716        0x800f02000        0x800f0e000 rw-   12    0   1   0 CN-- vn /lib/libc.so.7
  716        0x800f0e000        0x800f38000 rw-   13   13   1   0 CN-- df
  716        0x801000000        0x801800000 rw-   94   94   1   0 CN-- df
  716     0x7ffffffdf000     0x7ffffffff000 rw-    8    0   1   0 C--D df
  716     0x7ffffffff000     0x800000000000 r-x    1    1  36   0 ---- ph
Comment 41 Robert Blayzor 2016-02-13 12:47:08 UTC
and here is a recent core (that actually dumped full) for exim:


#0  _thr_signal_postfork_child () at /.amd_mnt/juno/pub/FreeBSD/10-STABLE/lib/libthr/thread/thr_sig.c:508
508	/.amd_mnt/juno/pub/FreeBSD/10-STABLE/lib/libthr/thread/thr_sig.c: No such file or directory.
	in /.amd_mnt/juno/pub/FreeBSD/10-STABLE/lib/libthr/thread/thr_sig.c
[New Thread 803006400 (LWP 100045/<unknown>)]
(gdb) bt full
#0  _thr_signal_postfork_child () at /.amd_mnt/juno/pub/FreeBSD/10-STABLE/lib/libthr/thread/thr_sig.c:508
No locals.
#1  0x000000080120706d in __thr_fork () at /.amd_mnt/juno/pub/FreeBSD/10-STABLE/lib/libthr/thread/thr_fork.c:203
	rtld_locks = {-4800, 32767, -4760, 32767, -4792, 32767, 3, 0}
	curthread = (struct pthread *) 0x803006400
	cancelsave = 0
	ret = 0
	errsave = 4
#2  0x0000000000428c54 in handle_smtp_call (listen_sockets=0x803065140, listen_socket_count=4, accept_socket=7, accepted=0x7fffffffcd38) at daemon.c:355
	pid = 32767
	interface_sockaddr = {v4 = {sin_len = 28 '\034', sin_family = 28 '\034', sin_port = 6400, sin_addr = {s_addr = 0},
    sin_zero = 0x7fffffffc9a0 "&\aðX\001\020"}, v6 = {sin6_len = 28 '\034', sin6_family = 28 '\034', sin6_port = 6400, sin6_flowinfo = 0, sin6_addr = {
      __u6_addr = {__u6_addr8 = 0x7fffffffc9a0 "&\aðX\001\020", __u6_addr16 = 0x7fffffffc9a0, __u6_addr32 = 0x7fffffffc9a0}}, sin6_scope_id = 0}, v0 = {
    sa_len = 28 '\034', sa_family = 28 '\034', sa_data = 0x7fffffffc99a ""}}
	ifsize = 28
	dup_accept_socket = 9
	max_for_this_host = 10
	wfsize = 100
	wfptr = 22
	use_log_write_selector = 1980
	whofrom = (uschar *) 0x803065e30 "[2607:f058:110:2::f:0]"
	reset_point = (void *) 0x803065e00
#3  0x0000000000427c55 in daemon_go () at daemon.c:2040
	accept_socket = 7
	sk = 0
	max_socket = 6
	lcount = 0
	select_errno = 4
	select_failed = 0
	select_listen = {__fds_bits = 0x7fffffffcc40}
	accepted = {sin6_len = 28 '\034', sin6_family = 28 '\034', sin6_port = 24059, sin6_flowinfo = 0, sin6_addr = {__u6_addr = {
      __u6_addr8 = 0x7fffffffcd40 "&\aðX\001\020", __u6_addr16 = 0x7fffffffcd40, __u6_addr32 = 0x7fffffffcd40}}, sin6_scope_id = 0}
	len = 28
	pid = 56724
	pw = (struct passwd *) 0x802698328
	listen_sockets = (int *) 0x803065140
	listen_socket_count = 4
	addresses = (ip_address_item *) 0x803065040
	last_connection_time = 1455363261
#4  0x000000000044af25 in main (argc=3, cargv=0x7fffffffed48) at exim.c:4719
	rsopts = 0x761b00
	argv = (uschar **) 0x7fffffffed48
	arg_receive_timeout = -1
	arg_smtp_receive_timeout = -1
	arg_error_handling = 0
	filter_sfd = -1
	filter_ufd = -1
	group_count = 1
	i = 0
---Type <return> to continue, or q <return> to quit---
	rv = 0
	list_queue_option = 0
	msg_action = 0
	msg_action_arg = -1
	namelen = 20
	queue_only_reason = 0
	perl_start_option = 0
	recipients_arg = 3
	sender_address_domain = 0
	test_retry_arg = -1
	test_rewrite_arg = -1
	arg_queue_only = 0
	bi_option = 0
	checking = 0
	count_queue = 0
	expansion_test = 0
	extract_recipients = 0
	flag_G = 0
	flag_n = 0
	forced_delivery = 0
	f_end_dot = 0
	deliver_give_up = 0
	list_queue = 0
	list_options = 0
	local_queue_only = 7868416
	more = 1
	one_msg_action = 0
	queue_only_set = 0
	receiving_message = 0
	sender_ident_set = 0
	session_local_queue_only = -5288
	unprivileged = 0
	removed_privilege = 0
	usage_wanted = 0
	verify_address_mode = 0
	verify_as_sender = 0
	version_printed = 0
	alias_arg = (uschar *) 0x0
	called_as = (uschar *) 0x554616 ""
	cmdline_syslog_name = (uschar *) 0x0
	start_queue_run_id = (uschar *) 0x0
	stop_queue_run_id = (uschar *) 0x0
	expansion_test_message = (uschar *) 0x0
	ftest_domain = (uschar *) 0x0
	ftest_localpart = (uschar *) 0x0
	ftest_prefix = (uschar *) 0x0
	ftest_suffix = (uschar *) 0x0
	log_oneline = (uschar *) 0x0
	malware_test_file = (uschar *) 0x0
---Type <return> to continue, or q <return> to quit---
	real_sender_address = (uschar *) 0x7fffffffeb58 " \005v"
	originator_home = (uschar *) 0x803065018 "/root"
	sz = 140737488350168
	reset_point = (void *) 0x80231a310
	pw = (struct passwd *) 0x802698328
	statbuf = {st_dev = 1895890688, st_ino = 5, st_mode = 8576, st_nlink = 1, st_uid = 0, st_gid = 0, st_rdev = 5, st_atim = {tv_sec = 1455201710,
    tv_nsec = 90063000}, st_mtim = {tv_sec = 1455201749, tv_nsec = 0}, st_ctim = {tv_sec = 1455201749, tv_nsec = 0}, st_size = 0, st_blocks = 0,
  st_blksize = 4096, st_flags = 0, st_gen = 0, st_lspare = 0, st_birthtim = {tv_sec = -1, tv_nsec = 0}}
	passed_qr_pid = 0
	passed_qr_pipe = -1
	group_list = 0x7fffffffd8f0
	info_flag = CMDINFO_NONE
	info_stdout = 0
Current language:  auto; currently minimal
Comment 42 Robert Blayzor 2016-02-15 16:46:15 UTC
Caught exim and dovecot full core dumps, same VM, couple hours apart...


#0  handle_smtp_call (listen_sockets=0x803065140, listen_socket_count=4, accept_socket=7,
    accepted=0x7fffffffcd38) at daemon.c:375
375	daemon.c: No such file or directory.
	in daemon.c
[New Thread 803006400 (LWP 100155/<unknown>)]
(gdb) bt full
#0  handle_smtp_call (listen_sockets=0x803065140, listen_socket_count=4, accept_socket=7,
    accepted=0x7fffffffcd38) at daemon.c:375
	i = 32767
	queue_only_reason = 0
	old_pool = 0
	save_debug_selector = 0
	local_queue_only = 32767
	session_local_queue_only = -4800
	act = {__sigaction_u = {__sa_handler = 0, __sa_sigaction = 0}, sa_flags = 0, sa_mask = {
    __bits = 0x7fffffffc91c}}
	pid = 0
	interface_sockaddr = {v4 = {sin_len = 28 '\034', sin_family = 28 '\034', sin_port = 6400,
    sin_addr = {s_addr = 0}, sin_zero = 0x7fffffffc9a0 "&\aðX\001\020"}, v6 = {
    sin6_len = 28 '\034', sin6_family = 28 '\034', sin6_port = 6400, sin6_flowinfo = 0,
    sin6_addr = {__u6_addr = {__u6_addr8 = 0x7fffffffc9a0 "&\aðX\001\020",
        __u6_addr16 = 0x7fffffffc9a0, __u6_addr32 = 0x7fffffffc9a0}}, sin6_scope_id = 0}, v0 = {
    sa_len = 28 '\034', sa_family = 28 '\034', sa_data = 0x7fffffffc99a ""}}
	ifsize = 28
	dup_accept_socket = 9
	max_for_this_host = 10
	wfsize = 100
	wfptr = 22
	use_log_write_selector = 1980
	whofrom = (uschar *) 0x803065e30 "[2607:f058:110:2::f:0]"
	reset_point = (void *) 0x803065e00
#1  0x0000000000427c55 in daemon_go () at daemon.c:2040
	accept_socket = 7
	sk = 0
	max_socket = 6
	lcount = 0
	select_errno = 4
	select_failed = 0
	select_listen = {__fds_bits = 0x7fffffffcc40}
	accepted = {sin6_len = 28 '\034', sin6_family = 28 '\034', sin6_port = 46531,
  sin6_flowinfo = 0, sin6_addr = {__u6_addr = {__u6_addr8 = 0x7fffffffcd40 "&\aðX\001\020",
      __u6_addr16 = 0x7fffffffcd40, __u6_addr32 = 0x7fffffffcd40}}, sin6_scope_id = 0}
	len = 28
	pid = 53159
	pw = (struct passwd *) 0x802698328
	listen_sockets = (int *) 0x803065140
	listen_socket_count = 4
	addresses = (ip_address_item *) 0x803065040
	last_connection_time = 1455463157
#2  0x000000000044af25 in main (argc=3, cargv=0x7fffffffed48) at exim.c:4719
	rsopts = 0x761b00
	argv = (uschar **) 0x7fffffffed48
	arg_receive_timeout = -1
	arg_smtp_receive_timeout = -1
	arg_error_handling = 0
---Type <return> to continue, or q <return> to quit---
	filter_sfd = -1
	filter_ufd = -1
	group_count = 1
	i = 0
	rv = 0
	list_queue_option = 0
	msg_action = 0
	msg_action_arg = -1
	namelen = 20
	queue_only_reason = 0
	perl_start_option = 0
	recipients_arg = 3
	sender_address_domain = 0
	test_retry_arg = -1
	test_rewrite_arg = -1
	arg_queue_only = 0
	bi_option = 0
	checking = 0
	count_queue = 0
	expansion_test = 0
	extract_recipients = 0
	flag_G = 0
	flag_n = 0
	forced_delivery = 0
	f_end_dot = 0
	deliver_give_up = 0
	list_queue = 0
	list_options = 0
	local_queue_only = 7868416
	more = 1
	one_msg_action = 0
	queue_only_set = 0
	receiving_message = 0
	sender_ident_set = 0
	session_local_queue_only = -5288
	unprivileged = 0
	removed_privilege = 0
	usage_wanted = 0
	verify_address_mode = 0
	verify_as_sender = 0
	version_printed = 0
	alias_arg = (uschar *) 0x0
	called_as = (uschar *) 0x554616 ""
	cmdline_syslog_name = (uschar *) 0x0
	start_queue_run_id = (uschar *) 0x0
	stop_queue_run_id = (uschar *) 0x0
	expansion_test_message = (uschar *) 0x0
	ftest_domain = (uschar *) 0x0
	ftest_localpart = (uschar *) 0x0
---Type <return> to continue, or q <return> to quit---
	ftest_prefix = (uschar *) 0x0
	ftest_suffix = (uschar *) 0x0
	log_oneline = (uschar *) 0x0
	malware_test_file = (uschar *) 0x0
	real_sender_address = (uschar *) 0x7fffffffeb58 " \005v"
	originator_home = (uschar *) 0x803065018 "/root"
	sz = 140737488350168
	reset_point = (void *) 0x80231a310
	pw = (struct passwd *) 0x802698328
	statbuf = {st_dev = 1895890688, st_ino = 5, st_mode = 8576, st_nlink = 1, st_uid = 0,
  st_gid = 0, st_rdev = 5, st_atim = {tv_sec = 1455201710, tv_nsec = 90063000}, st_mtim = {
    tv_sec = 1455201749, tv_nsec = 0}, st_ctim = {tv_sec = 1455201749, tv_nsec = 0}, st_size = 0,
  st_blocks = 0, st_blksize = 4096, st_flags = 0, st_gen = 0, st_lspare = 0, st_birthtim = {
    tv_sec = -1, tv_nsec = 0}}
	passed_qr_pid = 0
	passed_qr_pipe = -1
	group_list = 0x7fffffffd8f0
	info_flag = CMDINFO_NONE
	info_stdout = 0
Current language:  auto; currently minimal
(gdb) x/i $rip
0x428c9e <handle_smtp_call+2222>:	mov    %eax,0x764b34
(gdb) info registers
rax            0x7bc	1980
rbx            0x0	0
rcx            0x4	4
rdx            0x1	1
rsi            0x8030064e8	34410095848
rdi            0x3	3
rbp            0x7fffffffc9d0	0x7fffffffc9d0
rsp            0x7fffffffc830	0x7fffffffc830
r8             0x0	0
r9             0xfffff8003b2a1000	-8795100409856
r10            0x1	1
r11            0x246	582
r12            0x7fffffffed40	140737488350528
r13            0x7fffffffed68	140737488350568
r14            0x7fffffffed48	140737488350536
r15            0x3	3
rip            0x428c9e	0x428c9e <handle_smtp_call+2222>
eflags         0x10202	66050
cs             0x43	67
ss             0x3b	59
ds             0x0	0
es             0x0	0
fs             0x0	0
gs             0x0	0


procstat -v exim.core
  PID              START                END PRT  RES PRES REF SHD   FL TP PATH
53547           0x400000           0x560000 r-x  324  637   3   1 CN-- vn /usr/local/sbin/exim
53547           0x760000           0x76a000 rw-   10    0   1   0 C--- vn /usr/local/sbin/exim
53547           0x76a000           0x779000 rw-   15    0   1   0 C--- df
53547        0x800760000        0x80077e000 r-x   30   32  35   0 CN-- vn /libexec/ld-elf.so.1
53547        0x80077e000        0x800785000 rw-    7    7   2   0 CN-- df
53547        0x800787000        0x8007b7000 rw-   48   48   2   1 CN-- df
53547        0x80097d000        0x80097f000 rw-    2    2   2   0 CN-- df
53547        0x80097f000        0x80098d000 r-x   14   16   7   3 CN-- vn
53547        0x80098d000        0x800b8d000 ---    0    0   2   0 CN-- df
53547        0x800b8d000        0x800b8e000 rw-    1    0   2   0 CN-- vn
53547        0x800b8e000        0x800b9f000 rw-    0    0   0   0 ---- --
53547        0x800b9f000        0x800bc8000 r-x   41   46   9   4 CN-- vn
53547        0x800bc8000        0x800dc7000 ---    0    0   2   0 CN-- df
53547        0x800dc7000        0x800dc8000 rw-    1    0   2   0 CN-- vn
53547        0x800dc8000        0x800dd8000 r-x   16   18  19   9 CN-- vn
53547        0x800dd8000        0x800fd8000 ---    0    0   2   0 CN-- df
53547        0x800fd8000        0x800fd9000 rw-    1    0   2   0 CN-- vn
53547        0x800fd9000        0x800fdb000 rw-    0    0   0   0 ---- --
53547        0x800fdb000        0x800ff3000 r-x   20   23   3   1 CN-- vn
53547        0x800ff3000        0x8011f3000 ---    0    0   2   0 CN-- df
53547        0x8011f3000        0x8011f5000 rw-    2    0   2   0 CN-- vn
53547        0x8011f5000        0x80120d000 r-x   24   26  20   9 CN-- vn /lib/libthr.so.3
53547        0x80120d000        0x80140d000 ---    0    0   2   0 CN-- df
53547        0x80140d000        0x80140e000 rw-    1    0   1   0 C--- vn /lib/libthr.so.3
53547        0x80140e000        0x80141a000 rw-   12    0   1   0 C--- df
53547        0x80141a000        0x801445000 r-x   41   45   5   2 CN-- vn
53547        0x801445000        0x801645000 ---    0    0   2   0 CN-- df
53547        0x801645000        0x801648000 rw-    3    0   2   0 CN-- vn
53547        0x801648000        0x80181d000 r-x  425  434   3   1 CN-- vn
53547        0x80181d000        0x801a1c000 ---    0    0   2   0 CN-- df
53547        0x801a1c000        0x801a25000 rw-    9    0   2   0 CN-- vn
53547        0x801a25000        0x801a88000 r-x   87   98  27  13 CN-- vn /usr/lib/libssl.so.7
53547        0x801a88000        0x801c88000 ---    0    0   2   0 CN-- df
53547        0x801c88000        0x801c92000 rw-   10    0   2   0 CN-- vn /usr/lib/libssl.so.7
53547        0x801c92000        0x801e5f000 r-x  461  501  31  15 CN-- vn /lib/libcrypto.so.7
53547        0x801e5f000        0x80205f000 ---    0    0   2   0 CN-- df
53547        0x80205f000        0x802087000 rw-   40    0   2   0 CN-- vn /lib/libcrypto.so.7
53547        0x802087000        0x802089000 rw-    2    2   2   0 CN-- df
53547        0x802089000        0x8020fe000 r-x   45   46   8   3 CN-- vn
53547        0x8020fe000        0x8022fd000 ---    0    0   2   0 CN-- df
53547        0x8022fd000        0x8022fe000 rw-    1    0   2   0 CN-- vn
53547        0x8022fe000        0x802476000 r-x  376  390  68  33 CN-- vn /lib/libc.so.7
53547        0x802476000        0x802675000 ---    0    0   2   0 CN-- df
53547        0x802675000        0x802681000 rw-   12    0   1   0 C--- vn /lib/libc.so.7
53547        0x802681000        0x8026ab000 rw-   42    0   1   0 C--- df
53547        0x8026ab000        0x8026b5000 r-x   10   14  10   4 CN-- vn /usr/local/lib/libintl.so.8.1.4
53547        0x8026b5000        0x8028b5000 ---    0    0   2   0 CN-- df
53547        0x8028b5000        0x8028b6000 rw-    1    0   2   0 CN-- vn /usr/local/lib/libintl.so.8.1.4
53547        0x802c00000        0x803400000 rw-  512    0   1   0 C--- df
53547     0x7fffdfffe000     0x7fffdffff000 ---    0    0   0   0 ---- --
53547     0x7ffffffdf000     0x7ffffffff000 rw-   10    0   1   0 C--D df
53547     0x7ffffffff000     0x800000000000 r-x    1    1  37   0 ---- ph
Comment 43 Robert Blayzor 2016-02-15 16:47:32 UTC
#0  t_pop () at data-stack.c:275
275	data-stack.c: No such file or directory.
	in data-stack.c
(gdb) bt full
#0  t_pop () at data-stack.c:275
	frame_block = (struct stack_frame_block *) 0x8014b1378
#1  0x000000080090e584 in io_loop_call_io (io=0x80144e4c0) at ioloop.c:560
	ioloop = (struct ioloop *) 0x801421080
	t_id = 2
#2  0x0000000800911497 in io_loop_handler_run_internal (ioloop=0x801421080) at ioloop-kqueue.c:151
	ctx = (struct ioloop_handler_context *) 0x801419880
	events = (struct kevent *) 0x8014b9000
	event = (const struct kevent *) 0x8014b9000
	tv = {tv_sec = 43499, tv_usec = 883196}
	ts = {tv_sec = 43499, tv_nsec = 883196000}
	io = (struct io_file *) 0x80144e4c0
	events_count = 67
	ret = 1
	i = 0
#3  0x000000080090eb0e in io_loop_handler_run (ioloop=0x801421080) at ioloop.c:607
No locals.
#4  0x000000080090e92f in io_loop_run (ioloop=0x801421080) at ioloop.c:583
No locals.
#5  0x000000080086a8bb in master_service_run (service=0x80141b140, callback=0)
    at master-service.c:640
No locals.
#6  0x0000000000406558 in main (argc=3, argv=0x7fffffffb4c0) at main.c:888
	set = (struct master_settings *) 0x801429110
	error = 0x800ca9283 "H\203Ä ]Ã\017\037\200"
	doveconf_arg = 0x0
	orig_info_callback = (failure_callback_t *) 0x8008ee880 <i_syslog_error_handler>
	orig_debug_callback = (failure_callback_t *) 0x8008ee880 <i_syslog_error_handler>
	foreground = false
	ask_key_pass = false
	i = 3
	c = -1
	doubleopts = 0x7fffffffb3a0
Current language:  auto; currently minimal
(gdb) x/i $rip
0x8008eb306 <t_pop+86>:	mov    %rax,0x29d523(%rip)        # 0x800b88830 <current_block>
(gdb) info registers
rax            0x801406000	34380734464
rbx            0x0	0
rcx            0x80140f280	34380771968
rdx            0x0	0
rsi            0x7fffffffb198	140737488335256
rdi            0x0	0
rbp            0x7fffffffb280	0x7fffffffb280
rsp            0x7fffffffb260	0x7fffffffb260
r8             0x0	0
r9             0xfffffe007c476718	-2196938201320
r10            0x0	0
r11            0x246	582
r12            0x7fffffffb4b8	140737488336056
r13            0x7fffffffb4e0	140737488336096
r14            0x7fffffffb4c0	140737488336064
r15            0x3	3
rip            0x8008eb306	0x8008eb306 <t_pop+86>
eflags         0x10246	66118
cs             0x43	67
ss             0x3b	59
ds             0x0	0
es             0x0	0
fs             0x0	0
gs             0x0	0



procstat -v dovecot.core
  PID              START                END PRT  RES PRES REF SHD   FL TP PATH
  716           0x400000           0x417000 r-x   19   19   1   0 CN-- vn
  716           0x617000           0x618000 rw-    1    1   1   0 CN-- df
  716        0x800617000        0x800635000 r-x   30   32  32   0 CN-- vn /libexec/ld-elf.so.1
  716        0x800635000        0x800663000 rw-   36   36   1   0 C--- df
  716        0x800834000        0x800836000 rw-    2    2   1   0 CN-- df
  716        0x800836000        0x800983000 r-x  213  219  34  17 CN-- vn /usr/local/lib/dovecot/libdovecot.so.0.0.0
  716        0x800983000        0x800b82000 ---    0    0   1   0 CN-- df
  716        0x800b82000        0x800b89000 rw-    7    0   1   0 C--- vn /usr/local/lib/dovecot/libdovecot.so.0.0.0
  716        0x800b89000        0x800b8b000 rw-    2    2   1   0 CN-- df
  716        0x800b8b000        0x800d03000 r-x  376  390  63  31 CN-- vn /lib/libc.so.7
  716        0x800d03000        0x800f02000 ---    0    0   1   0 CN-- df
  716        0x800f02000        0x800f0e000 rw-   12    0   1   0 CN-- vn /lib/libc.so.7
  716        0x800f0e000        0x800f38000 rw-   13   13   1   0 CN-- df
  716        0x801000000        0x801800000 rw-   94   94   1   0 C--- df
  716     0x7ffffffdf000     0x7ffffffff000 rw-    8    8   1   0 C--D df
  716     0x7ffffffff000     0x800000000000 r-x    1    1  34   0 ---- ph
Comment 44 Robert Blayzor 2016-02-21 12:36:00 UTC
 FreeBSD 10.3-BETA1 #1 r295491M:

Failed to write core file for process exim (error 14)
pid 743 (exim), uid 26: exited on signal 11


#0  __thr_fork () at /.amd_mnt/juno/pub/FreeBSD/10-STABLE/lib/libthr/thread/thr_fork.c:186
186	/.amd_mnt/juno/pub/FreeBSD/10-STABLE/lib/libthr/thread/thr_fork.c: No such file or directory.
	in /.amd_mnt/juno/pub/FreeBSD/10-STABLE/lib/libthr/thread/thr_fork.c
[New Thread 803006400 (LWP 100046/<unknown>)]
(gdb) bt full
#0  __thr_fork () at /.amd_mnt/juno/pub/FreeBSD/10-STABLE/lib/libthr/thread/thr_fork.c:186
	rtld_locks = Cannot access memory at address 0x7fffffffc7d0
Current language:  auto; currently minimal
(gdb) x/i $rip
0x801206f3e <__thr_fork+190>:	callq  0x8012090e0 <__error_threaded>
(gdb) info registers
rax            0x17e44	97860
rbx            0x0	0
rcx            0x80238f70a	34397026058
rdx            0x0	0
rsi            0x801209658	34378651224
rdi            0x2	2
rbp            0x7fffffffc820	0x7fffffffc820
rsp            0x7fffffffc7c0	0x7fffffffc7c0
r8             0x2	2
r9             0x2	2
r10            0x2710	10000
r11            0x246	582
r12            0x803006400	34410095616
r13            0x76b5b0	7779760
r14            0x17e44	97860
r15            0x0	0
rip            0x801206f3e	0x801206f3e <__thr_fork+190>
eflags         0x10246	66118
cs             0x43	67
ss             0x3b	59
ds             0x0	0
es             0x0	0
fs             0x0	0
gs             0x0	0


procstat -f exim.core
  PID COMM               FD T V FLAGS     REF  OFFSET PRO NAME
  743 exim              text v r r-------  -       - -   /usr/local/sbin/exim
  743 exim               cwd v d r-------  -       - -   /var/spool/exim
  743 exim              root v d r-------  -       - -   /
  743 exim                 0 v c rw------  3       0 -   /dev/null
  743 exim                 1 v c rw------  3       0 -   /dev/null
  743 exim                 2 v c rw------  3       0 -   /dev/null
  743 exim                 3 s - rw------  1       0 TCP ::.25 ::.0
  743 exim                 4 s - rw------  1       0 TCP 0.0.0.0:25 0.0.0.0:0
  743 exim                 5 s - rw------  2       0 TCP 2607:f058:110:1::1:2.25 2607:f058:110:1::f:1.38009
  743 exim                 6 s - rw---n--  1       0 UDP 0.0.0.0:863 0.0.0.0:0
  743 exim                 7 s - rw------  2       0 TCP 2607:f058:110:1::1:2.25 2607:f058:110:1::f:1.38009
  743 exim                 8 s - rw------  1       0 UDD /var/run/log
Comment 45 Robert Blayzor 2016-02-21 21:36:43 UTC
#0  0x000000080120716f in __thr_fork ()
    at /.amd_mnt/juno/pub/FreeBSD/10-STABLE/lib/libthr/thread/thr_fork.c:257
257	/.amd_mnt/juno/pub/FreeBSD/10-STABLE/lib/libthr/thread/thr_fork.c: No such file or directory.
	in /.amd_mnt/juno/pub/FreeBSD/10-STABLE/lib/libthr/thread/thr_fork.c
Cannot find new threads: generic error
(gdb) bt full
#0  0x000000080120716f in __thr_fork ()
    at /.amd_mnt/juno/pub/FreeBSD/10-STABLE/lib/libthr/thread/thr_fork.c:257
	rtld_locks = Cannot access memory at address 0x7fffffffc7d0
Current language:  auto; currently minimal
(gdb) x/i $rip
0x80120716f <__thr_fork+751>:	mov    %ecx,(%rax)
(gdb) info registers
rax            0x8026a9358	34400277336
rbx            0x0	0
rcx            0x4	4
rdx            0x1	1
rsi            0x8030064e8	34410095848
rdi            0x3	3
rbp            0x7fffffffc820	0x7fffffffc820
rsp            0x7fffffffc7c0	0x7fffffffc7c0
r8             0x0	0
r9             0xfffff800484e49e0	-8794879931936
r10            0x1	1
r11            0x246	582
r12            0x803006400	34410095616
r13            0x76b5b0	7779760
r14            0x6f3f	28479
r15            0x0	0
rip            0x80120716f	0x80120716f <__thr_fork+751>
eflags         0x10202	66050
cs             0x43	67
ss             0x3b	59
ds             0x0	0
es             0x0	0
fs             0x0	0
gs             0x0	0
(gdb)


procstat -v exim.core
  PID              START                END PRT  RES PRES REF SHD   FL TP PATH
99063           0x400000           0x560000 r-x  313  626   2   1 CN-- vn /usr/local/sbin/exim
99063           0x760000           0x76a000 rw-   10    0   1   0 CN-- vn /usr/local/sbin/exim
99063           0x76a000           0x779000 rw-    5    5   1   0 CN-- df
99063        0x800760000        0x80077e000 r-x   30   32  31   0 CN-- vn /libexec/ld-elf.so.1
99063        0x80077e000        0x800785000 rw-    7    7   1   0 CN-- df
99063        0x800787000        0x8007b7000 rw-   40   40   1   0 CN-- df
99063        0x80097d000        0x80097f000 rw-    2    2   1   0 CN-- df
99063        0x80097f000        0x80098d000 r-x   14   16  11   3 CN-- vn
99063        0x80098d000        0x800b8d000 ---    0    0   1   0 CN-- df
99063        0x800b8d000        0x800b8e000 rw-    1    0   1   0 CN-- vn
99063        0x800b8e000        0x800b9f000 rw-    0    0   0   0 ---- --
99063        0x800b9f000        0x800bc8000 r-x   41   46  18   6 CN-- vn
99063        0x800bc8000        0x800dc7000 ---    0    0   1   0 CN-- df
99063        0x800dc7000        0x800dc8000 rw-    1    0   1   0 CN-- vn
99063        0x800dc8000        0x800dd8000 r-x   16   18  25  10 CN-- vn
99063        0x800dd8000        0x800fd8000 ---    0    0   1   0 CN-- df
99063        0x800fd8000        0x800fd9000 rw-    1    0   1   0 CN-- vn
99063        0x800fd9000        0x800fdb000 rw-    0    0   0   0 ---- --
99063        0x800fdb000        0x800ff3000 r-x   24   27   2   1 CN-- vn
99063        0x800ff3000        0x8011f3000 ---    0    0   1   0 CN-- df
99063        0x8011f3000        0x8011f5000 rw-    2    0   1   0 CN-- vn
99063        0x8011f5000        0x80120d000 r-x   24   26  23   8 CN-- vn /lib/libthr.so.3
99063        0x80120d000        0x80140d000 ---    0    0   1   0 CN-- df
99063        0x80140d000        0x80140e000 rw-    1    0   1   0 CN-- vn /lib/libthr.so.3
99063        0x80140e000        0x80141a000 rw-   12   12   1   0 C--- df
99063        0x80141a000        0x801445000 r-x   39   43   2   1 CN-- vn
99063        0x801445000        0x801645000 ---    0    0   1   0 CN-- df
99063        0x801645000        0x801648000 rw-    3    0   1   0 CN-- vn
99063        0x801648000        0x80181d000 r-x  429  438   9   2 CN-- vn /usr/local/lib/perl5/5.22/mach/CORE/libperl.so.5.22.1
99063        0x80181d000        0x801a1c000 ---    0    0   1   0 CN-- df
99063        0x801a1c000        0x801a25000 rw-    9    0   1   0 CN-- vn /usr/local/lib/perl5/5.22/mach/CORE/libperl.so.5.22.1
99063        0x801a25000        0x801a88000 r-x   81   92   8   4 CN-- vn /usr/lib/libssl.so.7
99063        0x801a88000        0x801c88000 ---    0    0   1   0 CN-- df
99063        0x801c88000        0x801c92000 rw-   10    0   1   0 CN-- vn /usr/lib/libssl.so.7
99063        0x801c92000        0x801e5f000 r-x  461  501  12   6 CN-- vn /lib/libcrypto.so.7
99063        0x801e5f000        0x80205f000 ---    0    0   1   0 CN-- df
99063        0x80205f000        0x802087000 rw-   40    0   1   0 CN-- vn /lib/libcrypto.so.7
99063        0x802087000        0x802089000 rw-    1    1   1   0 CN-- df
99063        0x802089000        0x8020fe000 r-x   43   44   7   3 CN-- vn
99063        0x8020fe000        0x8022fd000 ---    0    0   1   0 CN-- df
99063        0x8022fd000        0x8022fe000 rw-    1    0   1   0 CN-- vn
99063        0x8022fe000        0x802476000 r-x  376  390  55  24 CN-- vn /lib/libc.so.7
99063        0x802476000        0x802675000 ---    0    0   1   0 CN-- df
99063        0x802675000        0x802681000 rw-   12    0   1   0 CN-- vn /lib/libc.so.7
99063        0x802681000        0x8026ab000 rw-   11    0   1   0 C--- df
99063        0x8026ab000        0x8026b5000 r-x   10   14   7   3 CN-- vn /usr/local/lib/libintl.so.8.1.4
99063        0x8026b5000        0x8028b5000 ---    0    0   1   0 CN-- df
99063        0x8028b5000        0x8028b6000 rw-    1    0   1   0 CN-- vn /usr/local/lib/libintl.so.8.1.4
99063        0x802c00000        0x803400000 rw-  222  222   1   0 C--- df
99063     0x7fffdfffe000     0x7fffdffff000 ---    0    0   0   0 ---- --
99063     0x7ffffffdf000     0x7ffffffff000 rw-   10   10   1   0 C--D df
99063     0x7ffffffff000     0x800000000000 r-x    1    1  33   0 ---- ph
Comment 46 Konstantin Belousov freebsd_committer freebsd_triage 2016-02-22 14:02:37 UTC
(In reply to Robert Blayzor from comment #43)
So are you binaries and libs executed from NFS mount ?  Could you try to mount them statically instead of using automounter ?  Even better, try with a local filesystem.

Another thing to do is to set sysctl machdep.uprintf_signal to 1.  I would need the matching output on the failure and procstat -v output from the (corrupted) core.

Do you run ntpd ?  Do failures continue to appear if you stop both ntpd and automounter ?
Comment 47 Robert Blayzor 2016-02-22 14:08:05 UTC
I'm not using auto-mounter on the VM's I'm seeing this failure on. 

They do however have the root file system mounted on NFS, which is done at boot; NFS_ROOT. (diskless boot)

The entire root file system is mounted RO, and includes the entire base system (including /usr/local)

The only other thing that's created is a memory disk for /var and /etc and any NFS file systems used for mail.


I do run ntpd (from ports) on each VM. I can try stopping ntpd and see if that makes a difference. But knowing VM hosts, the clocks may drift quite a bit....

Will disable ntpd and report back.
Comment 48 Konstantin Belousov freebsd_committer freebsd_triage 2016-02-22 14:36:30 UTC
(In reply to Robert Blayzor from comment #47)
uprintf_signal is more interesting.

Wait, VM ?  Which hypervisor ?  Is it possible to reproduce the issue on bare metal ?
Comment 49 Robert Blayzor 2016-02-22 14:59:11 UTC
VMWare ESXi 5.1u2

I have *NOT* seen this behavior on other VM's created on the same set of ESXi hosts. The common thread (difference) seems to be these are setup to be "diskless" boot VM's where their root filesystem is mounted RO / via NFS. (NetApp)

I have bare metal hosts running 10.2 with no problem.

But at the same time, I have VM hosts, which have a local disk (vmdk) which are fine, and they're on the same physical ESXI hosts.

I'll set sysctl machdep.uprintf_signal as well.
Comment 50 Robert Blayzor 2016-02-23 22:46:27 UTC
DIfferent process, similar crash... core file did dump. Note, the "amd" mount you see is just the machine I complied the binaries (buildworld) on. No auto mounter is used on the diskless boot VM's.


pid 63755 (managesieve-login), uid 0: exited on signal 11 (core dumped)


#0  memset () at /.amd_mnt/juno/pub/FreeBSD/10-STABLE/lib/libc/amd64/string/memset.S:51
51	/.amd_mnt/juno/pub/FreeBSD/10-STABLE/lib/libc/amd64/string/memset.S: No such file or directory.
	in /.amd_mnt/juno/pub/FreeBSD/10-STABLE/lib/libc/amd64/string/memset.S
(gdb) bt full
#0  memset () at /.amd_mnt/juno/pub/FreeBSD/10-STABLE/lib/libc/amd64/string/memset.S:51
No locals.
#1  0x00000000106189e0 in map_object (fd=9, path=0x1062a260 "/lib/libc.so.7", sb=0x7fffffffd920)
    at /.amd_mnt/juno/pub/FreeBSD/10-STABLE/libexec/rtld-elf/map_object.c:242
	phsize = <value optimized out>
	stack_flags = Cannot access memory at address 0x7
Current language:  auto; currently asm
(gdb) x/i $rip
0x1061c524 <memset+68>:	rep stos %rax,%es:(%rdi)
(gdb) info registers
rax            0x0	0
rbx            0x1	1
rcx            0x4a	74
rdx            0x0	0
rsi            0x0	0
rdi            0x11787db0	293109168
rbp            0x7fffffffd8f0	0x7fffffffd8f0
rsp            0x7fffffffd7a8	0x7fffffffd7a8
r8             0x250	592
r9             0xfffffe00945fd6b8	-2196533946696
r10            0x0	0
r11            0x11787db0	293109168
r12            0x9	9
r13            0x1063a000	274964480
r14            0x1062a260	274899552
r15            0x7fffffffd920	140737488345376
rip            0x1061c524	0x1061c524 <memset+68>
eflags         0x10202	66050
cs             0x43	67
ss             0x3b	59
ds             0x0	0
es             0x0	0
fs             0x0	0
gs             0x0	0


procstat -v managesieve-login.core
  PID              START                END PRT  RES PRES REF SHD   FL TP PATH
63755           0x400000           0x40b000 r-x    7    8   1   0 CN-- vn /usr/local/libexec/dovecot/managesieve-login
63755           0x60b000           0x60c000 rw-    1    1   1   0 ---- df
63755         0x1060b000         0x10629000 r-x   30   30  32   0 CN-- vn /libexec/ld-elf.so.1
63755         0x10629000         0x1063a000 rw-   11   11   1   0 ---- df
63755         0x1063a000         0x1063b000 r--    1  387  64  31 CN-- vn /lib/libc.so.7
63755         0x10828000         0x1082a000 rw-    2    2   1   0 ---- df
63755         0x1082a000         0x1084b000 r-x   27   28  18   9 CN-- vn /usr/local/lib/dovecot/libdovecot-login.so.0.0.0
63755         0x1084b000         0x10a4a000 ---    0    0   1   0 ---- df
63755         0x10a4a000         0x10a4c000 rw-    2    0   1   0 C--- vn /usr/local/lib/dovecot/libdovecot-login.so.0.0.0
63755         0x10a4c000         0x10aaf000 r-x   80   90  28  14 CN-- vn /usr/lib/libssl.so.7
63755         0x10aaf000         0x10caf000 ---    0    0   1   0 ---- df
63755         0x10caf000         0x10cb9000 rw-   10    0   1   0 C--- vn /usr/lib/libssl.so.7
63755         0x10cb9000         0x10e86000 r-x  461  501  30  15 CN-- vn /lib/libcrypto.so.7
63755         0x10e86000         0x11086000 ---    0    0   1   0 ---- df
63755         0x11086000         0x110ae000 rw-   40    0   1   0 C--- vn /lib/libcrypto.so.7
63755         0x110ae000         0x110b0000 rw-    0    0   0   0 ---- --
63755         0x110b0000         0x111fd000 r-x  213  219  36  18 CN-- vn /usr/local/lib/dovecot/libdovecot.so.0.0.0
63755         0x111fd000         0x113fc000 ---    0    0   1   0 ---- df
63755         0x113fc000         0x11403000 rw-    7    0   1   0 C--- vn /usr/local/lib/dovecot/libdovecot.so.0.0.0
63755         0x11403000         0x11405000 rw-    0    0   0   0 ---- --
63755         0x11405000         0x1157d000 r-x  376  387  64  31 CN-- vn /lib/libc.so.7
63755         0x1157d000         0x1177c000 ---    0    0   2   0 ---- df
63755         0x1177c000         0x11788000 rw-   12    0   1   0 C--- vn /lib/libc.so.7
63755         0x11788000         0x117b2000 ---    0    0   2   0 ---- df
63755     0x7ffffffdf000     0x7ffffffff000 rw-    3    3   1   0 ---D df
63755     0x7ffffffff000     0x800000000000 r-x    1    1  34   0 ---- ph
Comment 51 Konstantin Belousov freebsd_committer freebsd_triage 2016-02-24 12:18:05 UTC
(In reply to Robert Blayzor from comment #50)
This is n-th copy of the same data.  I need to see the message from kernel on the segfault, it is either printed on the controlling terminal of the process, or in the log if no control terminal was active.
Comment 52 Konstantin Belousov freebsd_committer freebsd_triage 2016-02-24 12:19:41 UTC
(In reply to Konstantin Belousov from comment #51)
Hm, no, we apparently do not fall back to printf() if no control terminal is assigned.  Please do the following: in sys/amd64/amd64/trap.c, near the end of the function trap(), in the fragment

	if (uprintf_signal) {
		uprintf("pid %d comm %s: signal %d err %lx code %d type %d "

replace uprintf() with printf().  The messages should appear in the dmesg.
Comment 53 Robert Blayzor 2016-02-24 12:30:57 UTC
I have made the changes and installed the new kernel + reboot. Will report what dmesg gives next go around.

I was merely supplying data to show that I have multiple different processes crashing for apparently (what looks like) a similar condition.

More to come...
Comment 54 Robert Blayzor 2016-02-29 11:14:47 UTC
pid 852 comm exim: signal 11 err 7 code 2 type 12 addr 0x7fffffffc7b8 rsp 0x7fffffffc7c0 rip 0x801206f3e <e8 9d 21 00 00 45 85 f6>
Failed to write core file for process exim (error 14)
pid 852 (exim), uid 26: exited on signal 11


procstat -v /var/spool/tmp/exim.core
  PID              START                END PRT  RES PRES REF SHD   FL TP PATH
  852           0x400000           0x560000 r-x  299  310   2   1 CN-- vn /usr/local/sbin/exim
  852           0x760000           0x76a000 rw-   10    0   1   0 CN-- vn /usr/local/sbin/exim
  852           0x76a000           0x779000 rw-    5    5   1   0 CN-- df
  852        0x800760000        0x80077e000 r-x   30   30  34   0 CN-- vn /libexec/ld-elf.so.1
  852        0x80077e000        0x800785000 rw-    7    7   1   0 CN-- df
  852        0x800787000        0x8007b7000 rw-   40   40   1   0 CN-- df
  852        0x80097d000        0x80097f000 rw-    2    2   1   0 CN-- df
  852        0x80097f000        0x80098d000 r-x    8    9   6   3 CN-- vn
  852        0x80098d000        0x800b8d000 ---    0    0   1   0 CN-- df
  852        0x800b8d000        0x800b8e000 rw-    1    0   1   0 CN-- vn
  852        0x800b8e000        0x800b9f000 rw-    0    0   0   0 ---- --
  852        0x800b9f000        0x800bc8000 r-x   41   41   8   4 CN-- vn
  852        0x800bc8000        0x800dc7000 ---    0    0   1   0 CN-- df
  852        0x800dc7000        0x800dc8000 rw-    1    0   1   0 CN-- vn
  852        0x800dc8000        0x800dd8000 r-x   13   14  18   9 CN-- vn
  852        0x800dd8000        0x800fd8000 ---    0    0   1   0 CN-- df
  852        0x800fd8000        0x800fd9000 rw-    1    0   1   0 CN-- vn
  852        0x800fd9000        0x800fdb000 rw-    0    0   0   0 ---- --
  852        0x800fdb000        0x800ff3000 r-x   15   17   2   1 CN-- vn
  852        0x800ff3000        0x8011f3000 ---    0    0   1   0 CN-- df
  852        0x8011f3000        0x8011f5000 rw-    2    0   1   0 CN-- vn
  852        0x8011f5000        0x80120d000 r-x   24   25  25  12 CN-- vn /lib/libthr.so.3
  852        0x80120d000        0x80140d000 ---    0    0   1   0 CN-- df
  852        0x80140d000        0x80140e000 rw-    1    0   1   0 CN-- vn /lib/libthr.so.3
  852        0x80140e000        0x80141a000 rw-   12   12   1   0 CN-- df
  852        0x80141a000        0x801445000 r-x   35   38   4   2 CN-- vn
  852        0x801445000        0x801645000 ---    0    0   1   0 CN-- df
  852        0x801645000        0x801648000 rw-    3    0   1   0 CN-- vn
  852        0x801648000        0x80181d000 r-x  368  376   2   1 CN-- vn
  852        0x80181d000        0x801a1c000 ---    0    0   1   0 CN-- df
  852        0x801a1c000        0x801a25000 rw-    9    0   1   0 CN-- vn
  852        0x801a25000        0x801a88000 r-x   80   90  26  13 CN-- vn /usr/lib/libssl.so.7
  852        0x801a88000        0x801c88000 ---    0    0   1   0 CN-- df
  852        0x801c88000        0x801c92000 rw-   10    0   1   0 CN-- vn /usr/lib/libssl.so.7
  852        0x801c92000        0x801e5f000 r-x  461  501  30  15 CN-- vn /lib/libcrypto.so.7
  852        0x801e5f000        0x80205f000 ---    0    0   1   0 CN-- df
  852        0x80205f000        0x802087000 rw-   40    0   1   0 CN-- vn /lib/libcrypto.so.7
  852        0x802087000        0x802089000 rw-    1    1   1   0 CN-- df
  852        0x802089000        0x8020fe000 r-x   43   43   7   3 CN-- vn
  852        0x8020fe000        0x8022fd000 ---    0    0   1   0 CN-- df
  852        0x8022fd000        0x8022fe000 rw-    1    0   1   0 CN-- vn
  852        0x8022fe000        0x802476000 r-x  376  387  67  33 CN-- vn /lib/libc.so.7
  852        0x802476000        0x802675000 ---    0    0   1   0 CN-- df
  852        0x802675000        0x802681000 rw-   12    0   1   0 CN-- vn /lib/libc.so.7
  852        0x802681000        0x8026ab000 rw-   11   11   1   0 CN-- df
  852        0x8026ab000        0x8026b5000 r-x   10   11   9   4 CN-- vn
  852        0x8026b5000        0x8028b5000 ---    0    0   1   0 CN-- df
  852        0x8028b5000        0x8028b6000 rw-    1    0   1   0 CN-- vn
  852        0x802c00000        0x803400000 rw-  218  218   1   0 CN-- df
  852     0x7fffdfffe000     0x7fffdffff000 ---    0    0   0   0 ---- --
  852     0x7ffffffdf000     0x7ffffffff000 rw-   10    0   1   0 C--D df
  852     0x7ffffffff000     0x800000000000 r-x    1    1  36   0 ---- ph
Comment 55 Robert Blayzor 2016-02-29 11:17:41 UTC
bt from last 


(gdb) bt full
#0  __thr_fork () at /.amd_mnt/juno/pub/FreeBSD/10-STABLE/lib/libthr/thread/thr_fork.c:186
	rtld_locks = Cannot access memory at address 0x7fffffffc7d0
Current language:  auto; currently minimal
(gdb)
Comment 56 Konstantin Belousov freebsd_committer freebsd_triage 2016-02-29 15:14:48 UTC
(In reply to Robert Blayzor from comment #55)
The uprintf_signal data was useful, thanks.

It is quite clean that the issue occured in the child after the
multithreaded fork.  An evidence is in the 'CN' flags (copy on write
and need copy) for all map entries of the process, except the stack.
The number of stacks mapped also suggests that the parent only had one
thread executing during the fork.  This is confirmed also by the
backtrace from the core, but for different reasons I trust the core
less.

The fault occured on the stack access, as indicated by the fault
address.  What is very interesting is the error code 7, which is
hardware data indicating that this was user-mode write to the page
mapped read-only.  Which is again consistent with the state after
fork.

What I do not understand is why page fault handler did not performed
COW and changed the page permission to writeable, which it should, due
to rw permission on the map entry.

Could you, please, apply the attached debugging patch on top of the
used source and provide me the same data as now, i.e. procstat -v code
and kernel messages from the patch and uprintf_signal.
Comment 57 Konstantin Belousov freebsd_committer freebsd_triage 2016-02-29 15:16:04 UTC
Created attachment 167570 [details]
Debugging patch to track source of KERN_PROTECTION_FAILURE
Comment 58 Robert Blayzor 2016-03-05 12:22:40 UTC
After both patches...



pid 716 (dovecot), uid 0: exited on signal 11 (core dumped)


(gdb) bt full
#0  memcpy () at bcopy.S:65
No locals.
#1  0x0000000800915484 in sig_handler (signo=20, si=0x7fffffffa9a0, context=0x7fffffffa630)
    at lib-signals.c:132
	h = (struct signal_handler *) 0x8014198e0
	saved_errno = 14
	c = 0 '\0'
	tmp_si = {si_signo = 0, si_errno = 0, si_code = 0, si_pid = 0, si_uid = 0, si_status = 0,
  si_addr = 0x0, si_value = {sival_int = 0, sival_ptr = 0x0, sigval_int = 0, sigval_ptr = 0x0},
  _reason = {_fault = {_trapno = 0}, _timer = {_timerid = 0, _overrun = 0}, _mesgq = {_mqd = 0},
    _poll = {_band = 0}, __spare__ = {__spare1__ = 0, __spare2__ = 0x7fffffffa5c8}}}
#2  <signal handler called>
No symbol table info available.
#3  0x0000000000411fc8 in _fini ()
No symbol table info available.
#4  0x000000080061c9a8 in objlist_call_fini ()
    at /.amd_mnt/juno/pub/FreeBSD/10-STABLE/libexec/rtld-elf/rtld.c:2354
	list = (Objlist *) Cannot access memory at address 0x0
Current language:  auto; currently asm
(gdb) q
dev@mta1 [/var/spool/tmp] sudo procstat -v /var/spool/tmp/dovecot.core
  PID              START                END PRT  RES PRES REF SHD   FL TP PATH
  716           0x400000           0x417000 r-x   20   20   1   0 CN-- vn
  716           0x617000           0x618000 rw-    1    1   1   0 C--- df
  716        0x800617000        0x800635000 r-x   30   30  32   0 CN-- vn /libexec/ld-elf.so.1
  716        0x800635000        0x800663000 rw-   36   36   1   0 C--- df
  716        0x800834000        0x800836000 rw-    2    2   1   0 C--- df
  716        0x800836000        0x800983000 r-x  225  231  34  17 CN-- vn /usr/local/lib/dovecot/libdovecot.so.0.0.0
  716        0x800983000        0x800b82000 ---    0    0   1   0 CN-- df
  716        0x800b82000        0x800b89000 rw-    7    0   1   0 C--- vn /usr/local/lib/dovecot/libdovecot.so.0.0.0
  716        0x800b89000        0x800b8b000 rw-    2    0   1   0 C--- df
  716        0x800b8b000        0x800d03000 r-x  376  387  63  31 CN-- vn /lib/libc.so.7
  716        0x800d03000        0x800f02000 ---    0    0   1   0 CN-- df
  716        0x800f02000        0x800f0e000 rw-   12    0   1   0 C--- vn /lib/libc.so.7
  716        0x800f0e000        0x800f38000 rw-   15   15   1   0 C--- df
  716        0x801000000        0x801800000 rw-   94   94   1   0 C--- df
  716     0x7ffffffdf000     0x7ffffffff000 rw-    8    8   1   0 C--D df
  716     0x7ffffffff000     0x800000000000 r-x    1    1  34   0 ---- ph



(gdb) bt full
#0  memcpy () at bcopy.S:65
No locals.
#1  0x0000000800915484 in sig_handler (signo=20, si=0x7fffffffa9a0, context=0x7fffffffa630)
    at lib-signals.c:132
	h = (struct signal_handler *) 0x8014198e0
	saved_errno = 14
	c = 0 '\0'
	tmp_si = {si_signo = 0, si_errno = 0, si_code = 0, si_pid = 0, si_uid = 0, si_status = 0,
  si_addr = 0x0, si_value = {sival_int = 0, sival_ptr = 0x0, sigval_int = 0, sigval_ptr = 0x0},
  _reason = {_fault = {_trapno = 0}, _timer = {_timerid = 0, _overrun = 0}, _mesgq = {_mqd = 0},
    _poll = {_band = 0}, __spare__ = {__spare1__ = 0, __spare2__ = 0x7fffffffa5c8}}}
#2  <signal handler called>
No symbol table info available.
#3  0x0000000000411fc8 in _fini ()
No symbol table info available.
#4  0x000000080061c9a8 in objlist_call_fini ()
    at /.amd_mnt/juno/pub/FreeBSD/10-STABLE/libexec/rtld-elf/rtld.c:2354
	list = (Objlist *) Cannot access memory at address 0x0
Current language:  auto; currently asm
Comment 59 Robert Blayzor 2016-03-05 12:29:50 UTC
Thinking about this a little, going out on a limb. The only thing that differs on these servers compared to other diskless boot servers we have. (that use the same read only NFS root) is they have a larger memory disks for "/var" when created from /etc/rc.initdiskless.

/dev/md0                       8.8M    3.8M    4.3M    47%    /etc
/dev/md1                       457M    516K    420M     0%    /var


Because these are mail servers, /var may be a bit busier than on say DNS and web servers. (as /var/tmp is there, etc, temporary file writes from files being scanned by viruses). I've never seen any "out of memory" errors or that mount ever close to being anywhere near full; but I'm sure there are a lot of read/write going on.


I know it's a stretch but it's the only thing I can think of that differs from the other diskless servers that are setup the exact same way.

This appears to be happening on only our mail servers, which consequently happen to be our busiest.
Comment 60 Konstantin Belousov freebsd_committer freebsd_triage 2016-03-05 16:51:24 UTC
(In reply to Robert Blayzor from comment #58)
And where is uprintf_signal output, as well as the printfs from the debugging patch ?
Comment 61 Robert Blayzor 2016-03-07 12:28:57 UTC
I will make sure the patches didn't get overwritten somewhere. But I believe they are in there..

One thing different about this crash is that Dovecot actually managed to log something before the crash that I've not seen before:

dovecot: master: Fatal: gettimeofday(): Bad address


That was in regards to the last backtrace.
Comment 62 Robert Blayzor 2016-03-07 12:33:06 UTC
Confirmed that the two patches in in the kernel I built and installed...

from trap.c, around line 601:

        if (uprintf_signal) {
                printf("pid %d comm %s: signal %d err %lx code %d type %d "
                    "addr 0x%lx rsp 0x%lx rip 0x%lx "
                    "<%02x %02x %02x %02x %02x %02x %02x %02x>\n",
...
Comment 63 Robert Blayzor 2016-03-09 20:27:39 UTC
I have rechecked, recompiled patches against latest 10.3


Now, under normal conditions my dmesg just floods with, thousands of the messages below:


pid 3142 kpf 1
pid 3142 kpf 1
pid 3142 kpf 1
pid 3143 kpf 1
pid 3143 kpf 1
pid 3143 kpf 1
pid 3144 kpf 1
pid 3144 kpf 1
pid 3145 kpf 1
pid 3145 kpf 1
pid 3146 kpf 1
pid 3146 kpf 1
pid 3146 kpf 1
pid 3147 kpf 1
pid 3147 kpf 1
pid 3147 kpf 1
pid 3149 kpf 1
pid 3149 kpf 1
pid 3149 kpf 1
pid 3147 kpf 1
pid 3147 kpf 1
pid 3147 kpf 1
pid 3147 kpf 1
pid 3147 kpf 1
pid 3150 kpf 1
pid 3150 kpf 1
pid 3147 kpf 1
pid 3151 kpf 1
pid 3151 kpf 1
pid 3151 kpf 1
pid 3151 kpf 1
pid 3151 kpf 1
pid 3151 kpf 1
...
Comment 64 Konstantin Belousov freebsd_committer freebsd_triage 2016-03-14 10:37:40 UTC
(In reply to Robert Blayzor from comment #63)
Sorry for delay with replying.  For debugging purposes, please set debug.vn_io_fault_enable to 0, this should stop the message flood.
Comment 65 Robert Blayzor 2016-03-18 12:03:42 UTC
Here was a recent crash...

Mar 15 11:58:17.019 mta1.alb kernel: pid 69942 kpf 2
Mar 15 11:58:17.019 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:17.019 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:17.019 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:17.019 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:17.019 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:17.019 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:17.019 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:17.019 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:17.221 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:17.221 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:17.221 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:17.221 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:17.221 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:17.221 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:17.221 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:17.221 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:17.221 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:17.221 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:17.221 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:17.221 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:17.221 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:17.221 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:17.221 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:17.221 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:17.221 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:17.221 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:17.221 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:17.221 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:17.221 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:17.221 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:17.221 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:17.221 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:17.221 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:17.425 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:17.425 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:17.425 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:17.425 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:17.425 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:17.425 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:17.425 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:17.425 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:17.425 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:17.425 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:17.425 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:17.425 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:17.425 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:17.425 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:17.425 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:17.425 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:17.425 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:17.425 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:17.425 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:17.425 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:17.425 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:17.425 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:17.625 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:17.625 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:17.625 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:17.625 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:17.625 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:17.625 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:17.625 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:17.625 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:17.625 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:17.625 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:17.625 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:17.625 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:17.625 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:17.625 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:17.625 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:17.625 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:17.625 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:17.625 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:17.625 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:17.625 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:17.626 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:17.626 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:17.626 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:17.626 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:17.833 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:17.833 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:17.833 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:17.833 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:17.833 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:17.833 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:17.833 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:17.833 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:17.833 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:17.833 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:17.834 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:17.834 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:17.834 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:17.834 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:17.834 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:17.834 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:17.834 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:17.834 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:17.834 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:17.834 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:17.834 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:17.834 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:17.834 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:17.834 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:17.834 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:18.034 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:18.034 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:18.034 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:18.034 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:18.034 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:18.034 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:18.034 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:18.035 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:18.035 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:18.035 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:18.035 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:18.035 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:18.035 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:18.035 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:18.035 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:18.035 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:18.035 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:18.035 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:18.035 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:18.035 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:18.035 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:18.035 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:18.035 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:18.035 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:18.035 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:18.035 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:18.238 mta1.alb kernel: pid 69942 kpf 1
Mar 15 11:58:18.238 mta1.alb kernel: <6>pid 69942 (exim), uid 26: exited on signal 11 (core dumped)


procstat -v exim.core
Password:
  PID              START                END PRT  RES PRES REF SHD   FL TP PATH
69942           0x400000           0x560000 r-x  314  325   3   1 CN-- vn /usr/local/sbin/exim
69942           0x760000           0x76a000 rw-   10    0   2   1 CN-- vn /usr/local/sbin/exim
69942           0x76a000           0x779000 rw-    5    5   2   1 CN-- df
69942        0x800760000        0x80077e000 r-x   30   30  33   0 CN-- vn /libexec/ld-elf.so.1
69942        0x80077e000        0x800785000 rw-    7    7   2   0 CN-- df
69942        0x800787000        0x8007b7000 rw-   40   40   2   1 CN-- df
69942        0x80097d000        0x80097f000 rw-    2    2   2   0 CN-- df
69942        0x80097f000        0x80098d000 r-x   14   15   7   3 CN-- vn
69942        0x80098d000        0x800b8d000 ---    0    0   2   0 CN-- df
69942        0x800b8d000        0x800b8e000 rw-    1    0   2   0 CN-- vn
69942        0x800b8e000        0x800b9f000 rw-    0    0   0   0 ---- --
69942        0x800b9f000        0x800bc8000 r-x   41   41   9   4 CN-- vn
69942        0x800bc8000        0x800dc7000 ---    0    0   2   0 CN-- df
69942        0x800dc7000        0x800dc8000 rw-    1    0   2   0 CN-- vn
69942        0x800dc8000        0x800dd8000 r-x   16   17  19   9 CN-- vn
69942        0x800dd8000        0x800fd8000 ---    0    0   2   0 CN-- df
69942        0x800fd8000        0x800fd9000 rw-    1    0   2   0 CN-- vn
69942        0x800fd9000        0x800fdb000 rw-    0    0   0   0 ---- --
69942        0x800fdb000        0x800ff3000 r-x   18   20   3   1 CN-- vn
69942        0x800ff3000        0x8011f3000 ---    0    0   2   0 CN-- df
69942        0x8011f3000        0x8011f5000 rw-    2    0   2   0 CN-- vn
69942        0x8011f5000        0x80120d000 r-x   24   25  18   8 CN-- vn
69942        0x80120d000        0x80140d000 ---    0    0   2   0 CN-- df
69942        0x80140d000        0x80140e000 rw-    1    0   2   0 CN-- vn
69942        0x80140e000        0x80141a000 rw-   12    0   1   0 C--- df
69942        0x80141a000        0x801445000 r-x   38   41   5   2 CN-- vn
69942        0x801445000        0x801645000 ---    0    0   2   0 CN-- df
69942        0x801645000        0x801648000 rw-    3    0   2   0 CN-- vn
69942        0x801648000        0x80181d000 r-x  404  412   3   1 CN-- vn /usr/local/lib/perl5/5.22/mach/CORE/libperl.so.5.22.1
69942        0x80181d000        0x801a1c000 ---    0    0   2   0 CN-- df
69942        0x801a1c000        0x801a25000 rw-    9    0   2   0 CN-- vn /usr/local/lib/perl5/5.22/mach/CORE/libperl.so.5.22.1
69942        0x801a25000        0x801a88000 r-x   84   94  27  13 CN-- vn /usr/lib/libssl.so.7
69942        0x801a88000        0x801c88000 ---    0    0   2   0 CN-- df
69942        0x801c88000        0x801c92000 rw-   10    0   2   0 CN-- vn /usr/lib/libssl.so.7
69942        0x801c92000        0x801e5f000 r-x  461  501  31  15 CN-- vn /lib/libcrypto.so.7
69942        0x801e5f000        0x80205f000 ---    0    0   2   0 CN-- df
69942        0x80205f000        0x802087000 rw-   40    0   2   0 CN-- vn /lib/libcrypto.so.7
69942        0x802087000        0x802089000 rw-    1    1   2   0 CN-- df
69942        0x802089000        0x8020fe000 r-x   45   45   8   3 CN-- vn
69942        0x8020fe000        0x8022fd000 ---    0    0   2   0 CN-- df
69942        0x8022fd000        0x8022fe000 rw-    1    0   2   0 CN-- vn
69942        0x8022fe000        0x802476000 r-x  376  387  64  31 CN-- vn /lib/libc.so.7
69942        0x802476000        0x802675000 ---    0    0   2   0 CN-- df
69942        0x802675000        0x802681000 rw-   12    0   2   0 CN-- vn /lib/libc.so.7
69942        0x802681000        0x8026ab000 rw-   11   11   2   1 CN-- df
69942        0x8026ab000        0x8026b5000 r-x   10   11  10   4 CN-- vn /usr/local/lib/libintl.so.8.1.4
69942        0x8026b5000        0x8028b5000 ---    0    0   2   0 CN-- df
69942        0x8028b5000        0x8028b6000 rw-    1    0   2   0 CN-- vn /usr/local/lib/libintl.so.8.1.4
69942        0x802c00000        0x803400000 rw-  218    0   1   0 C--- df
69942     0x7fffdfffe000     0x7fffdffff000 ---    0    0   0   0 ---- --
69942     0x7ffffffdf000     0x7ffffffff000 rw-   10    0   1   0 C--D df
69942     0x7ffffffff000     0x800000000000 r-x    1    1  35   0 ---- ph
Comment 66 Robert Blayzor 2016-03-18 12:05:43 UTC
backtrace from previous crash:

#1  0x000000080120706d in __thr_fork () at /.amd_mnt/juno/pub/FreeBSD/10-STABLE/lib/libthr/thread/thr_fork.c:203
	rtld_locks = {-4800, 32767, -4760, 32767, -4792, 32767, 3, 0}
	curthread = (struct pthread *) 0x803006400
	cancelsave = 0
	ret = 0
	errsave = 4
#2  0x0000000000428c54 in handle_smtp_call (listen_sockets=0x803065140, listen_socket_count=4, accept_socket=7, accepted=0x7fffffffcd38) at daemon.c:355
	pid = 32767
	interface_sockaddr = {v4 = {sin_len = 28 '\034', sin_family = 28 '\034', sin_port = 6400, sin_addr = {s_addr = 0}, sin_zero = 0x7fffffffc9a0 "&\aðX\001\020"}, v6 = {
    sin6_len = 28 '\034', sin6_family = 28 '\034', sin6_port = 6400, sin6_flowinfo = 0, sin6_addr = {__u6_addr = {__u6_addr8 = 0x7fffffffc9a0 "&\aðX\001\020",
        __u6_addr16 = 0x7fffffffc9a0, __u6_addr32 = 0x7fffffffc9a0}}, sin6_scope_id = 0}, v0 = {sa_len = 28 '\034', sa_family = 28 '\034', sa_data = 0x7fffffffc99a ""}}
	ifsize = 28
	dup_accept_socket = 9
	max_for_this_host = 10
	wfsize = 100
	wfptr = 22
	use_log_write_selector = 1980
	whofrom = (uschar *) 0x803065e30 "[2607:f058:110:2::f:0]"
	reset_point = (void *) 0x803065e00
Comment 67 John Baldwin freebsd_committer freebsd_triage 2016-03-18 23:35:51 UTC
Sorry for another possible hail mary, but I've noticed that many of your faults are in the thread library.  There were several changes to libthr between 10.1 and 10.2.  They should all be bug fixes (in particular fixes to allow applications to dlopen() an object that uses libthr).

If these changes to libthr are relevant then there are a few things to investigate and/or try:

1) Can you run ldd on your dovecot and exim binaries to see if they are directly linked against libthr?

2) Do you know if your binaries are loading any "plugins" (e.g. SASL auth mechanisms, milters (I know postfix and sendmail can both use milters for things like greylisting, etc.))?  If you are, are any of your plugins linked against libthr?

If the answer to 1) is no, then the first thing to possibly try is to set LD_PRELOAD=libthr.so.3 in the environment of the dovecot and exim binaries when you start them up.  (If using rc.d scripts you can do this via foo_env variables in rc.conf.)
Comment 68 Robert Blayzor 2016-03-18 23:47:04 UTC
Exim most certainly is linked directly...

ldd /usr/local/sbin/exim | grep thr
	libthr.so.3 => /lib/libthr.so.3 (0x8011f5000)


Also many of the dovecot libs are linked against libthr as well...

ldd /usr/local/lib/dovecot/libdovecot* | grep thr
	libthr.so.3 => /lib/libthr.so.3 (0x801fb0000)
	libthr.so.3 => /lib/libthr.so.3 (0x801fb0000)
	libthr.so.3 => /lib/libthr.so.3 (0x801fb0000)
	libthr.so.3 => /lib/libthr.so.3 (0x801b91000)
	libthr.so.3 => /lib/libthr.so.3 (0x801b91000)
	libthr.so.3 => /lib/libthr.so.3 (0x801b91000)
Comment 69 John Baldwin freebsd_committer freebsd_triage 2016-03-19 18:53:41 UTC
(In reply to Robert Blayzor from comment #68)
Ok.  Some of the changes in 10.2 fixed problems related to loading libthr.so.3 at runtime after the binary had started (which is why I asked my question).  I think that my suggestion of using LD_PRELOAD would not be a useful test given your reply.  It might still be useful to try putting a 10.1 libthr in your 10.2 image to see if just swapping out libthr fixes the issue.  If it does it at least helps narrow down where the problem lies.
Comment 70 Konstantin Belousov freebsd_committer freebsd_triage 2016-03-28 11:54:00 UTC
(In reply to Robert Blayzor from comment #65)
If vn_io_fault_enable is set to 0, I do not see how kpf 1 message can occur.  Could you, please, double-check that the setting is right ?

Also, do you have any third-party module loaded ?

Did you tried to upgrade the hypervisor ?
Comment 71 Robert Blayzor 2016-03-28 12:09:48 UTC
I will double check the sysctl and set if not the case. It may not have persisted across reboot of the last kernel.

We want to avoid upgrading major revs of the hypervisor as we have other VM's that do not have NFS diskless root that we have no problems with.

The only difference between these hosts and other hosts on the same hypervisor is these  boot with no local (virtual) disk for the root FS; only tmp/swap.


We have no 3rd party modules loaded.
Comment 72 Robert Blayzor 2016-03-29 12:00:17 UTC
With the recent patches... I still get occasional unexpected crashes...


Here is all I got on the last one....

Mar 29 05:26:18.617 mxs0.alb kernel: pid 97296 kpf 2
Mar 29 05:26:18.617 mxs0.alb kernel: pid 97296 kpf 2
Mar 29 05:26:18.617 mxs0.alb kernel: pid 97296 kpf 2
Mar 29 05:26:18.617 mxs0.alb kernel: pid 97296 kpf 2
Mar 29 05:26:18.825 mxs0.alb kernel: <4>Failed to write core file for process exim (error 14)
Mar 29 05:26:18.825 mxs0.alb kernel: <6>pid 97296 (exim), uid 26: exited on signal 11


(gdb) bt full
#0  __thr_fork () at /.amd_mnt/juno/pub/FreeBSD/10-STABLE/lib/libthr/thread/thr_fork.c:186
	rtld_locks = Cannot access memory at address 0x7fffffffc7d0
Current language:  auto; currently min



So still looks like a libthr issue. I will see if I can grab a /lib/libthr.so.3 from a FreeBSD 10.1 install I have locally and see if I still get these problems.
Comment 73 Robert Blayzor 2016-03-29 12:34:02 UTC
I copied /lib/libthr.so.3 from a 10.1 physical server I have and rebooted the diskless VM's in question. That did not go over well. The servers booted ok, but when processes launched that use it, they would start but just hang. Not sure what the proper procedure would be here but if copying /lib/libthr.so.3 in place and rebooting should of worked, it did not.

Needless to say, I had to roll that change back.
Comment 74 Konstantin Belousov freebsd_committer freebsd_triage 2016-03-31 08:12:41 UTC
(In reply to Robert Blayzor from comment #72)
If the only reports were about the 'kpf 2' during the failure, please try the new debugging patch.  It still requires setting of debug.vn_io_fault to zero. As before, I need the dmesg output of the failing episode.

Patch is against HEAD, and it would be extra good if you tried this on the HEAD kernel, no need to update usermode.
Comment 75 Konstantin Belousov freebsd_committer freebsd_triage 2016-03-31 08:13:25 UTC
Created attachment 168808 [details]
Debugging patch to track the origin of OBJ_DEAD
Comment 76 Robert Blayzor 2016-03-31 13:16:05 UTC
Can this be hand patched against 10-STABLE ?  That's what our SVN co and install is based on. Or is there some other stuff in HEAD that is required? We are pre-production on this so I'm trying to limit pain from installing some bleeding edge changes.
Comment 77 Konstantin Belousov freebsd_committer freebsd_triage 2016-03-31 14:50:40 UTC
(In reply to Robert Blayzor from comment #76)
There are some changes in HEAD which might be relevant to the area I am trying to drill into.  I.e. it might be beneficial for you to try.  Note that only kernel is needed, I already mentioned that you do not need to update usermode.

Anyway, I moved the patch to stable/10, but I cannot even compile it right now.  So if any compilation issues are there, please fix.
Comment 78 Konstantin Belousov freebsd_committer freebsd_triage 2016-03-31 14:51:12 UTC
Created attachment 168832 [details]
OBJ_DEAD tracker for stable/10
Comment 79 Robert Blayzor 2016-04-01 12:38:43 UTC
Here you go...

9718 vm_fault OBJ_DEAD 0xfffff8000afb4330
pid 9718 kpf 2
9718 vm_fault OBJ_DEAD 0xfffff8000afb4330
pid 9718 kpf 2
9718 vm_fault OBJ_DEAD 0xfffff8000afb4330
pid 9718 kpf 2
9718 vm_fault OBJ_DEAD 0xfffff8000afb4330
pid 9718 kpf 2
Failed to write core file for process dovecot (error 14)
pid 9718 (dovecot), uid 0: exited on signal 11


(gdb) bt full
#0  0x000000000040a85a in service_process_create ()
No symbol table info available.
Cannot access memory at address 0x7fffffffae18



procstat -v dovecot.core
  PID              START                END PRT  RES PRES REF SHD   FL TP PATH
 9718           0x400000           0x411000 r-x   17   17   2   0 CN-- vn
 9718           0x610000           0x611000 rw-    1    1   2   1 CN-- df
 9718        0x800610000        0x80062e000 r-x   30   30  33   0 CN-- vn /libexec/ld-elf.so.1
 9718        0x80062e000        0x80065c000 rw-   36   36   2   1 CN-- df
 9718        0x80082d000        0x80082f000 rw-    2    2   2   0 CN-- df
 9718        0x80082f000        0x800922000 r-x  171  177  35  17 CN-- vn /usr/local/lib/dovecot/libdovecot.so.0.0.0
 9718        0x800922000        0x800b22000 ---    0    0   2   0 CN-- df
 9718        0x800b22000        0x800b28000 rw-    6    0   2   1 CN-- vn /usr/local/lib/dovecot/libdovecot.so.0.0.0
 9718        0x800b28000        0x800b2a000 rw-    2    2   2   1 CN-- df
 9718        0x800b2a000        0x800ca2000 r-x  376  387  64  31 CN-- vn /lib/libc.so.7
 9718        0x800ca2000        0x800ea1000 ---    0    0   2   0 CN-- df
 9718        0x800ea1000        0x800ead000 rw-   12    0   2   0 CN-- vn /lib/libc.so.7
 9718        0x800ead000        0x800ed7000 rw-   13   13   2   0 CN-- df
 9718        0x801000000        0x801800000 rw-   94    0   1   0 C--- df
 9718     0x7ffffffdf000     0x7ffffffff000 rw-    8    0   1   0 C--D df
 9718     0x7ffffffff000     0x800000000000 r-x    1    1  35   0 ---- ph
Comment 80 Konstantin Belousov freebsd_committer freebsd_triage 2016-04-01 14:43:05 UTC
(In reply to Robert Blayzor from comment #79)
The page fault was not handled since some object in the shadow chain which backs the faulted region, has the OBJ_DEAD flag set.  I do not see why would this state valid for the object in question.  There is another PR 204764, where the same flag is set for a vnode object and the manifestation of the problem is different. Instead of vm_fault, the object was found by vnode_create_vobject(), which sleeps forever waiting for the object termination to finish.  Since the flag was not set by the termination conditions, termination does not happen and lookup is stuck forever.

Right now, I have no idea why does this happen.  Either we have a bug in VM by spuriously setting the flag (but code reading does not support this possibility), or some random memory access happens and corrupts the vm object memory.  I do not know.

I attached some additions to the debugging patch, which both asserts that the object is write-locked when object->flags are modified, and also it slightly changes the layout of struct vm_object.  So if the issue is VM bug, most likely unlocked modifications, it could be catched.  Or, if the problem is the memory corruption, it should migrate to other place.

Still, it is only speculation.  Please make sure that you have INVARIANTS and WITNESS in your kernel config enabled.
Comment 81 Konstantin Belousov freebsd_committer freebsd_triage 2016-04-01 14:51:19 UTC
Created attachment 168870 [details]
OBJ_DEAD tracker for stable/10, v2: assert that the object is locked, change layout
Comment 82 Robert Blayzor 2016-04-01 15:47:14 UTC
Getting a compile build error with the latest patch:

In file included from /.amd_mnt/juno/pub/FreeBSD/10-STABLE/sys/compat/freebsd32/freebsd32_misc.c:95:
/.amd_mnt/juno/pub/FreeBSD/10-STABLE/sys/vm/vm_object.h:257:3: error: implicit declaration of function 'rw_assert' is invalid in C99
      [-Werror,-Wimplicit-function-declaration]
  VM_OBJECT_ASSERT_WLOCKED(object);
  ^
/.amd_mnt/juno/pub/FreeBSD/10-STABLE/sys/vm/vm_object.h:229:2: note: expanded from macro 'VM_OBJECT_ASSERT_WLOCKED'
        rw_assert(&(object)->lock, RA_WLOCKED)
        ^
/.amd_mnt/juno/pub/FreeBSD/10-STABLE/sys/vm/vm_object.h:257:3: error: use of undeclared identifier 'RA_WLOCKED'
/.amd_mnt/juno/pub/FreeBSD/10-STABLE/sys/vm/vm_object.h:229:29: note: expanded from macro 'VM_OBJECT_ASSERT_WLOCKED'
        rw_assert(&(object)->lock, RA_WLOCKED)
Comment 83 Konstantin Belousov freebsd_committer freebsd_triage 2016-04-01 16:14:45 UTC
Created attachment 168872 [details]
OBJ_DEAD tracker for stable/10, v2.1

This one builds, but might not boot.

Ensure that you have WITNESS and INVARIANTS.
Comment 84 John Baldwin freebsd_committer freebsd_triage 2016-04-01 17:51:10 UTC
(In reply to Konstantin Belousov from comment #80)
I wonder if this is related to the MFC I did in r284100 of the changes to keep VM objects on the VM list.  It doesn't alter handling of the DEAD flag, but does affect the object type.  It was merged to 10 in between 10.1 and 10.2 so it would fit the time window.
Comment 85 Robert Blayzor 2016-04-01 17:59:09 UTC
This builds without the following... 

options         INVARIANTS
options         WITNESS



But when I add those two build options, building kernel fails...



sys_machdep.o:/.amd_mnt/juno/pub/FreeBSD/10-STABLE/sys/amd64/amd64/sys_machdep.c:514: more undefined references to `__mtx_assert' follow
intel_idpgtbl.o: In function `ctx_idmap_nextlvl':
/.amd_mnt/juno/pub/FreeBSD/10-STABLE/sys/x86/iommu/intel_idpgtbl.c:117: undefined reference to `__rw_assert'
intel_idpgtbl.o: In function `ctx_map_buf_locked':
/.amd_mnt/juno/pub/FreeBSD/10-STABLE/sys/x86/iommu/intel_idpgtbl.c:424: undefined reference to `__rw_assert'
intel_idpgtbl.o: In function `ctx_unmap_buf_locked':
/.amd_mnt/juno/pub/FreeBSD/10-STABLE/sys/x86/iommu/intel_idpgtbl.c:604: undefined reference to `__rw_assert'
intel_idpgtbl.o: In function `ctx_free_pgtbl':
/.amd_mnt/juno/pub/FreeBSD/10-STABLE/sys/x86/iommu/intel_idpgtbl.c:708: undefined reference to `__rw_assert'
/.amd_mnt/juno/pub/FreeBSD/10-STABLE/sys/x86/iommu/intel_idpgtbl.c:718: undefined reference to `__rw_assert'
intel_idpgtbl.o:/.amd_mnt/juno/pub/FreeBSD/10-STABLE/sys/x86/iommu/intel_idpgtbl.c:356: more undefined references to `__rw_assert' follow
intel_qi.o: In function `dmar_qi_invalidate_locked':
/.amd_mnt/juno/pub/FreeBSD/10-STABLE/sys/x86/iommu/intel_qi.c:222: undefined reference to `__mtx_assert'
intel_qi.o: In function `dmar_qi_advance_tail':
/.amd_mnt/juno/pub/FreeBSD/10-STABLE/sys/x86/iommu/intel_qi.c:99: undefined reference to `__mtx_assert'
intel_qi.o: In function `dmar_qi_ensure':
/.amd_mnt/juno/pub/FreeBSD/10-STABLE/sys/x86/iommu/intel_qi.c:109: undefined reference to `__mtx_assert'
intel_qi.o: In function `dmar_qi_advance_tail':
/.amd_mnt/juno/pub/FreeBSD/10-STABLE/sys/x86/iommu/intel_qi.c:99: undefined reference to `__mtx_assert'
intel_qi.o: In function `dmar_qi_emit':
/.amd_mnt/juno/pub/FreeBSD/10-STABLE/sys/x86/iommu/intel_qi.c:142: undefined reference to `__mtx_assert'
intel_qi.o:/.amd_mnt/juno/pub/FreeBSD/10-STABLE/sys/x86/iommu/intel_qi.c:178: more undefined references to `__mtx_assert' follow
*** Error code 1

Stop.
make[2]: stopped in /opt/obj/.amd_mnt/juno/pub/FreeBSD/10-STABLE/sys/ESXPXE
*** Error code 1

Stop.
make[1]: stopped in /.amd_mnt/juno/pub/FreeBSD/10-STABLE
*** Error code 1

Stop.
make: stopped in /.amd_mnt/juno/pub/FreeBSD/10-STABLE
Comment 86 Konstantin Belousov freebsd_committer freebsd_triage 2016-04-01 18:31:23 UTC
(In reply to Robert Blayzor from comment #85)
You need also option INVARIANT_SUPPORT.  I assumed that this is a common knowledge, sorry.
Comment 87 Konstantin Belousov freebsd_committer freebsd_triage 2016-04-01 18:33:59 UTC
(In reply to John Baldwin from comment #84)
Thank you for the pointer, it is something for me to keep in mind.

But I do not see how would it be directly related. OBJT_DEAD and OBJ_DEAD are two different things.  In the other PR debugging data, the OBJ_DEAD flag is set, but the object itself is fine and not obliterated.  Previous debugging patch results make me confident that the OBJ_DEAD flag was not set with vm_object_set_flags(), while source code only sets the OBJ_DEAD with the calls of that helper.
Comment 88 Robert Blayzor 2016-04-04 13:04:10 UTC
Just a follow-up on this. The last patch has been running for a few days and I've not had an unexpected crash yet. I'm not saying it won't happen, it's not unusual for us not to see this for several days; then suddenly we may start getting them at a rate of 1+ per day. Just following up as we continue to monitor this...
Comment 89 Robert Blayzor 2016-04-06 18:30:25 UTC
Not sure this related or part of the problem, but the first time I've seen this on any of the VM's kernel messages...



lock order reversal:
 1st 0xfffff80009d9e240 ufs (ufs) @ /.amd_mnt/juno/pub/FreeBSD/10-STABLE/sys/kern/vfs_subr.c:2253
 2nd 0xfffffe007c5bef50 bufwait (bufwait) @ /.amd_mnt/juno/pub/FreeBSD/10-STABLE/sys/ufs/ffs/ffs_vnops.c:262
 3rd 0xfffff800117edd50 ufs (ufs) @ /.amd_mnt/juno/pub/FreeBSD/10-STABLE/sys/kern/vfs_subr.c:2253
KDB: stack backtrace:
#0 0xffffffff80484260 at kdb_backtrace+0x60
#1 0xffffffff804a30d4 at witness_checkorder+0xe24
#2 0xffffffff8042b98a at __lockmgr_args+0x9ea
#3 0xffffffff806818f4 at ffs_lock+0x84
#4 0xffffffff8073dc2c at VOP_LOCK1_APV+0xfc
#5 0xffffffff805057da at _vn_lock+0xaa
#6 0xffffffff804f6de7 at vget+0x67
#7 0xffffffff804ea601 at vfs_hash_get+0xe1
#8 0xffffffff8067cf00 at ffs_vgetf+0x40
#9 0xffffffff80674bf0 at softdep_sync_buf+0xac0
#10 0xffffffff80682656 at ffs_syncvnode+0x286
#11 0xffffffff80659a5e at ffs_truncate+0x6ae
#12 0xffffffff8068987a at ufs_direnter+0x81a
#13 0xffffffff806926a5 at ufs_makeinode+0x575
#14 0xffffffff8068e58d at ufs_create+0x2d
#15 0xffffffff8073b5c1 at VOP_CREATE_APV+0xf1
#16 0xffffffff80504e93 at vn_open_cred+0x2c3
#17 0xffffffff804fdf8f at kern_openat+0x26f
Comment 90 Robert Blayzor 2016-04-14 12:15:14 UTC
In 12 days of running since the last debug patch, I've seen only the below on a couple of the VM's....


lock order reversal:
 1st 0xfffff80009e5c418 ufs (ufs) @ /.amd_mnt/juno/pub/FreeBSD/10-STABLE/sys/kern/vfs_subr.c:2253
 2nd 0xfffffe007c5c37f8 bufwait (bufwait) @ /.amd_mnt/juno/pub/FreeBSD/10-STABLE/sys/ufs/ffs/ffs_vnops.c:262
 3rd 0xfffff80064499418 ufs (ufs) @ /.amd_mnt/juno/pub/FreeBSD/10-STABLE/sys/kern/vfs_subr.c:2253
KDB: stack backtrace:
#0 0xffffffff80484260 at kdb_backtrace+0x60
#1 0xffffffff804a30d4 at witness_checkorder+0xe24
#2 0xffffffff8042b98a at __lockmgr_args+0x9ea
#3 0xffffffff806818f4 at ffs_lock+0x84
#4 0xffffffff8073dc2c at VOP_LOCK1_APV+0xfc
#5 0xffffffff805057da at _vn_lock+0xaa
#6 0xffffffff804f6de7 at vget+0x67
#7 0xffffffff804ea601 at vfs_hash_get+0xe1
#8 0xffffffff8067cf00 at ffs_vgetf+0x40
#9 0xffffffff80674bf0 at softdep_sync_buf+0xac0
#10 0xffffffff80682656 at ffs_syncvnode+0x286
#11 0xffffffff80659a5e at ffs_truncate+0x6ae
#12 0xffffffff8068987a at ufs_direnter+0x81a
#13 0xffffffff806926a5 at ufs_makeinode+0x575
#14 0xffffffff8068e58d at ufs_create+0x2d
#15 0xffffffff8073b5c1 at VOP_CREATE_APV+0xf1
#16 0xffffffff80504e93 at vn_open_cred+0x2c3
#17 0xffffffff804fdf8f at kern_openat+0x26f
Comment 91 Robert Blayzor 2016-04-18 15:44:11 UTC
dmesg:

14157 vm_fault OBJ_DEAD 0xfffff8004d9b74c8
pid 14157 kpf 2
pid 14157 comm exim: signal 11 err 7 code 2 type 12 addr 0x764b34 rsp 0x7fffffffc830 rip 0x428c9e <89 04 25 34 4b 76 00 c7>
pid 14157 (exim), uid 26: exited on signal 11 (core dumped)


procstat -v exim.core
  PID              START                END PRT  RES PRES REF SHD   FL TP PATH
14157           0x400000           0x560000 r-x  316  327   3   1 CN-- vn /usr/local/sbin/exim
14157           0x760000           0x76a000 rw-   10    0   1   0 C--- vn /usr/local/sbin/exim
14157           0x76a000           0x779000 rw-    5    0   1   0 C--- df
14157        0x800760000        0x80077e000 r-x   30   30  32   0 CN-- vn /libexec/ld-elf.so.1
14157        0x80077e000        0x800785000 rw-    7    7   2   0 CN-- df
14157        0x800787000        0x8007b7000 rw-   40   40   2   1 CN-- df
14157        0x80097d000        0x80097f000 rw-    2    2   2   0 CN-- df
14157        0x80097f000        0x80098d000 r-x   14   15  12   3 CN-- vn
14157        0x80098d000        0x800b8d000 ---    0    0   2   0 CN-- df
14157        0x800b8d000        0x800b8e000 rw-    1    0   2   0 CN-- vn
14157        0x800b8e000        0x800b9f000 rw-    0    0   0   0 ---- --
14157        0x800b9f000        0x800bc8000 r-x   41   41  19   6 CN-- vn
14157        0x800bc8000        0x800dc7000 ---    0    0   2   0 CN-- df
14157        0x800dc7000        0x800dc8000 rw-    1    0   2   0 CN-- vn
14157        0x800dc8000        0x800dd8000 r-x   16   17  26  10 CN-- vn
14157        0x800dd8000        0x800fd8000 ---    0    0   2   0 CN-- df
14157        0x800fd8000        0x800fd9000 rw-    1    0   2   0 CN-- vn
14157        0x800fd9000        0x800fdb000 rw-    0    0   0   0 ---- --
14157        0x800fdb000        0x800ff3000 r-x   24   26   3   1 CN-- vn
14157        0x800ff3000        0x8011f3000 ---    0    0   2   0 CN-- df
14157        0x8011f3000        0x8011f5000 rw-    2    0   2   0 CN-- vn
14157        0x8011f5000        0x80120d000 r-x   24   25  24   8 CN-- vn /lib/libthr.so.3
14157        0x80120d000        0x80140d000 ---    0    0   2   0 CN-- df
14157        0x80140d000        0x80140e000 rw-    1    0   1   0 C--- vn /lib/libthr.so.3
14157        0x80140e000        0x80141a000 rw-   12    0   1   0 C--- df
14157        0x80141a000        0x801445000 r-x   38   41   3   1 CN-- vn
14157        0x801445000        0x801645000 ---    0    0   2   0 CN-- df
14157        0x801645000        0x801648000 rw-    3    0   2   0 CN-- vn
14157        0x801648000        0x80181d000 r-x  424  432  10   2 CN-- vn
14157        0x80181d000        0x801a1c000 ---    0    0   2   0 CN-- df
14157        0x801a1c000        0x801a25000 rw-    9    0   2   0 CN-- vn
14157        0x801a25000        0x801a88000 r-x   84   94   9   4 CN-- vn /usr/lib/libssl.so.7
14157        0x801a88000        0x801c88000 ---    0    0   2   0 CN-- df
14157        0x801c88000        0x801c92000 rw-   10    0   2   0 CN-- vn /usr/lib/libssl.so.7
14157        0x801c92000        0x801e5f000 r-x  461  501  13   6 CN-- vn /lib/libcrypto.so.7
14157        0x801e5f000        0x80205f000 ---    0    0   2   0 CN-- df
14157        0x80205f000        0x802087000 rw-   40    0   2   0 CN-- vn /lib/libcrypto.so.7
14157        0x802087000        0x802089000 rw-    1    1   2   0 CN-- df
14157        0x802089000        0x8020fe000 r-x   45   45   8   3 CN-- vn
14157        0x8020fe000        0x8022fd000 ---    0    0   2   0 CN-- df
14157        0x8022fd000        0x8022fe000 rw-    1    0   2   0 CN-- vn
14157        0x8022fe000        0x802476000 r-x  376  387  56  24 CN-- vn /lib/libc.so.7
14157        0x802476000        0x802675000 ---    0    0   2   0 CN-- df
14157        0x802675000        0x802681000 rw-   12    0   1   0 C--- vn /lib/libc.so.7
14157        0x802681000        0x8026ab000 rw-   11    0   1   0 C--- df
14157        0x8026ab000        0x8026b5000 r-x   10   11   8   3 CN-- vn /usr/local/lib/libintl.so.8.1.4
14157        0x8026b5000        0x8028b5000 ---    0    0   2   0 CN-- df
14157        0x8028b5000        0x8028b6000 rw-    1    0   2   0 CN-- vn /usr/local/lib/libintl.so.8.1.4
14157        0x802c00000        0x803400000 rw-  222    0   1   0 C--- df
14157     0x7fffdfffe000     0x7fffdffff000 ---    0    0   0   0 ---- --
14157     0x7ffffffdf000     0x7ffffffff000 rw-   10    0   1   0 C--D df
14157     0x7ffffffff000     0x800000000000 r-x    1    1  34   0 ---- ph




(gdb) bt full
#0  handle_smtp_call (listen_sockets=0x8030650b8, listen_socket_count=2, accept_socket=5, accepted=0x7fffffffcd38) at daemon.c:375
	i = 32767
	queue_only_reason = 0
	old_pool = 0
	save_debug_selector = 0
	local_queue_only = 32767
	session_local_queue_only = -4800
	act = {__sigaction_u = {__sa_handler = 0, __sa_sigaction = 0}, sa_flags = 0, sa_mask = {__bits = 0x7fffffffc91c}}
	pid = 0
	interface_sockaddr = {v4 = {sin_len = 28 '\034', sin_family = 28 '\034', sin_port = 6400, sin_addr = {s_addr = 0},
    sin_zero = 0x7fffffffc9a0 "&\aðX\001\020"}, v6 = {sin6_len = 28 '\034', sin6_family = 28 '\034', sin6_port = 6400, sin6_flowinfo = 0, sin6_addr = {__u6_addr = {
        __u6_addr8 = 0x7fffffffc9a0 "&\aðX\001\020", __u6_addr16 = 0x7fffffffc9a0, __u6_addr32 = 0x7fffffffc9a0}}, sin6_scope_id = 0}, v0 = {sa_len = 28 '\034',
    sa_family = 28 '\034', sa_data = 0x7fffffffc99a ""}}
	ifsize = 28
	dup_accept_socket = 7
	max_for_this_host = 10
	wfsize = 100
	wfptr = 22
	use_log_write_selector = 1980
	whofrom = (uschar *) 0x80308a060 "[2607:f058:110:1::f:0]"
	reset_point = (void *) 0x80308a030
#1  0x0000000000427c55 in daemon_go () at daemon.c:2040
	accept_socket = 5
	sk = 0
	max_socket = 4
	lcount = 0
	select_errno = 4
	select_failed = 0
	select_listen = {__fds_bits = 0x7fffffffcc40}
	accepted = {sin6_len = 28 '\034', sin6_family = 28 '\034', sin6_port = 37545, sin6_flowinfo = 0, sin6_addr = {__u6_addr = {
      __u6_addr8 = 0x7fffffffcd40 "&\aðX\001\020", __u6_addr16 = 0x7fffffffcd40, __u6_addr32 = 0x7fffffffcd40}}, sin6_scope_id = 0}
	len = 28
	pid = 14053
	pw = (struct passwd *) 0x802698328
	listen_sockets = (int *) 0x8030650b8
	listen_socket_count = 2
	addresses = (ip_address_item *) 0x803065038
	last_connection_time = 1460982877
#2  0x000000000044af25 in main (argc=3, cargv=0x7fffffffed48) at exim.c:4719
	rsopts = 0x761b00
	argv = (uschar **) 0x7fffffffed48
	arg_receive_timeout = -1
	arg_smtp_receive_timeout = -1
	arg_error_handling = 0
	filter_sfd = -1
	filter_ufd = -1
	group_count = 2
---Type <return> to continue, or q <return> to quit---
	i = 0
	rv = 0
	list_queue_option = 0
	msg_action = 0
	msg_action_arg = -1
	namelen = 20
	queue_only_reason = 0
	perl_start_option = 0
	recipients_arg = 3
	sender_address_domain = 0
	test_retry_arg = -1
	test_rewrite_arg = -1
	arg_queue_only = 0
	bi_option = 0
	checking = 0
	count_queue = 0
	expansion_test = 0
	extract_recipients = 0
	flag_G = 0
	flag_n = 0
	forced_delivery = 0
	f_end_dot = 0
	deliver_give_up = 0
	list_queue = 0
	list_options = 0
	local_queue_only = 7868416
	more = 1
	one_msg_action = 0
	queue_only_set = 0
	receiving_message = 0
	sender_ident_set = 0
	session_local_queue_only = -5288
	unprivileged = 0
	removed_privilege = 0
	usage_wanted = 0
	verify_address_mode = 0
	verify_as_sender = 0
	version_printed = 0
	alias_arg = (uschar *) 0x0
	called_as = (uschar *) 0x554616 ""
	cmdline_syslog_name = (uschar *) 0x0
	start_queue_run_id = (uschar *) 0x0
	stop_queue_run_id = (uschar *) 0x0
	expansion_test_message = (uschar *) 0x0
	ftest_domain = (uschar *) 0x0
	ftest_localpart = (uschar *) 0x0
	ftest_prefix = (uschar *) 0x0
---Type <return> to continue, or q <return> to quit---
	ftest_suffix = (uschar *) 0x0
	log_oneline = (uschar *) 0x0
	malware_test_file = (uschar *) 0x0
	real_sender_address = (uschar *) 0x7fffffffeb58 " \005v"
	originator_home = (uschar *) 0x803065018 "/root"
	sz = 140737488350168
	reset_point = (void *) 0x80231a310
	pw = (struct passwd *) 0x802698328
	statbuf = {st_dev = 1895890688, st_ino = 66, st_mode = 8592, st_nlink = 1, st_uid = 1001, st_gid = 4, st_rdev = 66, st_atim = {tv_sec = 1460632918,
    tv_nsec = 0}, st_mtim = {tv_sec = 1460632918, tv_nsec = 0}, st_ctim = {tv_sec = 1460632918, tv_nsec = 0}, st_size = 0, st_blocks = 0, st_blksize = 4096,
  st_flags = 0, st_gen = 0, st_lspare = 0, st_birthtim = {tv_sec = -1, tv_nsec = 0}}
	passed_qr_pid = 0
	passed_qr_pipe = -1
	group_list = 0x7fffffffd8f0
	info_flag = CMDINFO_NONE
	info_stdout = 0
Comment 92 Konstantin Belousov freebsd_committer freebsd_triage 2016-04-20 14:13:03 UTC
(In reply to Robert Blayzor from comment #91)
This happens with the _latest_ debugging patch applied, am I right ?

Please do 'kgdb kernel.debug /dev/mem' on the machine, and then at kgdb prompt, do 'p *(struct vm_object *)0xfffff8004d9b74c8'.
Comment 93 Robert Blayzor 2016-04-20 14:18:13 UTC
kgdb kernel.debug /dev/mem


This GDB was configured as "amd64-marcel-freebsd"...kernel.debug: No such file or directory.

Can't open a vmcore without a kernel



I assume I need to have a kernel compiled with something else?
Comment 94 Konstantin Belousov freebsd_committer freebsd_triage 2016-04-20 14:28:21 UTC
(In reply to Robert Blayzor from comment #93)
You need the kernel and kernel.debug files from the kernel installation you are running.

Erm, wait, in your case the issue is the SEGFAULT, not the hang, right ?  The object is probably already destroyed then.
Comment 95 Robert Blayzor 2016-04-20 14:38:05 UTC
The processes are actually SIG11 and dying. (core dump). They are not hanging.
Comment 96 Robert Blayzor 2016-04-20 18:38:18 UTC
Similar sigfault today... 


844 vm_fault OBJ_DEAD 0xfffff8000cf704c8
pid 844 kpf 2
pid 844 comm exim: signal 11 err 7 code 2 type 12 addr 0x803006434 rsp 0x7fffffffc7b0 rip 0x801201a67 <ff 48 34 74 02 5d c3 48>
844 vm_fault OBJ_DEAD 0xfffff8000cf704c8
pid 844 kpf 2
Failed to write core file for process exim (error 14)
pid 844 (exim), uid 26: exited on signal 11



(gdb) bt full
#0  _thr_signal_unblock (curthread=0x803006400)
    at /.amd_mnt/juno/pub/FreeBSD/10-STABLE/lib/libthr/thread/thr_sig.c:107
No locals.
Cannot access memory at address 0x7fffffffc7b8
Current language:  auto; currently minimal



procstat -v exim.core
  PID              START                END PRT  RES PRES REF SHD   FL TP PATH
  844           0x400000           0x560000 r-x  316  327   3   1 CN-- vn /usr/local/sbin/exim
  844           0x760000           0x76a000 rw-   10    0   2   1 CN-- vn /usr/local/sbin/exim
  844           0x76a000           0x779000 rw-    5    5   2   1 CN-- df
  844        0x800760000        0x80077e000 r-x   30   30  34   0 CN-- vn /libexec/ld-elf.so.1
  844        0x80077e000        0x800785000 rw-    7    7   2   0 CN-- df
  844        0x800787000        0x8007b7000 rw-   40   40   2   1 CN-- df
  844        0x80097d000        0x80097f000 rw-    2    2   2   0 CN-- df
  844        0x80097f000        0x80098d000 r-x   14   15  14   4 CN-- vn /lib/libcrypt.so.5
  844        0x80098d000        0x800b8d000 ---    0    0   2   0 CN-- df
  844        0x800b8d000        0x800b8e000 rw-    1    0   2   0 CN-- vn /lib/libcrypt.so.5
  844        0x800b8e000        0x800b9f000 rw-    0    0   0   0 ---- --
  844        0x800b9f000        0x800bc8000 r-x   41   41  19   6 CN-- vn /lib/libm.so.5
  844        0x800bc8000        0x800dc7000 ---    0    0   2   0 CN-- df
  844        0x800dc7000        0x800dc8000 rw-    1    0   2   0 CN-- vn /lib/libm.so.5
  844        0x800dc8000        0x800dd8000 r-x   16   17  26  10 CN-- vn /lib/libutil.so.9
  844        0x800dd8000        0x800fd8000 ---    0    0   2   0 CN-- df
  844        0x800fd8000        0x800fd9000 rw-    1    0   2   0 CN-- vn /lib/libutil.so.9
  844        0x800fd9000        0x800fdb000 rw-    0    0   0   0 ---- --
  844        0x800fdb000        0x800ff3000 r-x   24   26   3   1 CN-- vn /usr/local/lib/libspf2.so.2.1.0
  844        0x800ff3000        0x8011f3000 ---    0    0   2   0 CN-- df
  844        0x8011f3000        0x8011f5000 rw-    2    0   2   1 CN-- vn /usr/local/lib/libspf2.so.2.1.0
  844        0x8011f5000        0x80120d000 r-x   24   25  28  10 CN-- vn /lib/libthr.so.3
  844        0x80120d000        0x80140d000 ---    0    0   2   0 CN-- df
  844        0x80140d000        0x80140e000 rw-    1    0   2   1 CN-- vn /lib/libthr.so.3
  844        0x80140e000        0x80141a000 rw-   12    0   1   0 C--- df
  844        0x80141a000        0x801445000 r-x   38   41   5   2 CN-- vn /usr/local/lib/libpq.so.5
  844        0x801445000        0x801645000 ---    0    0   2   0 CN-- df
  844        0x801645000        0x801648000 rw-    3    0   2   0 CN-- vn /usr/local/lib/libpq.so.5
  844        0x801648000        0x80181d000 r-x  424  432  10   2 CN-- vn /usr/local/lib/perl5/5.22/mach/CORE/libperl.so.5.22.1
  844        0x80181d000        0x801a1c000 ---    0    0   2   0 CN-- df
  844        0x801a1c000        0x801a25000 rw-    9    0   2   0 CN-- vn /usr/local/lib/perl5/5.22/mach/CORE/libperl.so.5.22.1
  844        0x801a25000        0x801a88000 r-x   86   96  11   5 CN-- vn /usr/lib/libssl.so.7
  844        0x801a88000        0x801c88000 ---    0    0   2   0 CN-- df
  844        0x801c88000        0x801c92000 rw-   10    0   2   0 CN-- vn /usr/lib/libssl.so.7
  844        0x801c92000        0x801e5f000 r-x  461  501  15   7 CN-- vn /lib/libcrypto.so.7
  844        0x801e5f000        0x80205f000 ---    0    0   2   0 CN-- df
  844        0x80205f000        0x802087000 rw-   40    0   2   0 CN-- vn /lib/libcrypto.so.7
  844        0x802087000        0x802089000 rw-    1    1   2   0 CN-- df
  844        0x802089000        0x8020fe000 r-x   45   45   8   3 CN-- vn /usr/local/lib/libpcre.so.1.2.5
  844        0x8020fe000        0x8022fd000 ---    0    0   2   0 CN-- df
  844        0x8022fd000        0x8022fe000 rw-    1    0   2   0 CN-- vn /usr/local/lib/libpcre.so.1.2.5
  844        0x8022fe000        0x802476000 r-x  376  387  60  26 CN-- vn /lib/libc.so.7
  844        0x802476000        0x802675000 ---    0    0   2   0 CN-- df
  844        0x802675000        0x802681000 rw-   12    0   2   1 CN-- vn /lib/libc.so.7
  844        0x802681000        0x8026ab000 rw-   12   12   2   1 CN-- df
  844        0x8026ab000        0x8026b5000 r-x   10   11  10   4 CN-- vn /usr/local/lib/libintl.so.8.1.4
  844        0x8026b5000        0x8028b5000 ---    0    0   2   0 CN-- df
  844        0x8028b5000        0x8028b6000 rw-    1    0   2   0 CN-- vn /usr/local/lib/libintl.so.8.1.4
  844        0x802c00000        0x803400000 rw-  191    0   1   0 C--- df
  844     0x7fffdfffe000     0x7fffdffff000 ---    0    0   0   0 ---- --
  844     0x7ffffffdf000     0x7ffffffff000 rw-   10    0   1   0 C--D df
  844     0x7ffffffff000     0x800000000000 r-x    1    1  36   0 ---- ph
Comment 97 Konstantin Belousov freebsd_committer freebsd_triage 2016-04-25 12:03:22 UTC
https://reviews.freebsd.org/D6085
Comment 98 Robert Blayzor 2016-04-25 12:12:08 UTC
Sounds related for sure. Is there an official PR for that issue this could be marked duplicate of or is this it? Or are we still on two different problems?
Comment 99 Konstantin Belousov freebsd_committer freebsd_triage 2016-04-25 12:15:22 UTC
(In reply to Robert Blayzor from comment #98)
I have no idea why your debugging data does not contain backtraces of the OBJ_DEAD acquisition points.  Would it, I write the patch two months earlier.

I am highly confident that the issues fixed in the patch from the review are same as your SIGSEGV.
Comment 100 Robert Blayzor 2016-04-25 12:20:21 UTC
It certainly seems like it. I'll work on patching this in and giving it a go. Whats interesting is how this issue creeps up and why. I've only ever had this problem with diskless servers that mount / as NFS RO.
Comment 101 Robert Blayzor 2016-04-25 12:34:38 UTC
In the review, which release is this patch for? Tried patching it against 10_STABLE and it failed.
Comment 102 Robert Blayzor 2016-04-25 14:40:17 UTC
I assume this is from CURRENT. Any way this can be MFC to 10-STABLE ?
Comment 103 Konstantin Belousov freebsd_committer freebsd_triage 2016-04-25 15:13:33 UTC
Created attachment 169679 [details]
Patch from D6085, adopted for stable/10.
Comment 104 Robert Blayzor 2016-04-25 20:40:20 UTC
My first attempt at this patch against 10-STABLE.

I have a 10.3-RELEASE userland, and built a 10-STABLE kernel with this patch. When I boot the kernel it just hangs after normal kernel boot messages, right after:

NFS ROOT: 10.1.60.211:/vol/v10


I will do a fresh checkout of 10-STABLE and build kernel against that and attempt to reboot to see if it's the patch or the STABLE kernel.

10.3-RELEASE still builds fine.
Comment 105 Robert Blayzor 2016-04-26 11:08:03 UTC
I have tried the patch for stable/10 and it causes the system to hang at boot, at what what appears to be the start of init. (but not sure). Last message I get is the kernel message about root mounted from NFS and my vmx0 boot interface comes up. System is pingable, but hangs at this point.

If I back out the patch, it works fine.

I've tried the patch against stable/10 and also releng/10.3. Both of which seem to patch ok.
Comment 106 Konstantin Belousov freebsd_committer freebsd_triage 2016-04-26 11:48:24 UTC
(In reply to Robert Blayzor from comment #105)
Break into ddb and do 'ps' then 'alltrace', for the hung boot.
Comment 107 Robert Blayzor 2016-04-27 11:28:58 UTC
I did manage to get the patch to work against a pristine checkout of RELENG_10. So I have several VM's running now with the patch in place. Not sure what was in STABLE/10 that was hanging up the boot, perhaps something in previous patch. Will monitor and update.
Comment 108 Robert Blayzor 2016-05-08 14:32:51 UTC
It's been longer than average now and I have not run into the processes terminating abnormally with the last patch installed against 10.3. HOWEVER, I have noticed a new issue with the network stack that seems to be happening at roughly the same interval. I'm not sure if the two are related or if fixing one problem manifested into another.

Basically now we're getting the severs filling up with TCP connections stuck in a "CLOSED" state. We'll end up getting thousands of them until connections to the processes just time out.

Sometimes we'll see kernel messages:

sonewconn: pcb 0xfffff80001b36930: Listen queue overflow: 767 already in queue awaiting acceptance (46 occurrences)
sonewconn: pcb 0xfffff80001b36930: Listen queue overflow: 767 already in queue awaiting acceptance (46 occurrences)
sonewconn: pcb 0xfffff80001b36930: Listen queue overflow: 767 already in queue awaiting acceptance (50 occurrences)
sonewconn: pcb 0xfffff80001b36930: Listen queue overflow: 767 already in queue awaiting acceptance (42 occurrences)
sonewconn: pcb 0xfffff80001b36930: Listen queue overflow: 767 already in queue awaiting acceptance (44 occurrences)
sonewconn: pcb 0xfffff80001b36930: Listen queue overflow: 767 already in queue awaiting acceptance (38 occurrences)
sonewconn: pcb 0xfffff80001b36930: Listen queue overflow: 767 already in queue awaiting acceptance (40 occurrences)
[...]


But not always... Currently I have a server in this state and I'll see:


tcp6       0      0 mta1.imap              mta-slb-1.alb1.i.20511 CLOSED
tcp6       0      0 mta1.pop3              mta-slb-1.alb1.i.43879 CLOSED
tcp6       0      0 mta1.imap              mta-slb-0.alb1.i.5259  CLOSED
tcp6       0      0 mta1.pop3              mta-slb-0.alb1.i.12519 CLOSED
tcp6       0      0 mta1.imap              mta-slb-1.alb1.i.1316  CLOSED
tcp6       0      0 mta1.pop3              mta-slb-1.alb1.i.65343 CLOSED
tcp6       0      0 mta1.imap              mta-slb-0.alb1.i.16289 CLOSED
tcp6       0      0 mta1.pop3              mta-slb-0.alb1.i.19215 CLOSED
tcp6      32      0 mta1.sieve             mta-slb-0.alb1.i.19549 CLOSED
tcp6       0      0 mta1.imap              mta-slb-1.alb1.i.49287 CLOSED
tcp6      32      0 mta1.sieve             mta-slb-1.alb1.i.54187 CLOSED
tcp6       0      0 mta1.pop3              mta-slb-1.alb1.i.39767 CLOSED
tcp6       0      0 mta1.imap              mta-slb-0.alb1.i.54366 CLOSED
tcp6       0      0 mta1.pop3              mta-slb-0.alb1.i.47579 CLOSED
tcp6       0      0 mta1.imap              mta-slb-1.alb1.i.48798 CLOSED
tcp6       0      0 mta1.pop3              mta-slb-1.alb1.i.40190 CLOSED
... [ 1000's of lines truncated ]


It's not just Dovecot either, we also will see several stuck in CLOSED from Exim as well. So it doesn't look like an application issue, in fact, sockstat shows these stuck sockets not related to the process anymore... ie:

?        ?          ?     ?  tcp6   2607:f058:110:2::1:1:143 2607:f058:110:2::f:1:56602
?        ?          ?     ?  tcp6   2607:f058:110:2::1:1:4190 2607:f058:110:2::f:0:32558
?        ?          ?     ?  tcp6   2607:f058:110:2::1:1:110 2607:f058:110:2::f:1:53931
?        ?          ?     ?  tcp6   2607:f058:110:2::1:1:110 2607:f058:110:2::f:0:58671
?        ?          ?     ?  tcp6   2607:f058:110:2::1:1:143 2607:f058:110:2::f:1:58788
?        ?          ?     ?  tcp6   2607:f058:110:2::1:1:143 2607:f058:110:2::f:0:30523
?        ?          ?     ?  tcp6   2607:f058:110:2::1:1:110 2607:f058:10::10:32375
?        ?          ?     ?  tcp6   2607:f058:110:2::1:1:143 2607:f058:110:2::f:0:46131
?        ?          ?     ?  tcp6   2607:f058:110:2::1:1:110 2607:f058:110:2::f:1:50671
?        ?          ?     ?  tcp6   2607:f058:110:2::1:1:143 2607:f058:110:2::f:1:4223
?        ?          ?     ?  tcp6   2607:f058:110:2::1:1:143 2607:f058:110:2::f:1:15773
?        ?          ?     ?  tcp6   2607:f058:110:2::1:1:110 2607:f058:110:2::f:0:26610
?        ?          ?     ?  tcp6   2607:f058:110:2::1:1:4190 2607:f058:110:2::f:0:38765
?        ?          ?     ?  tcp6   2607:f058:110:2::1:1:143 2607:f058:110:2::f:0:42310
?        ?          ?     ?  tcp6   2607:f058:110:2::1:1:143 2607:f058:110:2::f:1:5643
?        ?          ?     ?  tcp6   2607:f058:110:2::1:1:143 2607:f058:110:2::f:0:37143
?        ?          ?     ?  tcp6   2607:f058:110:2::1:1:4190 2607:f058:110:2::f:0:24906
[...] (agan, thousands of lines truncated)


netstat -ans -p tcp
tcp:
	8497364 packets sent
		3825626 data packets (484438498 bytes)
		12 data packets (5560 bytes) retransmitted
		1 data packet unnecessarily retransmitted
		0 resends initiated by MTU discovery
		4106401 ack-only packets (0 delayed)
		0 URG only packets
		0 window probe packets
		313890 window update packets
		251435 control packets
	5525333 packets received
		4126773 acks (for 483181992 bytes)
		108333 duplicate acks
		0 acks for unsent data
		3787497 packets (1012422242 bytes) received in-sequence
		151 completely duplicate packets (0 bytes)
		0 old duplicate packets
		0 packets with some dup. data (0 bytes duped)
		0 out-of-order packets (0 bytes)
		0 packets (0 bytes) of data after window
		0 window probes
		7067 window update packets
		6376 packets received after close
		0 discarded for bad checksums
		0 discarded for bad header offset fields
		0 discarded because packet too short
		0 discarded due to memory problems
	7546 connection requests
	315652 connection accepts
	0 bad connection attempts
	0 listen queue overflows
	73753 ignored RSTs in the windows
	323196 connections established (including accepts)
	323132 connections closed (including 1414 drops)
		121869 connections updated cached RTT on close
		121869 connections updated cached RTT variance on close
		0 connections updated cached ssthresh on close
	2 embryonic connections dropped
	4126615 segments updated rtt (of 3726607 attempts)
	12 retransmit timeouts
		0 connections dropped by rexmit timeout
	0 persist timeouts
		0 connections dropped by persist timeout
	14 Connections (fin_wait_2) dropped because of timeout
	19343 keepalive timeouts
		19298 keepalive probes sent
		45 connections dropped by keepalive
	240267 correct ACK header predictions
	792401 correct data packet header predictions
	315653 syncache entries added
		0 retransmitted
		0 dupsyn
		0 dropped
		315652 completed
		0 bucket overflow
		0 cache overflow
		1 reset
		0 stale
		0 aborted
		0 badack
		0 unreach
		0 zone failures
	315653 cookies sent
	0 cookies received
	164 hostcache entries added
		0 bucket overflow
	0 SACK recovery episodes
	0 segment rexmits in SACK recovery episodes
	0 byte rexmits in SACK recovery episodes
	0 SACK options (SACK blocks) received
	0 SACK options (SACK blocks) sent
	0 SACK scoreboard overflow
	0 packets with ECN CE bit set
	0 packets with ECN ECT(0) bit set
	0 packets with ECN ECT(1) bit set
	0 successful ECN handshakes
	0 times ECN reduced the congestion window
	0 packets with valid tcp-md5 signature received
	0 packets with invalid tcp-md5 signature received
	0 packets with tcp-md5 signature mismatch
	0 packets with unexpected tcp-md5 signature received
	0 packets without expected tcp-md5 signature received




If I attempt to kill and restart the processes, sometimes it works, sometimes it doesn't and I have to end up rebooting the server.
Comment 109 Konstantin Belousov freebsd_committer freebsd_triage 2016-05-09 02:19:52 UTC
(In reply to Robert Blayzor from comment #108)
Thank you for the testing.  I am confident that your SIGSEGV issue is resolved.

Also, I am sure that the network issue you described is unrelated.  Please track it in the different bug report.
Comment 110 Twingly 2016-05-16 15:51:43 UTC
(In reply to Konstantin Belousov from comment #109)

> Also, I am sure that the network issue you described is unrelated.
> Please track it in the different bug report.

Are you referring to "Bug 204764 - Filesystem deadlock, process in vodead state"?
Comment 111 Robert Blayzor 2016-05-16 17:18:54 UTC
Opened a new PR on the excessive CLOSED socket issue I'm seeing:  PR 209471
Comment 112 Robert Blayzor 2016-05-16 22:22:15 UTC
Bug 209471 was opened for the socket issue. Interested parties may want to look at that, starting to look as if these two problems are related.
Comment 113 Robert Blayzor 2016-05-27 00:23:12 UTC
I have posted an update to Bug 209471.

The latest update is that when we see the problem ported in that bug appear, the process is stuck in "vmf_de" which is from the patch suggested here.
Comment 114 Robert Blayzor 2016-06-21 00:20:17 UTC
Even with the suggested patch, now seeing processes occasionally hang in state:

dev@mta0 [~] procstat -ta | grep 7893
 7893 100155 dovecot          -                  0  120 sleep   vmf_de
dev@mta0 [~] procstat -ta | grep 7893
 7893 100155 dovecot          -                  1  120 sleep   vmf_de
dev@mta0 [~] procstat -ta | grep 7893
 7893 100155 dovecot          -                  2  120 sleep   vmf_de
dev@mta0 [~] procstat -ta | grep 7893
 7893 100155 dovecot          -                  2  120 sleep   vmf_de
dev@mta0 [~] procstat -ta | grep 7893
 7893 100155 dovecot          -                  3  120 sleep   vmf_de
dev@mta0 [~] procstat -ta | grep 7893
 7893 100155 dovecot          -                  2  120 sleep   vmf_de
dev@mta0 [~] procstat -ta | grep 7893
 7893 100155 dovecot          -                  2  120 sleep   vmf_de



When this happens we usually end up with a process that leaves a lot of detached sockets stuck in a CLOSED state. The process normally becomes unkillable and only a reboot clears the problem.
Comment 115 Jason Wolfe 2016-06-23 22:57:00 UTC
Robert, check out https://svnweb.freebsd.org/changeset/base/302063 from 204764 and let us know if that helps any.
Comment 116 Robert Blayzor 2016-06-24 19:07:40 UTC
Should I leave the patch from comment #103 in place while using patch/diff from PR 204764 ?
Comment 117 Konstantin Belousov freebsd_committer freebsd_triage 2016-06-24 19:59:18 UTC
(In reply to Robert Blayzor from comment #116)
They are complementary.  Committed change should fix the vmf_de problem for you.
Comment 118 Robert Blayzor 2016-06-24 20:35:06 UTC
Tried building new kernel with recommend changes, when compiling I get an error:

/sys/vm/vm_meter.c:145:13: error: use of undeclared identifier 'vm_cnt'
                                              &vm_cnt.v_free_count)                                               ^
1 error generated.
*** Error code 1
Comment 119 Konstantin Belousov freebsd_committer freebsd_triage 2016-06-24 20:40:47 UTC
(In reply to Robert Blayzor from comment #118)
vm_cnt was called just cnt in previous branches.  Use &cnt.v_free_count.
Comment 120 commit-hook freebsd_committer freebsd_triage 2016-06-27 21:54:36 UTC
A commit references this bug:

Author: kib
Date: Mon Jun 27 21:54:20 UTC 2016
New revision: 302236
URL: https://svnweb.freebsd.org/changeset/base/302236

Log:
  If the vm_fault() handler raced with the vm_object_collapse()
  sleepable scan, iteration over the shadow chain looking for a page
  could find an OBJ_DEAD object.  Such state of the mapping is only
  transient, the dead object will be terminated and removed from the
  chain shortly.  We must not return KERN_PROTECTION_FAILURE unless the
  object type is changed to OBJT_DEAD in the chain, indicating that
  paging on this address is really impossible.  Returning
  KERN_PROTECTION_FAILURE prematurely causes spurious SIGSEGV delivered
  to processes, or kernel accesses to UVA spuriously failing with
  EFAULT.

  If the object with OBJ_DEAD flag is found, only return
  KERN_PROTECTION_FAILURE when object type is already OBJT_DEAD.
  Otherwise, sleep a tick and retry the fault handling.

  Ideally, we would wait until the OBJ_DEAD flag is resolved, e.g. by
  waiting until the paging on this object is finished.  But to do so, we
  need to reference the dead object, while vm_object_collapse() insists
  on owning the final reference on the collapsed object.  This could be
  fixed by e.g. changing the assert to shared reference release between
  vm_fault() and vm_object_collapse(), but it seems to be too much
  complications for rare boundary condition.

  PR:	204426
  Tested by:    pho
  Reviewed by:  alc
  Sponsored by: The FreeBSD Foundation
  X-Differential revision:	https://reviews.freebsd.org/D6085
  MFC after:	2 weeks
  Approved by:	re (gjb)

Changes:
  head/sys/vm/vm_fault.c
Comment 121 Robert Blayzor 2016-07-01 14:43:16 UTC
I've been running the last set of patches on this, and so far so good. Then again, it's not unusual for us to go on several days before we see the problem appear.

Will report back in one week.
Comment 122 Robert Blayzor 2016-07-05 12:37:24 UTC
We are still seeing a problem with servers getting hosed up with hundreds and hundreds of sockets in a CLOSED state...  I've experienced this with two different Apache2.4 servers... Both times, almost identical issue...

dmesg
sonewconn: pcb 0xfffff80001b34000: Listen queue overflow: 767 already in queue awaiting acceptance (34 occurrences)
sonewconn: pcb 0xfffff80001b34000: Listen queue overflow: 767 already in queue awaiting acceptance (39 occurrences)
sonewconn: pcb 0xfffff80001b34000: Listen queue overflow: 767 already in queue awaiting acceptance (36 occurrences)
sonewconn: pcb 0xfffff80001b34000: Listen queue overflow: 767 already in queue awaiting acceptance (37 occurrences)
sonewconn: pcb 0xfffff80001b34000: Listen queue overflow: 767 already in queue awaiting acceptance (42 occurrences)
sonewconn: pcb 0xfffff80001b34000: Listen queue overflow: 767 already in queue awaiting acceptance (38 occurrences)
sonewconn: pcb 0xfffff80001b34000: Listen queue overflow: 767 already in queue awaiting acceptance (39 occurrences)
sonewconn: pcb 0xfffff80001b34000: Listen queue overflow: 767 already in queue awaiting acceptance (36 occurrences)
...


webmail3 [~] netstat -an | grep " CLOSED"
tcp4     113      0 127.0.0.1.80           127.0.0.1.14511        CLOSED
tcp4     113      0 127.0.0.1.80           127.0.0.1.26064        CLOSED
tcp4     113      0 127.0.0.1.80           127.0.0.1.57807        CLOSED
tcp4     113      0 127.0.0.1.80           127.0.0.1.39385        CLOSED
tcp4     113      0 127.0.0.1.80           127.0.0.1.22623        CLOSED
tcp4     113      0 127.0.0.1.80           127.0.0.1.65280        CLOSED
tcp4     113      0 127.0.0.1.80           127.0.0.1.27494        CLOSED
tcp4     113      0 127.0.0.1.80           127.0.0.1.12996        CLOSED
tcp4     113      0 127.0.0.1.80           127.0.0.1.59834        CLOSED
tcp4     113      0 127.0.0.1.80           127.0.0.1.32036        CLOSED
tcp4     113      0 127.0.0.1.80           127.0.0.1.34364        CLOSED
tcp6       0      0 2607:f058:110:3:.80    2607:f058:110:3:.35467 CLOSED
tcp6       0      0 2607:f058:110:3:.80    2607:f058:110:3:.28605 CLOSED
tcp6       0      0 2607:f058:110:3:.80    2607:f058:110:3:.21232 CLOSED
tcp6       0      0 2607:f058:110:3:.80    2607:f058:110:3:.48099 CLOSED
tcp6       0      0 2607:f058:110:3:.80    2607:f058:110:3:.65026 CLOSED
tcp6       0      0 2607:f058:110:3:.80    2607:f058:110:3:.54254 CLOSED
tcp6       0      0 2607:f058:110:3:.80    2607:f058:110:3:.35803 CLOSED
tcp6       0      0 2607:f058:110:3:.80    2607:f058:110:3:.11435 CLOSED
tcp6       0      0 2607:f058:110:3:.80    2607:f058:110:3:.46650 CLOSED
tcp6       0      0 2607:f058:110:3:.80    2607:f058:110:3:.7447  CLOSED
...


webmail3 [~] netstat -an | grep " CLOSED" | wc -l
     980


webmail3 [~] sudo procstat -at | grep httpd
	  699 100076 httpd            -                  0  120 sleep   select
	  755 100093 httpd            -                  3  120 sleep   pipewr
	 4220 100124 httpd            -                  3  120 sleep   pipewr
	 6812 100448 httpd            -                  2  120 sleep   pipewr
	18402 100112 httpd            -                  0  120 sleep   pipewr
	24419 100122 httpd            -                  3  120 sleep   pipewr
	24590 100142 httpd            -                  2  120 sleep   pipewr
	24591 100431 httpd            -                  3  120 sleep   pipewr
	24592 100095 httpd            -                  2  120 sleep   pipewr
	24593 100432 httpd            -                  0  120 sleep   pipewr
	24594 100134 httpd            -                  2  120 sleep   pipewr
	24595 100052 httpd            -                  3  120 sleep   pipewr
	24596 100128 httpd            -                  2  120 sleep   pipewr
	24597 100062 httpd            -                  2  120 sleep   pipewr
	24598 100047 httpd            -                  3  120 sleep   pipewr
	24599 100073 httpd            -                  2  120 sleep   pipewr
	24600 100129 httpd            -                  0  120 sleep   pipewr
...

webmail3 [~] sudo procstat -at | grep httpd | wc -l
     201
Comment 123 Robert Blayzor 2016-08-03 19:40:09 UTC
After several weeks of the patches from this PR installed, we can report we've not had either issue in some time.

Will these patches be MFC, and if possible to 10-STABLE ?
Comment 124 Konstantin Belousov freebsd_committer freebsd_triage 2016-08-04 08:07:16 UTC
(In reply to Robert Blayzor from comment #123)
All relevant patches were already merged to stable/10 for around a month.
Comment 125 Oleksandr Tymoshenko freebsd_committer freebsd_triage 2019-01-21 18:00:41 UTC
There is a commit referencing this PR, but it's still not closed and has been inactive for some time. Closing the PR as fixed but feel free to re-open it if the issue hasn't been completely resolved.

Thanks