Recent upgrade of ntpd 4.2.8p4 to stable/9 results in a daemon which behaves very very oddly. Said upgrade: http://www.freshbsd.org/commit/freebsd/r289998 My log after several manual troubleshooting attempts -- note the intermixed segfaults: Oct 26 15:38:05 icarus ntpd[1092]: giving up resolving host clock.isc.org: servname not supported for ai_socktype (9) Oct 26 15:38:23 icarus ntpd[1116]: giving up resolving host clock.isc.org: servname not supported for ai_socktype (9) pid 1139 (ntpd), uid 0: exited on signal 11 (core dumped) Oct 26 15:39:07 icarus ntpd[1176]: giving up resolving host clock.isc.org: servname not supported for ai_socktype (9) Oct 26 15:39:59 icarus ntpd[1209]: giving up resolving host ntp-1.cso.uiuc.edu: servname not supported for ai_socktype (9) Oct 26 15:40:24 icarus ntpd[1268]: giving up resolving host clock.isc.org: servname not supported for ai_socktype (9) pid 1294 (ntpd), uid 0: exited on signal 11 (core dumped) pid 1312 (ntpd), uid 0: exited on signal 11 (core dumped) Oct 26 15:44:09 icarus ntpd[1409]: giving up resolving host clock.isc.org: servname not supported for ai_socktype (9) Oct 26 15:45:26 icarus ntpd[1490]: giving up resolving host 0.freebsd.pool.ntp.org: servname not supported for ai_socktype (9) Oct 26 15:50:18 icarus ntpd[1656]: giving up resolving host tick.jrc.us: servname not supported for ai_socktype (9) Segfaults are always here: root@icarus:~ # gdb /usr/sbin/ntpd /ntpd.core ... #0 0x000000080114d79d in _malloc_postfork () from /lib/libc.so.7 [New Thread 801807c00 (LWP 100797/ntpd)] [New Thread 801807400 (LWP 100791/ntpd)] (gdb) bt #0 0x000000080114d79d in _malloc_postfork () from /lib/libc.so.7 #1 0x000000080114fb3e in _malloc_postfork () from /lib/libc.so.7 #2 0x00000008011523fe in _malloc_prefork () from /lib/libc.so.7 #3 0x0000000801154482 in calloc () from /lib/libc.so.7 #4 0x000000080117aba6 in __res_state () from /lib/libc.so.7 #5 0x000000080118698c in freeaddrinfo () from /lib/libc.so.7 #6 0x00000008011ab61a in nsdispatch () from /lib/libc.so.7 #7 0x0000000801187ffb in getaddrinfo () from /lib/libc.so.7 #8 0x0000000000474f04 in blocking_getaddrinfo () #9 0x0000000000473a43 in blocking_child_common () #10 0x00000000004737e9 in blocking_thread () #11 0x0000000800afee70 in pthread_getprio () from /lib/libthr.so.3 #12 0x0000000000000000 in ?? () Important: The behaviour seen is very strange. Basically, the daemon starts, emits one of the aforementioned DNS errors, then proceeds to either a) exit, b) crash, or c) continue running. Sometimes when the daemon exits (possibly when crashing too), it restarts itself. There have been a couple times where ps -auxwww | grep ntp returns nothing, yet a few seconds later the daemon is found running. Things I've tried which made no difference: 1. Removing -4 from $ntpd_flags (I set this because while my system has IPv6, I prefer using IPv4 everywhere) 2. Using /etc/ntp.conf (r289998) instead of my own ntp.conf There is no workaround for this other than to roll back to something prior to r289998. Googling turns up several reports of this problem, but all relate to people trying to use chroot'ing with ntpd (I DO NOT use this feature). https://mail-index.netbsd.org/current-users/2014/01/26/msg024169.html https://mail-index.netbsd.org/current-users/2014/06/01/msg024998.html One report says that use of -O1 (on ARM) relieves the problem, but crashing is seen on VAX and other platforms. (My system uses gcc, not clang, just for the record) Footnote: upgrading to stable/10 is not an option until the load average bug there is rectified (I am not the only one to report this problem). I can try to test out this ntpd on a VM running stable/10 to see if the problem there is reproducible or not. My ntp.conf (w/ comments removed): server clock.isc.org iburst server ntp-1.cso.uiuc.edu iburst server clock.psu.edu iburst server tick.jrc.us iburst server 0.us.pool.ntp.org iburst restrict default limited kod nomodify notrap nopeer noquery restrict 127.0.0.1 restrict 192.168.1.0 mask 255.255.255.0 My rc.conf ntp-related flags: # ntpd_flags: temporary workaround for https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=199127 ntpd_enable="yes" ntpd_config="/conf/ME/ntp.conf" ntpd_sync_on_start="yes" ntpd_flags="-4 ${ntpd_flags}"
I can reproduce this problem on stable/9 using ports/net/ntp and ports/net/ntp-devel too, though the stack trace looks different (maybe getting clobbered, or maybe libevent takes care of something, not sure): root@icarus:~ # pkg info | grep ntp ntp-4.2.8p4 The Network Time Protocol Distribution root@icarus: # grep ntp /etc/rc.conf # ntpd_flags: temporary workaround for https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=199127 ntpd_enable="yes" ntpd_program="/usr/local/sbin/ntpd" ntpd_config="/conf/ME/ntp.conf" ntpd_sync_on_start="yes" ntpd_flags="-4 -p /var/run/ntpd.pid -f /var/db/ntpd.drift" root@icarus:~ # service ntpd start Starting ntpd. root@icarus:~ # dmesg -a | tail -1 pid 2425 (ntpd), uid 0: exited on signal 11 (core dumped) root@icarus:~ # gdb /usr/local/sbin/ntpd /ntpd.core ... #0 0x000000080134579d in _malloc_postfork () from /lib/libc.so.7 [New Thread 801807c00 (LWP 100886/ntpd)] [New Thread 801807400 (LWP 100878/ntpd)] (gdb) bt #0 0x000000080134579d in _malloc_postfork () from /lib/libc.so.7 #1 0x0000000801347b3e in _malloc_postfork () from /lib/libc.so.7 #2 0x000000080134a3fe in _malloc_prefork () from /lib/libc.so.7 #3 0x000000080134c482 in calloc () from /lib/libc.so.7 #4 0x0000000801372ba6 in __res_state () from /lib/libc.so.7 #5 0x000000080137e98c in freeaddrinfo () from /lib/libc.so.7 #6 0x00000008013a361a in nsdispatch () from /lib/libc.so.7 #7 0x000000080137fffb in getaddrinfo () from /lib/libc.so.7 #8 0x0000000000461c3d in ?? () #9 0x0000000000462293 in ?? () #10 0x00000000004648c9 in ?? () #11 0x00000008010a6e70 in pthread_getprio () from /lib/libthr.so.3 #12 0x0000000000000000 in ?? ()
Adding glebius (committer).
Take.
Jeremy, I'm trying to reproduce it, and I got the gai_strerror() log line only one. I didn't have any cores. Definitely, we need something special in the environment. While I continue my search, can you please do the following: cd /usr/src/usr.sbin/ntp make clean make WITH_DEBUG=-g make STRIP= install And please collect core files and share with us.
Jeremy, and if you have ideas what is special with your resolving environment, please share them. Is there anything special with your /etc/resolv.conf? Do you have a packet loss or unusually large RTT towards your DNS server?
Happy to rebuild with debug symbols, but I'm not sure that's going to shed too much light as most of the backtrace is in libc functions (I'd have to build world with -g). I'll do so and try to document step-by-step what I saw happen + provide backtrace and/or core dumps. resolv.conf is as bare bones as possible: search home.lan nameserver 127.0.0.1 named (from base) runs locally and is nearly a purely a caching nameserver (it hosts a couple zones (home.lan and a blackhole zone for unwanted FQDNs)). rc.conf settings for named: named_enable="yes" named_flags="-4" Hop count between myself and root nameservers varies, but is definitely below stock TTL (64). RTT between myself and nameservers varies, but usually 10-90ms (west coast vs. east coast). If RTT was a problem, I'd expect this to be manifesting itself badly universally. Whatever the issue is seems specific to ntp in some way. I did try fully stopping/starting named as a precaution. Other stuff I can think of off the top of my head: make.conf: CPUTYPE?=core2 # For DTrace; also affects ports STRIP= CFLAGS+= -fno-omit-frame-pointer src.conf: WITHOUT_ATM=true WITHOUT_BLUETOOTH=true WITHOUT_CLANG=true WITHOUT_FLOPPY=true WITHOUT_FREEBSD_UPDATE=true WITHOUT_IPFILTER=true WITHOUT_IPX=true WITHOUT_LIB32=true WITHOUT_NDIS=true WITHOUT_NETGRAPH=true WITHOUT_PPP=true WITHOUT_SENDMAIL=true WITHOUT_WIRELESS=true WITH_OPENSSH_NONE_CIPHER=true sysctl.conf: net.inet.icmp.icmplim=1500 net.inet.tcp.sendbuf_max=16777216 net.inet.tcp.recvbuf_max=16777216 net.inet.tcp.sendspace=65536 net.inet.tcp.recvspace=65536 kern.maxvnodes=393216 vfs.ufs.dirhash_maxmem=16777216 vfs.zfs.min_auto_ashift=12 System does not use a firewall. P.S. -- What exactly is ai_socktype 9? I see nothing about that in include/sys/socket.h, and it certainly can't be a reference to AF_DATAKIT. Or maybe it's a syscall return code; I haven't looked at the code.
9 is EAI_SERVICE, error code.
Rebuilt system per instructions in src/Makefile. Upon system boot, ntpd came up and appeared functioning (?!?!?!). However, I noticed the line about "restrict kod" doing nothing without "limited", so: Oct 27 05:22:38 icarus ntpd[708]: ntpd 4.2.8p4-a (1): Starting Oct 27 05:22:38 icarus kernel: Starting ntpd. Oct 27 05:22:38 icarus ntpd[708]: Command line: /usr/sbin/ntpd -g -c /conf/ME/ntp.conf -4 -p /var/run/ntpd.pid -f /var/db/ntpd.drift Oct 27 05:22:38 icarus ntpd[709]: proto: precision = 1.117 usec (-20) Oct 27 05:22:38 icarus ntpd[709]: restrict default: KOD does nothing without LIMITED. Oct 27 05:22:38 icarus ntpd[709]: Listen and drop on 0 v4wildcard 0.0.0.0:123 Oct 27 05:22:38 icarus ntpd[709]: Listen normally on 1 em0 192.168.1.51:123 Oct 27 05:22:38 icarus ntpd[709]: Listen normally on 2 lo0 127.0.0.1:123 Oct 27 05:22:38 icarus ntpd[709]: Listening on routing socket on fd #23 for interface updates Oct 27 05:22:40 icarus ntpd[709]: receive: Unexpected origin timestamp from 130.126.24.24 Oct 27 05:22:40 icarus ntpd[709]: receive: Unexpected origin timestamp from 128.118.25.3 I service ntpd stop'd, which got me these: Oct 27 05:23:38 icarus ntpd[709]: ntpd exiting on signal 15 (Terminated: 15) Oct 27 05:23:38 icarus ntpd[709]: 149.20.64.28 local addr 192.168.1.51 -> <null> Oct 27 05:23:38 icarus ntpd[709]: 130.126.24.24 local addr 192.168.1.51 -> <null> Oct 27 05:23:38 icarus ntpd[709]: 128.118.25.3 local addr 192.168.1.51 -> <null> Oct 27 05:23:38 icarus ntpd[709]: 208.90.144.52 local addr 192.168.1.51 -> <null> Oct 27 05:23:38 icarus ntpd[709]: 142.54.181.202 local addr 192.168.1.51 -> <null> I modified /conf/ME/ntp.conf to use: restrict default kod nomodify notrap nopeer noquery Then did service ntpd start. The result: Oct 27 05:23:39 icarus ntpd[1088]: ntpd 4.2.8p4-a (1): Starting Oct 27 05:23:39 icarus ntpd[1088]: Command line: /usr/sbin/ntpd -g -c /conf/ME/ntp.conf -4 -p /var/run/ntpd.pid -f /var/db/ntpd.drift Oct 27 05:23:39 icarus ntpd[1089]: proto: precision = 1.396 usec (-19) Oct 27 05:23:39 icarus ntpd[1089]: Listen and drop on 0 v4wildcard 0.0.0.0:123 Oct 27 05:23:39 icarus ntpd[1089]: Listen normally on 1 em0 192.168.1.51:123 Oct 27 05:23:39 icarus ntpd[1089]: Listen normally on 2 lo0 127.0.0.1:123 Oct 27 05:23:39 icarus ntpd[1089]: Listening on routing socket on fd #23 for interface updates Oct 27 05:23:39 icarus ntpd[1089]: giving up resolving host clock.isc.org: servname not supported for ai_socktype (9) Oct 27 05:23:39 icarus kernel: Oct 27 05:23:39 icarus ntpd[1089]: giving up resolving host clock.isc.org: servname not supported for ai_socktype (9) I did service ntpd stop: Oct 27 05:24:19 icarus ntpd[1089]: ntpd exiting on signal 15 (Terminated: 15) Oct 27 05:24:19 icarus ntpd[1089]: 130.126.24.24 local addr 192.168.1.51 -> <null> Oct 27 05:24:19 icarus ntpd[1089]: 128.118.25.3 local addr 192.168.1.51 -> <null> Oct 27 05:24:19 icarus ntpd[1089]: 208.90.144.52 local addr 192.168.1.51 -> <null> Oct 27 05:24:19 icarus ntpd[1089]: 172.82.134.52 local addr 192.168.1.51 -> <null> ...modified /conf/ME/ntp.conf to remove "limited" from restrict, then did service ntpd start. The result: Oct 27 05:24:20 icarus ntpd[1130]: ntpd 4.2.8p4-a (1): Starting Oct 27 05:24:20 icarus ntpd[1130]: Command line: /usr/sbin/ntpd -g -c /conf/ME/ntp.conf -4 -p /var/run/ntpd.pid -f /var/db/ntpd.drift Oct 27 05:24:20 icarus ntpd[1131]: proto: precision = 1.396 usec (-19) Oct 27 05:24:20 icarus ntpd[1131]: restrict default: KOD does nothing without LIMITED. Oct 27 05:24:20 icarus ntpd[1131]: Listen and drop on 0 v4wildcard 0.0.0.0:123 Oct 27 05:24:20 icarus ntpd[1131]: Listen normally on 1 em0 192.168.1.51:123 Oct 27 05:24:20 icarus ntpd[1131]: Listen normally on 2 lo0 127.0.0.1:123 Oct 27 05:24:20 icarus ntpd[1131]: Listening on routing socket on fd #23 for interface updates Oct 27 05:24:20 icarus kernel: pid 1131 (ntpd), uid 0: exited on signal 11 (core dumped) Backtrace is the same as what I've posted before (specifically the one containing all the blocking_* functions). Rebuild ntpd per your instructions in c#4, with one minor change: it should be make DEBUG_FLAGS=-g not WITH_DEBUG=-g. :-) I noticed a couple warnings in libntpevent that might be relevant (unsure): ===> libntpevent (all) cc -O2 -pipe -fno-omit-frame-pointer -march=core2 -I/usr/src/usr.sbin/ntp/libntpevent/../../../contrib/ntp/include -I/usr/src/usr.sbin/ntp/libntpevent/../../../contrib/ntp/sntp/libevent/include -I/usr/src/usr.sbin/ntp/libntpevent/ -DHAVE_BSD_NICE -DHAVE_STDINT_H -DSYS_FREEBSD -DPARSE -DHAVE_CONFIG_H -DOPENSSL -DUSE_OPENSSL_CRYPTO_RAND -DAUTOKEY -g -std=gnu99 -fstack-protector -Wno-pointer-sign -c /usr/src/usr.sbin/ntp/libntpevent/../../../contrib/ntp/sntp/libevent/buffer.c -o buffer.o /usr/src/usr.sbin/ntp/libntpevent/../../../contrib/ntp/sntp/libevent/buffer.c: In function 'evbuffer_run_callbacks': /usr/src/usr.sbin/ntp/libntpevent/../../../contrib/ntp/sntp/libevent/buffer.c:495: warning: implicit declaration of function 'LIST_END' /usr/src/usr.sbin/ntp/libntpevent/../../../contrib/ntp/sntp/libevent/buffer.c:495: warning: comparison between pointer and integer Verified /usr/sbin/ntpd mtime changed and file was substantially larger, then started ntpd. Result: Oct 27 05:40:03 icarus ntpd[4093]: ntpd 4.2.8p4-a (1): Starting Oct 27 05:40:03 icarus ntpd[4093]: Command line: /usr/sbin/ntpd -g -c /conf/ME/ntp.conf -4 -p /var/run/ntpd.pid -f /var/db/ntpd.drift Oct 27 05:40:03 icarus ntpd[4094]: proto: precision = 1.117 usec (-20) Oct 27 05:40:03 icarus ntpd[4094]: restrict default: KOD does nothing without LIMITED. Oct 27 05:40:03 icarus ntpd[4094]: Listen and drop on 0 v4wildcard 0.0.0.0:123 Oct 27 05:40:03 icarus ntpd[4094]: Listen normally on 1 em0 192.168.1.51:123 Oct 27 05:40:03 icarus ntpd[4094]: Listen normally on 2 lo0 127.0.0.1:123 Oct 27 05:40:03 icarus ntpd[4094]: Listening on routing socket on fd #23 for interface updates Oct 27 05:40:03 icarus ntpd[4094]: giving up resolving host clock.isc.org: servname not supported for ai_socktype (9) Oct 27 05:40:03 icarus kernel: Oct 27 05:40:03 icarus ntpd[4094]: giving up resolving host clock.isc.org: servname not supported for ai_socktype (9) At this point I tried something for fun: ntpq -c peers (should return a peer list). That broke very very badly: root@icarus:/usr/src/usr.sbin/ntp # ntpq -c peers /usr/src/usr.sbin/ntp/libntp/../../../contrib/ntp/lib/isc/unix/net.c:221: fatal error: RUNTIME_CHECK(((pthread_once((&once), (initialize_action)) == 0) ? 0 : 34) == 0) failed Abort (core dumped) bt + bt full: root@icarus:/usr/src/usr.sbin/ntp # gdb /usr/bin/ntpq ntpq.core #0 0x000000080115e13c in kill () from /lib/libc.so.7 (gdb) bt #0 0x000000080115e13c in kill () from /lib/libc.so.7 #1 0x000000080115cc73 in abort () from /lib/libc.so.7 #2 0x0000000000418ce8 in isc_error_fatal (file=<value optimized out>, line=<value optimized out>, format=<value optimized out>) at /usr/src/usr.sbin/ntp/libntp/../../../contrib/ntp/lib/isc/error.c:76 #3 0x0000000000418d40 in isc_error_runtimecheck (file=<value optimized out>, line=<value optimized out>, expression=<value optimized out>) at /usr/src/usr.sbin/ntp/libntp/../../../contrib/ntp/lib/isc/error.c:81 #4 0x0000000000413209 in isc_net_probeipv4 () at /usr/src/usr.sbin/ntp/libntp/../../../contrib/ntp/lib/isc/unix/net.c:226 #5 0x0000000000412005 in init_lib () at /usr/src/usr.sbin/ntp/libntp/../../../contrib/ntp/libntp/lib_strbuf.c:35 #6 0x0000000000407700 in ntpqmain (argc=3, argv=0x7fffffffebc8) at /usr/src/usr.sbin/ntp/ntpq/../../../contrib/ntp/ntpq/ntpq.c:461 #7 0x0000000000403a61 in _start () #8 0x000000080065a000 in ?? () #9 0x0000000000000000 in ?? () (gdb) bt full #0 0x000000080115e13c in kill () from /lib/libc.so.7 No symbol table info available. #1 0x000000080115cc73 in abort () from /lib/libc.so.7 No symbol table info available. #2 0x0000000000418ce8 in isc_error_fatal (file=<value optimized out>, line=<value optimized out>, format=<value optimized out>) at /usr/src/usr.sbin/ntp/libntp/../../../contrib/ntp/lib/isc/error.c:76 args = Could not find the frame base for "isc_error_fatal". (gdb) In fact, this happens even trying to do ntpq -h. Stopped ntpd: Oct 27 05:42:48 icarus ntpd[4094]: ntpd exiting on signal 15 (Terminated: 15) Oct 27 05:42:48 icarus ntpd[4094]: 130.126.24.24 local addr 192.168.1.51 -> <null> Oct 27 05:42:48 icarus ntpd[4094]: 128.118.25.3 local addr 192.168.1.51 -> <null> Oct 27 05:42:48 icarus ntpd[4094]: 208.90.144.52 local addr 192.168.1.51 -> <null> Oct 27 05:42:48 icarus ntpd[4094]: 104.41.150.68 local addr 192.168.1.51 -> <null> Modified /conf/ME/ntp.conf to include "limited" restrict bit, restarted daemon. Result: Oct 27 05:44:02 icarus ntpd[4199]: ntpd 4.2.8p4-a (1): Starting Oct 27 05:44:02 icarus ntpd[4199]: Command line: /usr/sbin/ntpd -g -c /conf/ME/ntp.conf -4 -p /var/run/ntpd.pid -f /var/db/ntpd.drift Oct 27 05:44:02 icarus ntpd[4200]: proto: precision = 1.117 usec (-20) Oct 27 05:44:02 icarus ntpd[4200]: Listen and drop on 0 v4wildcard 0.0.0.0:123 Oct 27 05:44:02 icarus ntpd[4200]: Listen normally on 1 em0 192.168.1.51:123 Oct 27 05:44:02 icarus ntpd[4200]: Listen normally on 2 lo0 127.0.0.1:123 Oct 27 05:44:02 icarus ntpd[4200]: Listening on routing socket on fd #23 for interface updates Oct 27 05:44:02 icarus ntpd[4200]: giving up resolving host clock.isc.org: servname not supported for ai_socktype (9) Oct 27 05:44:02 icarus kernel: Oct 27 05:44:02 icarus ntpd[4200]: giving up resolving host clock.isc.org: servname not supported for ai_socktype (9) Stopped daemon again and restarted it many times. No segfault, but I refused to believe it wouldn't happen. The problem is just to bizarre to give up like that. Modified /etc/rc.conf to remove ntpd_sync_on_start="yes", restarted daemon -- bam, segfault: Oct 27 05:46:51 icarus ntpd[4416]: ntpd 4.2.8p4-a (1): Starting Oct 27 05:46:51 icarus ntpd[4416]: Command line: /usr/sbin/ntpd -c /conf/ME/ntp.conf -4 -p /var/run/ntpd.pid -f /var/db/ntpd.drift Oct 27 05:46:51 icarus ntpd[4417]: proto: precision = 1.396 usec (-19) Oct 27 05:46:51 icarus ntpd[4417]: Listen and drop on 0 v4wildcard 0.0.0.0:123 Oct 27 05:46:51 icarus ntpd[4417]: Listen normally on 1 em0 192.168.1.51:123 Oct 27 05:46:51 icarus ntpd[4417]: Listen normally on 2 lo0 127.0.0.1:123 Oct 27 05:46:51 icarus ntpd[4417]: Listening on routing socket on fd #23 for interface updates Oct 27 05:46:51 icarus kernel: pid 4417 (ntpd), uid 0: exited on signal 11 (core dumped) bt + bt full: (gdb) bt #0 0x000000080114d79d in _malloc_postfork () from /lib/libc.so.7 #1 0x000000080114fb3e in _malloc_postfork () from /lib/libc.so.7 #2 0x00000008011523fe in _malloc_prefork () from /lib/libc.so.7 #3 0x0000000801154482 in calloc () from /lib/libc.so.7 #4 0x000000080117aba6 in __res_state () from /lib/libc.so.7 #5 0x000000080118698c in freeaddrinfo () from /lib/libc.so.7 #6 0x00000008011ab61a in nsdispatch () from /lib/libc.so.7 #7 0x0000000801187ffb in getaddrinfo () from /lib/libc.so.7 #8 0x0000000000474f04 in blocking_getaddrinfo (c=0x801a20980, req=0x801898780) at /usr/src/usr.sbin/ntp/libntp/../../../contrib/ntp/libntp/ntp_intres.c:352 #9 0x0000000000473a43 in blocking_child_common (c=0x801a20980) at /usr/src/usr.sbin/ntp/libntp/../../../contrib/ntp/libntp/ntp_worker.c:288 #10 0x00000000004737e9 in blocking_thread (ThreadArg=<value optimized out>) at /usr/src/usr.sbin/ntp/libntp/../../../contrib/ntp/libntp/work_thread.c:663 #11 0x0000000800afee70 in pthread_getprio () from /lib/libthr.so.3 #12 0x0000000000000000 in ?? () (gdb) bt full #0 0x000000080114d79d in _malloc_postfork () from /lib/libc.so.7 No symbol table info available. #1 0x000000080114fb3e in _malloc_postfork () from /lib/libc.so.7 No symbol table info available. #2 0x00000008011523fe in _malloc_prefork () from /lib/libc.so.7 No symbol table info available. #3 0x0000000801154482 in calloc () from /lib/libc.so.7 No symbol table info available. #4 0x000000080117aba6 in __res_state () from /lib/libc.so.7 No symbol table info available. #5 0x000000080118698c in freeaddrinfo () from /lib/libc.so.7 No symbol table info available. #6 0x00000008011ab61a in nsdispatch () from /lib/libc.so.7 No symbol table info available. #7 0x0000000801187ffb in getaddrinfo () from /lib/libc.so.7 No symbol table info available. #8 0x0000000000474f04 in blocking_getaddrinfo (c=0x801a20980, req=0x801898780) at /usr/src/usr.sbin/ntp/libntp/../../../contrib/ntp/libntp/ntp_intres.c:352 worker_ctx = (dnsworker_ctx *) 0x801a33400 resp = (blocking_pipe_header *) 0x801a72400 gai_resp = (blocking_gai_resp *) 0x801a72428 node = 0x801898820 "clock.isc.org" service = 0x80189882e "ntp" ai_res = (struct addrinfo *) 0x0 ai = <value optimized out> canons_octets = <value optimized out> this_octets = <value optimized out> resp_octets = <value optimized out> cp = <value optimized out> time_now = <value optimized out> #9 0x0000000000473a43 in blocking_child_common (c=0x801a20980) at /usr/src/usr.sbin/ntp/libntp/../../../contrib/ntp/libntp/ntp_worker.c:288 req = (blocking_pipe_header *) 0x801898780 #10 0x00000000004737e9 in blocking_thread (ThreadArg=<value optimized out>) at /usr/src/usr.sbin/ntp/libntp/../../../contrib/ntp/libntp/work_thread.c:663 No locals. #11 0x0000000800afee70 in pthread_getprio () from /lib/libthr.so.3 No symbol table info available. #12 0x0000000000000000 in ?? () No symbol table info available. (gdb) I've placed the cores, relevant system config files (including kernel config), and anything else I can think of here: http://jdc.koitsu.org/freebsd/204048/ I'm praying CFLAGS+=-fno-omit-frame-pointer for DTrace isn't causing this somehow.
The ntpq problem is already fixed in stable/9. We will update releng/9.3 as soon as we are done with ntpd problem, which definitely is a more complex one.
Can you please share the ntpd binary, that produced the core?
No problem -- binary is now there (ntpd.gz).
Alas, my gdb still fails too open your ntpd/ntpd.core :( Can you please compile and install libc with debug? This is done simply from the lib/libc dir with the same sequence: cd lib/libc make clean make DEBUG_FLAGS=-g make STRIP= install
Rather than continue to "destroy" (bad word, using the term loosely) my working FreeBSD system that must remain stable (and run a stable ntpd server), I built a VM under VMware Workstation 11.x to reproduce the problem. I've been 100% successful, including segfaults. The system configuration is even more bare-bones than my actual bare-metal box. Even named isn't running on it (instead resolv.conf uses 192.168.1.51 as a nameserver). The backtraces look the same as well. Gleb, if you could provide me with a public SSH key, I'd like to give you root-level SSH access to a VM where you can do whatever you need. This is a 2-core VM with 2GB RAM w/ dtrace. If you need console access (VNC), I can get you that. I can take system/VM snapshots (just hit me up here in the ticket) if needed as well, in case there's something major system-wide you need to do and might need/want to roll back quickly.
This appears to be a (technically) duplicate of bug 204013, but given the amount of information here, I'll leave it open. Lets not duplicate effort :) Treating this as the BASE ntp report, with bug 204013 about PORTS ntp, to the extent that of course they may be (likely?) related. Do we have reproductions on base versions other than stable/9 ?
Yes, here: http://marc.info/?l=freebsd-security&m=144603403915501&w=2
All current communication regarding this issue (as a whole) have been done in private Email between FreeBSD Project members Xin Li and Gleb Smirnoff, and NTP developers Harlan Stenn, Juergen Perlinger, and Danny Mayer. I've been CC'd on most (all?) the communications. There is still no "smoking gun" explanation for the problem as of this writing. There are several issues that have been brought up by Juergen via a quick code review, but nothing definitive has been found. One thing that has been proposed was to try using configure --without-threads and then really bang on the problem. On the FreeBSD side, Xin was able to reproduce the problem on a 10.2 machine (something called "psb"). In other words: this problem affects 9.x and 10.x alike (and that's worth noting because 10.x doesn't include ISC BIND in the base system (it provides a resolver library), instead using unbound, while 9.x does include ISC BIND). There has been mention of this problem on freebsd-security as well by another Project member: https://lists.freebsd.org/pipermail/freebsd-security/2015-October/008660.html -- don't conflate this with the ntpq/ntpdc SIGABRT problem (separate yet equally painful, but should be rectified now in the base system). What makes this matter so complicated: FreeBSD's NTP is nearly 6 years old (4.2.4p8 vs. 4.2.8p4). Harlan disclosed that between those two versions there have been between 1700 and 2100+ changes (not lines of code, but actual changes): http://support.ntp.org/bin/view/Dev/ReleaseTimeline The ChangeLog between these versions is massive: http://archive.ntp.org/ntp4/ChangeLog-stable
FWIW, trouble started for OPNsense when NTP was updated from 4.2.8p3 to 4.2.8p4. This doesn't mean that the bug wasn't there all the time or that it is found in NTP specifically (pthreads bug in FreeBSD?). It could mean the issue was only now unearthed, yet may provide a straw for going forward finding the commit that triggers this behaviour. Since there are already quite a few people involved (more or less visibly), a big thanks to everyone trying to get to the bottom of this. :)
I also have a suspicion that a bug in threading library is uncovered by NTP. In my core, I have __is_threaded global equal 0, while the process is definitely threaded.
NGie pointed to this commit: https://svnweb.freebsd.org/base?view=revision&revision=287591 I did a test build and it solved the issue for me on 10.1. :)
Did you do a blind reboot before? I mean, the problem can simply go away after a reboot. So rebooting with a patched kernel doesn't prove that the problem is fixed. :(
I rebooted to apply the kernel change, then restarted ntpd a few times using the procedure that would reliably crash it before. Stays up fine now. No more ntpd core dumps, no more dmesg output regarding this.
Yes, r287591 looks like the candidate very much. I'm now testing it on 10.2-RELEASE. Jeremy, can you please meanwhile test it on 9.3?
(In reply to Gleb Smirnoff from comment #22) Can/will do -- presently building a 2nd VM/test system for exactly this.
(In reply to Gleb Smirnoff from comment #22) Said commit never got MFC'd to stable/9: https://svnweb.freebsd.org/base/stable/9/sys/vm/vm_map.c Code on stable/9 presently looks like what was in r287591 on stable/10. Relevant sections of code: https://svnweb.freebsd.org/base/stable/9/sys/vm/vm_map.c?revision=266585&view=markup#l3881 https://svnweb.freebsd.org/base/stable/9/sys/vm/vm_map.c?revision=266585&view=markup#l4040
(In reply to Jeremy Chadwick from comment #24) > Code on stable/9 presently looks like what was in r287591 on stable/10. Typo on my part: should have read r287590, not r287591. Rest of previous comment still stands.
Jeremy, r287591 is revision from head. Its merge to stable/10 is r287846. It wasn't merged to stable/9 at all.
(In reply to Gleb Smirnoff from comment #26) I apologise -- I took c#19 to mean "this commit caused the problem" (I didn't look closely to see it was for CURRENT), not "this commit fixed the issue for me". Thanks for clarifying! Sure, I can merge that in and give it a try on stable/9.
Since backporting the change in r287591 (head) to stable/9, I have been unable to reproduce the problem. I've done everything possible I can think of that has triggered the problem and haven't seen a segfault yet. Likewise: the spurious "giving up resolving host XXX: servname not supported for ai_socktype (9)" log messages have also disappeared completely. I think we may have found our smoking gun.
Yes, I have same results on my 10.2 system. Thanks to kib@ for the fix. I'm starting to prepare Errata Notice.
The symptoms are entirely consistent with r287591. It should be harmless to mass backport to stable/8 and stable/7 as well.
(In reply to Gleb Smirnoff from comment #29) Others have experienced this in 10.2 as well, base and ports. See https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=204013.
*** Bug 204013 has been marked as a duplicate of this bug. ***
*** This bug has been marked as a duplicate of bug 204046 ***