Bug 204048 - stable/9: r289998: ntpd 4.2.8p4 DNS resolution misbehaves (occasional segfault)
Summary: stable/9: r289998: ntpd 4.2.8p4 DNS resolution misbehaves (occasional segfault)
Status: Closed DUPLICATE of bug 204046
Alias: None
Product: Base System
Classification: Unclassified
Component: bin (show other bugs)
Version: 9.3-RELEASE
Hardware: amd64 Any
: Normal Affects Many People
Assignee: FreeBSD Security Officer
URL:
Keywords: crash, needs-qa, regression
: 204013 (view as bug list)
Depends on:
Blocks:
 
Reported: 2015-10-26 23:06 UTC by Jeremy Chadwick
Modified: 2016-04-17 10:46 UTC (History)
9 users (show)

See Also:
koobs: mfc-stable10?
koobs: mfc-stable9?


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Jeremy Chadwick 2015-10-26 23:06:44 UTC
Recent upgrade of ntpd 4.2.8p4 to stable/9 results in a daemon which behaves very very oddly.  Said upgrade: http://www.freshbsd.org/commit/freebsd/r289998

My log after several manual troubleshooting attempts -- note the intermixed segfaults:

Oct 26 15:38:05 icarus ntpd[1092]: giving up resolving host clock.isc.org: servname not supported for ai_socktype (9)
Oct 26 15:38:23 icarus ntpd[1116]: giving up resolving host clock.isc.org: servname not supported for ai_socktype (9)
pid 1139 (ntpd), uid 0: exited on signal 11 (core dumped)
Oct 26 15:39:07 icarus ntpd[1176]: giving up resolving host clock.isc.org: servname not supported for ai_socktype (9)
Oct 26 15:39:59 icarus ntpd[1209]: giving up resolving host ntp-1.cso.uiuc.edu: servname not supported for ai_socktype (9)
Oct 26 15:40:24 icarus ntpd[1268]: giving up resolving host clock.isc.org: servname not supported for ai_socktype (9)
pid 1294 (ntpd), uid 0: exited on signal 11 (core dumped)
pid 1312 (ntpd), uid 0: exited on signal 11 (core dumped)
Oct 26 15:44:09 icarus ntpd[1409]: giving up resolving host clock.isc.org: servname not supported for ai_socktype (9)
Oct 26 15:45:26 icarus ntpd[1490]: giving up resolving host 0.freebsd.pool.ntp.org: servname not supported for ai_socktype (9)
Oct 26 15:50:18 icarus ntpd[1656]: giving up resolving host tick.jrc.us: servname not supported for ai_socktype (9)

Segfaults are always here:

root@icarus:~ # gdb /usr/sbin/ntpd /ntpd.core
...
#0  0x000000080114d79d in _malloc_postfork () from /lib/libc.so.7
[New Thread 801807c00 (LWP 100797/ntpd)]
[New Thread 801807400 (LWP 100791/ntpd)]
(gdb) bt
#0  0x000000080114d79d in _malloc_postfork () from /lib/libc.so.7
#1  0x000000080114fb3e in _malloc_postfork () from /lib/libc.so.7
#2  0x00000008011523fe in _malloc_prefork () from /lib/libc.so.7
#3  0x0000000801154482 in calloc () from /lib/libc.so.7
#4  0x000000080117aba6 in __res_state () from /lib/libc.so.7
#5  0x000000080118698c in freeaddrinfo () from /lib/libc.so.7
#6  0x00000008011ab61a in nsdispatch () from /lib/libc.so.7
#7  0x0000000801187ffb in getaddrinfo () from /lib/libc.so.7
#8  0x0000000000474f04 in blocking_getaddrinfo ()
#9  0x0000000000473a43 in blocking_child_common ()
#10 0x00000000004737e9 in blocking_thread ()
#11 0x0000000800afee70 in pthread_getprio () from /lib/libthr.so.3
#12 0x0000000000000000 in ?? ()

Important:

The behaviour seen is very strange.  Basically, the daemon starts, emits one of the aforementioned DNS errors, then proceeds to either a) exit, b) crash, or c) continue running.

Sometimes when the daemon exits (possibly when crashing too), it restarts itself.  There have been a couple times where ps -auxwww | grep ntp returns nothing, yet a few seconds later the daemon is found running.

Things I've tried which made no difference:

1. Removing -4 from $ntpd_flags (I set this because while my system has IPv6, I prefer using IPv4 everywhere)
2. Using /etc/ntp.conf (r289998) instead of my own ntp.conf

There is no workaround for this other than to roll back to something prior to r289998.

Googling turns up several reports of this problem, but all relate to people trying to use chroot'ing with ntpd (I DO NOT use this feature).

https://mail-index.netbsd.org/current-users/2014/01/26/msg024169.html
https://mail-index.netbsd.org/current-users/2014/06/01/msg024998.html

One report says that use of -O1 (on ARM) relieves the problem, but crashing is seen on VAX and other platforms.  (My system uses gcc, not clang, just for the record)

Footnote: upgrading to stable/10 is not an option until the load average bug there is rectified (I am not the only one to report this problem).  I can try to test out this ntpd on a VM running stable/10 to see if the problem there is reproducible or not.

My ntp.conf (w/ comments removed):

server clock.isc.org          iburst
server ntp-1.cso.uiuc.edu     iburst
server clock.psu.edu          iburst
server tick.jrc.us            iburst
server 0.us.pool.ntp.org      iburst

restrict default limited kod nomodify notrap nopeer noquery
restrict 127.0.0.1
restrict 192.168.1.0 mask 255.255.255.0

My rc.conf ntp-related flags:

# ntpd_flags: temporary workaround for https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=199127
ntpd_enable="yes"
ntpd_config="/conf/ME/ntp.conf"
ntpd_sync_on_start="yes"
ntpd_flags="-4 ${ntpd_flags}"
Comment 1 Jeremy Chadwick 2015-10-26 23:14:44 UTC
I can reproduce this problem on stable/9 using ports/net/ntp and ports/net/ntp-devel too, though the stack trace looks different (maybe getting clobbered, or maybe libevent takes care of something, not sure):

root@icarus:~ # pkg info | grep ntp
ntp-4.2.8p4                    The Network Time Protocol Distribution

root@icarus: # grep ntp /etc/rc.conf
# ntpd_flags: temporary workaround for https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=199127
ntpd_enable="yes"
ntpd_program="/usr/local/sbin/ntpd"
ntpd_config="/conf/ME/ntp.conf"
ntpd_sync_on_start="yes"
ntpd_flags="-4 -p /var/run/ntpd.pid -f /var/db/ntpd.drift"

root@icarus:~ # service ntpd start
Starting ntpd.
root@icarus:~ # dmesg -a | tail -1
pid 2425 (ntpd), uid 0: exited on signal 11 (core dumped)

root@icarus:~ # gdb /usr/local/sbin/ntpd /ntpd.core
...
#0  0x000000080134579d in _malloc_postfork () from /lib/libc.so.7
[New Thread 801807c00 (LWP 100886/ntpd)]
[New Thread 801807400 (LWP 100878/ntpd)]
(gdb) bt
#0  0x000000080134579d in _malloc_postfork () from /lib/libc.so.7
#1  0x0000000801347b3e in _malloc_postfork () from /lib/libc.so.7
#2  0x000000080134a3fe in _malloc_prefork () from /lib/libc.so.7
#3  0x000000080134c482 in calloc () from /lib/libc.so.7
#4  0x0000000801372ba6 in __res_state () from /lib/libc.so.7
#5  0x000000080137e98c in freeaddrinfo () from /lib/libc.so.7
#6  0x00000008013a361a in nsdispatch () from /lib/libc.so.7
#7  0x000000080137fffb in getaddrinfo () from /lib/libc.so.7
#8  0x0000000000461c3d in ?? ()
#9  0x0000000000462293 in ?? ()
#10 0x00000000004648c9 in ?? ()
#11 0x00000008010a6e70 in pthread_getprio () from /lib/libthr.so.3
#12 0x0000000000000000 in ?? ()
Comment 2 Jeremy Chadwick 2015-10-26 23:34:49 UTC
Adding glebius (committer).
Comment 3 Xin LI freebsd_committer freebsd_triage 2015-10-27 00:20:33 UTC
Take.
Comment 4 Gleb Smirnoff freebsd_committer freebsd_triage 2015-10-27 10:27:40 UTC
Jeremy,

I'm trying to reproduce it, and I got the gai_strerror() log line only one. I didn't have any cores. Definitely, we need something special in the environment.
While I continue my search, can you please do the following:

cd /usr/src/usr.sbin/ntp
make clean
make WITH_DEBUG=-g
make STRIP= install

And please collect core files and share with us.
Comment 5 Gleb Smirnoff freebsd_committer freebsd_triage 2015-10-27 10:28:48 UTC
Jeremy,

and if you have ideas what is special with your resolving environment, please share them. Is there anything special with your /etc/resolv.conf? Do you have a packet loss or unusually large RTT towards your DNS server?
Comment 6 Jeremy Chadwick 2015-10-27 11:49:22 UTC
Happy to rebuild with debug symbols, but I'm not sure that's going to shed too much light as most of the backtrace is in libc functions (I'd have to build world with -g).  I'll do so and try to document step-by-step what I saw happen + provide backtrace and/or core dumps.

resolv.conf is as bare bones as possible:

search home.lan
nameserver 127.0.0.1

named (from base) runs locally and is nearly a purely a caching nameserver (it hosts a couple zones (home.lan and a blackhole zone for unwanted FQDNs)).  rc.conf settings for named:

named_enable="yes"
named_flags="-4"

Hop count between myself and root nameservers varies, but is definitely below stock TTL (64).  RTT between myself and nameservers varies, but usually 10-90ms (west coast vs. east coast).  If RTT was a problem, I'd expect this to be manifesting itself badly universally.  Whatever the issue is seems specific to ntp in some way.

I did try fully stopping/starting named as a precaution.

Other stuff I can think of off the top of my head:

make.conf:

CPUTYPE?=core2
# For DTrace; also affects ports
STRIP=
CFLAGS+=        -fno-omit-frame-pointer

src.conf:

WITHOUT_ATM=true
WITHOUT_BLUETOOTH=true
WITHOUT_CLANG=true
WITHOUT_FLOPPY=true
WITHOUT_FREEBSD_UPDATE=true
WITHOUT_IPFILTER=true
WITHOUT_IPX=true
WITHOUT_LIB32=true
WITHOUT_NDIS=true
WITHOUT_NETGRAPH=true
WITHOUT_PPP=true
WITHOUT_SENDMAIL=true
WITHOUT_WIRELESS=true
WITH_OPENSSH_NONE_CIPHER=true

sysctl.conf:

net.inet.icmp.icmplim=1500
net.inet.tcp.sendbuf_max=16777216
net.inet.tcp.recvbuf_max=16777216
net.inet.tcp.sendspace=65536
net.inet.tcp.recvspace=65536
kern.maxvnodes=393216
vfs.ufs.dirhash_maxmem=16777216
vfs.zfs.min_auto_ashift=12

System does not use a firewall.

P.S. -- What exactly is ai_socktype 9?  I see nothing about that in include/sys/socket.h, and it certainly can't be a reference to AF_DATAKIT.  Or maybe it's a syscall return code; I haven't looked at the code.
Comment 7 Gleb Smirnoff freebsd_committer freebsd_triage 2015-10-27 12:05:00 UTC
9 is EAI_SERVICE, error code.
Comment 8 Jeremy Chadwick 2015-10-27 12:57:54 UTC
Rebuilt system per instructions in src/Makefile.  Upon system boot, ntpd came up and appeared functioning (?!?!?!).  However, I noticed the line about "restrict kod" doing nothing without "limited", so:

Oct 27 05:22:38 icarus ntpd[708]: ntpd 4.2.8p4-a (1): Starting
Oct 27 05:22:38 icarus kernel: Starting ntpd.
Oct 27 05:22:38 icarus ntpd[708]: Command line: /usr/sbin/ntpd -g -c /conf/ME/ntp.conf -4 -p /var/run/ntpd.pid -f /var/db/ntpd.drift
Oct 27 05:22:38 icarus ntpd[709]: proto: precision = 1.117 usec (-20)
Oct 27 05:22:38 icarus ntpd[709]: restrict default: KOD does nothing without LIMITED.
Oct 27 05:22:38 icarus ntpd[709]: Listen and drop on 0 v4wildcard 0.0.0.0:123
Oct 27 05:22:38 icarus ntpd[709]: Listen normally on 1 em0 192.168.1.51:123
Oct 27 05:22:38 icarus ntpd[709]: Listen normally on 2 lo0 127.0.0.1:123
Oct 27 05:22:38 icarus ntpd[709]: Listening on routing socket on fd #23 for interface updates
Oct 27 05:22:40 icarus ntpd[709]: receive: Unexpected origin timestamp from 130.126.24.24
Oct 27 05:22:40 icarus ntpd[709]: receive: Unexpected origin timestamp from 128.118.25.3

I service ntpd stop'd, which got me these:

Oct 27 05:23:38 icarus ntpd[709]: ntpd exiting on signal 15 (Terminated: 15)
Oct 27 05:23:38 icarus ntpd[709]: 149.20.64.28 local addr 192.168.1.51 -> <null>
Oct 27 05:23:38 icarus ntpd[709]: 130.126.24.24 local addr 192.168.1.51 -> <null>
Oct 27 05:23:38 icarus ntpd[709]: 128.118.25.3 local addr 192.168.1.51 -> <null>
Oct 27 05:23:38 icarus ntpd[709]: 208.90.144.52 local addr 192.168.1.51 -> <null>
Oct 27 05:23:38 icarus ntpd[709]: 142.54.181.202 local addr 192.168.1.51 -> <null>

I modified /conf/ME/ntp.conf to use:

restrict default kod nomodify notrap nopeer noquery

Then did service ntpd start.  The result:

Oct 27 05:23:39 icarus ntpd[1088]: ntpd 4.2.8p4-a (1): Starting
Oct 27 05:23:39 icarus ntpd[1088]: Command line: /usr/sbin/ntpd -g -c /conf/ME/ntp.conf -4 -p /var/run/ntpd.pid -f /var/db/ntpd.drift
Oct 27 05:23:39 icarus ntpd[1089]: proto: precision = 1.396 usec (-19)
Oct 27 05:23:39 icarus ntpd[1089]: Listen and drop on 0 v4wildcard 0.0.0.0:123
Oct 27 05:23:39 icarus ntpd[1089]: Listen normally on 1 em0 192.168.1.51:123
Oct 27 05:23:39 icarus ntpd[1089]: Listen normally on 2 lo0 127.0.0.1:123
Oct 27 05:23:39 icarus ntpd[1089]: Listening on routing socket on fd #23 for interface updates
Oct 27 05:23:39 icarus ntpd[1089]: giving up resolving host clock.isc.org: servname not supported for ai_socktype (9)
Oct 27 05:23:39 icarus kernel: Oct 27 05:23:39 icarus ntpd[1089]: giving up resolving host clock.isc.org: servname not supported for ai_socktype (9)

I did service ntpd stop:

Oct 27 05:24:19 icarus ntpd[1089]: ntpd exiting on signal 15 (Terminated: 15)
Oct 27 05:24:19 icarus ntpd[1089]: 130.126.24.24 local addr 192.168.1.51 -> <null>
Oct 27 05:24:19 icarus ntpd[1089]: 128.118.25.3 local addr 192.168.1.51 -> <null>
Oct 27 05:24:19 icarus ntpd[1089]: 208.90.144.52 local addr 192.168.1.51 -> <null>
Oct 27 05:24:19 icarus ntpd[1089]: 172.82.134.52 local addr 192.168.1.51 -> <null>

...modified /conf/ME/ntp.conf to remove "limited" from restrict, then did service ntpd start.  The result:

Oct 27 05:24:20 icarus ntpd[1130]: ntpd 4.2.8p4-a (1): Starting
Oct 27 05:24:20 icarus ntpd[1130]: Command line: /usr/sbin/ntpd -g -c /conf/ME/ntp.conf -4 -p /var/run/ntpd.pid -f /var/db/ntpd.drift
Oct 27 05:24:20 icarus ntpd[1131]: proto: precision = 1.396 usec (-19)
Oct 27 05:24:20 icarus ntpd[1131]: restrict default: KOD does nothing without LIMITED.
Oct 27 05:24:20 icarus ntpd[1131]: Listen and drop on 0 v4wildcard 0.0.0.0:123
Oct 27 05:24:20 icarus ntpd[1131]: Listen normally on 1 em0 192.168.1.51:123
Oct 27 05:24:20 icarus ntpd[1131]: Listen normally on 2 lo0 127.0.0.1:123
Oct 27 05:24:20 icarus ntpd[1131]: Listening on routing socket on fd #23 for interface updates
Oct 27 05:24:20 icarus kernel: pid 1131 (ntpd), uid 0: exited on signal 11 (core dumped)

Backtrace is the same as what I've posted before (specifically the one containing all the blocking_* functions).

Rebuild ntpd per your instructions in c#4, with one minor change: it should be make DEBUG_FLAGS=-g not WITH_DEBUG=-g.  :-)

I noticed a couple warnings in libntpevent that might be relevant (unsure):

===> libntpevent (all)
cc  -O2 -pipe -fno-omit-frame-pointer -march=core2 -I/usr/src/usr.sbin/ntp/libntpevent/../../../contrib/ntp/include  -I/usr/src/usr.sbin/ntp/libntpevent/../../../contrib/ntp/sntp/libevent/include  -I/usr/src/usr.sbin/ntp/libntpevent/ -DHAVE_BSD_NICE -DHAVE_STDINT_H -DSYS_FREEBSD -DPARSE -DHAVE_CONFIG_H  -DOPENSSL -DUSE_OPENSSL_CRYPTO_RAND -DAUTOKEY -g -std=gnu99  -fstack-protector -Wno-pointer-sign -c /usr/src/usr.sbin/ntp/libntpevent/../../../contrib/ntp/sntp/libevent/buffer.c -o buffer.o
/usr/src/usr.sbin/ntp/libntpevent/../../../contrib/ntp/sntp/libevent/buffer.c: In function 'evbuffer_run_callbacks':
/usr/src/usr.sbin/ntp/libntpevent/../../../contrib/ntp/sntp/libevent/buffer.c:495: warning: implicit declaration of function 'LIST_END'
/usr/src/usr.sbin/ntp/libntpevent/../../../contrib/ntp/sntp/libevent/buffer.c:495: warning: comparison between pointer and integer

Verified /usr/sbin/ntpd mtime changed and file was substantially larger, then started ntpd.  Result:

Oct 27 05:40:03 icarus ntpd[4093]: ntpd 4.2.8p4-a (1): Starting
Oct 27 05:40:03 icarus ntpd[4093]: Command line: /usr/sbin/ntpd -g -c /conf/ME/ntp.conf -4 -p /var/run/ntpd.pid -f /var/db/ntpd.drift
Oct 27 05:40:03 icarus ntpd[4094]: proto: precision = 1.117 usec (-20)
Oct 27 05:40:03 icarus ntpd[4094]: restrict default: KOD does nothing without LIMITED.
Oct 27 05:40:03 icarus ntpd[4094]: Listen and drop on 0 v4wildcard 0.0.0.0:123
Oct 27 05:40:03 icarus ntpd[4094]: Listen normally on 1 em0 192.168.1.51:123
Oct 27 05:40:03 icarus ntpd[4094]: Listen normally on 2 lo0 127.0.0.1:123
Oct 27 05:40:03 icarus ntpd[4094]: Listening on routing socket on fd #23 for interface updates
Oct 27 05:40:03 icarus ntpd[4094]: giving up resolving host clock.isc.org: servname not supported for ai_socktype (9)
Oct 27 05:40:03 icarus kernel: Oct 27 05:40:03 icarus ntpd[4094]: giving up resolving host clock.isc.org: servname not supported for ai_socktype (9)

At this point I tried something for fun: ntpq -c peers (should return a peer list).  That broke very very badly:

root@icarus:/usr/src/usr.sbin/ntp # ntpq -c peers
/usr/src/usr.sbin/ntp/libntp/../../../contrib/ntp/lib/isc/unix/net.c:221: fatal error: RUNTIME_CHECK(((pthread_once((&once), (initialize_action)) == 0) ? 0 : 34) == 0) failed
Abort (core dumped)

bt + bt full:

root@icarus:/usr/src/usr.sbin/ntp # gdb /usr/bin/ntpq ntpq.core
#0  0x000000080115e13c in kill () from /lib/libc.so.7
(gdb) bt
#0  0x000000080115e13c in kill () from /lib/libc.so.7
#1  0x000000080115cc73 in abort () from /lib/libc.so.7
#2  0x0000000000418ce8 in isc_error_fatal (file=<value optimized out>, line=<value optimized out>, format=<value optimized out>)
    at /usr/src/usr.sbin/ntp/libntp/../../../contrib/ntp/lib/isc/error.c:76
#3  0x0000000000418d40 in isc_error_runtimecheck (file=<value optimized out>, line=<value optimized out>,
    expression=<value optimized out>) at /usr/src/usr.sbin/ntp/libntp/../../../contrib/ntp/lib/isc/error.c:81
#4  0x0000000000413209 in isc_net_probeipv4 () at /usr/src/usr.sbin/ntp/libntp/../../../contrib/ntp/lib/isc/unix/net.c:226
#5  0x0000000000412005 in init_lib () at /usr/src/usr.sbin/ntp/libntp/../../../contrib/ntp/libntp/lib_strbuf.c:35
#6  0x0000000000407700 in ntpqmain (argc=3, argv=0x7fffffffebc8) at /usr/src/usr.sbin/ntp/ntpq/../../../contrib/ntp/ntpq/ntpq.c:461
#7  0x0000000000403a61 in _start ()
#8  0x000000080065a000 in ?? ()
#9  0x0000000000000000 in ?? ()
(gdb) bt full
#0  0x000000080115e13c in kill () from /lib/libc.so.7
No symbol table info available.
#1  0x000000080115cc73 in abort () from /lib/libc.so.7
No symbol table info available.
#2  0x0000000000418ce8 in isc_error_fatal (file=<value optimized out>, line=<value optimized out>, format=<value optimized out>)
    at /usr/src/usr.sbin/ntp/libntp/../../../contrib/ntp/lib/isc/error.c:76
        args = Could not find the frame base for "isc_error_fatal".
(gdb)

In fact, this happens even trying to do ntpq -h.

Stopped ntpd:

Oct 27 05:42:48 icarus ntpd[4094]: ntpd exiting on signal 15 (Terminated: 15)
Oct 27 05:42:48 icarus ntpd[4094]: 130.126.24.24 local addr 192.168.1.51 -> <null>
Oct 27 05:42:48 icarus ntpd[4094]: 128.118.25.3 local addr 192.168.1.51 -> <null>
Oct 27 05:42:48 icarus ntpd[4094]: 208.90.144.52 local addr 192.168.1.51 -> <null>
Oct 27 05:42:48 icarus ntpd[4094]: 104.41.150.68 local addr 192.168.1.51 -> <null>

Modified /conf/ME/ntp.conf to include "limited" restrict bit, restarted daemon.  Result:

Oct 27 05:44:02 icarus ntpd[4199]: ntpd 4.2.8p4-a (1): Starting
Oct 27 05:44:02 icarus ntpd[4199]: Command line: /usr/sbin/ntpd -g -c /conf/ME/ntp.conf -4 -p /var/run/ntpd.pid -f /var/db/ntpd.drift
Oct 27 05:44:02 icarus ntpd[4200]: proto: precision = 1.117 usec (-20)
Oct 27 05:44:02 icarus ntpd[4200]: Listen and drop on 0 v4wildcard 0.0.0.0:123
Oct 27 05:44:02 icarus ntpd[4200]: Listen normally on 1 em0 192.168.1.51:123
Oct 27 05:44:02 icarus ntpd[4200]: Listen normally on 2 lo0 127.0.0.1:123
Oct 27 05:44:02 icarus ntpd[4200]: Listening on routing socket on fd #23 for interface updates
Oct 27 05:44:02 icarus ntpd[4200]: giving up resolving host clock.isc.org: servname not supported for ai_socktype (9)
Oct 27 05:44:02 icarus kernel: Oct 27 05:44:02 icarus ntpd[4200]: giving up resolving host clock.isc.org: servname not supported for ai_socktype (9)

Stopped daemon again and restarted it many times.  No segfault, but I refused to believe it wouldn't happen.  The problem is just to bizarre to give up like that.

Modified /etc/rc.conf to remove ntpd_sync_on_start="yes", restarted daemon -- bam, segfault:

Oct 27 05:46:51 icarus ntpd[4416]: ntpd 4.2.8p4-a (1): Starting
Oct 27 05:46:51 icarus ntpd[4416]: Command line: /usr/sbin/ntpd -c /conf/ME/ntp.conf -4 -p /var/run/ntpd.pid -f /var/db/ntpd.drift
Oct 27 05:46:51 icarus ntpd[4417]: proto: precision = 1.396 usec (-19)
Oct 27 05:46:51 icarus ntpd[4417]: Listen and drop on 0 v4wildcard 0.0.0.0:123
Oct 27 05:46:51 icarus ntpd[4417]: Listen normally on 1 em0 192.168.1.51:123
Oct 27 05:46:51 icarus ntpd[4417]: Listen normally on 2 lo0 127.0.0.1:123
Oct 27 05:46:51 icarus ntpd[4417]: Listening on routing socket on fd #23 for interface updates
Oct 27 05:46:51 icarus kernel: pid 4417 (ntpd), uid 0: exited on signal 11 (core dumped)

bt + bt full:

(gdb) bt
#0  0x000000080114d79d in _malloc_postfork () from /lib/libc.so.7
#1  0x000000080114fb3e in _malloc_postfork () from /lib/libc.so.7
#2  0x00000008011523fe in _malloc_prefork () from /lib/libc.so.7
#3  0x0000000801154482 in calloc () from /lib/libc.so.7
#4  0x000000080117aba6 in __res_state () from /lib/libc.so.7
#5  0x000000080118698c in freeaddrinfo () from /lib/libc.so.7
#6  0x00000008011ab61a in nsdispatch () from /lib/libc.so.7
#7  0x0000000801187ffb in getaddrinfo () from /lib/libc.so.7
#8  0x0000000000474f04 in blocking_getaddrinfo (c=0x801a20980, req=0x801898780)
    at /usr/src/usr.sbin/ntp/libntp/../../../contrib/ntp/libntp/ntp_intres.c:352
#9  0x0000000000473a43 in blocking_child_common (c=0x801a20980)
    at /usr/src/usr.sbin/ntp/libntp/../../../contrib/ntp/libntp/ntp_worker.c:288
#10 0x00000000004737e9 in blocking_thread (ThreadArg=<value optimized out>)
    at /usr/src/usr.sbin/ntp/libntp/../../../contrib/ntp/libntp/work_thread.c:663
#11 0x0000000800afee70 in pthread_getprio () from /lib/libthr.so.3
#12 0x0000000000000000 in ?? ()
(gdb) bt full
#0  0x000000080114d79d in _malloc_postfork () from /lib/libc.so.7
No symbol table info available.
#1  0x000000080114fb3e in _malloc_postfork () from /lib/libc.so.7
No symbol table info available.
#2  0x00000008011523fe in _malloc_prefork () from /lib/libc.so.7
No symbol table info available.
#3  0x0000000801154482 in calloc () from /lib/libc.so.7
No symbol table info available.
#4  0x000000080117aba6 in __res_state () from /lib/libc.so.7
No symbol table info available.
#5  0x000000080118698c in freeaddrinfo () from /lib/libc.so.7
No symbol table info available.
#6  0x00000008011ab61a in nsdispatch () from /lib/libc.so.7
No symbol table info available.
#7  0x0000000801187ffb in getaddrinfo () from /lib/libc.so.7
No symbol table info available.
#8  0x0000000000474f04 in blocking_getaddrinfo (c=0x801a20980, req=0x801898780)
    at /usr/src/usr.sbin/ntp/libntp/../../../contrib/ntp/libntp/ntp_intres.c:352
        worker_ctx = (dnsworker_ctx *) 0x801a33400
        resp = (blocking_pipe_header *) 0x801a72400
        gai_resp = (blocking_gai_resp *) 0x801a72428
        node = 0x801898820 "clock.isc.org"
        service = 0x80189882e "ntp"
        ai_res = (struct addrinfo *) 0x0
        ai = <value optimized out>
        canons_octets = <value optimized out>
        this_octets = <value optimized out>
        resp_octets = <value optimized out>
        cp = <value optimized out>
        time_now = <value optimized out>
#9  0x0000000000473a43 in blocking_child_common (c=0x801a20980)
    at /usr/src/usr.sbin/ntp/libntp/../../../contrib/ntp/libntp/ntp_worker.c:288
        req = (blocking_pipe_header *) 0x801898780
#10 0x00000000004737e9 in blocking_thread (ThreadArg=<value optimized out>)
    at /usr/src/usr.sbin/ntp/libntp/../../../contrib/ntp/libntp/work_thread.c:663
No locals.
#11 0x0000000800afee70 in pthread_getprio () from /lib/libthr.so.3
No symbol table info available.
#12 0x0000000000000000 in ?? ()
No symbol table info available.
(gdb)

I've placed the cores, relevant system config files (including kernel config), and anything else I can think of here:

http://jdc.koitsu.org/freebsd/204048/

I'm praying CFLAGS+=-fno-omit-frame-pointer for DTrace isn't causing this somehow.
Comment 9 Gleb Smirnoff freebsd_committer freebsd_triage 2015-10-27 13:01:49 UTC
The ntpq problem is already fixed in stable/9. We will update releng/9.3 as soon as we are done with ntpd problem, which definitely is a more complex one.
Comment 10 Gleb Smirnoff freebsd_committer freebsd_triage 2015-10-27 13:12:03 UTC
Can you please share the ntpd binary, that produced the core?
Comment 11 Jeremy Chadwick 2015-10-27 13:14:36 UTC
No problem -- binary is now there (ntpd.gz).
Comment 12 Gleb Smirnoff freebsd_committer freebsd_triage 2015-10-27 13:17:19 UTC
Alas, my gdb still fails too open your ntpd/ntpd.core :(

Can you please compile and install libc with debug? This is done simply from the lib/libc dir with the same sequence:

cd lib/libc
make clean
make DEBUG_FLAGS=-g
make STRIP= install
Comment 13 Jeremy Chadwick 2015-10-27 20:37:15 UTC
Rather than continue to "destroy" (bad word, using the term loosely) my working FreeBSD system that must remain stable (and run a stable ntpd server), I built a VM under VMware Workstation 11.x to reproduce the problem.

I've been 100% successful, including segfaults.  The system configuration is even more bare-bones than my actual bare-metal box.  Even named isn't running on it (instead resolv.conf uses 192.168.1.51 as a nameserver).  The backtraces look the same as well.

Gleb, if you could provide me with a public SSH key, I'd like to give you root-level SSH access to a VM where you can do whatever you need.  This is a 2-core VM with 2GB RAM w/ dtrace.  If you need console access (VNC), I can get you that.

I can take system/VM snapshots (just hit me up here in the ticket) if needed as well, in case there's something major system-wide you need to do and might need/want to roll back quickly.
Comment 14 Kubilay Kocak freebsd_committer freebsd_triage 2015-10-30 08:26:56 UTC
This appears to be a (technically) duplicate of bug 204013, but given the amount of information here, I'll leave it open. Lets not duplicate effort :)

Treating this as the BASE ntp report, with bug 204013 about PORTS ntp, to the extent that of course they may be (likely?) related.

Do we have reproductions on base versions other than stable/9 ?
Comment 15 Franco Fichtner 2015-10-30 08:28:39 UTC
Yes, here: http://marc.info/?l=freebsd-security&m=144603403915501&w=2
Comment 16 Jeremy Chadwick 2015-10-30 08:48:53 UTC
All current communication regarding this issue (as a whole) have been done in private Email between FreeBSD Project members Xin Li and Gleb Smirnoff, and NTP developers Harlan Stenn, Juergen Perlinger, and Danny Mayer.  I've been CC'd on most (all?) the communications.

There is still no "smoking gun" explanation for the problem as of this writing.  There are several issues that have been brought up by Juergen via a quick code review, but nothing definitive has been found.  One thing that has been proposed was to try using configure --without-threads and then really bang on the problem.

On the FreeBSD side, Xin was able to reproduce the problem on a 10.2 machine (something called "psb").  In other words: this problem affects 9.x and 10.x alike (and that's worth noting because 10.x doesn't include ISC BIND in the base system (it provides a resolver library), instead using unbound, while 9.x does include ISC BIND).

There has been mention of this problem on freebsd-security as well by another Project member: https://lists.freebsd.org/pipermail/freebsd-security/2015-October/008660.html -- don't conflate this with the ntpq/ntpdc SIGABRT problem (separate yet equally painful, but should be rectified now in the base system).

What makes this matter so complicated: FreeBSD's NTP is nearly 6 years old (4.2.4p8 vs. 4.2.8p4).  Harlan disclosed that between those two versions there have been between 1700 and 2100+ changes (not lines of code, but actual changes): http://support.ntp.org/bin/view/Dev/ReleaseTimeline

The ChangeLog between these versions is massive: http://archive.ntp.org/ntp4/ChangeLog-stable
Comment 17 Franco Fichtner 2015-10-30 08:56:46 UTC
FWIW, trouble started for OPNsense when NTP was updated from 4.2.8p3 to 4.2.8p4. This doesn't mean that the bug wasn't there all the time or that it is found in NTP specifically (pthreads bug in FreeBSD?). It could mean the issue was only now unearthed, yet may provide a straw for going forward finding the commit that triggers this behaviour.

Since there are already quite a few people involved (more or less visibly), a big thanks to everyone trying to get to the bottom of this. :)
Comment 18 Gleb Smirnoff freebsd_committer freebsd_triage 2015-10-30 09:03:09 UTC
I also have a suspicion that a bug in threading library is uncovered by NTP. In my core, I have __is_threaded global equal 0, while the process is definitely threaded.
Comment 19 Franco Fichtner 2015-10-30 09:21:15 UTC
NGie pointed to this commit: https://svnweb.freebsd.org/base?view=revision&revision=287591

I did a test build and it solved the issue for me on 10.1.  :)
Comment 20 Gleb Smirnoff freebsd_committer freebsd_triage 2015-10-30 09:25:55 UTC
Did you do a blind reboot before? I mean, the problem can simply go away after a reboot. So rebooting with a patched kernel doesn't prove that the problem is fixed. :(
Comment 21 Franco Fichtner 2015-10-30 09:28:32 UTC
I rebooted to apply the kernel change, then restarted ntpd a few times using the procedure that would reliably crash it before. Stays up fine now. No more ntpd core dumps, no more dmesg output regarding this.
Comment 22 Gleb Smirnoff freebsd_committer freebsd_triage 2015-10-30 09:31:29 UTC
Yes, r287591 looks like the candidate very much. I'm now testing it on 10.2-RELEASE.

Jeremy, can you please meanwhile test it on 9.3?
Comment 23 Jeremy Chadwick 2015-10-30 09:34:31 UTC
(In reply to Gleb Smirnoff from comment #22)
Can/will do -- presently building a 2nd VM/test system for exactly this.
Comment 24 Jeremy Chadwick 2015-10-30 09:42:43 UTC
(In reply to Gleb Smirnoff from comment #22)
Said commit never got MFC'd to stable/9:

https://svnweb.freebsd.org/base/stable/9/sys/vm/vm_map.c

Code on stable/9 presently looks like what was in r287591 on stable/10.  Relevant sections of code:

https://svnweb.freebsd.org/base/stable/9/sys/vm/vm_map.c?revision=266585&view=markup#l3881
https://svnweb.freebsd.org/base/stable/9/sys/vm/vm_map.c?revision=266585&view=markup#l4040
Comment 25 Jeremy Chadwick 2015-10-30 09:46:47 UTC
(In reply to Jeremy Chadwick from comment #24)
> Code on stable/9 presently looks like what was in r287591 on stable/10.

Typo on my part: should have read r287590, not r287591.  Rest of previous comment still stands.
Comment 26 Gleb Smirnoff freebsd_committer freebsd_triage 2015-10-30 10:27:24 UTC
Jeremy,

r287591 is revision from head. Its merge to stable/10 is r287846. It wasn't merged to stable/9 at all.
Comment 27 Jeremy Chadwick 2015-10-30 10:44:17 UTC
(In reply to Gleb Smirnoff from comment #26)
I apologise -- I took c#19 to mean "this commit caused the problem" (I didn't look closely to see it was for CURRENT), not "this commit fixed the issue for me".  Thanks for clarifying!  Sure, I can merge that in and give it a try on stable/9.
Comment 28 Jeremy Chadwick 2015-10-30 11:07:26 UTC
Since backporting the change in r287591 (head) to stable/9, I have been unable to reproduce the problem.  I've done everything possible I can think of that has triggered the problem and haven't seen a segfault yet.

Likewise: the spurious "giving up resolving host XXX: servname not supported for ai_socktype (9)" log messages have also disappeared completely.

I think we may have found our smoking gun.
Comment 29 Gleb Smirnoff freebsd_committer freebsd_triage 2015-10-30 11:10:01 UTC
Yes, I have same results on my 10.2 system. Thanks to kib@ for the fix.

I'm starting to prepare Errata Notice.
Comment 30 Peter Wemm freebsd_committer freebsd_triage 2015-10-30 12:46:09 UTC
The symptoms are entirely consistent with r287591.  It should be harmless to mass backport to stable/8 and stable/7 as well.
Comment 31 Cy Schubert freebsd_committer freebsd_triage 2015-10-30 13:09:36 UTC
(In reply to Gleb Smirnoff from comment #29)
Others have experienced this in 10.2 as well, base and ports. See https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=204013.
Comment 32 Cy Schubert freebsd_committer freebsd_triage 2015-11-03 20:43:25 UTC
*** Bug 204013 has been marked as a duplicate of this bug. ***
Comment 33 Gleb Smirnoff freebsd_committer freebsd_triage 2015-11-04 10:42:54 UTC

*** This bug has been marked as a duplicate of bug 204046 ***