Note: this DOES NOT appear to be related to Bug 213448. This morning I found the following in my daily run output (periodic daily) on my stable/11 box: === ... Security check: (output mailed separately) usage: fetch [-146AadFlMmnPpqRrsUv] [-B bytes] [--bind-address=host] [--ca-cert=file] [--ca-path=dir] [--cert=file] [--crl=file] [-i file] [--key=file] [-N file] [--no-passive] [--no-proxy=list] [--no-sslv3] [--no-tlsv1] [--no-verify-hostname] [--no-verify-peer] [-o file] [--referer=URL] [-S bytes] [-T seconds] [--user-agent=agent-string] [-w seconds] URL ... fetch [-146AadFlMmnPpqRrsUv] [-B bytes] [--bind-address=host] [--ca-cert=file] [--ca-path=dir] [--cert=file] [--crl=file] [-i file] [--key=file] [-N file] [--no-passive] [--no-proxy=list] [--no-sslv3] [--no-tlsv1] [--no-verify-hostname] [--no-verify-peer] [-o file] [--referer=URL] [-S bytes] [-T seconds] [--user-agent=agent-string] [-w seconds] -h host -f file [-c dir] Checking userland and kernel versions: Userland and kernel are in sync. ... === Digging through old cron mails, this problem began November 28th (November 27th had no problem). The system has not been rebuilt (world, mergemaster, etc. -- system is running stable/11 r325712) or rebooted since November 24th. I modified rc.conf to contain ntp_leapfile_fetch_verbose="yes" and then ran service ntpd onefetch to achieve the following output: === root@mambo:~ # service ntpd onefetch ntp_src_leapfile version is 3676924800 ntp_db_leapfile version is 3676924800 not replacing /var/db/ntpd.leap-seconds.list with /etc/ntp/leap-seconds Within ntp leapfile expiry limit, initiating fetch fetching https://www.ietf.org/timezones/data/leap-seconds.list usage: fetch [-146AadFlMmnPpqRrsUv] [-B bytes] [--bind-address=host] [--ca-cert=file] [--ca-path=dir] [--cert=file] [--crl=file] [-i file] [--key=file] [-N file] [--no-passive] [--no-proxy=list] [--no-sslv3] [--no-tlsv1] [--no-verify-hostname] [--no-verify-peer] [-o file] [--referer=URL] [-S bytes] [-T seconds] [--user-agent=agent-string] [-w seconds] URL ... fetch [-146AadFlMmnPpqRrsUv] [-B bytes] [--bind-address=host] [--ca-cert=file] [--ca-path=dir] [--cert=file] [--crl=file] [-i file] [--key=file] [-N file] [--no-passive] [--no-proxy=list] [--no-sslv3] [--no-tlsv1] [--no-verify-hostname] [--no-verify-peer] [-o file] [--referer=URL] [-S bytes] [-T seconds] [--user-agent=agent-string] [-w seconds] -h host -f file [-c dir] {very VERY long stall here} === pstree output showed the following: === | \-+= 71172 root sshd: jdc [priv] (sshd) | \-+- 71174 jdc sshd: jdc@pts/1 (sshd) | \-+= 71175 jdc -bash (bash) | \-+= 71176 root sudo su - | \-+- 71177 root su - | \-+= 71178 root -su (csh) | \-+= 71181 root /bin/sh /etc/rc.d/ntpd onefetch | \-+- 71200 root /bin/sh /etc/rc.d/ntpd onefetch | \--- 71201 root awk $1 == "#$" { print $2 } === I then ^C'd the service ntpd onefetch and re-ran it under truss -af and found the following: === root@mambo:~ # truss -af service ntpd onefetch ... 71342: wait4(-1,{ EXITED,val=0 },0x0,0x0) = 71358 (0x116be) Within ntp leapfile expiry limit, initiating fetch 71342: write(1,"Within ntp leapfile expiry limit"...,51) = 51 (0x33) fetching https://www.ietf.org/timezones/data/leap-seconds.list 71342: write(1,"fetching https://www.ietf.org/ti"...,63) = 63 (0x3f) 71342: stat("/sbin/fetch",0x7fffffffdbe8) ERR#2 'No such file or directory' 71342: stat("/bin/fetch",0x7fffffffdbe8) ERR#2 'No such file or directory' 71342: stat("/usr/sbin/fetch",0x7fffffffdbe8) ERR#2 'No such file or directory' 71342: stat("/usr/bin/fetch",{ mode=-r-xr-xr-x ,inode=81231,size=36400,blksize=32768 }) = 0 (0x0) 71360: <new process> 71342: vfork() = 71360 (0x116c0) 71360: execve("/usr/bin/fetch",[ "fetch", "-mq", "-o", "https://www.ietf.org/timezones/data/leap-seconds.list" ],0x801418790) = 0 (0x0) 71360: mmap(0x0,32768,PROT_READ|PROT_WRITE,MAP_PRIVATE|MAP_ANON,-1,0x0) = 34366189568 (0x800627000) ... usage: fetch [-146AadFlMmnPpqRrsUv] [-B bytes] [--bind-address=host] [--ca-cert=file] [--ca-path=dir] [--cert=file] [--crl=file] [-i file] [--key=file] [-N file] [--no-passive] [--no-proxy=list] [--no-sslv3] [--no-tlsv1] [--no-verify-hostname] [--no-verify-peer] [-o file] [--referer=URL] [-S bytes] [-T seconds] [--user-agent=agent-string] [-w seconds] URL ... fetch [-146AadFlMmnPpqRrsUv] [-B bytes] [--bind-address=host] [--ca-cert=file] [--ca-path=dir] [--cert=file] [--crl=file] [-i file] [--key=file] [-N file] [--no-passive] [--no-proxy=list] [--no-sslv3] [--no-tlsv1] [--no-verify-hostname] [--no-verify-peer] [-o file] [--referer=URL] [-S bytes] [-T seconds] [--user-agent=agent-string] [-w seconds] -h host -f file [-c dir] 71360: write(2,"usage: fetch [-146AadFlMmnPpqRrs"...,811) = 811 (0x32b) ... === Look closely at the execve() call arguments: there output file argument is missing, which is why fetch is spewing usage syntax: 71360: execve("/usr/bin/fetch",[ "fetch", "-mq", "-o", "https://www.ietf.org/timezones/data/leap-seconds.list" ],0x801418790) = 0 (0x0) This call essentially correlates with ntpd_fetch_leapfile() in /etc/rc.d/ntpd: === ... if ntpd_needfetch_leapfile ; then for url in $ntp_leapfile_sources ; do $verbose fetching $url fetch $ntp_leapfile_fetch_opts -o $ntp_tmp_leapfile $url && break done ... === This means that $ntp_tmp_leapfile is somehow empty/undefined when the fetch is being attempted -- and that's the bug. The long delay is an after-effect of get_ntp_leapfile_ver()'s use of awk. AFAICT it's sitting around waiting for data it never gets because of the failed fetch. ntp-related stuff in my rc.conf and periodic.conf: rc.conf ========= ntpd_enable="yes" ntpd_config="/conf/ME/ntp.conf" ntpd_sync_on_start="yes" ntpd_flags="-4 -p /var/run/ntpd.pid -f /var/db/ntpd.drift" ntp_leapfile_fetch_verbose="yes" periodic.conf =============== daily_ntpd_avoid_congestion="no" ntp-related files and both file and leap second timestamps: root@mambo:~ # ls -lT /etc/ntp/leap-seconds /var/db/ntpd.leap-seconds.list -rw-r--r-- 1 root wheel 10408 Nov 9 08:37:13 2017 /etc/ntp/leap-seconds -rw-r--r-- 1 root wheel 10408 Nov 9 08:37:13 2017 /var/db/ntpd.leap-seconds.list root@mambo:~ # grep '^#\$' /etc/ntp/leap-seconds /var/db/ntpd.leap-seconds.list /etc/ntp/leap-seconds:#$ 3676924800 /var/db/ntpd.leap-seconds.list:#$ 3676924800 /etc/rc.d/ntpd is: # $FreeBSD: stable/11/etc/rc.d/ntpd 325256 2017-11-01 01:03:44Z cy $ /etc/periodic/daily/480.leapfile-ntpd is: # $FreeBSD: stable/11/etc/periodic/daily/480.leapfile-ntpd 325256 2017-11-01 01:03:44Z cy $
Few things to add for clarification: 1. The very long stall IS NOT caused by sleep(1). It's caused by the model of piping used with awk, best I can tell, 2. I've since modified /etc/periodic.conf, removing daily_ntpd_avoid_congestion="no" since it's been deprecated and replaced with the more universal anticongestion_sleeptime (which I now set to 10), and the long delay during this problem still happens. 3. I can't tell what the full impact is here. periodic daily certainly gets delayed as a result, but I only noticed that in my most recent run from last night -- an additional 30 minutes added -- which is very strange. So for now let's not focus on the "long delay", but instead figure out why $ntp_tmp_leapfile is undefined/empty.
I believe this problem may be related to the following commit, but unsure; gut feeling says very likely: http://www.freshbsd.org/commit/freebsd/r325256
Adding committers involved with r325256
It appears it hasn't been mfc'd yet.
(In reply to Cy Schubert from comment #4) Not sure I understand correctly -- r325256, which is an MFC of head r324681 and head r324738, did in fact reach stable/11 (that's how I have it): https://svnweb.freebsd.org/base?view=revision&revision=325256
What we want is an MFC of r326343. I will MFC as soon as my checkouts of stable/10 & 11 complete.
(In reply to Cy Schubert from comment #6) Ahh! Yes, that looks good. :-) Thank you!
MFC committed.