Bug 230187

Summary: Hardware clock freezes until ntpd is killed on 11.2-RELEASE-p0
Product: Base System Reporter: Danny McGrath <danmcgrath.ca>
Component: kernAssignee: freebsd-bugs (Nobody) <bugs>
Status: New ---    
Severity: Affects Only Me CC: cem
Priority: --- Keywords: regression
Version: 11.2-RELEASE   
Hardware: amd64   
OS: Any   
Attachments:
Description Flags
dmesg output and example ntp related tests none

Description Danny McGrath 2018-07-30 11:07:53 UTC
Created attachment 195628 [details]
dmesg output and example ntp related tests

I have a Dell Poweredge 410 that seems to be having some clock problems. I have noticed the occasional interruptions in Zabbix charts in the past on previous FreeBSD versions, but since upgrading to 11.2-p0, the problem has gotten to the point where the clock freezes entirely and blocks disk IO and various processes. Even the "data" command refuses to increment (see attached file).

I can probably stop ntpd entirely, but it would be nice to get some possible debug type commands to run so that I can try get to the bottom of why ntpd is stopping the clock on this one server. I used to have LOCAL(0) stratum 10 local source enabled, but disabling this had no effect.

I'm open to suggestions! Thanks.

=====
FreeBSD hostname.tld 11.2-RELEASE FreeBSD 11.2-RELEASE #0 r335856: Mon Jul  2 13:27:01 CEST 2018     root@hostname.tld:/usr/obj/usr/src/sys/ALTQ  amd64

(the ALTQ kernel runs on a half dosen other machines on 11.2 as well)

#> less ~/kernels/ALTQ 
#
# ALTQ -- Based on Generic kernel configuration file for FreeBSD/amd64
#
include         GENERIC

# ALTQ support -- troubled
options         ALTQ                    # Enable ALTQ
options         ALTQ_CBQ        # Build the ``Class Based Queuing'' discipline
options         ALTQ_RED        # Build the ``Random Early Detection'' extension
options         ALTQ_RIO        # Build ``Random Early Drop'' for input and output
options         ALTQ_HFSC       # Build the ``Hierarchical Packet Scheduler'' discipline
options         ALTQ_PRIQ       # Build the ``Priority Queuing'' discipline
Comment 1 Danny McGrath 2018-08-01 07:43:56 UTC
Seen some more dmesg entries and had ntpq -p command time out trying to talk to the ntpd:

+[702905] sonewconn: pcb 0xfffff80055cc8ae0: Listen queue overflow: 193 already in queue awaiting acceptance (10 occurrences)
+[702906] Limiting open port RST response from 4646 to 200 packets/sec
+[702966] sonewconn: pcb 0xfffff80055cc8ae0: Listen queue overflow: 193 already in queue awaiting acceptance (4885 occurrences)
+[703027] sonewconn: pcb 0xfffff80055cc8ae0: Listen queue overflow: 193 already in queue awaiting acceptance (71 occurrences)
+[703087] sonewconn: pcb 0xfffff80055cc8ae0: Listen queue overflow: 193 already in queue awaiting acceptance (68 occurrences)
+[703131] swap_pager: indefinite wait buffer: bufobj: 0, blkno: 131272, size: 12288

It's probably also worth noting that I am using PF on this host, and there seems to be a pattern of a port scan followed by a timeout. The current pf.conf isn't configured to allow any ntp packets in unless they are in the state table, if that helps anyone test things.

I will see about running some tests later. I have pflogd enabled with a much longer retention period now, so hopefully I can get some captures on this.
Comment 2 Danny McGrath 2018-08-01 13:20:55 UTC
Odd, it's doing it again, and not only is ntp still running, but the time actually jumped backwards briefly:

     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
+XXX.XXX.X.X (nt XXX.XX.XXX.XX    2 u   92  128  377    0.292   -0.098   0.057
*XXXX.XX.XX.XXX  .GPS.            1 u  132  128  377    2.657    0.102   0.054

<pts/4|root|hostname|~ #> date
Wed Aug  1 14:56:05 CEST 2018
<pts/4|root|hostname|~ #> date
Wed Aug  1 14:56:06 CEST 2018
<pts/4|root|hostname|~ #> date
Wed Aug  1 14:56:06 CEST 2018
<pts/4|root|hostname|~ #> date
Wed Aug  1 14:56:05 CEST 2018
<pts/4|root|hostname|~ #> date
Wed Aug  1 14:56:06 CEST 2018

The time keeps jumping back and forth, and ntpd -q is showing the exact same values. It's almost like ntpds' state is stuck and it's keeping the clock frozen.

Sure enough, as soon as I stop ntpd, the time (as reported by date) continues to run. Any thoughts, anyone?
Comment 3 Danny McGrath 2018-08-03 00:20:33 UTC
Update:

I disabled ntpd and tested running a simple ntpdate via cron. It's been almost two days now and the problem not only hasn't happened again, but the clock barely even drifted:

3 Aug 00:00:06 ntpdate[25890]: adjust time server xxx.xxx.x.x offset 0.003696 sec

Anyway, seems clear that it was ntpd related already, but nice to eliminate stuff.
Comment 4 Danny McGrath 2018-08-03 05:02:05 UTC
So much for that idea. Seems another packet scan triggers it still while ntpd isn't running.

I'll try narrow it down. About the only difference on this machine compared to other 11.2's is that this one gets some of the linux kld's loaded from I assume is poudriere. The pf.conf's are similar as well.

Anyway, sorry for the noise!