Bug 229644 - PINE64+ 12.0-CURRENT #0 r336112 - time management problem
Summary: PINE64+ 12.0-CURRENT #0 r336112 - time management problem
Status: In Progress
Alias: None
Product: Base System
Classification: Unclassified
Component: arm (show other bugs)
Version: CURRENT
Hardware: arm64 Any
: --- Affects Some People
Assignee: freebsd-arm mailing list
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-07-09 15:04 UTC by hlh
Modified: 2018-08-30 13:59 UTC (History)
4 users (show)

See Also:


Attachments
possible workaround for allwinner timer glitches (2.19 KB, patch)
2018-07-09 22:47 UTC, Ian Lepore
no flags Details | Diff
Reworked patch for unstable-a64 eventttimer. (3.14 KB, patch)
2018-07-10 16:50 UTC, Ian Lepore
no flags Details | Diff
Yet another try at fixing the event timer 2^32 glitch. (2.89 KB, patch)
2018-07-11 20:08 UTC, Ian Lepore
no flags Details | Diff
run a bunch of threads, and detect if sleep fails to wakeup (2.44 KB, text/plain)
2018-08-21 14:55 UTC, John-Mark Gurney
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description hlh 2018-07-09 15:04:18 UTC
Since r320599 until now (r336112) the time drift forward on my PINE64+ 2GB.

When i'm running something like `periodic daily` I get

[root@nakiska log]# grep clock_step /var/log/ntp.log
18 Jun 12:15:54 ntpd[612]: 0.0.0.0 061c 0c clock_step -178.960234 s
18 Jun 18:30:01 ntpd[612]: 0.0.0.0 061c 0c clock_step -178.955695 s
18 Jun 20:13:55 ntpd[612]: 0.0.0.0 061c 0c clock_step -536.871265 s
18 Jun 21:12:26 ntpd[612]: 0.0.0.0 061c 0c clock_step -715.825332 s
18 Jun 21:41:26 ntpd[612]: 0.0.0.0 061c 0c clock_step -178.958017 s
18 Jun 23:25:20 ntpd[612]: 0.0.0.0 061c 0c clock_step -178.960867 s
19 Jun 01:59:07 ntpd[612]: 0.0.0.0 061c 0c clock_step -178.962636 s
19 Jun 04:34:33 ntpd[612]: 0.0.0.0 061c 0c clock_step -894.789746 s
19 Jun 06:57:26 ntpd[612]: 0.0.0.0 061c 0c clock_step -357.918200 s
19 Jun 07:28:08 ntpd[612]: 0.0.0.0 c61c 0c clock_step -178.958874 s
19 Jun 08:38:21 ntpd[612]: 0.0.0.0 061c 0c clock_step -178.954737 s
19 Jun 11:50:59 ntpd[612]: 0.0.0.0 061c 0c clock_step -894.783737 s
19 Jun 12:18:09 ntpd[612]: 0.0.0.0 061c 0c clock_step -178.956662 s
19 Jun 12:52:37 ntpd[612]: 0.0.0.0 c61c 0c clock_step -178.958208 s
19 Jun 13:56:11 ntpd[612]: 0.0.0.0 061c 0c clock_step -178.955571 s
19 Jun 16:28:27 ntpd[612]: 0.0.0.0 c61c 0c clock_step -538.144980 s
19 Jun 19:01:44 ntpd[1671]: 0.0.0.0 c61c 0c clock_step -178.961168 s
21 Jun 14:41:37 ntpd[611]: 0.0.0.0 c61c 0c clock_step -7177.962418 s
21 Jun 16:38:42 ntpd[1675]: 0.0.0.0 061c 0c clock_step -178.956129 s
21 Jun 18:48:15 ntpd[614]: 0.0.0.0 c61c 0c clock_step -1.435511 s
21 Jun 19:48:19 ntpd[614]: 0.0.0.0 061c 0c clock_step -178.955568 s
24 Jun 17:27:39 ntpd[608]: 0.0.0.0 c61c 0c clock_step -7178.592515 s
24 Jun 16:30:34 ntpd[4620]: 0.0.0.0 c61c 0c clock_step -178.958430 s
24 Jun 17:24:06 ntpd[611]: 0.0.0.0 c61c 0c clock_step -1.564513 s
 9 Jul 12:07:35 ntpd[609]: 0.0.0.0 c61c 0c clock_step +26.217479 s
 9 Jul 14:11:01 ntpd[609]: 0.0.0.0 061c 0c clock_step -178.957072 s
 9 Jul 15:08:30 ntpd[609]: 0.0.0.0 061c 0c clock_step -178.954202 s
 9 Jul 15:38:16 ntpd[612]: 0.0.0.0 c61c 0c clock_step -0.880727 s
 9 Jul 15:57:34 ntpd[612]: 0.0.0.0 061c 0c clock_step -178.960327 s
 9 Jul 16:52:54 ntpd[612]: 0.0.0.0 061c 0c clock_step -894.784369 s
 9 Jul 16:52:07 ntpd[612]: 0.0.0.0 c61c 0c clock_step -178.957528 s
[root@nakiska log]#

As you can see the value of clock_step is ~ (n * -178.95) with n = 1, 2, 3, 4, 5
Comment 1 Mark Millard 2018-07-09 15:49:54 UTC
My E-mail history shows reports on freebsd-arm by hlh going back to at
least 2017-Oct-26. (Even mentioning the about -178.95 figure.) The
exchange included ntpq -p output and ntpq -c rl output, for example.

There are older reports from shigeru@os-hackers.jp going back to at
least 2017-Sep-12. jamlloc_areana.c was failing the assertion:
"nstime_compare(&decay->epoch, &time) <= 0" during buildworld. hlh's
early reports that I found were in response to this as I remember.

So at least 2 folks have seen issues over the last 10 months or so,
across a wide variety of versions overall.

There is some supporting material in the list archives.
Comment 2 Emmanuel Vadot freebsd_committer 2018-07-09 19:11:56 UTC
This is somewhat known that the generic timer in the Allwinner A64 is broken (it jumps forward and backward on some condition).
r335156 is an attemps to correct it based on finding by some linux guys. It's apparently not sufficient and I don't have another solution for now.
Comment 3 Ian Lepore freebsd_committer 2018-07-09 22:47:32 UTC
Created attachment 195014 [details]
possible workaround for allwinner timer glitches
Comment 4 Ian Lepore freebsd_committer 2018-07-09 22:50:56 UTC
I stumbled across the fact that apparently allwinner has a workaround in the linux BSP for their timer hardware that we don't have in freebsd.  I coded up a quick and dirty version of the same (I hope! I have no way to test) for us, can you please try the patch I attached and see if it fixes the problem?  Be sure you have some way to get back to a working kernel in case I botched this so bad the system panics or hangs (sorry, I've got no arm systems with the generic_timer hardware to test on).

See https://github.com/Allwinner-Homlet/H6-BSP4.9-linux/blob/e634a960316d/drivers/clocksource/arm_arch_timer.c#L272-L327
Comment 5 Ian Lepore freebsd_committer 2018-07-09 22:52:42 UTC
BTW, I should also add that Nathan Whitehorn noticed that 178.95 seconds corresponds to 2^32 ticks of a 24mhz clock.
Comment 6 Emmanuel Vadot freebsd_committer 2018-07-10 10:12:53 UTC
Thanks Ian,

I'm currently testing without the patch with a cpuburn program seeing how quirk I can stop ntpd issue and will then test with your patch.
Comment 7 Emmanuel Vadot freebsd_committer 2018-07-10 11:04:03 UTC
First test done
Running without the compatible and current HEAD.
Two cpuburn program running (from https://raw.githubusercontent.com/ssvb/cpuburn-arm/master/cpuburn-a53.S) and ntpd running.
A simple shell script that log the date every 10 seconds give this : https://people.freebsd.org/~manu/time_unpatched.txt

Really easy to see the time jump here.
I'm now running with the compatible and your patch to see if there is any difference.
Comment 8 hlh 2018-07-10 11:08:30 UTC
(In reply to Ian Lepore from comment #4)

I try your patch, but when running `periodic daily` I observe the same drift :-(

10 Jul 12:42:24 ntpd[786]: Listen normally on 4 lo0 [fe80::1%2]:123
10 Jul 12:42:24 ntpd[786]: Listen normally on 5 lo0 127.0.0.1:123
10 Jul 12:42:24 ntpd[786]: Listening on routing socket on fd #26 for interface updates
10 Jul 12:42:24 ntpd[786]: 0.0.0.0 c01d 0d kern kernel time sync enabled
10 Jul 12:42:24 ntpd[786]: 0.0.0.0 c012 02 freq_set kernel 30.240 PPM
10 Jul 12:42:24 ntpd[786]: 0.0.0.0 c016 06 restart
10 Jul 12:42:25 ntpd[786]: Soliciting pool server 109.88.7.29
10 Jul 12:42:26 ntpd[786]: Soliciting pool server 91.121.216.238
10 Jul 12:42:32 ntpd[786]: 0.0.0.0 c615 05 clock_sync
10 Jul 13:03:18 ntpd[786]: 0.0.0.0 0618 08 no_sys_peer
10 Jul 13:04:36 ntpd[786]: 0.0.0.0 0613 03 spike_detect -178.957519 s
10 Jul 13:05:45 ntpd[786]: 0.0.0.0 061c 0c clock_step -178.957190 s
10 Jul 13:02:46 ntpd[786]: 0.0.0.0 0615 05 clock_sync
10 Jul 13:02:49 ntpd[786]: 0.0.0.0 c618 08 no_sys_peer
Comment 9 Emmanuel Vadot freebsd_committer 2018-07-10 11:12:32 UTC
(In reply to hlh from comment #8)

Do you have the dtb propery that mark the timer as unstable ?
just run :

sysctl -b hw.fdt.dtb | dtc -I dtb -O dts | grep sun50i-a64-unstable-timer to confirm or not.
Comment 10 hlh 2018-07-10 11:56:02 UTC
(In reply to Emmanuel Vadot from comment #9)

I don't have sun50i-a64-unstable-timer in the dts.

How can I change this?
Comment 11 Emmanuel Vadot freebsd_committer 2018-07-10 13:42:43 UTC
(In reply to hlh from comment #10)

Easiest way now is to recompile u-boot after patching the DTS.

Ian, I still don't see a drift after ~3 hours.
I'll leave this running for the night but I guess you have the right fix.
Comment 12 Ian Lepore freebsd_committer 2018-07-10 16:50:00 UTC
Created attachment 195033 [details]
Reworked patch for unstable-a64 eventttimer.

Here is a reworked patch that seems simpler to me than what allwinner did in the linux BSP (and besides, I hate just copying linux driver code).  This takes a different approach... because the glitch is caused by setting tval during a counter register rollover, it just notices the rollover and sets tval again.  The loop should never actually iterate more than 1 extra time.
Comment 13 hlh 2018-07-10 17:39:41 UTC
(In reply to Emmanuel Vadot from comment #11)

After patching u-boot-pine64, installing and rebooting, no more time drift :-)

I will try the new patch

Thanks to all
Comment 14 hlh 2018-07-11 05:32:05 UTC
(In reply to Ian Lepore from comment #12)

The new patch is working fine for me!
Comment 15 hlh 2018-07-11 06:37:18 UTC
(In reply to hlh from comment #14)

Running more `periodic daily` lead to a time drift. It seems that the new patch is better that nothing but don't solve the problem completely.
Comment 16 Mark Millard 2018-07-11 15:54:34 UTC
(In reply to hlh from comment #15)

Is the drift still frequently involving approximate multiples of
-178.95 sec?
Comment 17 hlh 2018-07-11 18:46:28 UTC
(In reply to Mark Millard from comment #16)
For now, rebuilding the world, the drift is always ~ -178.95 never more (for the moment at least after 12h).
Comment 18 Ian Lepore freebsd_committer 2018-07-11 20:08:52 UTC
Created attachment 195062 [details]
Yet another try at fixing the event timer 2^32 glitch.

Here is another attempt to workaround this problem with code more straightforward than the allwinner linux workaround.  Their approach is to set tval then read back cval to double-check the hardware's arithmentic and loop to try again if the hardware got it wrong.  It occurred to me that if we're going to do the math ourselves, we should just set the compare register ourselves and call it done. :)
Comment 19 Emmanuel Vadot freebsd_committer 2018-07-12 10:15:47 UTC
(In reply to Ian Lepore from comment #18)

Running for 1 hour on mine pine, no problem of drifts. (with 4 cpuburn running just in case load has an impact).
Comment 20 hlh 2018-07-12 13:10:57 UTC
(In reply to Ian Lepore from comment #18)

Sorry to come back with bad news. With the 3th -version of the patch, during
`periodic daily` I get:

12 Jul 14:58:55 ntpd[674]: 0.0.0.0 0613 03 spike_detect -357.912634 s
12 Jul 15:05:46 ntpd[674]: 0.0.0.0 061c 0c clock_step -357.912845 s
12 Jul 14:59:48 ntpd[674]: 0.0.0.0 0615 05 clock_sync

Maybe something of interest:

I am running with a root file system on zfs on a USB drive:

Root mount waiting for: usbus0
ugen0.2: <VIA Labs, Inc. USB2.0 Hub> at usbus0
uhub2 on uhub0
uhub2: <VIA Labs, Inc. USB2.0 Hub, class 9/0, rev 2.10/b.e0, addr 2> on usbus0
uhub2: 4 ports with 4 removable, self powered
Root mount waiting for: usbus0
Root mount waiting for: usbus0
ugen0.3: <Western Digital Elements 1042> at usbus0
umass0 on uhub2
umass0: <MSC Bulk-Only Transport> on usbus0
da0 at umass-sim0 bus 0 scbus0 target 0 lun 0
da0: <WD Elements 1042 1006> Fixed Direct Access SPC-4 SCSI device
da0: Serial Number 575841314535314B54313238
da0: 40.000MB/s transfers
da0: 953867MB (1953519616 512 byte sectors)
da0: quirks=0x2<NO_6_BYTE>
Comment 21 Emmanuel Vadot freebsd_committer 2018-07-12 13:39:11 UTC
(In reply to hlh from comment #20)

I did test with 4 cpuburn + periodic daily.
No drift for me.
Comment 22 hlh 2018-07-12 16:33:41 UTC
(In reply to Emmanuel Vadot from comment #21)

Maybe an interference with the USB stack?
Comment 23 hlh 2018-07-13 10:39:55 UTC
I did more tests (svn update on /usr/ports, periodic daily with zfs filesystems
 with a total of ~20GB)

The drifts occur rarely but is always there:

13 Jul 08:31:25 ntpd[674]: Soliciting pool server 51.255.138.215
13 Jul 08:31:26 ntpd[674]: Soliciting pool server 194.78.244.172
13 Jul 08:41:14 ntpd[674]: Soliciting pool server 193.190.147.153
13 Jul 11:03:52 ntpd[674]: 0.0.0.0 0628 08 no_sys_peer
13 Jul 11:05:14 ntpd[674]: error resolving pool 0.freebsd.pool.ntp.org: hostname nor servname provided, or not known (8)
13 Jul 11:11:26 ntpd[674]: 0.0.0.0 0613 03 spike_detect -178.954309 s
13 Jul 11:15:01 ntpd[674]: 0.0.0.0 061c 0c clock_step -178.954045 s
13 Jul 11:12:02 ntpd[674]: 0.0.0.0 0615 05 clock_sync
13 Jul 11:12:03 ntpd[674]: 0.0.0.0 c618 08 no_sys_peer
13 Jul 11:12:55 ntpd[674]: Soliciting pool server 193.104.37.238
13 Jul 11:13:15 ntpd[674]: 51.255.138.215 local addr 192.168.22.4 -> <null>
13 Jul 11:19:31 ntpd[674]: Soliciting pool server 195.200.224.66
13 Jul 11:22:09 ntpd[674]: 109.88.7.29 local addr 192.168.22.4 -> <null>
13 Jul 11:47:35 ntpd[674]: Soliciting pool server 94.143.184.140
13 Jul 11:48:03 ntpd[674]: error resolving pool 0.freebsd.pool.ntp.org: hostname nor servname provided, or not known (8)
13 Jul 11:48:29 ntpd[674]: 0.0.0.0 0613 03 spike_detect -178.958891 s
13 Jul 11:48:41 ntpd[674]: Soliciting pool server 109.88.7.29
13 Jul 11:55:12 ntpd[674]: Soliciting pool server 193.190.253.212
13 Jul 11:56:23 ntpd[674]: 0.0.0.0 061c 0c clock_step -357.913259 s
13 Jul 11:50:25 ntpd[674]: 0.0.0.0 0615 05 clock_sync
13 Jul 11:50:26 ntpd[674]: 0.0.0.0 c618 08 no_sys_peer
13 Jul 11:57:14 ntpd[674]: 193.104.37.238 local addr 192.168.22.4 -> <null>
Comment 24 hlh 2018-07-15 14:36:48 UTC
(In reply to Ian Lepore from comment #4)

I retry the first version of the patch under the load of a ports upgrade:

portupgrade --batch -ua

and after 2 hours I get:

15 Jul 14:02:19 ntpd[676]: Soliciting pool server 194.78.244.172
15 Jul 14:02:21 ntpd[676]: 0.0.0.0 061c 0c clock_step -178.958106 s
15 Jul 13:59:22 ntpd[676]: 0.0.0.0 0615 05 clock_sync
15 Jul 13:59:23 ntpd[676]: 0.0.0.0 c618 08 no_sys_peer

which is way better than without the patch but with at least one drift
Comment 25 John-Mark Gurney freebsd_committer 2018-08-17 17:16:57 UTC
I can confirm that the attached patch does NOT fix the sleep problem for me on my PINE A64-LTS:
FreeBSD generic 12.0-ALPHA1 FreeBSD 12.0-ALPHA1 #4 r337599M: Mon Aug 13 13:51:22 UTC 2018     freebsd@generic:/usr/src/sys/arm64/compile/GENERIC  arm64

The issue that I see is that sleeping threads will not wake up.  Both a sx_sleep thread, and a simple userland sleep will not wake up, even days after their sleep was suppose to have expired.

I am running ntp, but I don't see any time skips though.  The only messages ntp gives me a messages about the leapseconds file being out of date.

So, I'm not sure that this is the same bug.
Comment 26 Mark Millard 2018-08-18 19:23:25 UTC
I normally do not see this from -r337400 (no
patches for the issue) that I'm running on a Pine64+ 2GB.
I do full buildworld buildkernel and poudreire-devel's
that build things like devel/llvm60 (around 14.5 hrs) and
lang/gcc8 (around 12.5hr) and whole runs that span well
over a day.

But see:

https://lists.freebsd.org/pipermail/freebsd-arm/2018-August/018680.html

for an example that I got during poudreiere-devel building
lang/gcc8. (There was something unusual that was involved
compared to my normal use and the timing of the problem
happens to look to be tied to that. But why it would make
any difference is not clear to me.)

Mostly I'm noting the "me too" and the variability in
how much it happens for different folks. At this point
another test with poudriere-devel is over 21 hours of
monitoring and still no no_sys_peer, ..., spike_detect,
clock_step, clock_sync, sequences.

The freebsd-arm message has material about my context that
shows some differences vs. hlh. For example: UFS for me
vs. ZFS for hlh and I build with -mcpu=cortex-a53 explicitly.
I also have a non-debug kernel (but with symbols). The
current test is using a powered USB 3.0 capable hub and
a 240 GiByte USB 3.0 400MiByte/s+ Read& Write capable SSD
stick, not the e.MMC via USB from the freebsd-arm message.
(Yes: on a USB 2.0 limited system.) The rest of the system
is the same.

hlh had requested that I do the monitoring during my
poudriere-devel activity for comparison/contrast with his
context and results. FYI for setting up the log:

I was told by hlh to add

logconfig =all
logfile /var/log/ntp.log

at the end of /etc/ntp.conf and then make sure ntpd was
restarted (such as via service ntpd restart or a
reboot). I did the restart since poudriere-devel was
running and would keep going for a long time. (True of
both tests: poudriere-devel was started first in the
second test as well.)

Not that I expect it is relevant, but I've been using
sysctl vm.pageout_oom_seq=120 to avoid the system killing
processes because of having low free RAM for significant
periods during the likes of devel/llvm60 being built (or
similar activity during buildworld).
Comment 27 Emmanuel Vadot freebsd_committer 2018-08-20 18:20:23 UTC
I've made a patch that uses the timer block in the A64 as the timecounter.
https://github.com/evadot/freebsd/tree/aw_timer

You will need a dtb overlay : https://people.freebsd.org/~manu/sun50i-a64-timer.dtbo

For overlays to apply you will need a dtb loaded by u-boot, not the one embedded.
https://people.freebsd.org/~manu/sun50i-a64-pine64-plus.dtb

Put this file in /boot/msdos/dtb/allwinner (dtb/allwinner on the fat partition).
Comment 28 hlh 2018-08-21 06:55:00 UTC
(In reply to Emmanuel Vadot from comment #27)

I update to r338128

I add your 3 updates:
  sys/arm/allwinner/a10_timer.c
  sys/arm64/conf/GENERIC
  sys/conf/files.arm64

I added sun50i-a64-pine64-plus.dtb
on the fat partition (under dtb/allwinner) as requested

Where do I put the dtb overlay?

Does I need to build a new u-boot with your update r477232 on
sysutils/u-boot-master?
Comment 29 Emmanuel Vadot freebsd_committer 2018-08-21 06:57:53 UTC
Ah yes sorry, the overlay should go under /boot/dtb/overlays in the root partition (UFS or ZFS, same as the kernel) and you need :
fdt_overlays=sun50i-a64-timer
in loader.conf
Comment 30 Emmanuel Vadot freebsd_committer 2018-08-21 06:59:57 UTC
If the dtbo applied correctly you should have :
# sysctl kern.timecounter.choice
kern.timecounter.choice: a10_timer timer0(1000) ARM MPCore Timecounter(1000) dummy(-1000000)
Comment 31 hlh 2018-08-21 13:35:18 UTC
(In reply to Emmanuel Vadot from comment #30)

# sysctl kern.timecounter
kern.timecounter.fast_gettime: 1
kern.timecounter.tick: 1
kern.timecounter.choice: a10_timer timer0(1000) ARM MPCore Timecounter(1000) dummy(-1000000)
kern.timecounter.hardware: a10_timer timer0
kern.timecounter.alloweddeviation: 5
kern.timecounter.stepwarnings: 0
kern.timecounter.tc.a10_timer timer0.quality: 1000
kern.timecounter.tc.a10_timer timer0.frequency: 24000000
kern.timecounter.tc.a10_timer timer0.counter: 2934755145
kern.timecounter.tc.a10_timer timer0.mask: 4294967295
kern.timecounter.tc.ARM MPCore Timecounter.quality: 1000
kern.timecounter.tc.ARM MPCore Timecounter.frequency: 24000000
kern.timecounter.tc.ARM MPCore Timecounter.counter: 3249730234
kern.timecounter.tc.ARM MPCore Timecounter.mask: 4294967295

time seems stable under load.

I log `ntpq -p` every 600s and after 3h ntp seems to "converge" and be happy.

for examble:

*94.143.184.140  .GPS.            1 u  195  256  377   19.049  -21.995   4.771
*94.143.184.140  .GPS.            1 u    3  256  377   19.049  -21.995   1.740
*94.143.184.140  .GPS.            1 u   70  256  377   19.049  -21.995   1.920
*94.143.184.140  .GPS.            1 u  146  256  377   19.049  -21.995   3.716
*94.143.184.140  .GPS.            1 u   85  128  373   19.974  -24.259   3.275
*94.143.184.140  .GPS.            1 u   19   64  177   20.107  -20.573   5.554
*94.143.184.140  .GPS.            1 u   24   64  377   19.301  -20.383   4.210
*94.143.184.140  .GPS.            1 u   35   64  377   19.159  -12.262   4.394
*94.143.184.140  .GPS.            1 u   46   64  377   19.477   -7.311   1.684
*94.143.184.140  .GPS.            1 u   61   64  377   19.254   -5.215  19.249
*94.143.184.140  .GPS.            1 u    9   64  377   19.456   -4.651   4.637
*94.143.184.140  .GPS.            1 u   17   64  377   19.260   -3.707  16.789
*94.143.184.140  .GPS.            1 u   98  128  377   19.240   -3.643   3.422
*94.143.184.140  .GPS.            1 u   39  128  377   19.240   -3.643   3.304
*94.143.184.140  .GPS.            1 u  113  128  377   20.073   -3.548   1.561
*94.143.184.140  .GPS.            1 u   62  128  377   19.631   -4.301   2.203
*94.143.184.140  .GPS.            1 u  139  256  377   19.626   -4.423   2.016
*94.143.184.140  .GPS.            1 u  220  256  375   19.626   -4.423   2.021
*94.143.184.140  .GPS.            1 u   33  256  357   19.626   -4.423   0.364
*94.143.184.140  .GPS.            1 u  102  256  277   19.626   -4.423   2.841

# ntpq -p
     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
 0.freebsd.pool. .POOL.          16 p    -   64    0    0.000    0.000   0.002
-193.104.37.238  193.190.230.66   2 u  214  256  377   20.822   -4.713   0.654
+ntp.gillam.be   172.16.200.253   2 u   71  256  337   24.189   -1.143   0.657
+ntp.vives.be    193.190.230.66   2 u  246  256  377   17.186   -1.391   0.593
-be1.justaguy.be 216.218.254.202  2 u  351  512  377   18.322   -4.369   0.639
*94.143.184.140  .GPS.            1 u   91  256  177   19.748   -4.821   3.167

Thanks a lot!

I rebuild world and I will test if 
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=228802
is also solved.
Comment 32 Emmanuel Vadot freebsd_committer 2018-08-21 13:38:05 UTC
That's good news that it worked for you since you seems able to trigger the panic quikly each time :)
I'm running jmg@ python script on my pine64 since yesterday 10pm (UTC+2) and it doesn't have any dead thread for now.
Comment 33 hlh 2018-08-21 13:44:35 UTC
(In reply to Emmanuel Vadot from comment #32)

I notice only one thing: On all machines `ntpq -p` list servers with offset >0 
_and_ <0.

here all ofsets are <0

eg on another pine64 at r320599

# ntpq -p
     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
 0.freebsd.pool. .POOL.          16 p    -   64    0    0.000    0.000   0.001
+hades.boxed-it. 193.190.230.66   2 u  125 1024  377   13.171    1.805   1.025
+ntp.rack66.net  193.67.79.202    2 u   73 1024  377   15.537   -0.506   1.670
-be1.justaguy.be 216.218.254.202  2 u  613 1024  377   15.246   -1.217   0.198
*ntp.gillam.be   172.16.200.253   2 u  675 1024  377   22.069    2.159   0.815
-193.104.37.238  193.190.230.66   2 u  629 1024  377   18.457   -1.633   0.695
Comment 34 John-Mark Gurney freebsd_committer 2018-08-21 14:55:50 UTC
Created attachment 196420 [details]
run a bunch of threads, and detect if sleep fails to wakeup

Here is a program I wrote to help detect if a sleep will not wake up.  This is related to bug 222126, and I'll link to it there too.


https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=222126
Comment 35 hlh 2018-08-21 15:14:01 UTC
(In reply to John-Mark Gurney from comment #34)

while building world I'm running your script
Comment 36 hlh 2018-08-22 12:26:47 UTC
(In reply to John-Mark Gurney from comment #34)

After 23 hours, no problem crop up :-)
Comment 37 hlh 2018-08-22 15:16:15 UTC
(In reply to Emmanuel Vadot from comment #29)

After building and installing world (r338128), when I reboot /boot/loader.efi  (on mmcsd) shown me a graphical boot menu. I can only strike the return key to boot but the fdt overlay is not loaded and so no a10_timer.

If I replace /boot/loader.efi by a older version (thanks zfs snapshot) no graphical menu but the a10_timer is used.
Comment 38 John-Mark Gurney freebsd_committer 2018-08-22 15:50:19 UTC
The loader issue is related to the lua/4th switch over.  See a thread on the mailing list.
Comment 39 hlh 2018-08-23 16:59:57 UTC
(In reply to Emmanuel Vadot from comment #32)

I just notice something in ntp.log since using a10_timer:

23 Aug 10:48:49 ntpd[696]: kernel reports TIME_ERROR: 0x41: Clock Unsynchronized
23 Aug 10:48:49 ntpd[696]: 0.0.0.0 c612 02 freq_set kernel 35.152 PPM

everytime ntp try a freq_set
Comment 40 commit-hook freebsd_committer 2018-08-23 18:47:21 UTC
A commit references this bug:

Author: manu
Date: Thu Aug 23 18:46:06 UTC 2018
New revision: 338272
URL: https://svnweb.freebsd.org/changeset/base/338272

Log:
  a10_timer: Update the driver so we can use it on other SoC

  a10_timer is currently use in UP allwinner SoC (A10 and A13).
  Those don't have the generic arm timer.
  The arm generic timecounter is broken in the A64 SoC, some attempts have
  been made to fix the glitch but users still reported some minor ones.
  Since the A64 (and all Allwinner SoC) still have this timer controller, rework
  the driver so we can use it in any SoC.
  Since it doesn't have the 64 bits counter on all SoC, use one of the
  generic 32 bits counter as the timecounter source.

  PR:	229644

Changes:
  head/sys/arm/allwinner/a10_timer.c
  head/sys/arm64/conf/GENERIC
  head/sys/conf/files.arm64
Comment 41 Mark Millard 2018-08-29 05:28:48 UTC
(In reply to hlh from comment #39)

I finally updated from -r337400 to -r338341 based and
now I also get "TIME_ERROR: 0x41: Clock Unsynchronized"
notices, such as in the sequence below. But I've no
clue if they are important.

31 Dec 16:00:42 ntpd[665]: Listen and drop on 0 v6wildcard [::]:123
31 Dec 16:00:42 ntpd[665]: Listen and drop on 1 v4wildcard 0.0.0.0:123
31 Dec 16:00:42 ntpd[665]: Listen normally on 2 awg0 192.168.0.100:123
31 Dec 16:00:42 ntpd[665]: Listen normally on 3 lo0 [::1]:123
31 Dec 16:00:42 ntpd[665]: Listen normally on 4 lo0 [fe80::1%2]:123
31 Dec 16:00:42 ntpd[665]: Listen normally on 5 lo0 127.0.0.1:123
31 Dec 16:00:42 ntpd[665]: Listening on routing socket on fd #26 for interface updates
31 Dec 16:00:42 ntpd[665]: kernel reports TIME_ERROR: 0x41: Clock Unsynchronized
31 Dec 16:00:42 ntpd[665]: 0.0.0.0 c01d 0d kern kernel time sync enabled
31 Dec 16:00:42 ntpd[665]: kernel reports TIME_ERROR: 0x41: Clock Unsynchronized
31 Dec 16:00:42 ntpd[665]: 0.0.0.0 c012 02 freq_set kernel 0.000 PPM
31 Dec 16:00:42 ntpd[665]: 0.0.0.0 c011 01 freq_not_set
31 Dec 16:00:42 ntpd[665]: 0.0.0.0 c016 06 restart
31 Dec 16:00:43 ntpd[665]: Soliciting pool server 199.223.248.101
31 Dec 16:00:44 ntpd[665]: Soliciting pool server 45.76.244.193
31 Dec 16:00:45 ntpd[665]: Soliciting pool server 69.89.207.99
31 Dec 16:00:46 ntpd[665]: Soliciting pool server 72.30.35.89
31 Dec 16:00:53 ntpd[665]: 0.0.0.0 c61c 0c clock_step +273200393.748699 s
28 Aug 18:00:47 ntpd[665]: 0.0.0.0 c614 04 freq_mode
28 Aug 18:00:48 ntpd[665]: 0.0.0.0 c618 08 no_sys_peer
28 Aug 18:04:06 ntpd[665]: Soliciting pool server 204.2.134.164
28 Aug 18:06:27 ntpd[665]: kernel reports TIME_ERROR: 0x41: Clock Unsynchronized
28 Aug 18:06:27 ntpd[665]: 0.0.0.0 c612 02 freq_set kernel 41.951 PPM
28 Aug 18:06:27 ntpd[665]: 0.0.0.0 c615 05 clock_sync
28 Aug 19:00:36 ntpd[665]: ntpd exiting on signal 15 (Terminated)
28 Aug 19:00:36 ntpd[665]: 199.223.248.101 local addr 192.168.0.100 -> <null>
28 Aug 19:00:36 ntpd[665]: 45.76.244.193 local addr 192.168.0.100 -> <null>
28 Aug 19:00:36 ntpd[665]: 69.89.207.99 local addr 192.168.0.100 -> <null>
28 Aug 19:00:36 ntpd[665]: 72.30.35.89 local addr 192.168.0.100 -> <null>
28 Aug 19:00:36 ntpd[665]: 204.2.134.164 local addr 192.168.0.100 -> <null>
28 Aug 19:00:36 ntpd[665]: 0.0.0.0 061d 0d kern kernel time sync disabled
28 Aug 19:01:27 ntpd[695]: Listen and drop on 0 v6wildcard [::]:123
28 Aug 19:01:27 ntpd[695]: Listen and drop on 1 v4wildcard 0.0.0.0:123
28 Aug 19:01:27 ntpd[695]: Listen normally on 2 awg0 192.168.0.100:123
28 Aug 19:01:27 ntpd[695]: Listen normally on 3 lo0 [::1]:123
28 Aug 19:01:27 ntpd[695]: Listen normally on 4 lo0 [fe80::1%2]:123
28 Aug 19:01:27 ntpd[695]: Listen normally on 5 lo0 127.0.0.1:123
28 Aug 19:01:27 ntpd[695]: Listening on routing socket on fd #26 for interface updates
28 Aug 19:01:28 ntpd[695]: kernel reports TIME_ERROR: 0x41: Clock Unsynchronized
28 Aug 19:01:28 ntpd[695]: 0.0.0.0 c01d 0d kern kernel time sync enabled
28 Aug 19:01:28 ntpd[695]: kernel reports TIME_ERROR: 0x41: Clock Unsynchronized
28 Aug 19:01:28 ntpd[695]: 0.0.0.0 c012 02 freq_set kernel 0.000 PPM
28 Aug 19:01:28 ntpd[695]: 0.0.0.0 c011 01 freq_not_set
28 Aug 19:01:28 ntpd[695]: 0.0.0.0 c016 06 restart
28 Aug 19:01:29 ntpd[695]: Soliciting pool server 108.61.56.35
28 Aug 19:01:30 ntpd[695]: Soliciting pool server 173.0.156.209
28 Aug 19:01:31 ntpd[695]: Soliciting pool server 207.192.69.118
28 Aug 19:01:31 ntpd[695]: Soliciting pool server 35.171.237.77
28 Aug 19:01:38 ntpd[695]: 0.0.0.0 c61c 0c clock_step -1.002265 s
28 Aug 19:01:37 ntpd[695]: 0.0.0.0 c614 04 freq_mode
28 Aug 19:01:37 ntpd[695]: receive: Unexpected origin timestamp 0xdf307d02.05aafc0e does not match aorg 0000000000.00000000 from server@108.61.56.35 xmt 0xdf307d01.0e9a0092
28 Aug 19:01:37 ntpd[695]: receive: Unexpected origin timestamp 0xdf307d02.05a3640a does not match aorg 0000000000.00000000 from server@207.192.69.118 xmt 0xdf307d01.1094063d
28 Aug 19:01:37 ntpd[695]: 0.0.0.0 c618 08 no_sys_peer
28 Aug 19:04:55 ntpd[695]: Soliciting pool server 45.79.187.10
28 Aug 19:06:37 ntpd[695]: kernel reports TIME_ERROR: 0x41: Clock Unsynchronized
28 Aug 19:06:37 ntpd[695]: 0.0.0.0 c612 02 freq_set kernel 35.844 PPM
28 Aug 19:06:37 ntpd[695]: 0.0.0.0 c615 05 clock_sync

(It is now 28 Aug 22:27:?? local time.)
Comment 42 Mark Millard 2018-08-30 13:59:14 UTC
(In reply to hlh from comment #33)

In -r338341 I see a mix of >0 and <0 "offset" figures (and one
0). For example:

# uname -apKU
you have mail
FreeBSD pine64 12.0-ALPHA3 FreeBSD 12.0-ALPHA3  r338341M  arm64 aarch64 1200083 1200083

# ntpq -p
     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
 0.freebsd.pool. .POOL.          16 p    -   64    0    0.000    0.000   0.002
*time-c-g.nist.g .NIST.           1 u   28  256  377   79.752    3.732  12.078
+ntp2.wiktel.com 212.215.1.157    2 u  512  512  377   67.564    0.932   2.819
-ewr1.m-d.net    18.26.4.105      2 u  122  512  377   76.835    0.121   2.410
-time.tritn.com  198.60.22.240    2 u  590 1024  377  158.600   -3.355   2.189
+ntp.your.org    .CDMA.           1 u  212  512  377   61.884    5.245   0.668