Bug 153620

Summary: [xen] Xen guest system clock drifts in AWS EC2 (FreeBSD 9.0-CURRENT i386 T1-micro)
Product: Base System Reporter: Greg Holmberg <fbsd-9.0-aws-ec2-1293964000>
Component: kernAssignee: freebsd-bugs (Nobody) <bugs>
Status: Closed Overcome By Events    
Severity: Affects Only Me CC: cperciva
Priority: Normal    
Version: 9.0-CURRENT   
Hardware: Any   
OS: Any   

Description Greg Holmberg 2011-01-02 10:40:13 UTC
9.0-CURRENT system running as AMI in Amazon EC2 cloud keeps poor time.

System was under heavy load, repeatedly compiling packages to exercise memory allocation code.

Clock in guest should be updated faithfully by the host. Clock in this AMI drifted 2200 seconds over 11 hours.

Fix: 

No known fix. (Didn't rtfs yet)

Workaround: perhaps run ntpdate out of cron every twenty minutes?
How-To-Repeat: From Amazon AWS Console, start a FreeBSD instance. I used ami-a0fc0dc9, the most recent 9.0-CURRENT available on Dec 30, 2010.

Wait a few hours. (Maybe use it heavily?)

Compare the correct time from a good NTP source with the AMI system clock.
Comment 1 Colin Percival freebsd_committer freebsd_triage 2011-01-02 11:12:29 UTC
Responsible Changed
From-To: freebsd-bugs->freebsd-xen

Assign Xen clock bug to freebsd-xen list.
Comment 2 Colin Percival freebsd_committer freebsd_triage 2011-01-02 11:16:52 UTC
This is interesting -- I thought I had squashed all the clock drift bugs.

Can you tell me:
1. Did the clock run ahead, or behind?
2. Can you reproduce this?
3. Did the clock _drift_, or _jump_?

The 2200 seconds mentioned is almost exactly the 2^41 ns period of the Xen
timecounter, so if the clock jumped it's probably safe to guess that it's
involved somehow...

-- 
Colin Percival
Security Officer, FreeBSD | freebsd.org | The power to serve
Founder / author, Tarsnap | tarsnap.com | Online backups for the truly paranoid
Comment 3 Greg Holmberg 2011-01-02 11:51:33 UTC
On Sun, Jan 02, 2011 at 03:16:52AM -0800, Colin Percival wrote:
>
> Can you tell me:
> 1. Did the clock run ahead, or behind?

The NTP adjustment is a positive number (see below). Does this mean the
clock is running slow?


> 2. Can you reproduce this?
>
                                                                                                                               
Yes. In the existing AMI, I just reset the clock again. Since I filed
the PR, it had drifted "offset 0.124360 sec". While writing this email,
it has drifted another "offset 0.009534 sec". I will let it go without
correction for a while now.


> 3. Did the clock _drift_, or _jump_?
>
                                                                                                                               
Good question. Based on a handful of invocations of ntpdate, I would
say that it drifts. The offset is different each time. ...

Regards,

Greg
Comment 4 greg 2011-01-04 01:10:29 UTC
> On Sun, Jan 02, 2011 at 03:16:52AM -0800, Colin Percival wrote:
>
> 3. Did the clock _drift_, or _jump_?
>

I started a new VM (ami-5b82b72f) using the latest available 2011-01-01 code.

The problem -- the system clock losing 2200 seconds -- is still present.

Last night, the clock in the new AMI seems to have lost 2200 seconds twice in a 14 hour idle period.

The system clock seemed to be stopped before login. It started incrementing smoothly again when I logged in to check on the system in the morning. The system clock itself only managed to advance 14 minutes (837 seconds) overnight.

After subtracting the accumulated drift from the previous evening's work, we see that the offset with a nearby NTP server at the moment I logged in was 4398.107077. This value is very close to  2 * (2^41) / 1e9.

   system clock            NTP source

   ...
   20110103-141513UTC      0.437444     # drift is gradual, linear
   20110103-141613UTC      0.438124     # clock is running slow
   20110103-141713UTC      0.438786
   20110103-141813UTC      0.439584
   20110103-141914UTC      0.440247
   20110103-142014UTC      0.440957
   20110103-142114UTC      0.441663
   20110103-142214UTC      0.442313
   20110103-142314UTC      0.443064
   20110103-143711UTC      4398.550141  <--- 4398.107077 second difference
   20110103-143811UTC      4398.550755       between slow clock and NTP src
   ...

   
I have noticed that it only seems to happen when there are no active login sessions. I have seen a large jump in system time happen with a monitor script backgrounded from a single login shell, sometime after logout.  I have seen it happen when the script runs in the foreground of a window in GNU screen.

I have not been able to provoke it during an interactive login session. Maybe I just need to be more creative.

Best regards,

Greg Holmberg
Comment 5 Colin Percival freebsd_committer freebsd_triage 2011-01-04 11:32:21 UTC
Ok, I think I see what's happening here: Under some conditions it seems that
the clock stops running.

If there are no interrupts from any source, the FreeBSD instance never gets
scheduled; and the clock loses time in multiples of the timecounter period
(2^41 ns) if it doesn't tick for that long.

The reason this doesn't show up with an interactive login session open is that
ssh generates enough network traffic to wake the kernel periodically; this is
also provides a workaround for this bug: Send a ping to the instance once every
30 minutes.

-- 
Colin Percival
Security Officer, FreeBSD | freebsd.org | The power to serve
Founder / author, Tarsnap | tarsnap.com | Online backups for the truly paranoid
Comment 6 Greg Holmberg 2011-01-05 16:32:53 UTC
The latest AMI (FreeBSD 9.0-CURRENT @ 2011-01-04) still drops time in chunks. It loses time in multiples of 2200 seconds while idle.

The clock in the latest AMIs and older AMIs also drifts. Without some kind of external correction, it runs slightly slower, losing time at a rate of a little less than one second every 24 hours.

In the 2011-01-01 and 2011-01-04 AMIs, a single ping from a remote host once every thirty minutes keeps an otherwise idle VM awake enough to prevent any time from being lost in chunks.

In the latest AMIs and older AMIs, no amount of system activity or interrupts seems to prevent, aggravate, or change the rate of gradual clock drift.

Regards,

Greg Holmberg
Comment 7 Greg Holmberg 2011-01-09 05:39:44 UTC
This problem -- dropping 2200 seconds of clock time -- is not seen in a
FreeBSD 8.2-RC1 AMI on Amazon EC2 (ami-f77e4a83), using the same type of
T1-micro VM as in the initial report.

The clock drift noted with 9-CURRENT is also present in 8.2-RC1, but at
about 40% of the rate, taking three hours and twenty-two minutes to lose
an entire second.

Best regards,

Greg Holmberg
Comment 8 Greg Holmberg 2011-01-09 14:01:44 UTC
Both Intel and AMD systems experience this problem.

Interestingly, the dual Xeon E5430 lost 4x more time over six hours of
idling than the Opteron 2218 HE. The image was ami-e388bd97.

Best,

Greg Holmberg
Comment 9 abaldwin 2011-04-02 21:20:50 UTC
I'm noticing the same behavior on FreeBSD 8.2-RELEASE on t1.micro 
instances (AMI ami-423bc82b). I'm routinely losing ~4400 seconds (which 
is enough to force ntpd to abort) when the instance is idle (no SSH 
sessions connected).
Comment 10 Eitan Adler freebsd_committer freebsd_triage 2017-12-31 08:01:35 UTC
For bugs matching the following criteria:

Status: In Progress Changed: (is less than) 2014-06-01

Reset to default assignee and clear in-progress tags.

Mail being skipped
Comment 11 Colin Percival freebsd_committer freebsd_triage 2020-02-15 22:25:48 UTC
Closing this since it applies to a very old FreeBSD running on very old EC2 instances using very old virtualization (Xen/PV).

If anyone encounters similar clock issues now it will almost certainly be unrelated to this old bug report.