Bug 218452 - I can cause an assertion with a dtrace script and when it does happen, dtrace fails until the module dtraceall is reloaded
Summary: I can cause an assertion with a dtrace script and when it does happen, dtrace...
Status: New
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 11.0-STABLE
Hardware: Any Any
: --- Affects Some People
Assignee: Mark Johnston
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-04-07 04:49 UTC by Adam Stylinski
Modified: 2018-03-29 11:46 UTC (History)
4 users (show)

See Also:


Attachments
Dtrace script that eventually fails with an assertion (440 bytes, text/plain)
2017-04-07 04:49 UTC, Adam Stylinski
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Adam Stylinski 2017-04-07 04:49:38 UTC
Created attachment 181559 [details]
Dtrace script that eventually fails with an assertion

I've attached a dtrace script where this problem happens reliably for me within a stable VM.  The assertion is about a timestamp.

Here's the assertion:

Assertion failed: (buf->dtbd_timestamp >= first_timestamp), file /usr/src/cddl/lib/libdtrace/../../../cddl/contrib/opensolaris/lib/libdtrace/common/dt_consume.c, line 3328.
Comment 1 Fabian Keil 2017-04-07 10:44:56 UTC
In ElectroBSD I replaced this and another timestamp-related assertion
with warnings:

https://www.fabiankeil.de/sourcecode/electrobsd/ElectroBSD-r312620-6cfa243f1516/0129-dtrace_consume-Warn-if-cpu-cores-are
https://www.fabiankeil.de/sourcecode/electrobsd/ElectroBSD-r312620-6cfa243f1516/0165-libdtrace-Replace-another-timestamp-related-assert-wi.diffn-t-completely-in.diff

Obviously that's not a real fix but in many cases the timestamp mismatch
doesn't matter and in these cases it seemed preferable to prematurly
aborting the trace and dumping core.
Comment 2 Fabian Keil 2017-04-07 11:27:09 UTC
Copy&paste is complicated but thanks to redirects the links should work now.
Comment 3 Adam Stylinski 2017-04-07 14:16:10 UTC
(In reply to Fabian Keil from comment #2)

Based on your description of what this assertion really means I wonder if the hypervisor I'm using exacerbates this issue.  I'm using VirtualBox.
Comment 4 Andriy Gapon freebsd_committer 2017-04-07 16:00:26 UTC
I've seen an issue like this in the past.  It was on a physical machine where TSC locks were noticeably different between processors.  Unfortunately, right now FreeBSD measures TSC differences only once and doesn't recalibrate deltas periodically like illumos does.
Comment 5 Fabian Keil 2017-04-07 16:30:57 UTC
I'm seeing the issue on a pysical system as well.

dmesg.boot excerpt:
CPU: AMD Athlon(tm) 64 X2 Dual Core Processor 4600+ (2542.18-MHz K8-class CPU)
  Origin="AuthenticAMD"  Id=0x60fb2  Family=0xf  Model=0x6b  Stepping=2
  Features=0x178bfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,MMX,FXSR,SSE,SSE2,HTT>
  Features2=0x2001<SSE3,CX16>
  AMD Features=0xea500800<SYSCALL,NX,MMX+,FFXSR,RDTSCP,LM,3DNow!+,3DNow!>
  AMD Features2=0x11f<LAHF,CMP,SVM,ExtAPIC,CR8,Prefetch>
  SVM: NAsids=64
real memory  = 4294967296 (4096 MB)
avail memory = 3984990208 (3800 MB)
Event timer "LAPIC" quality 100

/var/log/messages frequently contains messages like:
Apr  7 18:10:04 kendra kernel: calcru: runtime went backwards from 2321 usec to 2192 usec for pid 43082 (getty)
Apr  7 18:10:04 kendra kernel: calcru: runtime went backwards from 33837 usec to 32484 usec for pid 46789 (sshd)
Apr  7 18:10:04 kendra kernel: calcru: runtime went backwards from 236 usec to 233 usec for pid 45792 (g_eli[0] gpt/swap-a)
Apr  7 18:10:04 kendra kernel: calcru: runtime went backwards from 878257 usec to 829154 usec for pid 88073 (perl)
Apr  7 18:10:04 kendra kernel: calcru: runtime went backwards from 1228 usec to 1159 usec for pid 70123 (nginx)
Apr  7 18:10:04 kendra kernel: calcru: runtime went backwards from 930 usec to 878 usec for pid 68524 (nginx)
Apr  7 18:10:04 kendra kernel: calcru: runtime went backwards from 1297 usec to 1225 usec for pid 16461 (casperd)
which I believe is another symptom of the same problem.

My other systems use different hardware and don't trigger the DTrace sanity check.

I don't know if VirtualBox exacerbates the problem or if it simply
inherits it from the system it's running on.

I don't use VirtualBox and I'm not seeing the issue when using bhyve,
but then again, I can only run bhyve on systems where the host system
doesn't trigger the sanity checks either.
Comment 6 Adam Stylinski 2017-04-07 17:34:48 UTC
(In reply to Fabian Keil from comment #5)

Interesting, I don't see those in /var/log/messages but maybe you're building with more warning/debug stuff turned on (perhaps the invariant options?).  I'm encountering this issue with just STABLE and the generic kernel and no additional build flags.  It's possible it's machine and or virtual machine specific, it's also possible that machines on release aren't seeing this because assertions are turned off.  Then again, I thought that was the case for everything but CURRENT, but evidently I'm wrong on that part.
Comment 7 Adam Stylinski 2017-04-07 20:54:19 UTC
(In reply to Andriy Gapon from comment #4)

Andriy - given that this is happening, can I expect these microsecond timestamp deltas to be reasonably accurate or am I screwed without a fix for this?
Comment 8 Fabian Keil 2017-04-08 10:34:21 UTC
The calcru complaints I'm getting don't depend on debug flags:

sys/kern/kern_resource.c-914-   } else { /* tu < ruxp->rux_tu */
sys/kern/kern_resource.c-915-           /*
sys/kern/kern_resource.c-916-            * What happened here was likely that a laptop, which ran at
sys/kern/kern_resource.c-917-            * a reduced clock frequency at boot, kicked into high gear.
sys/kern/kern_resource.c-918-            * The wisdom of spamming this message in that case is
sys/kern/kern_resource.c-919-            * dubious, but it might also be indicative of something
sys/kern/kern_resource.c-920-            * serious, so lets keep it and hope laptops can be made
sys/kern/kern_resource.c-921-            * more truthful about their CPU speed via ACPI.
sys/kern/kern_resource.c-922-            */
sys/kern/kern_resource.c:923:           printf("calcru: runtime went backwards from %ju usec "
sys/kern/kern_resource.c-924-               "to %ju usec for pid %d (%s)\n",
sys/kern/kern_resource.c-925-               (uintmax_t)ruxp->rux_tu, (uintmax_t)tu,
sys/kern/kern_resource.c-926-               p->p_pid, p->p_comm);
Comment 9 commit-hook freebsd_committer 2017-12-18 17:26:37 UTC
A commit references this bug:

Author: markj
Date: Mon Dec 18 17:26:25 UTC 2017
New revision: 326935
URL: https://svnweb.freebsd.org/changeset/base/326935

Log:
  Avoid CPU migration in dtrace_gethrtime() on x86.

  dtrace_gethrtime() may be called outside of probe context, and in
  particular, from the DTRACEIOC_BUFSNAP handler.

  Disable interrupts rather than using sched_pin() to help ensure that
  we don't call any external functions when in probe context.

  PR:		218452
  MFC after:	1 week

Changes:
  head/sys/cddl/dev/dtrace/amd64/dtrace_subr.c
  head/sys/cddl/dev/dtrace/i386/dtrace_subr.c
Comment 10 Mark Johnston freebsd_committer 2017-12-18 17:27:42 UTC
I haven't been able to reproduce this in bhyve, but would be interested in knowing whether it still occurs after r326935.
Comment 11 Mark Linimon freebsd_committer 2018-02-19 01:36:40 UTC
To submitter: does this problem still recur?
Comment 12 Adam Stylinski 2018-02-19 02:16:15 UTC
(In reply to Mark Linimon from comment #11)

I do still have the VM this occurs in, but I'm presently out of town for the week.  I may have time to test it out remotely sometime during the week, though.
Comment 13 martin 2018-02-22 11:35:52 UTC
(In reply to Mark Johnston from comment #10)

The change in r326935 does not fix it for me ('dtruss pwd' triggers the original problem for me sometimes, but I've never seen any calcru messages).

Initially I was running the GENERIC 10.3-RELEASE from the iso, updated to p26 with freebsd-update (no virtualization).  I then recompiled dtruss.ko with the patched dtrace_gethrtime, but the problem persists.  I checked the code in the new module using kgdb.

My hardware is similar to comment #5:

CPU: AMD Athlon(tm) 64 X2 Dual Core Processor 4400+ (2216.54-MHz 686-class CPU)
  Origin="AuthenticAMD"  Id=0x20f32  Family=0xf  Model=0x23  Stepping=2
  Features=0x178bfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,MMX,FXSR,SSE,SSE2,HTT>
  Features2=0x1<SSE3>
  AMD Features=0xe2500800<SYSCALL,NX,MMX+,FFXSR,LM,3DNow!+,3DNow!>
  AMD Features2=0x3<LAHF,CMP>
real memory  = 2147483648 (2048 MB)
avail memory = 2081153024 (1984 MB)
Event timer "LAPIC" quality 400
ACPI APIC Table: <Nvidia AWRDACPI>
FreeBSD/SMP: Multiprocessor System Detected: 2 CPUs
FreeBSD/SMP: 1 package(s) x 2 core(s)
 cpu0 (BSP): APIC ID:  0
 cpu1 (AP): APIC ID:  1

I think this CPU is known to have TSC synchronization issues across cores.
Comment 14 Mark Johnston freebsd_committer 2018-03-10 16:08:03 UTC
Would anyone able to reproduce this problem be able to test the patch here and confirm whether it fixes the problem? https://reviews.freebsd.org/D14638
Comment 15 martin 2018-03-29 11:46:15 UTC
(In reply to Mark Johnston from comment #14)

I applied the D14638 patch to 10.3-RELEASE-p26, but still get the original assertion:

Assertion failed: (buf->dtbd_timestamp >= first_timestamp), file /usr/src/cddl/lib/libdtrace/../../../cddl/contrib/opensolaris/lib/libdtrace/common/dt_consume.c, line 3328.