Created attachment 181559 [details] Dtrace script that eventually fails with an assertion I've attached a dtrace script where this problem happens reliably for me within a stable VM. The assertion is about a timestamp. Here's the assertion: Assertion failed: (buf->dtbd_timestamp >= first_timestamp), file /usr/src/cddl/lib/libdtrace/../../../cddl/contrib/opensolaris/lib/libdtrace/common/dt_consume.c, line 3328.
In ElectroBSD I replaced this and another timestamp-related assertion with warnings: https://www.fabiankeil.de/sourcecode/electrobsd/ElectroBSD-r312620-6cfa243f1516/0129-dtrace_consume-Warn-if-cpu-cores-are https://www.fabiankeil.de/sourcecode/electrobsd/ElectroBSD-r312620-6cfa243f1516/0165-libdtrace-Replace-another-timestamp-related-assert-wi.diffn-t-completely-in.diff Obviously that's not a real fix but in many cases the timestamp mismatch doesn't matter and in these cases it seemed preferable to prematurly aborting the trace and dumping core.
Copy&paste is complicated but thanks to redirects the links should work now.
(In reply to Fabian Keil from comment #2) Based on your description of what this assertion really means I wonder if the hypervisor I'm using exacerbates this issue. I'm using VirtualBox.
I've seen an issue like this in the past. It was on a physical machine where TSC locks were noticeably different between processors. Unfortunately, right now FreeBSD measures TSC differences only once and doesn't recalibrate deltas periodically like illumos does.
I'm seeing the issue on a pysical system as well. dmesg.boot excerpt: CPU: AMD Athlon(tm) 64 X2 Dual Core Processor 4600+ (2542.18-MHz K8-class CPU) Origin="AuthenticAMD" Id=0x60fb2 Family=0xf Model=0x6b Stepping=2 Features=0x178bfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,MMX,FXSR,SSE,SSE2,HTT> Features2=0x2001<SSE3,CX16> AMD Features=0xea500800<SYSCALL,NX,MMX+,FFXSR,RDTSCP,LM,3DNow!+,3DNow!> AMD Features2=0x11f<LAHF,CMP,SVM,ExtAPIC,CR8,Prefetch> SVM: NAsids=64 real memory = 4294967296 (4096 MB) avail memory = 3984990208 (3800 MB) Event timer "LAPIC" quality 100 /var/log/messages frequently contains messages like: Apr 7 18:10:04 kendra kernel: calcru: runtime went backwards from 2321 usec to 2192 usec for pid 43082 (getty) Apr 7 18:10:04 kendra kernel: calcru: runtime went backwards from 33837 usec to 32484 usec for pid 46789 (sshd) Apr 7 18:10:04 kendra kernel: calcru: runtime went backwards from 236 usec to 233 usec for pid 45792 (g_eli[0] gpt/swap-a) Apr 7 18:10:04 kendra kernel: calcru: runtime went backwards from 878257 usec to 829154 usec for pid 88073 (perl) Apr 7 18:10:04 kendra kernel: calcru: runtime went backwards from 1228 usec to 1159 usec for pid 70123 (nginx) Apr 7 18:10:04 kendra kernel: calcru: runtime went backwards from 930 usec to 878 usec for pid 68524 (nginx) Apr 7 18:10:04 kendra kernel: calcru: runtime went backwards from 1297 usec to 1225 usec for pid 16461 (casperd) which I believe is another symptom of the same problem. My other systems use different hardware and don't trigger the DTrace sanity check. I don't know if VirtualBox exacerbates the problem or if it simply inherits it from the system it's running on. I don't use VirtualBox and I'm not seeing the issue when using bhyve, but then again, I can only run bhyve on systems where the host system doesn't trigger the sanity checks either.
(In reply to Fabian Keil from comment #5) Interesting, I don't see those in /var/log/messages but maybe you're building with more warning/debug stuff turned on (perhaps the invariant options?). I'm encountering this issue with just STABLE and the generic kernel and no additional build flags. It's possible it's machine and or virtual machine specific, it's also possible that machines on release aren't seeing this because assertions are turned off. Then again, I thought that was the case for everything but CURRENT, but evidently I'm wrong on that part.
(In reply to Andriy Gapon from comment #4) Andriy - given that this is happening, can I expect these microsecond timestamp deltas to be reasonably accurate or am I screwed without a fix for this?
The calcru complaints I'm getting don't depend on debug flags: sys/kern/kern_resource.c-914- } else { /* tu < ruxp->rux_tu */ sys/kern/kern_resource.c-915- /* sys/kern/kern_resource.c-916- * What happened here was likely that a laptop, which ran at sys/kern/kern_resource.c-917- * a reduced clock frequency at boot, kicked into high gear. sys/kern/kern_resource.c-918- * The wisdom of spamming this message in that case is sys/kern/kern_resource.c-919- * dubious, but it might also be indicative of something sys/kern/kern_resource.c-920- * serious, so lets keep it and hope laptops can be made sys/kern/kern_resource.c-921- * more truthful about their CPU speed via ACPI. sys/kern/kern_resource.c-922- */ sys/kern/kern_resource.c:923: printf("calcru: runtime went backwards from %ju usec " sys/kern/kern_resource.c-924- "to %ju usec for pid %d (%s)\n", sys/kern/kern_resource.c-925- (uintmax_t)ruxp->rux_tu, (uintmax_t)tu, sys/kern/kern_resource.c-926- p->p_pid, p->p_comm);
A commit references this bug: Author: markj Date: Mon Dec 18 17:26:25 UTC 2017 New revision: 326935 URL: https://svnweb.freebsd.org/changeset/base/326935 Log: Avoid CPU migration in dtrace_gethrtime() on x86. dtrace_gethrtime() may be called outside of probe context, and in particular, from the DTRACEIOC_BUFSNAP handler. Disable interrupts rather than using sched_pin() to help ensure that we don't call any external functions when in probe context. PR: 218452 MFC after: 1 week Changes: head/sys/cddl/dev/dtrace/amd64/dtrace_subr.c head/sys/cddl/dev/dtrace/i386/dtrace_subr.c
I haven't been able to reproduce this in bhyve, but would be interested in knowing whether it still occurs after r326935.
To submitter: does this problem still recur?
(In reply to Mark Linimon from comment #11) I do still have the VM this occurs in, but I'm presently out of town for the week. I may have time to test it out remotely sometime during the week, though.
(In reply to Mark Johnston from comment #10) The change in r326935 does not fix it for me ('dtruss pwd' triggers the original problem for me sometimes, but I've never seen any calcru messages). Initially I was running the GENERIC 10.3-RELEASE from the iso, updated to p26 with freebsd-update (no virtualization). I then recompiled dtruss.ko with the patched dtrace_gethrtime, but the problem persists. I checked the code in the new module using kgdb. My hardware is similar to comment #5: CPU: AMD Athlon(tm) 64 X2 Dual Core Processor 4400+ (2216.54-MHz 686-class CPU) Origin="AuthenticAMD" Id=0x20f32 Family=0xf Model=0x23 Stepping=2 Features=0x178bfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,MMX,FXSR,SSE,SSE2,HTT> Features2=0x1<SSE3> AMD Features=0xe2500800<SYSCALL,NX,MMX+,FFXSR,LM,3DNow!+,3DNow!> AMD Features2=0x3<LAHF,CMP> real memory = 2147483648 (2048 MB) avail memory = 2081153024 (1984 MB) Event timer "LAPIC" quality 400 ACPI APIC Table: <Nvidia AWRDACPI> FreeBSD/SMP: Multiprocessor System Detected: 2 CPUs FreeBSD/SMP: 1 package(s) x 2 core(s) cpu0 (BSP): APIC ID: 0 cpu1 (AP): APIC ID: 1 I think this CPU is known to have TSC synchronization issues across cores.
Would anyone able to reproduce this problem be able to test the patch here and confirm whether it fixes the problem? https://reviews.freebsd.org/D14638
(In reply to Mark Johnston from comment #14) I applied the D14638 patch to 10.3-RELEASE-p26, but still get the original assertion: Assertion failed: (buf->dtbd_timestamp >= first_timestamp), file /usr/src/cddl/lib/libdtrace/../../../cddl/contrib/opensolaris/lib/libdtrace/common/dt_consume.c, line 3328.
Hi, I would like to bump this ticket. I am running into this issue on an AWS EC2 instance running 12.0-RELEASE-p2. Is additional debugging info needed to help squash this guy?
(In reply to pete from comment #16) I haven't been able to reproduce the issue in a long time despite a fair amount of effort. Could you attach a dmesg from the system and the dtrace invocation you're using?
(In reply to pete from comment #16) You may also wish to try applying r344366, though one user confirmed that it did not solve their problem.
A commit references this bug: Author: markj Date: Thu Mar 21 02:52:23 UTC 2019 New revision: 345359 URL: https://svnweb.freebsd.org/changeset/base/345359 Log: Don't attempt to measure TSC skew when running as a VM guest. It simply doesn't work in general since VCPUs may migrate between physical cores. The approach used to measure skew also doesn't make much sense in a VM. PR: 218452 MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Changes: head/sys/cddl/dev/dtrace/amd64/dtrace_subr.c head/sys/cddl/dev/dtrace/i386/dtrace_subr.c
If anyone able to reproduce the problem is willing, please try r345359. (It can be backported to any stable branch easily.) I don't believe it will fix the problem entirely, but I suspect that the attempt to correct for TSC skew hurts more than it helps when running in a VM.
(In reply to Mark Johnston from comment #20) Thanks Mark. I'll try to find steps which I can trigger this issue on and will test out. If there are any steps I can take to help debug the root cause on a system which is currently experiencing this please let me know.
(In reply to pete from comment #21) Could you please try running the following on an affected system, and paste the output here? # dtrace -n 'BEGIN{print(`tsc_skew); exit(0);}'
A commit references this bug: Author: markj Date: Thu Apr 4 02:05:51 UTC 2019 New revision: 345867 URL: https://svnweb.freebsd.org/changeset/base/345867 Log: MFC r345359, r345384: Don't attempt to measure TSC skew when running as a VM guest. PR: 218452 Changes: _U stable/12/ stable/12/sys/cddl/dev/dtrace/amd64/dtrace_subr.c stable/12/sys/cddl/dev/dtrace/i386/dtrace_subr.c
A commit references this bug: Author: markj Date: Thu Apr 4 02:07:24 UTC 2019 New revision: 345868 URL: https://svnweb.freebsd.org/changeset/base/345868 Log: MFC r345359, r345384: Don't attempt to measure TSC skew when running as a VM guest. PR: 218452 Changes: _U stable/11/ stable/11/sys/cddl/dev/dtrace/amd64/dtrace_subr.c stable/11/sys/cddl/dev/dtrace/i386/dtrace_subr.c
A commit references this bug: Author: markj Date: Thu Apr 4 02:08:37 UTC 2019 New revision: 345869 URL: https://svnweb.freebsd.org/changeset/base/345869 Log: MFC r345359, r345384: Don't attempt to measure TSC skew when running as a VM guest. PR: 218452 Changes: _U stable/10/ stable/10/sys/cddl/dev/dtrace/amd64/dtrace_subr.c stable/10/sys/cddl/dev/dtrace/i386/dtrace_subr.c
Please reopen if the bug is still reproducible.
# dtrace -n 'fbt:openzfs:vdev_prop_get:entry { stack(); }' dtrace: description 'fbt:openzfs:vdev_prop_get:entry ' matched 1 probe Assertion failed: (buf->dtbd_timestamp >= first_timestamp), file /usr/src/cddl/contrib/opensolaris/lib/libdtrace/common/dt_consume.c, line 3026. Abort (core dumped) # dtrace -n 'BEGIN{print(`tsc_skew); exit(0);}' dtrace: description 'BEGIN' matched 1 probe Assertion failed: (buf->dtbd_timestamp >= first_timestamp), file /usr/src/cddl/contrib/opensolaris/lib/libdtrace/common/dt_consume.c, line 3026. Abort (core dumped) # uname -a FreeBSD vdevprops 13.0-CURRENT FreeBSD 13.0-CURRENT r350702 GENERIC amd64 This is running inside virtualbox, so it might not be fixable.
(In reply to Allan Jude from comment #27) Does it occur consistently? Does kern.vm_guest give the right output under virtualbox?