Bug 266145 - Intel Alder Lake: crashes on CURRENT vfs
Summary: Intel Alder Lake: crashes on CURRENT vfs
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: CURRENT
Hardware: amd64 Any
: --- Affects Some People
Assignee: freebsd-bugs (Nobody)
URL:
Keywords: crash
Depends on:
Blocks:
 
Reported: 2022-08-31 21:23 UTC by rkoberman
Modified: 2023-04-03 20:31 UTC (History)
6 users (show)

See Also:


Attachments
core.text from crash (192.83 KB, text/plain)
2022-08-31 21:23 UTC, rkoberman
no flags Details
Current experimental workaround (4.61 KB, patch)
2022-11-20 15:04 UTC, Konstantin Belousov
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description rkoberman 2022-08-31 21:23:28 UTC
Created attachment 236272 [details]
core.text from crash

I have had multiple crashes on a new Alder Lake system (Lenovo T16 (Intel)). The backtraces on all showed vfs and a couple also included softupdate. Softupdate is further implicated by the fact that "fsck -l /" consistently reported "SOFTUPDATE INCONSISTENCY" and required a full fsck and managed to corrupt the pkg database at least three times. In all cases, I was building ports on the system.

Due to this being a fresh install, I was unable to get a dump, but I finally got one this morning. I disabled softupdate and have yet to have a crash; yet another indication that softupdate is the trigger. Possibly a race caused by the different speeds and thread counts of the different cores.

FreeBSD 14.0-CURRENT #0 main-n257619-c1a0ab5ec52: Fri Aug 26 09:37:24 UTC 2022

I will attach the core.text file to this report.
Comment 1 rkoberman 2022-09-01 06:33:49 UTC
After building over 200 ports today after disabling soft updates, no crashes. While not complete proof, it looks pretty clear that the problem is in soft updates. I still have a lot of ports to build.
Comment 2 rkoberman 2022-09-01 23:41:32 UTC
I now have had a crash with soft updates disabled, s disabling them is not a work around. It dos seem to significantly reduce the frequency of the crashes. I am saving the vmcoore and core.txt files.
Comment 3 Amar Takhar 2022-09-05 05:31:09 UTC
It's the same for me on my system Intel i9-12900KF.  Disabling soft updates didn't make much of a change for me though it did take a bit longer to crash  Once it hits somewhere it doesn't like on the filesystem it consistently fails even across reboots it seems to be extremely random.  Disabling the E-core fixes all the issues.
Comment 4 rkoberman 2022-09-05 05:55:24 UTC
(In reply to Amar Takhar from comment #3)
I'm reasonably sure that it is not that it has a problem with the file system. It's likely a race of some sort caused by the different performance of the cores. When a process moves between different sped cores, it looks like things go wonky, probably in the vfs or maybe the UFS code.

I now have about 7 dumps, though only one since disabling soft updates that look tie to vfs.

I can make the cores available as they are all from a system being prepped for use with no sensitive data files.
Comment 5 Amar Takhar 2022-09-06 23:19:40 UTC
(In reply to rkoberman from comment #4)

Yes I would believe that as the issues aren't always in VFS.  Sometimes may machine just reboots immediately.  No dump just instantly reboots.  There is also the issue of the audio overruns.  With E-cores enabled I also experience odd slowdowns out of nowhere sometimes a single video playing in Firefox causes it or just running a build.  I experience one of these issues with E-cores turned off.
Comment 6 rkoberman 2022-09-07 00:20:42 UTC
(In reply to Amar Takhar from comment #5)
There are two unrelated issues causing the crashes. One is the UFS/VFS issue I described in this ticket. It is distinct form the iwlwifi crash that only occurs at driver load. I have only seen it at boot, but bz@ reports that it has been seen when reloading the driver on a running system. It is unclear to me that this is Alder Lake specific.

There is another iwlwifi issue that causes a lot of annoying console messages that I have been testing a fix for and bz@ should have patched shortly. Again, ot Alder Lake specific.

I have not opened a ticket on the iwlwifi crashes, but have reported them to bz@.
Comment 7 rkoberman 2022-10-09 06:09:39 UTC
Small update on this. I ran the Alder Lake system on only the 2 P cores for at least three days with no panics. I tried switch to only 8 E cores and the system crashed again. I am back to P cores. I believe al crashes appeared to be tied to VFS/UFS activity, but I lost m core files except for the crasg today and the one one I posted to this ticket.
Comment 8 Amar Takhar 2022-10-09 06:29:37 UTC
(In reply to rkoberman from comment #7)

Every crash I had was UFS/VFS as well.  I still have all my coredumps.  I'm able to disable my E-cores via the BIOS which is what I have done the system runs very stable after that.

I should also mention there are severe audio issues as well which does not go away when the E-cores are disabled.  This may or may not be related info in this ticket:

  https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=263385
Comment 9 Konstantin Belousov freebsd_committer freebsd_triage 2022-11-20 15:04:44 UTC
Created attachment 238192 [details]
Current experimental workaround
Comment 10 rkoberman 2022-11-20 17:26:54 UTC
Can I assume that, with this patch, I should re-enable PCID?

Building new kernel now. Will report any issues I see.
Comment 11 rkoberman 2022-12-18 20:53:56 UTC
After over a month I have been running with all cores and PCID enabled with no crashes on my Alder Lake. Is it time to call this "fixed"?
Comment 12 Amar Takhar 2022-12-19 17:55:18 UTC
(In reply to rkoberman from comment #11)

As far as I know nothing has landed yet in src for this.  Just the patch that appears to work so I wouldn't close this ticket just yet until something is finalised.
Comment 13 rkoberman 2022-12-20 04:01:05 UTC
I am now very confused. I thought that the patch had been committed and checked out the latest sources. I've been running unpatched main for almost two weeks with no problems with al 12 "cores" enabled and vm.pmap.pcid_enabled: 1. Has some other fix been committed or am I just incredibly lucky? Never kept the system running with a mate desktop for any length of times in the past.

Could some other change have fixed it?
Comment 14 Amar Takhar 2022-12-20 05:07:06 UTC
(In reply to rkoberman from comment #13)

I don't think anything has been pushed regarding this.  It's possible you've been lucky.  You sure you're running the unpatched version?  How recent is it I tried CURRENT from not too long ago and had the crash come up until I patched.
Comment 15 rkoberman 2022-12-20 06:03:06 UTC
(In reply to Amar Takhar from comment #14)
Sorry for the noise! The version I am running is from 2-Dec. I removed the patched version on 9-Dec, so I am running the patched code.

I'll go re-install the patch in a few minutes. Again, sorry for the noise.
Comment 16 Konstantin Belousov freebsd_committer freebsd_triage 2022-12-21 11:31:56 UTC
https://reviews.freebsd.org/D37770
Comment 17 commit-hook freebsd_committer freebsd_triage 2022-12-31 22:11:23 UTC
A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=cde70e312c3fde5b37a29be1dacb7fde9a45b94a

commit cde70e312c3fde5b37a29be1dacb7fde9a45b94a
Author:     Konstantin Belousov <kib@FreeBSD.org>
AuthorDate: 2022-10-10 23:08:55 +0000
Commit:     Konstantin Belousov <kib@FreeBSD.org>
CommitDate: 2022-12-31 22:09:45 +0000

    amd64: for small cores, use (big hammer) INVPCID_CTXGLOB instead of INVLPG

    A hypothetical CPU bug makes invalidation of global PTEs using INVLPG
    in pcid mode unreliable, it seems.  The workaround is applied for all
    CPUs with small cores, since we do not know the scope of the issue, and
    the right fix.

    Reviewed by:    alc (previous version)
    Discussed with: emaste, markj
    Tested by:      karels
    PR:     261169, 266145
    Sponsored by:   The FreeBSD Foundation
    MFC after:      1 week
    Differential revision:  https://reviews.freebsd.org/D37770

 sys/amd64/amd64/initcpu.c    |  5 +++++
 sys/amd64/amd64/mp_machdep.c | 16 +++++++++++-----
 sys/amd64/amd64/pmap.c       | 36 +++++++++++++++++++++++++++++-------
 sys/amd64/include/pcpu.h     |  3 ++-
 sys/amd64/include/pmap.h     | 20 ++++++++++++++++++++
 5 files changed, 67 insertions(+), 13 deletions(-)
Comment 18 rkoberman 2023-01-01 18:33:33 UTC
After commit cde70e312c3fde5b37a29be1dacb7fde9a45b94a the issue seems to be resolved. All cores running with pmap enabled for several hours.

While I can't claim that this is the final resolution of all issues with cores running with different performance characteristics, at least this specific and drastic issue looks resolved. Thanks to all, especially KIB for resolving this issue. As far as I am concerned, once this get MFCed, this ticket may be closed.
Comment 19 Ed Maste freebsd_committer freebsd_triage 2023-01-09 19:17:30 UTC
cde70e312c3f should solve the problem as a workaround, but the underlying issue likely needs new microcode or a new CPU. Once those are available we'll ask folks to test the original code path (via the vm.pmap.pcid_invlpg_workaround tunable) and we should be able to tighten up the workaround to apply only to known microcode versions/CPUs.
Comment 20 commit-hook freebsd_committer freebsd_triage 2023-01-20 03:25:06 UTC
A commit in branch stable/13 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=567cc4e6bfd92d7351e385569f2bb4b7c89b6db0

commit 567cc4e6bfd92d7351e385569f2bb4b7c89b6db0
Author:     Konstantin Belousov <kib@FreeBSD.org>
AuthorDate: 2022-10-10 23:08:55 +0000
Commit:     Konstantin Belousov <kib@FreeBSD.org>
CommitDate: 2023-01-20 03:21:57 +0000

    amd64: for small cores, use (big hammer) INVPCID_CTXGLOB instead of INVLPG

    PR:     261169, 266145
    Tested by:      pho

    (cherry picked from commit cde70e312c3fde5b37a29be1dacb7fde9a45b94a)

 sys/amd64/amd64/initcpu.c    |  5 +++++
 sys/amd64/amd64/mp_machdep.c | 16 +++++++++++-----
 sys/amd64/amd64/pmap.c       | 36 +++++++++++++++++++++++++++++-------
 sys/amd64/include/pcpu.h     |  3 ++-
 sys/amd64/include/pmap.h     | 20 ++++++++++++++++++++
 5 files changed, 67 insertions(+), 13 deletions(-)