Summary: | Intel Alder Lake: crashes on CURRENT vfs | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Base System | Reporter: | rkoberman | ||||||
Component: | kern | Assignee: | freebsd-bugs (Nobody) <bugs> | ||||||
Status: | Closed FIXED | ||||||||
Severity: | Affects Some People | CC: | emaste, fixer, grahamperrin, kib, ml, verm | ||||||
Priority: | --- | Keywords: | crash | ||||||
Version: | CURRENT | ||||||||
Hardware: | amd64 | ||||||||
OS: | Any | ||||||||
See Also: |
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=261169 https://reviews.freebsd.org/D37770 |
||||||||
Attachments: |
|
After building over 200 ports today after disabling soft updates, no crashes. While not complete proof, it looks pretty clear that the problem is in soft updates. I still have a lot of ports to build. I now have had a crash with soft updates disabled, s disabling them is not a work around. It dos seem to significantly reduce the frequency of the crashes. I am saving the vmcoore and core.txt files. It's the same for me on my system Intel i9-12900KF. Disabling soft updates didn't make much of a change for me though it did take a bit longer to crash Once it hits somewhere it doesn't like on the filesystem it consistently fails even across reboots it seems to be extremely random. Disabling the E-core fixes all the issues. (In reply to Amar Takhar from comment #3) I'm reasonably sure that it is not that it has a problem with the file system. It's likely a race of some sort caused by the different performance of the cores. When a process moves between different sped cores, it looks like things go wonky, probably in the vfs or maybe the UFS code. I now have about 7 dumps, though only one since disabling soft updates that look tie to vfs. I can make the cores available as they are all from a system being prepped for use with no sensitive data files. (In reply to rkoberman from comment #4) Yes I would believe that as the issues aren't always in VFS. Sometimes may machine just reboots immediately. No dump just instantly reboots. There is also the issue of the audio overruns. With E-cores enabled I also experience odd slowdowns out of nowhere sometimes a single video playing in Firefox causes it or just running a build. I experience one of these issues with E-cores turned off. (In reply to Amar Takhar from comment #5) There are two unrelated issues causing the crashes. One is the UFS/VFS issue I described in this ticket. It is distinct form the iwlwifi crash that only occurs at driver load. I have only seen it at boot, but bz@ reports that it has been seen when reloading the driver on a running system. It is unclear to me that this is Alder Lake specific. There is another iwlwifi issue that causes a lot of annoying console messages that I have been testing a fix for and bz@ should have patched shortly. Again, ot Alder Lake specific. I have not opened a ticket on the iwlwifi crashes, but have reported them to bz@. Small update on this. I ran the Alder Lake system on only the 2 P cores for at least three days with no panics. I tried switch to only 8 E cores and the system crashed again. I am back to P cores. I believe al crashes appeared to be tied to VFS/UFS activity, but I lost m core files except for the crasg today and the one one I posted to this ticket. (In reply to rkoberman from comment #7) Every crash I had was UFS/VFS as well. I still have all my coredumps. I'm able to disable my E-cores via the BIOS which is what I have done the system runs very stable after that. I should also mention there are severe audio issues as well which does not go away when the E-cores are disabled. This may or may not be related info in this ticket: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=263385 Created attachment 238192 [details]
Current experimental workaround
Can I assume that, with this patch, I should re-enable PCID? Building new kernel now. Will report any issues I see. After over a month I have been running with all cores and PCID enabled with no crashes on my Alder Lake. Is it time to call this "fixed"? (In reply to rkoberman from comment #11) As far as I know nothing has landed yet in src for this. Just the patch that appears to work so I wouldn't close this ticket just yet until something is finalised. I am now very confused. I thought that the patch had been committed and checked out the latest sources. I've been running unpatched main for almost two weeks with no problems with al 12 "cores" enabled and vm.pmap.pcid_enabled: 1. Has some other fix been committed or am I just incredibly lucky? Never kept the system running with a mate desktop for any length of times in the past. Could some other change have fixed it? (In reply to rkoberman from comment #13) I don't think anything has been pushed regarding this. It's possible you've been lucky. You sure you're running the unpatched version? How recent is it I tried CURRENT from not too long ago and had the crash come up until I patched. (In reply to Amar Takhar from comment #14) Sorry for the noise! The version I am running is from 2-Dec. I removed the patched version on 9-Dec, so I am running the patched code. I'll go re-install the patch in a few minutes. Again, sorry for the noise. A commit in branch main references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=cde70e312c3fde5b37a29be1dacb7fde9a45b94a commit cde70e312c3fde5b37a29be1dacb7fde9a45b94a Author: Konstantin Belousov <kib@FreeBSD.org> AuthorDate: 2022-10-10 23:08:55 +0000 Commit: Konstantin Belousov <kib@FreeBSD.org> CommitDate: 2022-12-31 22:09:45 +0000 amd64: for small cores, use (big hammer) INVPCID_CTXGLOB instead of INVLPG A hypothetical CPU bug makes invalidation of global PTEs using INVLPG in pcid mode unreliable, it seems. The workaround is applied for all CPUs with small cores, since we do not know the scope of the issue, and the right fix. Reviewed by: alc (previous version) Discussed with: emaste, markj Tested by: karels PR: 261169, 266145 Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D37770 sys/amd64/amd64/initcpu.c | 5 +++++ sys/amd64/amd64/mp_machdep.c | 16 +++++++++++----- sys/amd64/amd64/pmap.c | 36 +++++++++++++++++++++++++++++------- sys/amd64/include/pcpu.h | 3 ++- sys/amd64/include/pmap.h | 20 ++++++++++++++++++++ 5 files changed, 67 insertions(+), 13 deletions(-) After commit cde70e312c3fde5b37a29be1dacb7fde9a45b94a the issue seems to be resolved. All cores running with pmap enabled for several hours. While I can't claim that this is the final resolution of all issues with cores running with different performance characteristics, at least this specific and drastic issue looks resolved. Thanks to all, especially KIB for resolving this issue. As far as I am concerned, once this get MFCed, this ticket may be closed. cde70e312c3f should solve the problem as a workaround, but the underlying issue likely needs new microcode or a new CPU. Once those are available we'll ask folks to test the original code path (via the vm.pmap.pcid_invlpg_workaround tunable) and we should be able to tighten up the workaround to apply only to known microcode versions/CPUs. A commit in branch stable/13 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=567cc4e6bfd92d7351e385569f2bb4b7c89b6db0 commit 567cc4e6bfd92d7351e385569f2bb4b7c89b6db0 Author: Konstantin Belousov <kib@FreeBSD.org> AuthorDate: 2022-10-10 23:08:55 +0000 Commit: Konstantin Belousov <kib@FreeBSD.org> CommitDate: 2023-01-20 03:21:57 +0000 amd64: for small cores, use (big hammer) INVPCID_CTXGLOB instead of INVLPG PR: 261169, 266145 Tested by: pho (cherry picked from commit cde70e312c3fde5b37a29be1dacb7fde9a45b94a) sys/amd64/amd64/initcpu.c | 5 +++++ sys/amd64/amd64/mp_machdep.c | 16 +++++++++++----- sys/amd64/amd64/pmap.c | 36 +++++++++++++++++++++++++++++------- sys/amd64/include/pcpu.h | 3 ++- sys/amd64/include/pmap.h | 20 ++++++++++++++++++++ 5 files changed, 67 insertions(+), 13 deletions(-) |
Created attachment 236272 [details] core.text from crash I have had multiple crashes on a new Alder Lake system (Lenovo T16 (Intel)). The backtraces on all showed vfs and a couple also included softupdate. Softupdate is further implicated by the fact that "fsck -l /" consistently reported "SOFTUPDATE INCONSISTENCY" and required a full fsck and managed to corrupt the pkg database at least three times. In all cases, I was building ports on the system. Due to this being a fresh install, I was unable to get a dump, but I finally got one this morning. I disabled softupdate and have yet to have a crash; yet another indication that softupdate is the trigger. Possibly a race caused by the different speeds and thread counts of the different cores. FreeBSD 14.0-CURRENT #0 main-n257619-c1a0ab5ec52: Fri Aug 26 09:37:24 UTC 2022 I will attach the core.text file to this report.