Bug 261169 - Intel Alder Lake: data corruption with Read&Write files to FAT32 or UFS
Summary: Intel Alder Lake: data corruption with Read&Write files to FAT32 or UFS
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 13.0-STABLE
Hardware: amd64 Any
: --- Affects Some People
Assignee: Konstantin Belousov
URL:
Keywords: crash
Depends on:
Blocks:
 
Reported: 2022-01-13 08:55 UTC by Vico
Modified: 2024-07-03 22:52 UTC (History)
10 users (show)

See Also:


Attachments
i3-N305 cpuid -r (45.68 KB, text/plain)
2024-06-30 21:54 UTC, Wes Morgan
no flags Details
cpuid-etallen -r for N97 (23.22 KB, text/plain)
2024-07-01 14:18 UTC, Jeff Cunningham
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Vico 2022-01-13 08:55:47 UTC
For Intel Adler Lake P core + E core processor (i7-12700T), copying files to FAT32 partition, the file corrutped (50%), but ZFS is fine. After disabling E core in the code by restrict the max cpu number, this issue is gone. And No E core processor has no such issue, like i7-12400.

HW ENV:
CPU: Intel AlderLake 12th Gen i7-12700T
Disk: NVME SSD

There are 3 methods to reproduce this issue:
1. Make FreeBSD 13 USB disk installer, install FreeBSD with UFS, and select install source and ports, the txz package checking will be failed.

2. Boot to shell by USB disk installer, and mount a FAT32 parition (on SSD), and copy a 300MB file to the FAT32, compare the sha256 checksums for the source file and the dst file, the checksum are different (50%). Or if there is a 300MB file in FAT32 partition, mount the parition, and for the first time check the sha256 value by running 'sha256 file.tgz', the checksum is wrong, but the second time, the checksum is correct. 

3. Install FreeBSD 13 with ZFS, and it can work well. And boot into FreeBSD, disable swap, and format the SWAP partition to FAT32. Do the testing as above.
Comment 1 Vico 2022-02-14 09:11:27 UTC
Hi,
Any updates for the P core /E core issue?
Comment 2 Amar Takhar 2022-04-17 00:44:55 UTC
I ran into this exact issue with 13.1-RC3 Using an i9-12900KF 8p/8e cores.  Random corruption using UFS.  UEFI doesn't work I had to install using CSM enabled in the BIOS.  Samsung 980 Pro NVMe drive and Asus Prime Z690-P motherboard.
Comment 3 Amar Takhar 2022-05-17 04:17:08 UTC
As an update to this I updated my BIOS to v1401 (2022-03 update) and the P/E core UFS corruption seems to have gone away.

However a new issue has shown itself when I'm playing videos other parts of my interface skip around.  For instance as I'm typing this it's jumping around quite a lot.  Though this only seems to happen with playing videos in Firefox if I'm playing an external video it's fine.  I'll open a new ticket for this issue after I observe it for a day or two.
Comment 4 Amar Takhar 2022-08-20 18:10:23 UTC
I just realised I forgot to update this ticket.  The corruption is still there so my E cores are still disabled.
Comment 5 Vico 2022-11-14 06:09:20 UTC
Amar, you needn't disable E core in BIOS, the workaround is disable pcid.
Comment 6 Amar Takhar 2022-12-14 20:21:42 UTC
(In reply to Vico from comment #5)

Disabling PCID did not change anything still the same issues.  A patch from Konstantin Belousov fixed all my issues including audio.
Comment 7 Amar Takhar 2022-12-14 20:22:12 UTC
This patch from Konstantin Belousov fixed these problems for me:

  https://kib.kiev.ua/git/gitweb.cgi?p=deviant3.git;a=commit;h=5d72240a8777b26d5e0a7d2d26bb919d05f60002

It also fixed all my audio issues as well everything works fantastic.
Comment 8 Konstantin Belousov freebsd_committer freebsd_triage 2022-12-21 11:31:51 UTC
https://reviews.freebsd.org/D37770
Comment 9 commit-hook freebsd_committer freebsd_triage 2022-12-31 22:11:25 UTC
A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=cde70e312c3fde5b37a29be1dacb7fde9a45b94a

commit cde70e312c3fde5b37a29be1dacb7fde9a45b94a
Author:     Konstantin Belousov <kib@FreeBSD.org>
AuthorDate: 2022-10-10 23:08:55 +0000
Commit:     Konstantin Belousov <kib@FreeBSD.org>
CommitDate: 2022-12-31 22:09:45 +0000

    amd64: for small cores, use (big hammer) INVPCID_CTXGLOB instead of INVLPG

    A hypothetical CPU bug makes invalidation of global PTEs using INVLPG
    in pcid mode unreliable, it seems.  The workaround is applied for all
    CPUs with small cores, since we do not know the scope of the issue, and
    the right fix.

    Reviewed by:    alc (previous version)
    Discussed with: emaste, markj
    Tested by:      karels
    PR:     261169, 266145
    Sponsored by:   The FreeBSD Foundation
    MFC after:      1 week
    Differential revision:  https://reviews.freebsd.org/D37770

 sys/amd64/amd64/initcpu.c    |  5 +++++
 sys/amd64/amd64/mp_machdep.c | 16 +++++++++++-----
 sys/amd64/amd64/pmap.c       | 36 +++++++++++++++++++++++++++++-------
 sys/amd64/include/pcpu.h     |  3 ++-
 sys/amd64/include/pmap.h     | 20 ++++++++++++++++++++
 5 files changed, 67 insertions(+), 13 deletions(-)
Comment 10 commit-hook freebsd_committer freebsd_triage 2023-01-20 03:25:01 UTC
A commit in branch stable/13 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=567cc4e6bfd92d7351e385569f2bb4b7c89b6db0

commit 567cc4e6bfd92d7351e385569f2bb4b7c89b6db0
Author:     Konstantin Belousov <kib@FreeBSD.org>
AuthorDate: 2022-10-10 23:08:55 +0000
Commit:     Konstantin Belousov <kib@FreeBSD.org>
CommitDate: 2023-01-20 03:21:57 +0000

    amd64: for small cores, use (big hammer) INVPCID_CTXGLOB instead of INVLPG

    PR:     261169, 266145
    Tested by:      pho

    (cherry picked from commit cde70e312c3fde5b37a29be1dacb7fde9a45b94a)

 sys/amd64/amd64/initcpu.c    |  5 +++++
 sys/amd64/amd64/mp_machdep.c | 16 +++++++++++-----
 sys/amd64/amd64/pmap.c       | 36 +++++++++++++++++++++++++++++-------
 sys/amd64/include/pcpu.h     |  3 ++-
 sys/amd64/include/pmap.h     | 20 ++++++++++++++++++++
 5 files changed, 67 insertions(+), 13 deletions(-)
Comment 11 Mark Linimon freebsd_committer freebsd_triage 2024-01-10 04:29:44 UTC
^Triage: assign to committer that resolved.
Comment 12 Wes Morgan 2024-06-13 02:50:28 UTC
This bug seems to have re-appeared, or a new one that is related. I have an Alder Lake-N i3-N305 system, which only has 8 E-cores; no P-cores or hyperthreading. 

Copying a file repeatedly on a UFS filesystem will result in checksum failures after about 10-20 copies, and continuing will eventually trigger a kernel panic "ffs_blkfree_cg: freeing free block". It does not happen with tmpfs, zfs, or "ufs to tmpfs" (the entire file gets cached quickly).

I don't think it is possible to disable the VFS read cache to repeatedly read the same file and test for only read errors, but copying the file and checking the new copy is eventually 100% reproducible, sometimes resulting in a filesystem corrupted in a way that fsck doesn't find it, but unlinking the file will trigger a panic.

The pcid invlpg workaround seems to be active:

[morganw@boron:~$]: sysctl -a |grep pcid
vm.pmap.pcid_save_cnt: 55453
vm.pmap.pcid_invlpg_workaround: 1
vm.pmap.invpcid_works: 1
vm.pmap.pcid_enabled: 1


CPU: Intel(R) Core(TM) i3-N305 (998.40-MHz K8-class CPU)
  Origin="GenuineIntel"  Id=0xb06e0  Family=0x6  Model=0xbe  Stepping=0
  Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE>
  Features2=0x7ffafbbf<SSE3,PCLMULQDQ,DTES64,MON,DS_CPL,VMX,EST,TM2,SSSE3,SDBG,FMA,CX16,xTPR,PDCM,PCID,SSE4.1,SSE4.2,x2APIC,MOVBE,POPCNT,TSCDLT,AESNI,XSAVE,OSXSAVE,AVX,F16C,RDRAND>
  AMD Features=0x2c100800<SYSCALL,NX,Page1GB,RDTSCP,LM>
  AMD Features2=0x121<LAHF,ABM,Prefetch>
  Structured Extended Features=0x239ca7eb<FSGSBASE,TSCADJ,BMI1,AVX2,FDPEXC,SMEP,BMI2,ERMS,INVPCID,NFPUSG,PQE,RDSEED,ADX,SMAP,CLFLUSHOPT,CLWB,PROCTRACE,SHA>
  Structured Extended Features2=0x98c007bc<UMIP,PKU,OSPKE,WAITPKG,GFNI,VAES,VPCLMULQDQ,RDPID,MOVDIRI,MOVDIR64B>
  Structured Extended Features3=0xfc184410<FSRM,MD_CLEAR,IBT,IBPB,STIBP,L1DFL,ARCH_CAP,CORE_CAP,SSBD>
  XSAVE Features=0xf<XSAVEOPT,XSAVEC,XINUSE,XSAVES>
  IA32_ARCH_CAPS=0x180fd6b<RDCL_NO,IBRS_ALL,SKIP_L1DFL_VME,MDS_NO,TAA_NO>
  VT-x: (disabled in BIOS) PAT,HLT,MTF,PAUSE,EPT,UG,VPID,VID,PostIntr
  TSC: P-state invariant, performance statistics
Comment 13 Jeff Cunningham 2024-06-30 16:02:53 UTC
I'm running 14.1-Release, and I'm seeing the same behavior on an E-Core only Alder Lake.  In my case, it's the N97, which has 4 E-cores.  Installing 14.1 onto a UFS filesystem would never complete successfully, but installing to a ZFS filesystem would always complete and succeed.

After a successful ZFS install, I can recreate the issue by creating a new UFS partition and filesystem, and doing some I/O on this UFS filesystem.  Typically, it only requires that I unfurl a large tarball, and then remove the directory created by the untar.  Kernel panics result, just as Wes describes.

root@flub:/usr/src/sys/amd64/amd64 # sysctl -a |grep pcid
vm.pmap.pcid_save_cnt: 576070
vm.pmap.pcid_invlpg_workaround: 1
vm.pmap.invpcid_works: 1
vm.pmap.pcid_enabled: 1
root@flub:/usr/src/sys/amd64/amd64 #

CPU: Intel(R) N97 (1996.80-MHz K8-class CPU)
  Origin="GenuineIntel"  Id=0xb06e0  Family=0x6  Model=0xbe  Stepping=0
  Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE>
  Features2=0x7ffafbbf<SSE3,PCLMULQDQ,DTES64,MON,DS_CPL,VMX,EST,TM2,SSSE3,SDBG,FMA,CX16,xTPR,PDCM,PCID,SSE4.1,SSE4.2,x2APIC,MOVBE,POPCNT,TSCDLT,AESNI,XSAVE,OSXSAVE,AVX,F16C,RDRAND>
  AMD Features=0x2c100800<SYSCALL,NX,Page1GB,RDTSCP,LM>
  AMD Features2=0x121<LAHF,ABM,Prefetch>
  Structured Extended Features=0x239ca7eb<FSGSBASE,TSCADJ,BMI1,AVX2,FDPEXC,SMEP,BMI2,ERMS,INVPCID,NFPUSG,PQE,RDSEED,ADX,SMAP,CLFLUSHOPT,CLWB,PROCTRACE,SHA>
  Structured Extended Features2=0x98c007bc<UMIP,PKU,OSPKE,WAITPKG,GFNI,VAES,VPCLMULQDQ,RDPID,MOVDIRI,MOVDIR64B>
  Structured Extended Features3=0xfc184410<FSRM,MD_CLEAR,IBT,IBPB,STIBP,L1DFL,ARCH_CAP,CORE_CAP,SSBD>
  XSAVE Features=0xf<XSAVEOPT,XSAVEC,XINUSE,XSAVES>
  IA32_ARCH_CAPS=0x180fd6b<RDCL_NO,IBRS_ALL,SKIP_L1DFL_VME,MDS_NO,TAA_NO>
  VT-x: PAT,HLT,MTF,PAUSE,EPT,UG,VPID,VID,PostIntr
  TSC: P-state invariant, performance statistics

root@flub:/usr/src/sys/amd64/amd64 # freebsd-version
14.1-RELEASE-p1
root@flub:/usr/src/sys/amd64/amd64 #
Comment 14 Konstantin Belousov freebsd_committer freebsd_triage 2024-06-30 21:43:47 UTC
Please install sysutils/cpuid
and show the output from 'cpuid-etallen -r'.
Comment 15 Wes Morgan 2024-06-30 21:54:09 UTC
Created attachment 251803 [details]
i3-N305 cpuid -r

Output of cpuid-etallen -r
Comment 16 Konstantin Belousov freebsd_committer freebsd_triage 2024-06-30 22:14:48 UTC
I am interested in the result of the following experiment:

apply the following patch
diff --git a/sys/amd64/amd64/initcpu.c b/sys/amd64/amd64/initcpu.c
index c5266ffcc235..99e7e9c38da6 100644
--- a/sys/amd64/amd64/initcpu.c
+++ b/sys/amd64/amd64/initcpu.c
@@ -265,7 +265,7 @@ cpu_init_small_core(void)
 	if (pmap_pcid_enabled && invpcid_works &&
 	    pmap_pcid_invlpg_workaround_uena) {
 		PCPU_SET(pcid_invlpg_workaround, 1);
-		pmap_pcid_invlpg_workaround = 1;
+		atomic_add_int(&pmap_pcid_invlpg_workaround, 1);
 	}
 }
 
then show me the value of vm.pmap.pcid_invlpg_workaround from the patched
kernel.
Comment 17 Wes Morgan 2024-06-30 22:43:36 UTC
(In reply to Konstantin Belousov from comment #16)

Shall I also test a UFS filesystem for corruption with that patch applied?
Comment 18 Wes Morgan 2024-07-01 03:45:51 UTC
The "pcid_invlpg" value now has the cpu count instead of 1.

vm.pmap.pcid_save_cnt: 49860
vm.pmap.pcid_invlpg_workaround: 8
vm.pmap.invpcid_works: 1
vm.pmap.pcid_enabled: 1
hw.vmm.vmx.cap.invpcid: 1
Comment 19 Jeff Cunningham 2024-07-01 14:18:15 UTC
Created attachment 251817 [details]
cpuid-etallen -r for N97
Comment 20 Jeff Cunningham 2024-07-01 15:16:27 UTC
After patch application:

root@flub:~ # sysctl vm.pmap.pcid_invlpg_workaround
vm.pmap.pcid_invlpg_workaround: 4
root@flub:~ #
Comment 21 Konstantin Belousov freebsd_committer freebsd_triage 2024-07-03 16:53:20 UTC
So the fact that counter reaches the number of CPUs means that workaround
applied correctly for all cores.  Perhaps there is one more hw issue with
small cores.

BTW, do you have the latest microcode update applied?
Comment 22 Jeff Cunningham 2024-07-03 21:33:45 UTC
I hadn't considered updating the microcode -- that seems to have fixed the problem.

The version sits at 00000010 before the update:

root@flub:/usr/ports/sysutils/cpupdate # cpupdate -i
Found CPU(s) from Intel
Core 0 to 3: CPUID: b06e0  Fam 06  Mod be  Step 00  Flag 01 uCode 00000010
root@flub:/usr/ports/sysutils/cpupdate #

And version 00000017 after:

root@flub:/mnt # cpupdate -i
Found CPU(s) from Intel
Core 0 to 3: CPUID: b06e0  Fam 06  Mod be  Step 00  Flag 01 uCode 00000017
root@flub:/mnt #

I've now unfurled and removed my test tarball a dozen times with no errors.  I would previously get a kernel panic always within the first or second try.  To note, this is with running the kernel with your workaround patch listed in comment 16.

I'll update here if I can generate any new kernel panics with UFS I/O.
Comment 23 Konstantin Belousov freebsd_committer freebsd_triage 2024-07-03 22:32:09 UTC
(In reply to Jeff Cunningham from comment #22)
The last thing of general public interest would be to disable the sw
workaround in the kernel and see if updated microcode is enough to get
rid of the problem (it should be).
Comment 24 Jeff Cunningham 2024-07-03 22:52:18 UTC
Yes, you are correct.  I booted back to the unpatched kernel, and I am unable to produce kernel panics as before.  Seems the complete fix in my case was to simply update the microcode (I had wrongly assumed your patch was doing something besides a debug print).  Thanks for the help!