Bug 248514

Summary: jedec_dimm(4) and imcsmb(4): support of memory controllers in Skylake and newer Intel CPUs
Product: Base System Reporter: Vladimir Druzenko <vvd>
Component: kernAssignee: freebsd-bugs (Nobody) <bugs>
Status: New ---    
Severity: Affects Many People CC: rb, rpokala, rpokala, ruben
Priority: ---    
Version: 12.2-RELEASE   
Hardware: amd64   
OS: Any   
Attachments:
Description Flags
Support Skylake-Xeon in imcsmb(4) (take 1) none

Description Vladimir Druzenko freebsd_committer freebsd_triage 2020-08-07 12:07:23 UTC
I have 1st generation Xeon Scalable 3104 (Skylake) and after "kldload smb imcsmb jedec_dimm" can't find any related sysctl variables, which are described in the manuals.

Datasheets:

* Skylake:
xeon-scalable-mem-ds-vol2-336062-r003.pdf: https://www.intel.com/content/dam/www/public/us/en/documents/datasheets/xeon-scalable-datasheet-vol-1.pdf
xeon-scalable-family-vol-1-ds-336063-002.pdf: https://www.intel.com/content/dam/www/public/us/en/documents/datasheets/xeon-scalable-datasheet-vol-2.pdf
(messed order)

* Cascade Lake:
2nd-gen-xeon-scalable-datasheet-vol-1.pdf: https://www.intel.com/content/dam/www/public/us/en/documents/datasheets/2nd-gen-xeon-scalable-datasheet-vol-1.pdf
2nd-gen-xeon-scalable-datasheet-vol-2.pdf: https://www.intel.com/content/dam/www/public/us/en/documents/datasheets/2nd-gen-xeon-scalable-datasheet-vol-2.pdf

* Cooper Lake:
Can't find datasheets for 3rd gen of Xeon Scalable.

Thanks!
Comment 1 Ravi Pokala 2020-08-07 16:43:25 UTC
imcsmb(4) has not been updated to work with *Lake CPUs. I think I started taking a swing at this sometime last year, but eventually put it on hold because I do not have access to systems which both have those CPUs, and for which I know the SMBus address map.

I'll see if I can dig up my work-in-progress and attach it here. If you can test it for me, then we might be able to finish this off together.
Comment 2 Vladimir Druzenko freebsd_committer freebsd_triage 2020-08-07 18:16:07 UTC
(In reply to Ravi Pokala from comment #1)
Ofc I can test!
Better as patch to 12.1 - I'll rebuild module and load it for test.
But if you need HEAD only, then I can boot it form LiveUSB (https://download.freebsd.org/ftp/snapshots/amd64/amd64/ISO-IMAGES/13.0/) and kldload modules.
Comment 3 Ravi Pokala 2020-08-11 04:07:03 UTC
Created attachment 217141 [details]
Support Skylake-Xeon in imcsmb(4) (take 1)

(In reply to VVD from comment #2)

The attached patch should apply cleanly to both -HEAD and stable/12
Comment 4 Vladimir Druzenko freebsd_committer freebsd_triage 2020-08-11 12:52:23 UTC
(In reply to Ravi Pokala from comment #3)
Thanks!

kldload imcsmb.ko:
imcsmb_pci0: <Intel Skylake Xeon iMC 0 SMBus controllers> at device 30.5 numa-domain 0 on pci5
imcsmb0: <iMC SMBus controller> numa-domain 0 on imcsmb_pci0
smbus1: <System Management Bus> numa-domain 0 on imcsmb0
smb1: <SMBus generic I/O> on smbus1
imcsmb1: <iMC SMBus controller> numa-domain 0 on imcsmb_pci0
smbus2: <System Management Bus> numa-domain 0 on imcsmb1
smb2: <SMBus generic I/O> on smbus2
imcsmb_pci1: <Intel Skylake Xeon iMC 1 SMBus controllers> at device 30.6 numa-domain 0 on pci5
imcsmb2: <iMC SMBus controller> numa-domain 0 on imcsmb_pci1
smbus3: <System Management Bus> numa-domain 0 on imcsmb2
smb3: <SMBus generic I/O> on smbus3
imcsmb3: <iMC SMBus controller> numa-domain 0 on imcsmb_pci1
smbus4: <System Management Bus> numa-domain 0 on imcsmb3
smb4: <SMBus generic I/O> on smbus4

But after kldload jedec_dimm.ko:
sysctl -a | grep jedec | wc -l
       0

Added to /boot/device.hints:
hint.jedec_dimm.0.at="smbus1"
hint.jedec_dimm.0.addr="0xa0"
hint.jedec_dimm.0.slotid="Silkscreen"

kldunload jedec_dimm imcsmb / kldload - nothing changed.
Comment 5 Ravi Pokala 2020-08-11 18:47:53 UTC
(In reply to VVD from comment #4)

Are you sure that smbus1:0xa0 is the proper bus:address for the DIMM in question?

For experimentation purposes, you could configure the kernel environment to look at all possible addresses:

    kldunload imcsmb.ko smbus.ko jedec_dimm.ko
    unit=0
    for bus in 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ; do
        for addr in 0xa0 0xa2 0xa4 0xa6 0xa8 0xaa 0xac 0xae ; do
            kenv hint.jedec_dimm.${unit}.at="smbus${bus}"
            kenv hint.jedec_dimm.${unit}.addr="${addr}"
            unit=$(( ${unit} + 1 ))
        done
    done
    kldload /path/to/imcsmb.ko /boot/kernel/smbus.ko /boot/kernel/jedec_dimm.ko

Can you try that and let me know if any of them work? When you're done, you can run it again with `kenv -u' to remove all the extra entries, then configure device.hints for the real values.
Comment 6 Vladimir Druzenko freebsd_committer freebsd_triage 2020-08-11 19:40:51 UTC
(In reply to Ravi Pokala from comment #5)
imcsmb_pci0: <Intel Skylake Xeon iMC 0 SMBus controllers> at device 30.5 numa-domain 0 on pci5
imcsmb0: <iMC SMBus controller> numa-domain 0 on imcsmb_pci0
smbus1: <System Management Bus> numa-domain 0 on imcsmb0
smb1: <SMBus generic I/O> on smbus1
smbus1: <unknown device> at addr 0xa0
smbus1: <unknown device> at addr 0xa2
smbus1: <unknown device> at addr 0xa4
smbus1: <unknown device> at addr 0xa6
smbus1: <unknown device> at addr 0xa8
smbus1: <unknown device> at addr 0xaa
smbus1: <unknown device> at addr 0xac
smbus1: <unknown device> at addr 0xae
imcsmb1: <iMC SMBus controller> numa-domain 0 on imcsmb_pci0
smbus2: <System Management Bus> numa-domain 0 on imcsmb1
smb2: <SMBus generic I/O> on smbus2
smbus2: <unknown device> at addr 0xa0
smbus2: <unknown device> at addr 0xa2
smbus2: <unknown device> at addr 0xa4
smbus2: <unknown device> at addr 0xa6
smbus2: <unknown device> at addr 0xa8
smbus2: <unknown device> at addr 0xaa
smbus2: <unknown device> at addr 0xac
smbus2: <unknown device> at addr 0xae
imcsmb_pci1: <Intel Skylake Xeon iMC 1 SMBus controllers> at device 30.6 numa-domain 0 on pci5
imcsmb2: <iMC SMBus controller> numa-domain 0 on imcsmb_pci1
smbus3: <System Management Bus> numa-domain 0 on imcsmb2
smb3: <SMBus generic I/O> on smbus3
smbus3: <unknown device> at addr 0xa0
smbus3: <unknown device> at addr 0xa2
smbus3: <unknown device> at addr 0xa4
smbus3: <unknown device> at addr 0xa6
smbus3: <unknown device> at addr 0xa8
smbus3: <unknown device> at addr 0xaa
smbus3: <unknown device> at addr 0xac
smbus3: <unknown device> at addr 0xae
imcsmb3: <iMC SMBus controller> numa-domain 0 on imcsmb_pci1
smbus4: <System Management Bus> numa-domain 0 on imcsmb3
smb4: <SMBus generic I/O> on smbus4
smbus4: <unknown device> at addr 0xa0
smbus4: <unknown device> at addr 0xa2
smbus4: <unknown device> at addr 0xa4
smbus4: <unknown device> at addr 0xa6
smbus4: <unknown device> at addr 0xa8
smbus4: <unknown device> at addr 0xaa
smbus4: <unknown device> at addr 0xac
smbus4: <unknown device> at addr 0xae
jedec_dimm0: failed to read dram_type
jedec_dimm1: failed to read dram_type
jedec_dimm2: failed to read dram_type
jedec_dimm3: failed to read dram_type
jedec_dimm4: failed to read dram_type
jedec_dimm5: failed to read dram_type
jedec_dimm6: failed to read dram_type
jedec_dimm7: failed to read dram_type
imcsmb0: transfer timeout
jedec_dimm8: failed to read dram_type
imcsmb0: transfer timeout
jedec_dimm9: failed to read dram_type
imcsmb0: transfer timeout
jedec_dimm10: failed to read dram_type
imcsmb0: transfer timeout
jedec_dimm11: failed to read dram_type
imcsmb0: transfer timeout
jedec_dimm12: failed to read dram_type
imcsmb0: transfer timeout
jedec_dimm13: failed to read dram_type
imcsmb0: transfer timeout
jedec_dimm14: failed to read dram_type
imcsmb0: transfer timeout
jedec_dimm15: failed to read dram_type
imcsmb1: transfer timeout
jedec_dimm16: failed to read dram_type
imcsmb1: transfer timeout
jedec_dimm17: failed to read dram_type
imcsmb1: transfer timeout
jedec_dimm18: failed to read dram_type
imcsmb1: transfer timeout
jedec_dimm19: failed to read dram_type
imcsmb1: transfer timeout
jedec_dimm20: failed to read dram_type
imcsmb1: transfer timeout
jedec_dimm21: failed to read dram_type
imcsmb1: transfer timeout
jedec_dimm22: failed to read dram_type
imcsmb1: transfer timeout
jedec_dimm23: failed to read dram_type
imcsmb2: transfer timeout
jedec_dimm24: failed to read dram_type
imcsmb2: transfer timeout
jedec_dimm25: failed to read dram_type
imcsmb2: transfer timeout
jedec_dimm26: failed to read dram_type
imcsmb2: transfer timeout
jedec_dimm27: failed to read dram_type
imcsmb2: transfer timeout
jedec_dimm28: failed to read dram_type
imcsmb2: transfer timeout
jedec_dimm29: failed to read dram_type
imcsmb2: transfer timeout
jedec_dimm30: failed to read dram_type
imcsmb2: transfer timeout
jedec_dimm31: failed to read dram_type
imcsmb3: transfer timeout
jedec_dimm32: failed to read dram_type
imcsmb3: transfer timeout
jedec_dimm33: failed to read dram_type
imcsmb3: transfer timeout
jedec_dimm34: failed to read dram_type
imcsmb3: transfer timeout
jedec_dimm35: failed to read dram_type
imcsmb3: transfer timeout
jedec_dimm36: failed to read dram_type
imcsmb3: transfer timeout
jedec_dimm37: failed to read dram_type
imcsmb3: transfer timeout
jedec_dimm38: failed to read dram_type
imcsmb3: transfer timeout
jedec_dimm39: failed to read dram_type

"sysctl -a | grep jedec" still empty.
Comment 7 Vladimir Druzenko freebsd_committer freebsd_triage 2020-08-11 20:09:57 UTC
Look like addresses are incorrect "0xa0 0xa2 0xa4 0xa6 0xa8 0xaa 0xac 0xae".

I'm using IRC - freenode and efnet - we can discuss this faster in IRC.
Comment 8 Vladimir Druzenko freebsd_committer freebsd_triage 2020-08-11 20:21:42 UTC
Part of the dmidecode output:

Handle 0x002B, DMI type 17, 84 bytes
Memory Device
        Array Handle: 0x0029
        Error Information Handle: Not Provided
        Total Width: 72 bits
        Data Width: 64 bits
        Size: 16384 MB
        Form Factor: DIMM
        Set: None
        Locator: P1-DIMMA1
        Bank Locator: P0_Node0_Channel0_Dimm0
        Type: DDR4
        Type Detail: Synchronous Registered (Buffered)
        Speed: 2666 MT/s
        Manufacturer: Samsung
        Serial Number: 37984D9E
        Asset Tag: P1-DIMMA1_AssetTag (date:17/48)
        Part Number: M393A2K40BB2-CTD    
        Rank: 1
        Configured Memory Speed: 2133 MT/s
        Minimum Voltage: 1.2 V
        Maximum Voltage: 1.2 V
        Configured Voltage: 1.2 V
        Memory Technology: DRAM
        Memory Operating Mode Capability: Volatile memory
        Firmware Version: 0000 
        Module Manufacturer ID: Bank 1, Hex 0xCE
        Module Product ID: Unknown
        Memory Subsystem Controller Manufacturer ID: Unknown
        Memory Subsystem Controller Product ID: Unknown
        Non-Volatile Size: None
        Volatile Size: 16 GB
        Cache Size: None
        Logical Size: None
Comment 9 Vladimir Druzenko freebsd_committer freebsd_triage 2020-12-13 20:46:22 UTC
Maybe you need remote access to hardware?
Comment 10 Ravi Pokala 2020-12-14 08:21:28 UTC
Sorry, this fell off my radar.

The problem here is that the iMC-SMBus controller was not really intended for use by the OS. During POST, the memory controller uses it to read the SPD information from the DIMMs and configure itself to use their DRAM; during normal operation, the system firmware (the Management Engine?) uses it to read the TSOD temperature from the DIMMs. The hardware has a BUSY indicator, but it appears to be advisory, and it's possible that firmware does not honor it, which could allow firmware-initiated operations to stomp on OS-initiated operations.

And to top it off, I know Intel board firmware disabled OS access to the iMC-SMBus controllers on *Well outright, as part of their security-hardening fixes after Spectre-Meltdown; I suspect other board vendors followed suit. It's possible that for *Lake, they disabled it from the start.

The upshot of all this, is that the controller might not be usable by the OS on *Lake CPUs.

Try adding this line near the start of imcsmb_transfer():

================================================================
	orig_cntl_val = pci_read_config(sc->imcsmb_pci,
	    sc->regs->smb_cntl, 4);
+	device_printf(sc->dev, "cntl: 0x%08x\n", orig_cntl_val);
	cntl_val = orig_cntl_val;
================================================================

I'm particularly interested in bit 26 (0x04000000), SMB_DIS_WRT; if it is set, the BIOS has locked the OS out from using the iMC-SMBus controller, and that's game over. :-/

While I appreciate your offer of remote access, I don't have any time to dig into this right now, and probably won't any time in the next few months.
Comment 11 Vladimir Druzenko freebsd_committer freebsd_triage 2021-01-06 05:45:51 UTC
(In reply to Ravi Pokala from comment #10)

imcsmb_pci0: <Intel Skylake Xeon iMC 0 SMBus controllers> at device 30.5 numa-domain 0 on pci5
imcsmb0: <iMC SMBus controller> numa-domain 0 on imcsmb_pci0
smbus0: <System Management Bus> numa-domain 0 on imcsmb0
smbus0: <unknown device> at addr 0xa0
smbus0: <unknown device> at addr 0xa2
smbus0: <unknown device> at addr 0xa4
smbus0: <unknown device> at addr 0xa6
smbus0: <unknown device> at addr 0xa8
smbus0: <unknown device> at addr 0xaa
smbus0: <unknown device> at addr 0xac
smbus0: <unknown device> at addr 0xae
imcsmb1: <iMC SMBus controller> numa-domain 0 on imcsmb_pci0
smbus1: <System Management Bus> numa-domain 0 on imcsmb1
smbus1: <unknown device> at addr 0xa0
smbus1: <unknown device> at addr 0xa2
smbus1: <unknown device> at addr 0xa4
smbus1: <unknown device> at addr 0xa6
smbus1: <unknown device> at addr 0xa8
smbus1: <unknown device> at addr 0xaa
smbus1: <unknown device> at addr 0xac
smbus1: <unknown device> at addr 0xae
imcsmb_pci1: <Intel Skylake Xeon iMC 1 SMBus controllers> at device 30.6 numa-domain 0 on pci5
imcsmb2: <iMC SMBus controller> numa-domain 0 on imcsmb_pci1
smbus2: <System Management Bus> numa-domain 0 on imcsmb2
smbus2: <unknown device> at addr 0xa0
smbus2: <unknown device> at addr 0xa2
smbus2: <unknown device> at addr 0xa4
smbus2: <unknown device> at addr 0xa6
smbus2: <unknown device> at addr 0xa8
smbus2: <unknown device> at addr 0xaa
smbus2: <unknown device> at addr 0xac
smbus2: <unknown device> at addr 0xae
imcsmb3: <iMC SMBus controller> numa-domain 0 on imcsmb_pci1
smbus3: <System Management Bus> numa-domain 0 on imcsmb3
smbus3: <unknown device> at addr 0xa0
smbus3: <unknown device> at addr 0xa2
smbus3: <unknown device> at addr 0xa4
smbus3: <unknown device> at addr 0xa6
smbus3: <unknown device> at addr 0xa8
smbus3: <unknown device> at addr 0xaa
smbus3: <unknown device> at addr 0xac
smbus3: <unknown device> at addr 0xae
imcsmb0: cntl: 0x00000000
imcsmb0: transfer timeout
jedec_dimm0: failed to read dram_type
imcsmb0: cntl: 0x00000000
imcsmb0: transfer timeout
jedec_dimm1: failed to read dram_type
imcsmb0: cntl: 0x00000000
imcsmb0: transfer timeout
jedec_dimm2: failed to read dram_type
imcsmb0: cntl: 0x00000000
imcsmb0: transfer timeout
jedec_dimm3: failed to read dram_type
imcsmb0: cntl: 0x00000000
imcsmb0: transfer timeout
jedec_dimm4: failed to read dram_type
imcsmb0: cntl: 0x00000000
imcsmb0: transfer timeout
jedec_dimm5: failed to read dram_type
imcsmb0: cntl: 0x00000000
imcsmb0: transfer timeout
jedec_dimm6: failed to read dram_type
imcsmb0: cntl: 0x00000000
imcsmb0: transfer timeout
jedec_dimm7: failed to read dram_type
imcsmb1: cntl: 0x00000000
imcsmb1: transfer timeout
jedec_dimm8: failed to read dram_type
imcsmb1: cntl: 0x00000000
imcsmb1: transfer timeout
jedec_dimm9: failed to read dram_type
imcsmb1: cntl: 0x00000000
imcsmb1: transfer timeout
jedec_dimm10: failed to read dram_type
imcsmb1: cntl: 0x00000000
imcsmb1: 
transfer timeout
jedec_dimm11: failed to read dram_type
imcsmb1: cntl: 0x00000000
imcsmb1: transfer timeout
jedec_dimm12: failed to read dram_type
imcsmb1: cntl: 0x00000000
imcsmb1: transfer timeout
jedec_dimm13: failed to read dram_type
imcsmb1: cntl: 0x00000000
imcsmb1: transfer timeout
jedec_dimm14: failed to read dram_type
imcsmb1: cntl: 0x00000000
imcsmb1: transfer timeout
jedec_dimm15: failed to read dram_type
imcsmb2: cntl: 0x00000000
imcsmb2: transfer timeout
jedec_dimm16: failed to read dram_type
imcsmb2: cntl: 0x00000000
imcsmb2: transfer timeout
jedec_dimm17: failed to read dram_type
imcsmb2: cntl: 0x00000000
imcsmb2: transfer timeout
jedec_dimm18: failed to read dram_type
imcsmb2: cntl: 0x00000000
imcsmb2: transfer timeout
jedec_dimm19: failed to read dram_type
imcsmb2: cntl: 0x00000000
imcsmb2: transfer timeout
jedec_dimm20: failed to read dram_type
imcsmb2: cntl: 0x00000000
imcsmb2: transfer timeout
jedec_dimm21: failed to read dram_type
imcsmb2: cntl: 0x00000000
imcsmb2: transfer timeout
jedec_dimm22: failed to read dram_type
imcsmb2: cntl: 0x00000000
imcsmb2: transfer timeout
jedec_dimm23: failed to read dram_type
imcsmb3: cntl: 0x00000000
imcsmb3: transfer timeout
jedec_dimm24: failed to read dram_type
imcsmb3: cntl: 0x00000000
imcsmb3: transfer timeout
jedec_dimm25: failed to read dram_type
imcsmb3: cntl: 0x00000000
imcsmb3: transfer timeout
jedec_dimm26: failed to read dram_type
imcsmb3: cntl: 0x00000000
imcsmb3: transfer timeout
jedec_dimm27: failed to read dram_type
imcsmb3: cntl: 0x00000000
imcsmb3: transfer timeout
jedec_dimm28: failed to read dram_type
imcsmb3: cntl: 0x00000000
imcsmb3: transfer timeout
jedec_dimm29: failed to read dram_type
imcsmb3: cntl: 0x00000000
imcsmb3: transfer timeout
jedec_dimm30: failed to read dram_type
imcsmb3: cntl: 0x00000000
imcsmb3: transfer timeout
jedec_dimm31: failed to read dram_type