Created attachment 202271 [details] thaiphoon-smbus.png intsmb0: <AMD FCH SMBus Controller> at device 20.0 on pci0 smbus0: <System Management Bus> on intsmb0 smbus0: <unknown device> at addr 0xa4 smbus0: <unknown device> at addr 0xa6 smbus0: <unknown device> at addr 0xa2 jedec_dimm0: <DDR4 DIMM> at addr 0xa4 on smbus0 jedec_dimm0: unable to change page for offset 0x0149: 1 jedec_dimm0: unable to restore page for offset 0x0049: 1 jedec_dimm0: unable to change page for offset 0x0145: 1 jedec_dimm0: unable to restore page for offset 0x0045: 1 jedec_dimm0: failed to read TSOD Manufacturer ID device_attach: jedec_dimm0 attach returned 1 jedec_dimm1: <DDR4 DIMM> at addr 0xa6 on smbus0 jedec_dimm1: unable to change page for offset 0x0149: 1 jedec_dimm1: unable to restore page for offset 0x0049: 1 jedec_dimm1: unable to change page for offset 0x0145: 1 jedec_dimm1: unable to restore page for offset 0x0045: 1 jedec_dimm1: failed to read TSOD Manufacturer ID device_attach: jedec_dimm1 attach returned 1 (interestingly, Thaiphoon Burner on Windows finds different SMBus addresses for the SPD EEPROMs: 0x52 and 0x53)
Created attachment 202272 [details] thaiphoon-spd-dump-0.png
Created attachment 202273 [details] thaiphoon-spd-dump-1.png
> (interestingly, Thaiphoon Burner on Windows finds different SMBus addresses for the SPD EEPROMs: 0x52 and 0x53) SMBus uses the low-order bit to denote read or write. It looks like Windows uses 7-bit addressing, where the read/write bit is simply omitted. FreeBSD uses 8-bit SMBus addresses, which are always passed around with the read/write bit cleared. Thus, 7-bit addresses 0x52 and 0x53 become 8-bit addresses 0xa4 and 0xa6.
Could you try loading the driver again with 'bootverbose' set? That should cause it try to read and dump every byte of the SPD device.
(In reply to Ravi Pokala from comment #4) Well, that doesn't look helpful: jedec_dimm0: <DDR4 DIMM> at addr 0xa4 on smbus0 intsmb0: error = 1, status = 0x6 jedec_dimm0: unable to change page: 1 intsmb0: error = 1, status = 0x6 jedec_dimm0: unable to restore page: 1 intsmb0: error = 1, status = 0x6 jedec_dimm0: unable to change page for offset 0x0149: 1 intsmb0: error = 1, status = 0x6 jedec_dimm0: unable to restore page for offset 0x0049: 1 intsmb0: error = 1, status = 0x6 jedec_dimm0: unable to change page for offset 0x0145: 1 intsmb0: error = 1, status = 0x6 jedec_dimm0: unable to restore page for offset 0x0045: 1 intsmb0: error = 1, status = 0x6 jedec_dimm0: failed to read TSOD Manufacturer ID device_attach: jedec_dimm0 attach returned 1 jedec_dimm1: <DDR4 DIMM> at addr 0xa6 on smbus0 intsmb0: error = 1, status = 0x6 jedec_dimm1: unable to change page: 1 intsmb0: error = 1, status = 0x6 jedec_dimm1: unable to restore page: 1 intsmb0: error = 1, status = 0x6 jedec_dimm1: unable to change page for offset 0x0149: 1 intsmb0: error = 1, status = 0x6 jedec_dimm1: unable to restore page for offset 0x0049: 1 intsmb0: error = 1, status = 0x6 jedec_dimm1: unable to change page for offset 0x0145: 1 intsmb0: error = 1, status = 0x6 jedec_dimm1: unable to restore page for offset 0x0045: 1 intsmb0: error = 1, status = 0x6 jedec_dimm1: failed to read TSOD Manufacturer ID device_attach: jedec_dimm1 attach returned 1 intsmb0: error = 1, status = 0x6 jedec_dimm2: failed to read dram_type looking at the code, it doesn't print the original page if the additional page fails. I'll try fixing that.
Here's the first page: 0000 24 10 0c 02 85 21 00 00 00 00 00 03 01 03 80 00 |$....!..........| 0010 00 00 08 0c 80 03 00 00 71 71 71 11 08 79 f0 0a |........qqq..y..| 0020 20 08 00 05 00 a8 1e 2b 2b 00 00 00 00 00 00 00 | ......++.......| 0030 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| 0040 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| 0050 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| 0060 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| 0070 00 00 00 00 00 83 b5 ce c0 c0 c0 c0 00 c2 e7 f6 |................| 0080 11 11 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| 0090 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| 00a0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| 00b0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| 00c0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| 00d0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| 00e0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| 00f0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 b6 58 |...............X| Matches the output in Thaiphoon Burner.
The fact that the Windows util (and, for that matter, the firmware) can read the second (256-byte) page of the SPD, means that the SPD's page-change mechanism does in fact work. That suggests a driver issue. Conveniently, enabling bootverbose to get the SPD dump from jedec_dimm(4), *also* enabled error reporting for intsmb(4). That driver is implemented in the misleadingly-named sys/dev/intpm/intpm.c file; this is an SMBus controller on an AMD system, not an INTel Power Management controller. <sigh> > intsmb0: error = 1, status = 0x6 That message can only come from intsmb_error(): ================================================================ static int intsmb_error(device_t dev, int status) { int error = 0; if (status & PIIX4_SMBHSTSTAT_ERR) error |= SMB_EBUSERR; if (status & PIIX4_SMBHSTSTAT_BUSC) error |= SMB_ECOLLI; if (status & PIIX4_SMBHSTSTAT_FAIL) error |= SMB_ENOACK; if (error != 0 && bootverbose) device_printf(dev, "error = %d, status = %#x\n", error, status); return (error); } ================================================================ The PIIX4_SMBHSTSTAT* values are defined in sys/dev/intpm/intpmreg.h: ================================================================ #define PIIX4_SMBHSTSTAT_BUSY (1<<0) #define PIIX4_SMBHSTSTAT_INTR (1<<1) #define PIIX4_SMBHSTSTAT_ERR (1<<2) #define PIIX4_SMBHSTSTAT_BUSC (1<<3) #define PIIX4_SMBHSTSTAT_FAIL (1<<4) ================================================================ A "status" value of 0x6 is 00110b : (PIIX4_SMBHSTSTAT_ERR | PIIX4_SMBHSTSTAT_INTR). The SMB_E* values are defined in sys/dev/smbus/smbconf.h: ================================================================ #define SMB_ENOERR 0x0 #define SMB_EBUSERR 0x1 #define SMB_ENOTSUPP 0x2 #define SMB_ENOACK 0x4 #define SMB_ECOLLI 0x8 #define SMB_EABORT 0x10 #define SMB_ETIMEOUT 0x20 #define SMB_EBUSY 0x40 #define SMB_EINVAL 0x100 ================================================================ An "error" value of 1 is SMB_EBUSERR, which matches the PIIX4_SMBHSTSTAT => SMB_E mapping that intsmb_error() does.
The fact that the message from intsmb_error() comes immediately before a message about failing to change or restore the page, means that the call that failed was smbus_writeb(), which is ultimately handled by intsmb_writeb(): ================================================================ static int intsmb_writeb(device_t dev, u_char slave, char cmd, char byte) { struct intsmb_softc *sc = device_get_softc(dev); int error; INTSMB_LOCK(sc); error = intsmb_free(sc); if (error) { INTSMB_UNLOCK(sc); return (error); } bus_write_1(sc->io_res, PIIX4_SMBHSTADD, slave & ~LSB); bus_write_1(sc->io_res, PIIX4_SMBHSTCMD, cmd); bus_write_1(sc->io_res, PIIX4_SMBHSTDAT0, byte); intsmb_start(sc, PIIX4_SMBHSTCNT_PROT_BDATA, 0); error = intsmb_stop(sc); INTSMB_UNLOCK(sc); return (error); } ================================================================ You can see that intsmb_writeb() does not call intsmb_error() directly. However, it does call intsmb_stop(): ================================================================ static int intsmb_stop(struct intsmb_softc *sc) { int error, status; INTSMB_LOCK_ASSERT(sc); if (sc->poll || cold) /* So that it can use device during device probe on SMBus. */ return (intsmb_stop_poll(sc)); error = msleep(sc, &sc->lock, PWAIT | PCATCH, "SMBWAI", hz / 8); if (error == 0) { status = bus_read_1(sc->io_res, PIIX4_SMBHSTSTS); if (!(status & PIIX4_SMBHSTSTAT_BUSY)) { error = intsmb_error(sc->dev, status); if (error == 0 && !(status & PIIX4_SMBHSTSTAT_INTR)) device_printf(sc->dev, "unknown cause why?\n"); #ifdef ENABLE_ALART bus_write_1(sc->io_res, PIIX4_SMBSLVCNT, PIIX4_SMBSLVCNT_ALTEN); #endif return (error); } } /* Timeout Procedure. */ sc->isbusy = 0; /* Re-enable suppressed interrupt from slave part. */ bus_write_1(sc->io_res, PIIX4_SMBSLVCNT, PIIX4_SMBSLVCNT_ALTEN); if (error == EWOULDBLOCK) return (SMB_ETIMEOUT); else return (SMB_EABORT); } ================================================================ Unfortunately, I'm not familiar enough with this SMBus controller -- or the intsmb(4) driver -- to debug this further. It looks like the people who have made functional changes in the past decade are cem@ (already CCed) and avg@ (adding).
(In reply to Greg V from comment #0) > intsmb0: <AMD FCH SMBus Controller> at device 20.0 on pci0 Which AMD FCH controller is this? If this is the new Zen2 stuff, we may be missing support for it in intpm(4). Anything interesting in 'devinfo -rv'? Anyone have a datasheet for this FCH?
(In reply to Ravi Pokala from comment #8) Ravi, are you sure that this is correct? > rc = smbus_writeb(sc->smbus, > (JEDEC_DTI_PAGE | JEDEC_LSA_PAGE_SET1), 0, 0); Given smbus_writeb(bus,slave,cmd,byte).
(In reply to Conrad Meyer from comment #9) This is first gen Zen (R7 1700, X370). I'm not sure what qualifies as interesting, but here's its description: intsmb0 pnpinfo vendor=0x1022 device=0x790b subvendor=0x1462 subdevice=0x7a33 class=0x0c0500 at slot=20 function=0 dbsf=pci0:0:20:0 handle=\_SB_.PCI0.D028 ACPI I/O ports: 0xb00-0xb0f smbus0 jedec_dimm0 at addr=0xa4 jedec_dimm1 at addr=0xa6 Looks like the AMD website only has documentation for the previous (?) FCH — https://www.amd.com/system/files/TechDocs/51205_Bolton_FCH_BIOS_Dev_Guide.pdf mentions device ID 780b, not 790b. FreeBSD source calls 790b 'AMDCZ'.
(In reply to Greg V from comment #11) > (In reply to Conrad Meyer from comment #9) > > This is first gen Zen (R7 1700, X370). I'm not sure what qualifies as > interesting, but here's its description: Nvm, I thought you might have zen2 for some reason. I have the same device with zen1 X390 and it seems to work fine here: intsmb0 pnpinfo vendor=0x1022 device=0x790b subvendor=0x1849 subdevice=0xffff class=0x0c0500 at slot=20 function=0 dbsf=pci0:0:20:0 handle=\_SB_.PCI0.D07C ACPI I/O ports: 0xb00-0xb0f smbus0 jedec_dimm0 at addr=0xa2 jedec_dimm1 at addr=0xa6 jedec_dimm2 at addr=0xaa jedec_dimm3 at addr=0xae (Different subvendor/subdevice? Not sure that matters.) Same I/O port range. My SPD addresses are at different locations, but that is expected. > intsmb0 pnpinfo vendor=0x1022 device=0x790b subvendor=0x1462 > subdevice=0x7a33 class=0x0c0500 at slot=20 function=0 dbsf=pci0:0:20:0 > handle=\_SB_.PCI0.D028 > ACPI I/O ports: > 0xb00-0xb0f > smbus0 > jedec_dimm0 at addr=0xa4 > jedec_dimm1 at addr=0xa6 > FreeBSD source calls 790b 'AMDCZ'. Yeah, it is an old piece of IP, introduced in family 15h model ~70h ("Carrizo" APU, I guess). They reuse it in Zen.
(In reply to Andriy Gapon from comment #10) ================================================================ Ravi, are you sure that this is correct? > rc = smbus_writeb(sc->smbus, > (JEDEC_DTI_PAGE | JEDEC_LSA_PAGE_SET1), 0, 0); Given smbus_writeb(bus,slave,cmd,byte). ================================================================ Sadly, yes, I'm sure it's correct. SMBus only supports 8-bit command codes (which are used as EEPROM byte offsets), so page-change mechanism is required for accessing bytes [256-512]. For whatever reason, JEDEC (or the SPD EEPROM chip vendors) decided that the way to handle that is by writing to a specific SMBus address. In response, *all* devices on the bus switch to that page. As I noted in jedec_dimm.h: ================================================================ /* TSE2004av defines several Device Type Identifiers (DTIs), which are the high * nybble of the SMBus address. Addresses with DTIs of PROTECT (or PAGE, which * has the same value) are essentially "broadcast" addresses; all SPD devices * respond to them, changing their mode based on the Logical Serial Address * (LSA) encoded in bits [3:1]. For normal SPD access, bits [3:1] encode the * DIMM slot number. */ ================================================================ Thus, using "(JEDEC_DTI_PAGE | JEDEC_LSA_PAGE_SET1)" as the slave address is in fact correct.
(In reply to Ravi Pokala from comment #13) Oh, I recall this now. Thank you for refreshing my memory. And I could have checked the code more thoroughly myself. So, sorry for the noise. In one part-specific datasheet I noticed this curious bit: https://www.onsemi.com/pub/Collateral/N34C04-D.PDF > 17. The N34C04MU3ETG will respond with NoACK following the dummy Data Byte, > while the N34C04MU3EKTG will respond with ACK. Maybe that NoACK (NACK) is what we are seeing here as an SMBus error. Although the code mapped PIIX4_SMBHSTSTAT_ERR | PIIX4_SMBHSTSTAT_INTR to SMB_EBUSERR rather than to SMB_ENOACK, I am not sure if that mapping is (universally) correct. E.g., in one AMD BKDG document I see this: > Bits Description > 2 DeviceErr. Read; Write-1-to-clear; Set-by-hardware. > Reset: 0. > 1=An error of one of the following occurred: > • Illegal command field. > • Unclaimed cycle > • Host device time-out. > 1 SMBusInterrupt. Read; Write-1-to-clear; Set-by-hardware. > Reset: 0. 1=Completion of the last host command. A combination of these bits can mean an "unclaimed cycle" and maybe that means "NACK" in the SMBus protocol terminology. So, it would be interesting to see - what specific EEPROM part the DIMM in question uses (a looking glass can help) - what happens if we ignore the error from the page change write
Ravi, also, I think that it is wrong to fail the attach on a failure to access TSOD. Many DDR3 DIMMs simply do not have it (not sure about DDR4). I think that all "server" RDIMMS/LRDIMMs and consumer ECC-capable UDIMMS that I've seen had it. But all non-ECC UDIMMs didn't. So, a missing TSOD should not be a fatal error.
(In reply to Andriy Gapon from comment #14) > In one part-specific datasheet I noticed this curious bit: > https://www.onsemi.com/pub/Collateral/N34C04-D.PDF >> 17. The N34C04MU3ETG will respond with NoACK following the dummy Data Byte, >> while the N34C04MU3EKTG will respond with ACK. Oh, that's very clever. <facepalm> Seriously though, nice find Andriy! That could very well be the Greg is seeing.
(In reply to Andriy Gapon from comment #15) > also, I think that it is wrong to fail the attach on a failure to access TSOD. > Many DDR3 DIMMs simply do not have it (not sure about DDR4). > I think that all "server" RDIMMS/LRDIMMs and consumer ECC-capable UDIMMS that I've seen had it. But all non-ECC UDIMMs didn't. > So, a missing TSOD should not be a fatal error. Agreed, which is why it isn't: ================================================================ /* The MSBit of the TSOD-presence byte reports whether or not the TSOD * is in fact present. If it is, read manufacturer and device info from * it to confirm that it's a valid TSOD device. It's an error if any of * those bytes are unreadable; it's not an error if the device is simply * not known to us (tsod_match == NULL). * While DDR3 and DDR4 don't explicitly require a TSOD, essentially all * DDR3 and DDR4 DIMMs include one. */ rc = smbus_readb(sc->smbus, sc->spd_addr, tsod_present_offset, &byte); ... if (byte & 0x80) { tsod_present = true; ... } else { tsod_match = NULL; tsod_present = false; } ... /* Create the temperature sysctl IFF the TSOD is present and valid */ if (tsod_present && (tsod_match != NULL)) { ================================================================
(In reply to Ravi Pokala from comment #16) > That could very well be the Greg is seeing. That could very well be the *problem* Greg is seeing.
(In reply to Ravi Pokala from comment #17) In my svn checkout the first ellipses expand to: if (rc != 0) { device_printf(dev, "failed to read TSOD-present byte: %d\n", rc); goto out; }
The TSOD-present byte is just another byte in the SPD; it should be readable if the SPD in general is readable. If that specific byte is not readable, that suggests a real problem reading the SPD.
Sorry, I've been inattentive again. What if that bit lies, e.g., because of a vendor's mistake... Then, we would fail here: if (byte & 0x80) { tsod_present = true; rc = jedec_dimm_readw_be(sc, TSOD_REG_MANUFACTURER, &vendorid); if (rc != 0) { device_printf(dev, "failed to read TSOD Manufacturer ID\n"); goto out; } And we see that in comment #0: > jedec_dimm0: failed to read TSOD Manufacturer ID > device_attach: jedec_dimm0 attach returned 1 P.S. it also seems that smbus error codes ideally need translating to "normal" error codes.
(In reply to Andriy Gapon from comment #14) > what specific EEPROM part the DIMM in question uses I don't really want to take the heatsink off, but the high resolution photos helpfully provided by KFA2 on the official website show a chip marked "44 5MC 5AC" ("5" might be "S"?), unfortunately google doesn't find anything for that query.
Ignoring the page errors makes it read the second page successfully (contents match the ones seen in Windows)! And yeah, the TSOD presence bit seems to lie on these DIMMs… So pretty much the only thing I can see is the capacity. I guess I shouldn't be surprised, since the SPD is so full of zeros. dev.jedec_dimm.0.serial: 00000000 dev.jedec_dimm.0.part: dev.jedec_dimm.0.capacity: 8192 dev.jedec_dimm.0.type: DDR4
Awesome! Can you attach a patch with your changes?
Created attachment 202365 [details] dimm-hack.patch (In reply to Ravi Pokala from comment #24) It's not a good patch, but this is what I did
Thanks Greg! I'll massage it as needed and commit it for you sometime in the next few days.
(In reply to Greg V from comment #23) I once got some Kingston modules, they all had the same serial number. Turned out to be counterfeit. Hope that that is not the case with the memory you got. Also, could you please share a link to the picture?
(In reply to Andriy Gapon from comment #27) http://www.kfa2.com/kfa2/kfa2-halloffame-extreme.html / http://ftp.kfa2.com/downloads/KFA2_HOF_Extreme.zip
> Thanks Greg! I'll massage it as needed and commit it for you sometime in the next few days. For some very large definitions of "few"... :-/ I've finally circled back around to this, and have a patch in Phabricator: https://reviews.freebsd.org/D19681 Greg, could you apply that patch and confirm that it works for you? Thanks!
(In reply to Ravi Pokala from comment #29) Yes, it does
(In reply to Greg V from comment #30) > Yes, it does Excellent, thanks!
A commit references this bug: Author: rpokala Date: Wed Mar 27 21:50:02 UTC 2019 New revision: 345611 URL: https://svnweb.freebsd.org/changeset/base/345611 Log: Teach jedec_dimm(4) to be more forgiving of non-fatal errors. It looks like some DIMMs claim to have a TSOD, but actually don't. Some claim they weren't able to change the SPD page, but they did. Neither of those should be fatal errors. PR: 235944 Submitted by: Greg V <greg@unrelenting.technology> Reported by: Greg V <greg@unrelenting.technology> Reviewed by: cem MFC after: 1 weeks Sponsored by: Panasas Differential Revision: https://reviews.freebsd.org/D19681 Changes: head/sys/dev/jedec_dimm/jedec_dimm.c
MFCed to stable/12 in r345836 MFCed to stable/11 in r345837 MFCed to stable/10 in r345838