Bug 235944 - jedec_dimm(4) does not attach to KFA2 (aka Galax) Hall of Fame DDR4 sticks
Summary: jedec_dimm(4) does not attach to KFA2 (aka Galax) Hall of Fame DDR4 sticks
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: CURRENT
Hardware: Any Any
: --- Affects Only Me
Assignee: Ravi Pokala
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-02-22 20:55 UTC by Val Packett
Modified: 2019-04-03 19:40 UTC (History)
5 users (show)

See Also:
rpokala: mfc-stable12+
rpokala: mfc-stable11+
rpokala: mfc-stable10+


Attachments
thaiphoon-smbus.png (7.48 KB, image/png)
2019-02-22 20:55 UTC, Val Packett
no flags Details
thaiphoon-spd-dump-0.png (32.78 KB, image/png)
2019-02-22 20:56 UTC, Val Packett
no flags Details
thaiphoon-spd-dump-1.png (33.18 KB, image/png)
2019-02-22 20:57 UTC, Val Packett
no flags Details
dimm-hack.patch (1.03 KB, patch)
2019-02-25 20:24 UTC, Val Packett
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Val Packett 2019-02-22 20:55:21 UTC
Created attachment 202271 [details]
thaiphoon-smbus.png

intsmb0: <AMD FCH SMBus Controller> at device 20.0 on pci0
smbus0: <System Management Bus> on intsmb0
smbus0: <unknown device> at addr 0xa4
smbus0: <unknown device> at addr 0xa6
smbus0: <unknown device> at addr 0xa2

jedec_dimm0: <DDR4 DIMM> at addr 0xa4 on smbus0
jedec_dimm0: unable to change page for offset 0x0149: 1
jedec_dimm0: unable to restore page for offset 0x0049: 1
jedec_dimm0: unable to change page for offset 0x0145: 1
jedec_dimm0: unable to restore page for offset 0x0045: 1
jedec_dimm0: failed to read TSOD Manufacturer ID
device_attach: jedec_dimm0 attach returned 1
jedec_dimm1: <DDR4 DIMM> at addr 0xa6 on smbus0
jedec_dimm1: unable to change page for offset 0x0149: 1
jedec_dimm1: unable to restore page for offset 0x0049: 1
jedec_dimm1: unable to change page for offset 0x0145: 1
jedec_dimm1: unable to restore page for offset 0x0045: 1
jedec_dimm1: failed to read TSOD Manufacturer ID
device_attach: jedec_dimm1 attach returned 1

(interestingly, Thaiphoon Burner on Windows finds different SMBus addresses for the SPD EEPROMs: 0x52 and 0x53)
Comment 1 Val Packett 2019-02-22 20:56:46 UTC
Created attachment 202272 [details]
thaiphoon-spd-dump-0.png
Comment 2 Val Packett 2019-02-22 20:57:15 UTC
Created attachment 202273 [details]
thaiphoon-spd-dump-1.png
Comment 3 Ravi Pokala 2019-02-22 23:21:24 UTC
> (interestingly, Thaiphoon Burner on Windows finds different SMBus addresses for the SPD EEPROMs: 0x52 and 0x53)

SMBus uses the low-order bit to denote read or write. It looks like Windows uses 7-bit addressing, where the read/write bit is simply omitted. FreeBSD uses 8-bit SMBus addresses, which are always passed around with the read/write bit cleared. Thus, 7-bit addresses 0x52 and 0x53 become 8-bit addresses 0xa4 and 0xa6.
Comment 4 Ravi Pokala 2019-02-22 23:38:52 UTC
Could you try loading the driver again with 'bootverbose' set? That should cause it try to read and dump every byte of the SPD device.
Comment 5 Val Packett 2019-02-23 00:45:17 UTC
(In reply to Ravi Pokala from comment #4)

Well, that doesn't look helpful:

jedec_dimm0: <DDR4 DIMM> at addr 0xa4 on smbus0
intsmb0: error = 1, status = 0x6
jedec_dimm0: unable to change page: 1
intsmb0: error = 1, status = 0x6
jedec_dimm0: unable to restore page: 1
intsmb0: error = 1, status = 0x6
jedec_dimm0: unable to change page for offset 0x0149: 1
intsmb0: error = 1, status = 0x6
jedec_dimm0: unable to restore page for offset 0x0049: 1
intsmb0: error = 1, status = 0x6
jedec_dimm0: unable to change page for offset 0x0145: 1
intsmb0: error = 1, status = 0x6
jedec_dimm0: unable to restore page for offset 0x0045: 1
intsmb0: error = 1, status = 0x6
jedec_dimm0: failed to read TSOD Manufacturer ID
device_attach: jedec_dimm0 attach returned 1
jedec_dimm1: <DDR4 DIMM> at addr 0xa6 on smbus0
intsmb0: error = 1, status = 0x6
jedec_dimm1: unable to change page: 1
intsmb0: error = 1, status = 0x6
jedec_dimm1: unable to restore page: 1
intsmb0: error = 1, status = 0x6
jedec_dimm1: unable to change page for offset 0x0149: 1
intsmb0: error = 1, status = 0x6
jedec_dimm1: unable to restore page for offset 0x0049: 1
intsmb0: error = 1, status = 0x6
jedec_dimm1: unable to change page for offset 0x0145: 1
intsmb0: error = 1, status = 0x6
jedec_dimm1: unable to restore page for offset 0x0045: 1
intsmb0: error = 1, status = 0x6
jedec_dimm1: failed to read TSOD Manufacturer ID
device_attach: jedec_dimm1 attach returned 1
intsmb0: error = 1, status = 0x6
jedec_dimm2: failed to read dram_type

looking at the code, it doesn't print the original page if the additional page fails. I'll try fixing that.
Comment 6 Val Packett 2019-02-23 00:47:32 UTC
Here's the first page:

0000   24 10 0c 02 85 21 00 00 00 00 00 03 01 03 80 00  |$....!..........|
0010   00 00 08 0c 80 03 00 00 71 71 71 11 08 79 f0 0a  |........qqq..y..|
0020   20 08 00 05 00 a8 1e 2b 2b 00 00 00 00 00 00 00  | ......++.......|
0030   00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  |................|
0040   00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  |................|
0050   00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  |................|
0060   00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  |................|
0070   00 00 00 00 00 83 b5 ce c0 c0 c0 c0 00 c2 e7 f6  |................|
0080   11 11 00 00 00 00 00 00 00 00 00 00 00 00 00 00  |................|
0090   00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  |................|
00a0   00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  |................|
00b0   00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  |................|
00c0   00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  |................|
00d0   00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  |................|
00e0   00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  |................|
00f0   00 00 00 00 00 00 00 00 00 00 00 00 00 00 b6 58  |...............X|

Matches the output in Thaiphoon Burner.
Comment 7 Ravi Pokala 2019-02-23 03:25:47 UTC
The fact that the Windows util (and, for that matter, the firmware) can read the second (256-byte) page of the SPD, means that the SPD's page-change mechanism does in fact work. That suggests a driver issue.

Conveniently, enabling bootverbose to get the SPD dump from jedec_dimm(4), *also* enabled error reporting for intsmb(4). That driver is implemented in the misleadingly-named sys/dev/intpm/intpm.c file; this is an SMBus controller on an AMD system, not an INTel Power Management controller. <sigh>

> intsmb0: error = 1, status = 0x6

That message can only come from intsmb_error():

================================================================
static int
intsmb_error(device_t dev, int status)
{
	int error = 0;

	if (status & PIIX4_SMBHSTSTAT_ERR)
		error |= SMB_EBUSERR;
	if (status & PIIX4_SMBHSTSTAT_BUSC)
		error |= SMB_ECOLLI;
	if (status & PIIX4_SMBHSTSTAT_FAIL)
		error |= SMB_ENOACK;

	if (error != 0 && bootverbose)
		device_printf(dev, "error = %d, status = %#x\n", error, status);

	return (error);
}
================================================================

The PIIX4_SMBHSTSTAT* values are defined in sys/dev/intpm/intpmreg.h:

================================================================
#define	PIIX4_SMBHSTSTAT_BUSY	(1<<0)
#define	PIIX4_SMBHSTSTAT_INTR	(1<<1)
#define	PIIX4_SMBHSTSTAT_ERR	(1<<2)
#define	PIIX4_SMBHSTSTAT_BUSC	(1<<3)
#define	PIIX4_SMBHSTSTAT_FAIL	(1<<4)
================================================================

A "status" value of 0x6 is 00110b : (PIIX4_SMBHSTSTAT_ERR | PIIX4_SMBHSTSTAT_INTR).

The SMB_E* values are defined in sys/dev/smbus/smbconf.h:

================================================================
#define SMB_ENOERR	0x0
#define SMB_EBUSERR	0x1
#define SMB_ENOTSUPP	0x2
#define SMB_ENOACK	0x4
#define SMB_ECOLLI	0x8
#define SMB_EABORT	0x10
#define SMB_ETIMEOUT	0x20
#define SMB_EBUSY	0x40
#define	SMB_EINVAL	0x100
================================================================

An "error" value of 1 is SMB_EBUSERR, which matches the PIIX4_SMBHSTSTAT => SMB_E mapping that intsmb_error() does.
Comment 8 Ravi Pokala 2019-02-23 03:45:47 UTC
The fact that the message from intsmb_error() comes immediately before a message about failing to change or restore the page, means that the call that failed was smbus_writeb(), which is ultimately handled by intsmb_writeb():

================================================================
static int
intsmb_writeb(device_t dev, u_char slave, char cmd, char byte)
{
	struct intsmb_softc *sc = device_get_softc(dev);
	int error;

	INTSMB_LOCK(sc);
	error = intsmb_free(sc);
	if (error) {
		INTSMB_UNLOCK(sc);
		return (error);
	}
	bus_write_1(sc->io_res, PIIX4_SMBHSTADD, slave & ~LSB);
	bus_write_1(sc->io_res, PIIX4_SMBHSTCMD, cmd);
	bus_write_1(sc->io_res, PIIX4_SMBHSTDAT0, byte);
	intsmb_start(sc, PIIX4_SMBHSTCNT_PROT_BDATA, 0);
	error = intsmb_stop(sc);
	INTSMB_UNLOCK(sc);
	return (error);
}
================================================================

You can see that intsmb_writeb() does not call intsmb_error() directly. However, it does call intsmb_stop():

================================================================
static int
intsmb_stop(struct intsmb_softc *sc)
{
	int error, status;

	INTSMB_LOCK_ASSERT(sc);

	if (sc->poll || cold)
		/* So that it can use device during device probe on SMBus. */
		return (intsmb_stop_poll(sc));

	error = msleep(sc, &sc->lock, PWAIT | PCATCH, "SMBWAI", hz / 8);
	if (error == 0) {
		status = bus_read_1(sc->io_res, PIIX4_SMBHSTSTS);
		if (!(status & PIIX4_SMBHSTSTAT_BUSY)) {
			error = intsmb_error(sc->dev, status);
			if (error == 0 && !(status & PIIX4_SMBHSTSTAT_INTR))
				device_printf(sc->dev, "unknown cause why?\n");
#ifdef ENABLE_ALART
			bus_write_1(sc->io_res, PIIX4_SMBSLVCNT,
			    PIIX4_SMBSLVCNT_ALTEN);
#endif
			return (error);
		}
	}

	/* Timeout Procedure. */
	sc->isbusy = 0;

	/* Re-enable suppressed interrupt from slave part. */
	bus_write_1(sc->io_res, PIIX4_SMBSLVCNT, PIIX4_SMBSLVCNT_ALTEN);
	if (error == EWOULDBLOCK)
		return (SMB_ETIMEOUT);
	else
		return (SMB_EABORT);
}
================================================================

Unfortunately, I'm not familiar enough with this SMBus controller -- or the intsmb(4) driver -- to debug this further. It looks like the people who have made functional changes in the past decade are cem@ (already CCed) and avg@ (adding).
Comment 9 Conrad Meyer freebsd_committer freebsd_triage 2019-02-23 04:34:20 UTC
(In reply to Greg V from comment #0)
> intsmb0: <AMD FCH SMBus Controller> at device 20.0 on pci0

Which AMD FCH controller is this?  If this is the new Zen2 stuff, we may be missing support for it in intpm(4).  Anything interesting in 'devinfo -rv'?

Anyone have a datasheet for this FCH?
Comment 10 Andriy Gapon freebsd_committer freebsd_triage 2019-02-23 08:56:47 UTC
(In reply to Ravi Pokala from comment #8)

Ravi, are you sure that this is correct?
>                 rc = smbus_writeb(sc->smbus,
>                    (JEDEC_DTI_PAGE | JEDEC_LSA_PAGE_SET1), 0, 0);
Given smbus_writeb(bus,slave,cmd,byte).
Comment 11 Val Packett 2019-02-23 10:22:10 UTC
(In reply to Conrad Meyer from comment #9)

This is first gen Zen (R7 1700, X370). I'm not sure what qualifies as interesting, but here's its description:

        intsmb0 pnpinfo vendor=0x1022 device=0x790b subvendor=0x1462 subdevice=0x7a33 class=0x0c0500 at slot=20 function=0 dbsf=pci0:0:20:0 handle=\_SB_.PCI0.D028
            ACPI I/O ports:
                0xb00-0xb0f
          smbus0
            jedec_dimm0 at addr=0xa4
            jedec_dimm1 at addr=0xa6

Looks like the AMD website only has documentation for the previous (?) FCH — https://www.amd.com/system/files/TechDocs/51205_Bolton_FCH_BIOS_Dev_Guide.pdf mentions device ID 780b, not 790b. FreeBSD source calls 790b 'AMDCZ'.
Comment 12 Conrad Meyer freebsd_committer freebsd_triage 2019-02-23 17:00:41 UTC
(In reply to Greg V from comment #11)
> (In reply to Conrad Meyer from comment #9)
> 
> This is first gen Zen (R7 1700, X370). I'm not sure what qualifies as
> interesting, but here's its description:

Nvm, I thought you might have zen2 for some reason.  I have the same device with zen1 X390 and it seems to work fine here:

        intsmb0 pnpinfo vendor=0x1022 device=0x790b subvendor=0x1849 subdevice=0xffff class=0x0c0500 at slot=20 function=0 dbsf=pci0:0:20:0 handle=\_SB_.PCI0.D07C
            ACPI I/O ports:
                0xb00-0xb0f
          smbus0
            jedec_dimm0 at addr=0xa2
            jedec_dimm1 at addr=0xa6
            jedec_dimm2 at addr=0xaa
            jedec_dimm3 at addr=0xae

(Different subvendor/subdevice?  Not sure that matters.)  Same I/O port range.  My SPD addresses are at different locations, but that is expected.


>         intsmb0 pnpinfo vendor=0x1022 device=0x790b subvendor=0x1462
> subdevice=0x7a33 class=0x0c0500 at slot=20 function=0 dbsf=pci0:0:20:0
> handle=\_SB_.PCI0.D028
>             ACPI I/O ports:
>                 0xb00-0xb0f
>           smbus0
>             jedec_dimm0 at addr=0xa4
>             jedec_dimm1 at addr=0xa6


> FreeBSD source calls 790b 'AMDCZ'.

Yeah, it is an old piece of IP, introduced in family 15h model ~70h ("Carrizo" APU, I guess).  They reuse it in Zen.
Comment 13 Ravi Pokala 2019-02-25 04:38:47 UTC
(In reply to Andriy Gapon from comment #10)

================================================================
Ravi, are you sure that this is correct?
>                 rc = smbus_writeb(sc->smbus,
>                    (JEDEC_DTI_PAGE | JEDEC_LSA_PAGE_SET1), 0, 0);

Given smbus_writeb(bus,slave,cmd,byte).
================================================================

Sadly, yes, I'm sure it's correct. SMBus only supports 8-bit command codes (which are used as EEPROM byte offsets), so page-change mechanism is required for accessing bytes [256-512]. For whatever reason, JEDEC (or the SPD EEPROM chip vendors) decided that the way to handle that is by writing to a specific SMBus address. In response, *all* devices on the bus switch to that page. As I noted in jedec_dimm.h:

================================================================
/* TSE2004av defines several Device Type Identifiers (DTIs), which are the high
 * nybble of the SMBus address. Addresses with DTIs of PROTECT (or PAGE, which
 * has the same value) are essentially "broadcast" addresses; all SPD devices
 * respond to them, changing their mode based on the Logical Serial Address
 * (LSA) encoded in bits [3:1]. For normal SPD access, bits [3:1] encode the
 * DIMM slot number.
 */
================================================================

Thus, using "(JEDEC_DTI_PAGE | JEDEC_LSA_PAGE_SET1)" as the slave address is in fact correct.
Comment 14 Andriy Gapon freebsd_committer freebsd_triage 2019-02-25 08:21:48 UTC
(In reply to Ravi Pokala from comment #13)
Oh, I recall this now.  Thank you for refreshing my memory.  And I could have checked the code more thoroughly myself.  So, sorry for the noise.

In one part-specific datasheet I noticed this curious bit:
https://www.onsemi.com/pub/Collateral/N34C04-D.PDF
> 17. The N34C04MU3ETG will respond with NoACK following the dummy Data Byte,
> while the N34C04MU3EKTG will respond with ACK.

Maybe that NoACK (NACK) is what we are seeing here as an SMBus error.
Although the code mapped PIIX4_SMBHSTSTAT_ERR | PIIX4_SMBHSTSTAT_INTR to SMB_EBUSERR rather than to SMB_ENOACK, I am not sure if that mapping is (universally) correct.

E.g., in one AMD BKDG document I see this:
> Bits            Description
> 2               DeviceErr. Read; Write-1-to-clear; Set-by-hardware.
>                 Reset: 0.
>                 1=An error of one of the following occurred:
>                 • Illegal command field.
>                 • Unclaimed cycle
>                 • Host device time-out.
> 1               SMBusInterrupt. Read; Write-1-to-clear; Set-by-hardware.
>                 Reset: 0. 1=Completion of the last host command.

A combination of these bits can mean an "unclaimed cycle" and maybe that means "NACK" in the SMBus protocol terminology.

So, it would be interesting to see
- what specific EEPROM part the DIMM in question uses (a looking glass can help)
- what happens if we ignore the error from the page change write
Comment 15 Andriy Gapon freebsd_committer freebsd_triage 2019-02-25 08:25:29 UTC
Ravi,
also, I think that it is wrong to fail the attach on a failure to access TSOD.
Many DDR3 DIMMs simply do not have it (not sure about DDR4).
I think that all "server" RDIMMS/LRDIMMs and consumer ECC-capable UDIMMS that I've seen had it.  But all non-ECC UDIMMs didn't.
So, a missing TSOD should not be a fatal error.
Comment 16 Ravi Pokala 2019-02-25 15:10:05 UTC
(In reply to Andriy Gapon from comment #14)

> In one part-specific datasheet I noticed this curious bit:
> https://www.onsemi.com/pub/Collateral/N34C04-D.PDF
>> 17. The N34C04MU3ETG will respond with NoACK following the dummy Data Byte,
>> while the N34C04MU3EKTG will respond with ACK.

Oh, that's very clever. <facepalm>

Seriously though, nice find Andriy! That could very well be the Greg is seeing.
Comment 17 Ravi Pokala 2019-02-25 15:17:52 UTC
(In reply to Andriy Gapon from comment #15)

> also, I think that it is wrong to fail the attach on a failure to access TSOD.
> Many DDR3 DIMMs simply do not have it (not sure about DDR4).
> I think that all "server" RDIMMS/LRDIMMs and consumer ECC-capable UDIMMS that I've seen had it.  But all non-ECC UDIMMs didn't.
> So, a missing TSOD should not be a fatal error.

Agreed, which is why it isn't:

================================================================
	/* The MSBit of the TSOD-presence byte reports whether or not the TSOD
	 * is in fact present. If it is, read manufacturer and device info from
	 * it to confirm that it's a valid TSOD device. It's an error if any of
	 * those bytes are unreadable; it's not an error if the device is simply
	 * not known to us (tsod_match == NULL).
	 * While DDR3 and DDR4 don't explicitly require a TSOD, essentially all
	 * DDR3 and DDR4 DIMMs include one.
	 */
	rc = smbus_readb(sc->smbus, sc->spd_addr, tsod_present_offset, &byte);
...
	if (byte & 0x80) {
		tsod_present = true;
...
	} else {
		tsod_match = NULL;
		tsod_present = false;
	}
...
	/* Create the temperature sysctl IFF the TSOD is present and valid */
	if (tsod_present && (tsod_match != NULL)) {
================================================================
Comment 18 Ravi Pokala 2019-02-25 15:20:41 UTC
(In reply to Ravi Pokala from comment #16)
> That could very well be the Greg is seeing.

That could very well be the *problem* Greg is seeing.
Comment 19 Andriy Gapon freebsd_committer freebsd_triage 2019-02-25 15:27:58 UTC
(In reply to Ravi Pokala from comment #17)
In my svn checkout the first ellipses expand to:
        if (rc != 0) {
                device_printf(dev, "failed to read TSOD-present byte: %d\n",
                    rc);
                goto out;
        }
Comment 20 Ravi Pokala 2019-02-25 17:04:29 UTC
The TSOD-present byte is just another byte in the SPD; it should be readable if the SPD in general is readable. If that specific byte is not readable, that suggests a real problem reading the SPD.
Comment 21 Andriy Gapon freebsd_committer freebsd_triage 2019-02-25 17:34:21 UTC
Sorry, I've been inattentive again.
What if that bit lies, e.g., because of a vendor's mistake...
Then, we would fail here:
        if (byte & 0x80) {
                tsod_present = true;
                rc = jedec_dimm_readw_be(sc, TSOD_REG_MANUFACTURER, &vendorid);
                if (rc != 0) {
                        device_printf(dev,
                            "failed to read TSOD Manufacturer ID\n");
                        goto out;
                }

And we see that in comment #0:
> jedec_dimm0: failed to read TSOD Manufacturer ID
> device_attach: jedec_dimm0 attach returned 1

P.S. it also seems that smbus error codes ideally need translating to "normal" error codes.
Comment 22 Val Packett 2019-02-25 19:17:27 UTC
(In reply to Andriy Gapon from comment #14)

> what specific EEPROM part the DIMM in question uses

I don't really want to take the heatsink off, but the high resolution photos helpfully provided by KFA2 on the official website show a chip marked "44 5MC 5AC" ("5" might be "S"?), unfortunately google doesn't find anything for that query.
Comment 23 Val Packett 2019-02-25 19:52:07 UTC
Ignoring the page errors makes it read the second page successfully (contents match the ones seen in Windows)!

And yeah, the TSOD presence bit seems to lie on these DIMMs…

So pretty much the only thing I can see is the capacity. I guess I shouldn't be surprised, since the SPD is so full of zeros.

dev.jedec_dimm.0.serial: 00000000
dev.jedec_dimm.0.part:
dev.jedec_dimm.0.capacity: 8192
dev.jedec_dimm.0.type: DDR4
Comment 24 Ravi Pokala 2019-02-25 20:22:18 UTC
Awesome! Can you attach a patch with your changes?
Comment 25 Val Packett 2019-02-25 20:24:46 UTC
Created attachment 202365 [details]
dimm-hack.patch

(In reply to Ravi Pokala from comment #24)
It's not a good patch, but this is what I did
Comment 26 Ravi Pokala 2019-02-25 21:11:09 UTC
Thanks Greg! I'll massage it as needed and commit it for you sometime in the next few days.
Comment 27 Andriy Gapon freebsd_committer freebsd_triage 2019-02-25 22:02:21 UTC
(In reply to Greg V from comment #23)
I once got some Kingston modules, they all had the same serial number.
Turned out to be counterfeit. Hope that that is not the case with the memory you got.

Also, could you please share a link to the picture?
Comment 29 Ravi Pokala 2019-03-22 18:41:20 UTC
> Thanks Greg! I'll massage it as needed and commit it for you sometime in the next few days.

For some very large definitions of "few"... :-/

I've finally circled back around to this, and have a patch in Phabricator:

https://reviews.freebsd.org/D19681

Greg, could you apply that patch and confirm that it works for you?

Thanks!
Comment 30 Val Packett 2019-03-23 14:50:32 UTC
(In reply to Ravi Pokala from comment #29)
Yes, it does
Comment 31 Ravi Pokala 2019-03-25 23:23:33 UTC
(In reply to Greg V from comment #30)

> Yes, it does

Excellent, thanks!
Comment 32 commit-hook freebsd_committer freebsd_triage 2019-03-27 21:50:10 UTC
A commit references this bug:

Author: rpokala
Date: Wed Mar 27 21:50:02 UTC 2019
New revision: 345611
URL: https://svnweb.freebsd.org/changeset/base/345611

Log:
  Teach jedec_dimm(4) to be more forgiving of non-fatal errors.

  It looks like some DIMMs claim to have a TSOD, but actually don't. Some
  claim they weren't able to change the SPD page, but they did. Neither of
  those should be fatal errors.

  PR:		235944
  Submitted by:	Greg V <greg@unrelenting.technology>
  Reported by:	Greg V <greg@unrelenting.technology>
  Reviewed by:	cem
  MFC after:	1 weeks
  Sponsored by:	Panasas
  Differential Revision:	https://reviews.freebsd.org/D19681

Changes:
  head/sys/dev/jedec_dimm/jedec_dimm.c
Comment 33 Ravi Pokala freebsd_committer freebsd_triage 2019-04-03 19:40:21 UTC
MFCed to stable/12 in r345836
MFCed to stable/11 in r345837
MFCed to stable/10 in r345838