Bug 222079 - [ichwd] New Intel TCO Watchdog Timer (Skylake/Sunrise Point and newer) is not supported
Summary: [ichwd] New Intel TCO Watchdog Timer (Skylake/Sunrise Point and newer) is not...
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: CURRENT
Hardware: amd64 Any
: --- Affects Some People
Assignee: Andriy Gapon
URL:
Keywords: feature, needs-patch
Depends on:
Blocks:
 
Reported: 2017-09-05 18:47 UTC by Val Packett
Modified: 2018-12-14 09:31 UTC (History)
7 users (show)

See Also:
driesm: mfc-stable12?


Attachments
Attempt to clear NO_REBOOT bit on tco_version==4 hardware (3.73 KB, patch)
2018-11-06 04:48 UTC, t_uemura
no flags Details | Diff
(Maybe) fixed resource allocation. (4.28 KB, patch)
2018-11-08 06:18 UTC, t_uemura
no flags Details | Diff
my update of the NO_REBOOT patch (5.90 KB, patch)
2018-12-05 13:27 UTC, Andriy Gapon
no flags Details | Diff
my update of the NO_REBOOT patch (5.89 KB, patch)
2018-12-05 13:30 UTC, Andriy Gapon
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Val Packett 2017-09-05 18:47:05 UTC
Looks like Intel moved the watchdog from LPC to SMBus in Sunrise Point, so the ichwd driver does not work anymore.

Relevant Linux patches: https://lkml.org/lkml/2015/7/27/456
Comment 1 Konstantin Belousov freebsd_committer freebsd_triage 2017-10-10 15:28:46 UTC
I have some WIP but it is far from being able to drive the hardware.  The current state is that I have attachments for the power controller, added code which understands location of the TCO registers in SMBUS but the stumbling block is to correctly reserve TCO window in rman.  It is already reserved by motherboard PNP0C02 on the server where I debug this.
Comment 2 Andriy Gapon freebsd_committer freebsd_triage 2018-09-19 10:42:27 UTC
Just curious if there are any news about this.
Comment 3 Konstantin Belousov freebsd_committer freebsd_triage 2018-09-19 12:00:13 UTC
(In reply to Andriy Gapon from comment #2)
It is in the same state, nothing happens.  I did not received the advise about co-existence of drivers.
Comment 4 Andriy Gapon freebsd_committer freebsd_triage 2018-10-16 14:43:01 UTC
A proposed patch (first iteration as of now): https://reviews.freebsd.org/D17585
Please see the review request for its limitations.
If PCI ID of your pci0:0:31:4 device is different, then please add it before testing.
Comment 5 commit-hook freebsd_committer freebsd_triage 2018-10-22 14:44:48 UTC
A commit references this bug:

Author: avg
Date: Mon Oct 22 14:44:45 UTC 2018
New revision: 339591
URL: https://svnweb.freebsd.org/changeset/base/339591

Log:
  ichwd: add support for TCO watchdog timer in Lewisburg PCH (C620)

  The change is based on public documents listed below as well as Linux
  changes and the code developed by Kostik.

  The documents:
  - Intel? C620 Series Chipset Platform Controller Hub Datasheet
  - Intel? 100 Series and Intel? C230 Series Chipset Family Platform
    Controller Hub (PCH) Datasheet - Volume 2 of 2

  Interesting Linux commits:
  - https://github.com/torvalds/linux/commit/9424693035a57961a8eb09e96aab315a7096535d
  - https://github.com/torvalds/linux/commit/2a7a0e9bf7b32e838d873226808ab8a6c00148f7

  The peculiarity of the new chipsets is that the watchdog resources are
  configured in PCI registers of SMBus controller and Power Management
  function as opposed to the LPC bridge.  I took a simplistic approach of
  querying the resources from the respective PCI devices.  ichwd is still
  a device on isa bus.  The PCI devices are found by their slot and
  function defined in the datasheets as siblings of the upstream LPC
  bridge.

  There are some shortcuts and missing features.

  First of all, I have not implemented the functionality required to clear
  the no-reboot bit.  That would require writing to a special PCI
  configuration register of a hidden / invisible PCI device after which
  the device would start responding to accesses to other registers.  The
  no-reboot bit was not set on my test hardware, so I decided to leave its
  handling for the later time.

  Also, I did not try to handle the case where the watchdog resources are
  not configured by the hardware as well as the case where ACPI defined
  operational region conflicts with the watchdog resources.  My test
  system did not have either of those problem, so, again, I decided to
  leave those cases until later.
  See this Linux commit for some details of the ACPI problem:
  https://github.com/torvalds/linux/commit/a7ae81952cdab56a1277bd2f9ed7284c0f575120

  Finally, I have added only the PCI ID found on my test system.  I think
  that more IDs can be added as the change gets tested.

  Tested on Dell PowerEdge R740.

  PR:		222079
  Reviewed by:	mav, kib
  MFC after:	3 weeks
  Relnotes:	maybe
  Sponsored by:	Panzura
  Differential Revision: https://reviews.freebsd.org/D17585

Changes:
  head/sys/dev/ichwd/ichwd.c
  head/sys/dev/ichwd/ichwd.h
Comment 6 t_uemura 2018-11-01 04:14:26 UTC
The patch (with the relevant PCI device ID added) can be applied cleanly
against 11.2-STABLE as of 28th Sep. on my Shuttle DS77U mini PC (Sunrise
Point-LP chipset), compiled with no error, and the patched ichwd now
is able to probe/attach the watchdog timer as follows.

ichwd0: <Sunrise Point-LP watchdog timer> at port 0x400-0x41f on isa0

Though, the timer never fires by killing the running watchdogd daemon,
as it should.

/etc/rc.d/watchdogd onestart
sleep 10
killall -KILL watchdogd

I can see the following two verbose lines when I start watchdogd, just
for information.

ichwd0: timer enabled
ichwd0: timeout set to 229 ticks

Is this related to the missing no-reboot bit handling?

P.S. The added PCI device ID is 0x9d23 .
Comment 7 Andriy Gapon freebsd_committer freebsd_triage 2018-11-01 06:36:46 UTC
(In reply to t_uemura from comment #6)

I think that the missing no-reboot handling is the most likely issue here.
If you are interested, this is what Linux does for that bit:
https://github.com/torvalds/linux/blob/9424693035a57961a8eb09e96aab315a7096535d/drivers/i2c/busses/i2c-i801.c#L1201

On FreeBSD/x86 pci_cfgregread() and pci_cfgregwrite() [see sys/x86/include/pci_cfgreg.h] are functions that can be used to directly access the hardware even if it's hidden from the pci bus driver.
Comment 8 t_uemura 2018-11-06 04:48:17 UTC
Created attachment 198998 [details]
Attempt to clear NO_REBOOT bit on tco_version==4 hardware
Comment 9 t_uemura 2018-11-06 04:54:58 UTC
Andriy

Thank you for your pointer. I just added some lines to try to clear
NO_REBOOT bit but no success.

The readout of SMBus PCR GC register (offset 0xc6000c of SBREG_BAR) is
always 0xffffffff and I can't change the value. I don't know whether
this is because of programming error or WDT is disabled completely.
Comment 10 t_uemura 2018-11-06 06:11:50 UTC
Reply to myself. My hardware seems to be a culprit. The watchdog
doesn't fire also on Linux even it is probed and attached successfully.
The Linux test detail is described in the following page.

http://www.madore.org/~david/linux/iTCO-wdt-test.html
Comment 11 Andriy Gapon freebsd_committer freebsd_triage 2018-11-06 10:11:17 UTC
(In reply to t_uemura from comment #8)

Thank you very much for the patch!

Does it work as far as device attachment goes?
I mean, are you able to read the BAR and to create the resource based on it?
What is the BAR that you get?

My own preference would be NOT to create a device_t handle for 31:1.  I would instead just set the resource described by the BAR on the ichwd device.
Comment 12 Andriy Gapon freebsd_committer freebsd_triage 2018-11-06 10:12:50 UTC
(In reply to t_uemura from comment #10)
I see...
Maybe the BIOS on your system configures the watchdog to generate an SMI and locks that configuration.  Maybe something else.
Thank you for the investigation and the patch.
Comment 13 commit-hook freebsd_committer freebsd_triage 2018-11-06 13:55:15 UTC
A commit references this bug:

Author: avg
Date: Tue Nov  6 13:54:25 UTC 2018
New revision: 340182
URL: https://svnweb.freebsd.org/changeset/base/340182

Log:
  MFC r339591: ichwd: add support for TCO watchdog timer in Lewisburg PCH (C620)

  PR:		222079
  Relnotes:	maybe
  Sponsored by:	Panzura

Changes:
_U  stable/11/
  stable/11/sys/dev/ichwd/ichwd.c
  stable/11/sys/dev/ichwd/ichwd.h
Comment 14 Dries Michiels freebsd_committer freebsd_triage 2018-11-06 16:26:56 UTC
Hi, Can this be comitted to stable 12 too? Or is this already in? Thanks
Comment 15 commit-hook freebsd_committer freebsd_triage 2018-11-06 17:31:22 UTC
A commit references this bug:

Author: avg
Date: Tue Nov  6 17:31:10 UTC 2018
New revision: 340190
URL: https://svnweb.freebsd.org/changeset/base/340190

Log:
  MFC r339591: ichwd: add support for TCO watchdog timer in Lewisburg PCH (C620)

  PR:		222079
  Approved by:	re (rgrimes)
  Relnotes:	maybe
  Sponsored by:	Panzura

Changes:
_U  stable/12/
  stable/12/sys/dev/ichwd/ichwd.c
  stable/12/sys/dev/ichwd/ichwd.h
Comment 16 t_uemura 2018-11-07 00:53:57 UTC
Andriy

> I mean, are you able to read the BAR and to create the resource based on it?
> What is the BAR that you get?

With my patch applied, I can read SBREG_BAR by pci_cfgregread() calls.
For my hardware, the BAR value is 0xfd000000, so the SMBus PCR GC
register should be at 0xfdc6000c .

In order to access to the pointed location, I think I must call
bus_alloc_resource() with relevant device_t specified as its first
argument.

By specifying dev or something like device_get_parent(dev), all calls
fail to allocate resource. I think dev must be PCI one, for
pci_alloc_resource() to be called.

> My own preference would be NOT to create a device_t handle for 31:1.  I would
> instead just set the resource described by the BAR on the ichwd device.

Sure. If there is an API without needing device_t, I would use that.
Anyway, I've got the following verbose message with my patch applied.

ichwd0: <Sunrise Point-LP watchdog timer> at port 0x400-0x41f on isa0
pcib0: allocated type 4 (0x400-0x41f) for rid 0 of ichwd0
pcib0: allocated type 4 (0x1830-0x1837) for rid 1 of ichwd0
found-> vendor=0x8086, dev=0x9d20, revid=0x21
	domain=0, bus=0, slot=31, func=1
	class=05-80-00, hdrtype=0x00, mfdev=0
	cmdreg=0x0006, statreg=0x0000, cachelnsz=0 (dwords)
	lattimer=0x00 (0 ns), mingnt=0x00 (0 ns), maxlat=0x00 (0 ns)
	map[10]: type Memory, range 64, base rxfd000000, size 24, enabled
pcib0: allocated type 3 (0xfd000000-0xfdffffff) for rid 10 of pci0:0:31:1
ichwd0: ICH WDT present but disabled in BIOS or hardware
device_attach: ichwd0 attach returned 6
Comment 17 t_uemura 2018-11-07 05:49:43 UTC
I noticed that my ThinkPad T460s also had a Sunrise Point-LP PCH. The
Linux test worked correctly (successfully rebooted) on this ThinkPad,
though my patched ichwd didn't.

Now I think there is programming error in my patch, maybe around
mapping SMBus PCR GC register (around line 783 in patched ichwd.c) or
accessing the register (around line 505).
Comment 18 Andriy Gapon freebsd_committer freebsd_triage 2018-11-07 06:57:40 UTC
(In reply to t_uemura from comment #16)

See how ichwd steals / borrows a resource based at ICH_TCOBASE.
You would need to call bus_set_resource() before you can call bus_alloc_resource().  Also, you would need to invent a unique rid for the new resource and consistently use it in both set and alloc.
Comment 19 Andriy Gapon freebsd_committer freebsd_triage 2018-11-07 07:03:59 UTC
(In reply to t_uemura from comment #17)
I do not see any obvious mistakes in your patch _except_ for using a resource associated with a PCI device after hiding that PCI device.  It's possible that the resource becomes invalid.
Comment 20 Andriy Gapon freebsd_committer freebsd_triage 2018-11-07 07:08:08 UTC
(In reply to t_uemura from comment #16)
Oh, based on this
> pcib0: allocated type 3 (0xfd000000-0xfdffffff) for rid 10 of pci0:0:31:1
I suspect that ichwd_read_p2sb_4 / ichwd_write_p2sb_4 operate on address 0xfd000000 instead of 0xfdc6000c.  Resource allocation is hard.
Comment 21 t_uemura 2018-11-08 06:18:46 UTC
Created attachment 199074 [details]
(Maybe) fixed resource allocation.
Comment 22 t_uemura 2018-11-08 06:47:55 UTC
Andriy

By calling bus_set_resource() instantly panics my system. I think this
device doesn't need bus_set_resource() because the target memory
range is mapped correctly in pci_add_map(), perhaps called within
pci_add_child() .

> map[10]: type Memory, range 64, base rxfd000000, size 24, enabled

I don't know why bus_alloc_resource() in my previous patch allocates
not only 0xfdc6000c ~ 0xfdc6000f but whole 0xfd000000 ~ 0xfdffffff.

In the modified patch, bus_alloc_resource() now try to allocate whole
16 MiB range, and ichwd_read_p2sb_4 / ichwd_write_p2sb_4 is accessing
offset 0xc6000c of that range.

On both Shuttle mini PC and ThinkPad, the initial readout value of
ichwd_read_p2sb_4 is 0x00000000. Now I can set/unset the NO_REBOOT
bit, so at least the resource allocation issue is fixed.

On the other hand, with or without my new patch, ThinkPad now reboots
successfully upon watchdog timeout. However Shuttle mini PC doesn't.
This may be a BIOS or even a hardware issue.
Comment 23 Andriy Gapon freebsd_committer freebsd_triage 2018-12-05 13:27:02 UTC
Created attachment 199847 [details]
my update of the NO_REBOOT patch

(In reply to t_uemura from comment #22)
Sorry for the long silence.
Could you please test the patch that I am attaching and see if it doesn't cause any regressions comparing to your patch?
Comment 24 Andriy Gapon freebsd_committer freebsd_triage 2018-12-05 13:30:39 UTC
Created attachment 199848 [details]
my update of the NO_REBOOT patch

Oops, fixed a copy+paste error.
Comment 25 t_uemura 2018-12-12 01:23:24 UTC
Sorry for my delay. Just tested but to no avail. Specifically, it
probed and attached successfully, enabled once watchdogd was started,
and never fired after the running watchdogd was killed by SIGKILL.

The verbose dmesg is as follows.

ichwd0: <Sunrise Point-LP watchdog timer> at port 0x400-0x41f iomem 0xfdc6000c-0xfdc6000f on isa0
pcib0: allocated type 4 (0x400-0x41f) for rid 0 of ichwd0
pcib0: allocated type 3 (0xfdc6000c-0xfdc6000f) for rid 1 of ichwd0
pcib0: allocated type 4 (0x1830-0x1837) for rid 2 of ichwd0
ichwd0: timer disabled
random: harvesting attach, 8 bytes (4 bits) from ichwd0
ichwd0: timer enabled
ichwd0: timeout set to 229 ticks

Note that a person at Shuttle Technical Support replied to my inquiry
that their DS77U "cannot support onboard hardware watchdog function".

On the other hand, the original ichwd surely worked on ThinkPad T460s,
so at least the corresponding device ID 0x9d23 can be added.
Comment 26 Andriy Gapon freebsd_committer freebsd_triage 2018-12-12 07:41:43 UTC
(In reply to t_uemura from comment #25)

Thank you for testing!

I didn't suppose that my change will fix the watchdog functionality on DS77U.
I just wanted to check that it works no worse than the original patch that you developed.
Comment 27 t_uemura 2018-12-12 07:57:06 UTC
OK, I'll try your patch on ThinkPad shortly.
Comment 28 t_uemura 2018-12-14 06:23:11 UTC
I just confirmed on ThinkPad T460s that ichwd with your patch applied
worked correctly.
Comment 29 Andriy Gapon freebsd_committer freebsd_triage 2018-12-14 07:38:27 UTC
(In reply to t_uemura from comment #28)
Thank you!