Bug 280817 - With DMAR enabled, my laptop panics
Summary: With DMAR enabled, my laptop panics
Status: Closed Unable to Reproduce
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 15.0-CURRENT
Hardware: amd64 Any
: --- Affects Some People
Assignee: freebsd-bugs (Nobody)
URL:
Keywords: crash, regression
Depends on:
Blocks:
 
Reported: 2024-08-14 15:52 UTC by Warner Losh
Modified: 2024-08-28 13:00 UTC (History)
9 users (show)

See Also:


Attachments
dmesg log with boot from 15/08 and 30/07 for comparison (51.48 KB, text/plain)
2024-08-15 11:23 UTC, Nuno Teixeira
no flags Details
dmesg log with boot from 15/08 and 30/07 for comparison (cleaned) (25.44 KB, text/plain)
2024-08-15 11:43 UTC, Nuno Teixeira
no flags Details
dmesg main-n271681-82cb2a4158fa (13.92 KB, text/plain)
2024-08-15 23:22 UTC, Nuno Teixeira
no flags Details
pciconf -lv main-n271681-82cb2a4158fa (5.27 KB, text/plain)
2024-08-15 23:23 UTC, Nuno Teixeira
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Warner Losh freebsd_committer freebsd_triage 2024-08-14 15:52:01 UTC
The recent changes to enable DMAR by default on Intel result in a panic on my system (partial stack, I can't get to it easily):

null pointer dereference in dmar_match_by_path because unit is NULL.

trap
dmar_match_by_path() +0x20
dmar_find()+0x185
iommu_get_dma_tag()
acpi_pci_get_dma_tag()
xhci_init()
xhci_pci_attach()
...

Adding a workaround to return false when unit == NULL in dmar_match_by_path results in a system w/o all its interrupts, so it doesn't boot. Lots of
CPU0:lpcal APIC error 0x40

There may be other errors in the log, but my keyboard is jammed when it breaks to debugger, so I can't scroll back, or ask for dmesg from the debugger.

Only 'hw.dmar.enable=0' in loader.conf offers any relief.
Comment 1 Warner Losh freebsd_committer freebsd_triage 2024-08-14 15:53:32 UTC
This is a 8th generation i7 Lenovo YOGA.
Comment 2 Konstantin Belousov freebsd_committer freebsd_triage 2024-08-14 21:18:29 UTC
If unit is not set, it is either BIOS bug, or something prevented attach from
finishing.  In either case, there should be some messages in the (verbose)
dmesg giving a hint.

Do you have AMT on this machine?  It might work as a serial console, to catch
boot-time messages.
Comment 3 Konstantin Belousov freebsd_committer freebsd_triage 2024-08-14 21:28:07 UTC
You might also disable interrupts remapping.  Then driver should only attach,
without affecting either DMA or intr operations, and the system should boot.
Then we can get dmesg and see why attach (?) failed.
Comment 4 Nuno Teixeira freebsd_committer freebsd_triage 2024-08-15 11:23:13 UTC
Created attachment 252773 [details]
dmesg log with boot from 15/08 and 30/07 for comparison

Laptop amd64:
Lenovo Legion 5 Intel
(Legion 5-15IMH05 (Lenovo) - Type 82AU)

Upgrading from around 30/07 -> 15/08 boot is OK and I've noticed:

---
nda0: Serial Number S4DZNFDMAR0: <unknown dev>:pci0:30:7 sid f7 fault acc 0 adt 0x0 reason 0x25 addr 100000000000000
DMAR0: <unknown dev>:pci0:30:7 sid f7 fault acc 0 adt 0x0 reason 0x25 addr 100000000000000
DMAR0: <unknown dev>:pci0:30:7 sid f7 fault acc 0 adt 0x0 reason 0x25 addr 400000000000000
DMAR0: <unknown dev>:pci0:30:7 sid f7 fault acc 0 adt 0x0 reason 0x25 addr 500000000000000
---
Comment 5 Nuno Teixeira freebsd_committer freebsd_triage 2024-08-15 11:26:24 UTC
(In reply to Nuno Teixeira from comment #4)
(...)

Also shows:

dmar0: <DMA remap> iomem 0xfed91000-0xfed91fff on acpi0
Comment 6 Nuno Teixeira freebsd_committer freebsd_triage 2024-08-15 11:43:10 UTC
Created attachment 252774 [details]
dmesg log with boot from 15/08 and 30/07 for comparison (cleaned)

Clean duplicate boots and have a log with just 2 boots for comparison
Comment 7 Konstantin Belousov freebsd_committer freebsd_triage 2024-08-15 15:30:14 UTC
(In reply to Nuno Teixeira from comment #6)
I do not understand what do you want to say there.
Is your machine bootable after the update, or not?

Regardless, what is the device at pci0:30:7? Use pciconf -lv to identify it.
The culprit is that the device is issuing compat mode MSI(-X) interrupts
message, which are aborted by DMAR and reported as fault.  This is expected
and really is the DMAR purpose.  The only question is why the device does
that.
Comment 8 Nuno Teixeira freebsd_committer freebsd_triage 2024-08-15 15:42:15 UTC
(In reply to Konstantin Belousov from comment #7)

My laptop boots fine as I mentioned earlier:

> Upgrading from around 30/07 -> 15/08 boot is OK ...

Maybe this dmesg logs are usefull.
Comment 9 Nuno Teixeira freebsd_committer freebsd_triage 2024-08-15 15:45:49 UTC
(In reply to Konstantin Belousov from comment #7)

> Regardless, what is the device at pci0:30:7?

I'm not at the laptop right now, but I'm using passtrhu on Intel wireless to use in windows 11 bhive. Later I will check pciconf and do a test running win11/bhyve
Comment 10 Konstantin Belousov freebsd_committer freebsd_triage 2024-08-15 15:56:25 UTC
(In reply to Nuno Teixeira from comment #9)
You cannot combine DMAR and bhyve pass-through right now.
Comment 11 Nuno Teixeira freebsd_committer freebsd_triage 2024-08-15 16:04:19 UTC
(In reply to Konstantin Belousov from comment #10)

Ok, good to know. I will disable it.
Comment 12 Nuno Teixeira freebsd_committer freebsd_triage 2024-08-15 23:22:26 UTC
Created attachment 252790 [details]
dmesg main-n271681-82cb2a4158fa

dmesg main-n271681-82cb2a4158fa

<snip>
nda0 at nvme0 bus 0 scbus1 target 0 lun 1
nda0: <SAMSUNG MZVLB1T0HBLR-000L2 3L1QEXF7 S4DZNF0N126179>
nda0: Serial Number S4DZNFDMAR0: <unknown dev>:pci0:30:7 sid f7 fault acc 0 adt 0x0 reason 0x25 addr 100000000000000
DMAR0: <unknown dev>:pci0:30:7 sid f7 fault acc 0 adt 0x0 reason 0x25 addr 100000000000000
0N126179
nda0: nvme version 1.3
nda0: 976762MB (2000409264 512 byte sectors)
<snip>
Comment 13 Nuno Teixeira freebsd_committer freebsd_triage 2024-08-15 23:23:06 UTC
Created attachment 252791 [details]
pciconf -lv main-n271681-82cb2a4158fa

pciconf -lv main-n271681-82cb2a4158fa
Comment 14 Nuno Teixeira freebsd_committer freebsd_triage 2024-08-15 23:25:51 UTC
(In reply to Konstantin Belousov from comment #10)

Hello,

Just removed passthru from loader.conf.

Maybe this warning/error could be important as descibed in #c12
I'be uploaded pciconf and there is no pci0:30:7 in there...

Thanks
Comment 15 Nuno Teixeira freebsd_committer freebsd_triage 2024-08-15 23:32:30 UTC
(In reply to Nuno Teixeira from comment #14)

(...)

Something was messing with nda serial number:

* from 30/07:
nda0: <SAMSUNG MZVLB1T0HBLR-000L2 3L1QEXF7 S4DZNF0N126179>
nda0: Serial Number S4DZNF0N126179

* from today:
nda0: <SAMSUNG MZVLB1T0HBLR-000L2 3L1QEXF7 S4DZNF0N126179>
nda0: Serial Number S4DZNFDMAR0: <unknown dev>:pci0:30:7 sid f7 fault acc 0 adt 0x0 reason 0x25 addr 100000000000000
Comment 16 Nuno Teixeira freebsd_committer freebsd_triage 2024-08-15 23:43:58 UTC
(In reply to Nuno Teixeira from comment #15)

(...)

It seems that some dmesg log lines were overlaped.
Comment 17 Konstantin Belousov freebsd_committer freebsd_triage 2024-08-17 15:15:49 UTC
(In reply to Warner Losh from comment #0)
What is the source line where the trap occurs?
Comment 18 Konstantin Belousov freebsd_committer freebsd_triage 2024-08-20 14:47:14 UTC
Try https://reviews.freebsd.org/D46382
This should stop the panic (I hope), but still I need verbose dmesg to understand
why the attach failing.
Comment 19 commit-hook freebsd_committer freebsd_triage 2024-08-20 15:50:50 UTC
A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=0875f3cd74b2f305e82bff4e640c89f891ca84f8

commit 0875f3cd74b2f305e82bff4e640c89f891ca84f8
Author:     Ed Maste <emaste@FreeBSD.org>
AuthorDate: 2024-08-20 15:43:11 +0000
Commit:     Ed Maste <emaste@FreeBSD.org>
CommitDate: 2024-08-20 15:49:25 +0000

    Revert "x86: Enable Intel DMAR by default"

    A number of people have reported panics with it enabled by default,
    possibly due to broken ACPI tables, which we do not handle well. D46382
    is a potential fix for this issue.

    Additionally DMAR is currently not compatible with bhyve passthrough
    (see comment #10 in PR280817), with a draft patch to address that in
    D25672.

    Revert to disabling DMAR by default pending the resolution of those two
    issues.

    This reverts commit 3192fc30230ae432b80cca783abc2dbea9d3f383.

    PR:             280817
    Sponsored by:   The FreeBSD Foundation

 sys/x86/iommu/intel_drv.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)
Comment 20 Cy Schubert freebsd_committer freebsd_triage 2024-08-20 17:47:53 UTC
hw.dmar.enable="0" in loader.conf is the quick and dirty workaround.
Comment 21 Mark Johnston freebsd_committer freebsd_triage 2024-08-20 18:15:25 UTC
Enabling DMAR also breaks suspend-to-S3 on my framework 13.  During resume, the screen doesn't turn on and I can hear fans spinning.
Comment 22 Ed Maste freebsd_committer freebsd_triage 2024-08-20 18:21:14 UTC
(In reply to Mark Johnston from comment #21)
> Enabling DMAR also breaks suspend-to-S3 on my framework 13.

Oh yeah, from Val Packett: https://reviews.freebsd.org/D22642
Comment 23 commit-hook freebsd_committer freebsd_triage 2024-08-21 15:25:03 UTC
A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=45543d3424d46f84a5399879e190fc359dcefbd4

commit 45543d3424d46f84a5399879e190fc359dcefbd4
Author:     Konstantin Belousov <kib@FreeBSD.org>
AuthorDate: 2024-08-20 14:41:33 +0000
Commit:     Konstantin Belousov <kib@FreeBSD.org>
CommitDate: 2024-08-21 15:23:07 +0000

    DMAR: clear dmar_devs[unit] if attach failed

    This should stop attempts to use a unit which was not completely
    initialized, but referenced by ACPI DMAR table during scoped devices
    operions.

    PR:     280817
    Sponsored by:   Advanced Micro Devices (AMD)
    Sponsored by:   The FreeBSD Foundation
    MFC after:      1 week
    Differential revision:  https://reviews.freebsd.org/D46382

 sys/x86/iommu/intel_drv.c | 11 +++++++++++
 1 file changed, 11 insertions(+)
Comment 24 commit-hook freebsd_committer freebsd_triage 2024-08-28 00:41:10 UTC
A commit in branch stable/14 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=d66c4853b84002c064bc314a0824a8667a0089c6

commit d66c4853b84002c064bc314a0824a8667a0089c6
Author:     Konstantin Belousov <kib@FreeBSD.org>
AuthorDate: 2024-08-20 14:41:33 +0000
Commit:     Konstantin Belousov <kib@FreeBSD.org>
CommitDate: 2024-08-28 00:26:33 +0000

    DMAR: clear dmar_devs[unit] if attach failed

    PR:     280817

    (cherry picked from commit 45543d3424d46f84a5399879e190fc359dcefbd4)

 sys/x86/iommu/intel_drv.c | 11 +++++++++++
 1 file changed, 11 insertions(+)
Comment 25 Ed Maste freebsd_committer freebsd_triage 2024-08-28 01:39:11 UTC
DMAR has now been disabled by default (i.e. hw.dmar.enable=0). It would be good to confirm after Kostik's 45543d3424d4 the panic does not occur with it reenabled.
Comment 26 Ed Maste freebsd_committer freebsd_triage 2024-08-28 13:00:00 UTC
Presumed addressed by Kostik's change; Warner's laptop that demonstrated this issue is no longer functional.

Others can test with hw.dmar.enable=1 set and submit a PR if additional issues are encountered. I'll see about posting a call for testing to -CURRENT.