Bug 262765 - Random lockups, data loss, and poor I/O and sound quality after 95edb10b47fc1a919cd1687aaf16be9e14456c89
Summary: Random lockups, data loss, and poor I/O and sound quality after 95edb10b47fc1...
Status: Closed DUPLICATE of bug 262150
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: CURRENT
Hardware: amd64 Any
: --- Affects Only Me
Assignee: freebsd-bugs (Nobody)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2022-03-24 20:48 UTC by tod.jackson
Modified: 2022-03-31 02:50 UTC (History)
1 user (show)

See Also:


Attachments
Probably relevant dmesg (88.91 KB, text/plain)
2022-03-30 21:32 UTC, tod.jackson
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description tod.jackson 2022-03-24 20:48:18 UTC
This is way beyond my level, and one of the reasons I didn't want to move beyond 13.0.

Reverting LinuxKPI: implement dma_sync_single_for_*, apply to (un)map single/sg fixes all sorts of problems for me, but it's a big hammer that probably breaks things for everyone else. I have no idea who the culprit is.

I first started having troubles in Linux a few years ago, and finally narrowed it down here. It's entirely possible my firmware is broken, but is there anything we can do?

My drm-kmod is aso devoid of panic(), but unluckily this doesn't manifest as panics. I had to make some stuff up or return (-ENOMEM) to accomodate these changes, but it's nothing of interest.

This is really complicated because multiple drivers are trying to manage memory owned by the firmware, and they don't cooperate.

I found my workaround, and it solves a sort of several year mystery, but maybe we can do better.

I don't even know what kind of quirk this could be. If I had to guess, the relevant parts are dma_sync_single_for_cpu and cache flushing.

This is from  some i915 documentation:

Now the pagetables are a bit tricky. In the end, they're all in system memory, but there are a few hoops to jump through to get at them. The GTT pagetables has just one level, so with a 4 byte entry size we need 2MB of contiguous pagetable space. The firmware allocates that for us from stolen memory (that is, a part of the system memory that is not listed in the e820 map, so it's not managed by the Linux kernel). But we write these PTEs through an alias in the register mmio bar! The reason for that is to allow the SA to invalidate TLBs. Note, though, that this only invalidates TLBs for cpu access. Any other access to the GTT (such as from the GT or the display block) has its own rules for TLB invalidation. Also, on recent generations we need to (depending upon circumstances) manually invalidate the SA TLB by writing to a magic register. To speed up map/unmap operations, we map that GTT PTE aliasing region in the mmio with wc (if this is possible, which means the cpu needs to support PAT).

A lot of this is just stubbed or nonexistent right now, notably runtime PM and the more complicated GT/engine bits. And we really have no idea what the Nvidia driver is doing, aside from trying and failing to write in write-protected regions. I took this upstream, but nobody really cares because they don't want to deal with a proprietary blob.

scbus0 on ahcich1 bus 0:
<TOSHIBA MQ02ABD100H HEF01D>       at scbus0 target 0 lun 0 (pass0,ada0)
<>                                 at scbus0 target -1 lun ffffffff ()
scbus1 on ahciem0 bus 0:
<AHCI SGPIO Enclosure 2.00 0001>   at scbus1 target 0 lun 0 (pass1,ses0)
<>                                 at scbus1 target -1 lun ffffffff ()
scbus-1 on xpt0 bus 0:
<>                                 at scbus-1 target -1 lun ffffffff (xpt0)

I can provide any relevant information, but I don't fully understand the problem. I'm on a few day old CURRENT with evadot's drm-subtree on top of it, but I don't think my drm-kmod grabs anything of interest from there.
Comment 1 tod.jackson 2022-03-30 21:32:59 UTC
Created attachment 232830 [details]
Probably relevant dmesg

Nothing explodes after resume, but I have to reboot to restore network functionality. Note I was having problems with XHCI in 13.0 too. I don't think this is specifically an iwlwifi bug.
Comment 2 Graham Perrin freebsd_committer freebsd_triage 2022-03-30 22:16:23 UTC
(In reply to tod.jackson from comment #1)

> … reboot to restore network functionality. …

Does /etc/netstart work as well in this situation?
Comment 3 tod.jackson 2022-03-30 22:37:30 UTC
I had to do a number of amateur hacks to avoid the original problem, so the device is toast at that point.

iwlwifi0: Error sending TXPATH_FLUSH: enqueue_hcmd failed: -5
iwlwifi0: Failed to send flush command (-5)
iwlwifi0: flush request fail
iwlwifi0: Error sending TIME_EVENT_CMD: enqueue_hcmd failed: -5
iwlwifi0: Couldn't send TIME_EVENT_CMD: -5
wlan0: link state changed to DOWN
iwlwifi0: fail to flush all tx fifo queues Q 4
iwlwifi0: Queue 4 is active on fifo 2 and stuck for 10000 ms. SW [97, 126] HW [90, 90] FH TRB=0x05a5a5a5a
iwlwifi0: Error sending MAC_CONTEXT_CMD: enqueue_hcmd failed: -5
iwlwifi0: Failed to send MAC context (action:2): -5
iwlwifi0: Error sending REPLY_BEACON_FILTERING_CMD: enqueue_hcmd failed: -5
WARNING ret && !!!(({ __typeof(((volatile const unsigned long *)(&mvm->status))[((IWL_MVM_STATUS_HW_RESTART_REQUESTED) / 64)]) __var = ({ __asm__ __volatile__("": : :"memory"); (*(volatile __typeof(((volatile const unsigned long *)(&mvm->>
iwlwifi0: Error sending ADD_STA: enqueue_hcmd failed: -5
iwlwifi0: lkpi_iv_newstate: error -5 during state transition 5 (RUN) -> 0 (INIT)
iwlwifi0: Error sending SCD_QUEUE_CFG: enqueue_hcmd failed: -5
iwlwifi0: Failed to disable queue 1 (ret=-5)
iwlwifi0: Error sending REMOVE_STA: enqueue_hcmd failed: -5
iwlwifi0: Failed to remove station. Id=1
iwlwifi0: Failed sending remove station
iwlwifi0: iwl_trans_send_cmd bad state = 0
iwlwifi0: Failed to synchronize multicast groups update
Comment 4 tod.jackson 2022-03-31 02:50:32 UTC
I'm going to give the benefit of the doubt and assume https://github.com/freebsd/freebsd-src/commit/ecb691143d2e4fe1100086ff73b5687c0cb29963 fixes it in releng. 

CURRENT now defines a version that is higher than the driver I'm stuck on, so I won't be of much more help. There's also the problem of (proprietary) drivers exposing Linux functionality that we don't fully implement.

I think another issue at play here is witness triggering a panic on "possibly incorrect MADV_DONTNEED", which doesn't affect releng.

So as not to dissuade people from testing, I'm going to close this as a bad interaction in the ieee80211 code. Technically the only issue I'm still able to reproduce appears to be a duplicate. Oh well. At least the past few weeks were kind of fun.

*** This bug has been marked as a duplicate of bug 262150 ***