Bug 258551 - e1000: "Rework em_msi_link interrupt filter" commit causes to hard lock-up
Summary: e1000: "Rework em_msi_link interrupt filter" commit causes to hard lock-up
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 13.0-STABLE
Hardware: amd64 Any
: --- Affects Only Me
Assignee: Kevin Bowling
URL:
Keywords: IntelNetworking, regression
Depends on:
Blocks:
 
Reported: 2021-09-17 00:52 UTC by t_uemura
Modified: 2021-09-28 17:35 UTC (History)
2 users (show)

See Also:
kbowling: mfc-stable13+
kbowling: mfc-stable12+
kbowling: mfc-stable11-


Attachments
Successful boot sequence (108.03 KB, image/jpeg)
2021-09-17 00:52 UTC, t_uemura
no flags Details
Hard lock-up (116.73 KB, image/jpeg)
2021-09-17 00:53 UTC, t_uemura
no flags Details
Partial revert msi-x, unconditional re-arms (1.14 KB, patch)
2021-09-17 16:23 UTC, Kevin Bowling
kbowling: maintainer-approval-
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description t_uemura 2021-09-17 00:52:48 UTC
Created attachment 227954 [details]
Successful boot sequence

https://cgit.freebsd.org/src/commit/sys/dev/e1000?h=stable/13&id=1fb96c59b4ce265ea94eddef5a97c7c075ceaec5

On my Shuttle DS77U Bhyve host, the above mentioned 1fb96c59b4ce265ea94eddef5a97c7c075ceaec5 commit causes the system to freeze during boot. Just backing out the commit solves. The detail is as follows.

* Yesterday's 13-STABLE (8895170347fcfd9c9acf413ed408f11b15760b4b)

* uname -a
  FreeBSD fwina.tesla.local 13.0-STABLE FreeBSD 13.0-STABLE #0: Thu Sep 16 18:42:17 JST 2021     root@fwina.tesla.local:/usr/obj/usr/src/amd64.amd64/sys/FWINA  amd64

* Part of pciconf -lv:
em0@pci0:0:31:6:        class=0x020000 rev=0x21 hdr=0x00 vendor=0x8086 device=0x156f subvendor=0x8086 subdevice=0x0000
      vendor     = 'Intel Corporation'
      device     = 'Ethernet Connection I219-LM'
      class      = network
      subclass   = ethernet
  igb0@pci0:1:0:0:        class=0x020000 rev=0x03 hdr=0x00 vendor=0x8086 device=0x1539 subvendor=0x1297 subdevice=0x4052
      vendor     = 'Intel Corporation'
      device     = 'I211 Gigabit Network Connection'
      class      = network
      subclass   = ethernet

* Live LAN connection hooked up only to igb0. em0 has no connection during test.

* bridge is fabricated because this host also serves as a traffic filter.
  autobridge_interfaces="bridge0"
  autobridge_bridge0="igb0 em0"
  cloned_interfaces="bridge0"
  ifconfig_bridge0="up"
  ifconfig_em0="up"
  ifconfig_igb0="inet 10.141.30.22 netmask 255.255.255.0 up"

* Various modules specified in $devmatch_blacklist so no kld is loaded during test.

When the mentioned commit backed out, the system boots without any glitch. When the commit being applied, the system freezes at adding net default: gateway, maybe trying to link up the bridge0 and its member interfaces.

The attached ok.jpg shows the successful activation of bridge0 while the ng.jpg is the capture of the hard lock-up. Power-cycle, load the previous kernel and fsck is the only available option to recover.
Comment 1 t_uemura 2021-09-17 00:53:15 UTC
Created attachment 227955 [details]
Hard lock-up
Comment 2 Kevin Bowling freebsd_committer 2021-09-17 04:25:54 UTC
(In reply to t_uemura from comment #1)
Thanks, can you try backing out fc7682b17f3738573099b8b03f5628dcc8148adb instead?  I think this is your I219 not initializing.
Comment 3 t_uemura 2021-09-17 06:45:41 UTC
(In reply to Kevin Bowling from comment #2)
Same hard lock-up happened when only fc7682b17f3738573099b8b03f5628dcc8148adb backed out. The culprit must be 1fb96c59b4ce265ea94eddef5a97c7c075ceaec5 .
Thanks.
Comment 4 t_uemura 2021-09-17 07:05:36 UTC
(In reply to Kevin Bowling from comment #2)
JFYI, by moving the LAN connection from igb0 to em0,
  em0: Hardware Initialization Failed
warnings were disappeared.

Also note that 
  Could not read PHY page 769
warning was shown at the very late of the shutdown sequence when em0 has no connection.
Comment 5 Kevin Bowling freebsd_committer 2021-09-17 16:23:36 UTC
Created attachment 227963 [details]
Partial revert msi-x, unconditional re-arms

Can you try this patch?  It re-enables the link interrupts unconditionally in the fast handler instead of waiting to do it in the interrupt filter.

I am not sure how this would cause your lockup but there is little downside to reverting if this stabilizes it for you.

One thing I am worried about in fc7682b17f3738573099b8b03f5628dcc8148adb is it can brick I219s and their interaction with the Management Engine.  Can you try reverting that and doing a hard power cycle too?  I have the changes split out into individual patches here:  https://github.com/freebsd/freebsd-src/pull/538/commits it would be helpful if you can test.
Comment 6 t_uemura 2021-09-23 02:43:15 UTC
(In reply to Kevin Bowling from comment #5)
Sorry for my delay.

Only the attached partial revert msi-x patch applied, the system sometime booted successfully and sometime not. I'd attempted to make kernel several times and rebooted/did powercycle nearly 20 times but I couldn't make sure the cause of this strangeness.

When backed out 1fb96c59b4ce265ea94eddef5a97c7c075ceaec5, the system always worked as expected regardless of 
  a) live LAN connection on em0 or igb0, and
  b) reboot or full powercycle.

Furthermore, backed out both 1fb96c59b4ce265ea94eddef5a97c7c075ceaec5 and fc7682b17f3738573099b8b03f5628dcc8148adb, the system still worked as expected, same as the above.

The cause of my hard lock-up might be in 1fb96c59b4ce265ea94eddef5a97c7c075ceaec5 but not in the partial revert msi-x patch. At least, fc7682b17f3738573099b8b03f5628dcc8148adb seemed to be harmless.
Comment 7 Kevin Bowling freebsd_committer 2021-09-23 21:09:00 UTC
(In reply to t_uemura from comment #6)
Thanks, can you try this instead https://reviews.freebsd.org/D32087
Comment 8 t_uemura 2021-09-24 09:43:11 UTC
(In reply to Kevin Bowling from comment #7)
Tried two installkernel, two reboots and four full powercycle, and two successful boot from full powercycle. Other four tries were to lock-ups.
Comment 9 Kevin Bowling freebsd_committer 2021-09-24 22:18:20 UTC
(In reply to t_uemura from comment #8)
I've updated https://reviews.freebsd.org/D32087, it should be equivalent to a revert for these cards.
Comment 10 Kevin Bowling freebsd_committer 2021-09-26 16:36:41 UTC
Are you able to test the diff https://reviews.freebsd.org/D32087?  It seems straight forward but I would like confirmation it fixes the issue before committing since this bug is intermittent.
Comment 11 t_uemura 2021-09-27 00:57:05 UTC
(In reply to Kevin Bowling from comment #10)
The patch applied kernel worked flawlessly regardless of LAN connection and boor/powercycle. This completely fixes the issue. Very appreciated. Thanks.
Comment 12 commit-hook freebsd_committer 2021-09-27 16:30:12 UTC
A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=450c3f8b3d259c7eb82488319aff45f1f6554aaf

commit 450c3f8b3d259c7eb82488319aff45f1f6554aaf
Author:     Kevin Bowling <kbowling@FreeBSD.org>
AuthorDate: 2021-09-27 16:17:48 +0000
Commit:     Kevin Bowling <kbowling@FreeBSD.org>
CommitDate: 2021-09-27 16:25:58 +0000

    e1000: Re-arm link changes

    A change to MSI-X link handler was somehow causing issues on
    MSI-based em(4) NICs.

    Revert the change based on user reports and testing.

    PR:             258551
    Reported by:    Franco Fichtner <franco@opnsense.org>, t_uemura@macome.co.jp
    Reviewed by:    markj, Franco Fichtner <franco@opnsense.org>
    Tested by:      t_uemura@macome.co.jp
    MFC after:      1 day

 sys/dev/e1000/if_em.c | 22 ++++++----------------
 1 file changed, 6 insertions(+), 16 deletions(-)
Comment 13 commit-hook freebsd_committer 2021-09-28 16:58:07 UTC
A commit in branch stable/13 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=594a25fa43049c336d6016002538cad7a5383284

commit 594a25fa43049c336d6016002538cad7a5383284
Author:     Kevin Bowling <kbowling@FreeBSD.org>
AuthorDate: 2021-09-27 16:17:48 +0000
Commit:     Kevin Bowling <kbowling@FreeBSD.org>
CommitDate: 2021-09-28 16:55:59 +0000

    e1000: Re-arm link changes

    A change to MSI-X link handler was somehow causing issues on
    MSI-based em(4) NICs.

    Revert the change based on user reports and testing.

    PR:             258551
    Reported by:    Franco Fichtner <franco@opnsense.org>, t_uemura@macome.co.jp
    Reviewed by:    markj, Franco Fichtner <franco@opnsense.org>
    Tested by:      t_uemura@macome.co.jp
    MFC after:      1 day

    (cherry picked from commit 450c3f8b3d259c7eb82488319aff45f1f6554aaf)

 sys/dev/e1000/if_em.c | 22 ++++++----------------
 1 file changed, 6 insertions(+), 16 deletions(-)
Comment 14 commit-hook freebsd_committer 2021-09-28 17:31:15 UTC
A commit in branch stable/12 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=a6640bca4827036ad9374696381513da7d9df0f9

commit a6640bca4827036ad9374696381513da7d9df0f9
Author:     Kevin Bowling <kbowling@FreeBSD.org>
AuthorDate: 2021-09-27 16:17:48 +0000
Commit:     Kevin Bowling <kbowling@FreeBSD.org>
CommitDate: 2021-09-28 17:28:54 +0000

    e1000: Re-arm link changes

    A change to MSI-X link handler was somehow causing issues on
    MSI-based em(4) NICs.

    Revert the change based on user reports and testing.

    PR:             258551
    Reported by:    Franco Fichtner <franco@opnsense.org>, t_uemura@macome.co.jp
    Reviewed by:    markj, Franco Fichtner <franco@opnsense.org>
    Tested by:      t_uemura@macome.co.jp
    MFC after:      1 day

    (cherry picked from commit 450c3f8b3d259c7eb82488319aff45f1f6554aaf)

 sys/dev/e1000/if_em.c | 22 ++++++----------------
 1 file changed, 6 insertions(+), 16 deletions(-)
Comment 15 Kevin Bowling freebsd_committer 2021-09-28 17:35:15 UTC
(In reply to t_uemura from comment #11)
Thank you very much for your report and testing!