224059 – LOR or deadlock in em0 when resuming from suspend

Bug 224059 - LOR or deadlock in em0 when resuming from suspend

Summary: LOR or deadlock in em0 when resuming from suspend

Status:	Closed FIXED

Alias:	None

Product:	Base System
Classification:	Unclassified
Component:	bin (show other bugs)
Version:	CURRENT
Hardware:	Any Any

Importance:	--- Affects Many People
Assignee:	Sean Bruno

URL:
Keywords:	IntelNetworking

Depends on:
Blocks:	233817
	Show dependency tree / graph

Reported:	2017-12-03 20:11 UTC by Niclas Zeising
Modified:	2021-04-30 03:40 UTC (History)
CC List:	8 users (show)

See Also:

Flags:	koobs: mfc-stable12+ koobs: mfc-stable11+

Attachments
Output from pciconf -lvbc (8.74 KB, text/plain) 2017-12-03 20:11 UTC, Niclas Zeising	no flags	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Niclas Zeising freebsd_committer

2017-12-03 20:11:07 UTC

Created attachment 188497 [details]
Output from pciconf -lvbc

I get a lock or LOR or similar when resuming from suspend with the network cable plugged in. This is on a Lenovo Thinkpad X270.
The machine resumes fine, but after a little while (minutes, if not less) the machine freezes.  I can feel it get warm and the fans spin, as if the CPU is working 100%.  It feels like the lockup happens once there's traffic on the NIC after the resume.  Suspend/resume when not using the NIC works fine (such as when using WiFi).
There is nothing on the screen when this happens, the screen just freezes in the way it was, with no reaction on keyboard input and nothing on the console.  When I reboot, however, there is the following in /var/log/messages, which is what led me to em0.

kernel: reversal:
kernel: em0:tx(0):callo (em0:tx(0):callo) @ /usr/src/sys/kern/kern_mutex.c:182
kernel: /usr/src/sys/net/iflib.c:2143
kernel: backtrace:
kernel: #0 0xffffffff805a3e93 at witness_debugger+0x73
kernel: #1 0xffffffff805a3d12 at witness_checkorder+0xe02
kernel: #2 0xffffffff8051fd6c at __mtx_lock_flags+0x9c
kernel: #3 0xffffffff80653789 at iflib_timer+0x149
kernel: #4 0xffffffff8055856c at softclock_call_cc+0x14c
kernel: #5 0xffffffff8055892c at softclock+0x7c
kernel: #6 0xffffffff805046a9 at intr_event_execute_handlers+0x99
kernel: #7 0xffffffff80504d96 at ithread_loop+0xb6
kernel: #8 0xffffffff80501ae4 at fork_exit+0x84
kernel: #9 0xffffffff8087718e at fork_trampoline+0xe

System is:
FreeBSD garnet.daemonic.se 12.0-CURRENT FreeBSD 12.0-CURRENT #0 r325963M: Sat Nov 18 14:01:30 CET 2017     root@garnet.daemonic.se:/usr/obj/usr/src/amd64.amd64/sys/GARNET  amd64

Attached is also output from pciconf -lvbc

Comment 1 Niclas Zeising freebsd_committer

2017-12-04 21:11:01 UTC

Updated to latest source (r326539) and the deadlock is still there.  Same trace, only difference is the line number in sys/kern/kern_mutex.c, which is 184 now.

Comment 2 Johannes Lundberg 2018-02-28 10:55:07 UTC

Here's a bug report I was just about to file when I heard about this one.
---
if_em is loadable module.
Network on em0 not working after suspend/resume.
ifconfig output stuck after nd6 options... line.

procstat -ak:
---
ifconfig  -  mi_switch turnstile_wait __mtx_lock_sleep __mtx_lock_flags iflib_media_status ifmedia_ioctl ifioctl ...
---

While writing this on another machine, the machine with stuck ifconfig rebooted by itself (about 5 minutes after doing resume and issuing ifconfig command). 

backtrace:
#0  doadump (textdump=0) at pcpu.h:230
#1  0xffffffff81d94528 in vt_kms_postswitch () from /boot/modules.drm-v4.9/drm.ko
#2  0xffffffff80543b78 in vt_window_switch (vw=0xffffffff80c99e28) at /usr/src/sys/dev/vt/vt_core.c:563
#3  0xffffffff805412a0 in vtterm_cngrab (tm=<value optimized out>) at /usr/src/sys/dev/vt/vt_core.c:1530
#4  0xffffffff80648162 in cngrab () at /usr/src/sys/kern/kern_cons.c:370
#5  0xffffffff806a8acb in vpanic (fmt=0xffffffff80b0fac3 "%s: possible deadlock detected for %p, blocked for %d ticks\n", 
    ap=0xfffffe00407f2a00) at /usr/src/sys/kern/kern_shutdown.c:786
#6  0xffffffff806a8c03 in panic (fmt=<value optimized out>) at /usr/src/sys/kern/kern_shutdown.c:739
#7  0xffffffff806429dc in deadlkres () at /usr/src/sys/kern/kern_clock.c:242
#8  0xffffffff80669144 in fork_exit (callout=0xffffffff80642680 <deadlkres>, arg=0x0, frame=0xfffffe00407f2ac0)
    at /usr/src/sys/kern/kern_fork.c:1039
#9  0xffffffff809f9dbe in fork_trampoline () at /usr/src/sys/amd64/amd64/exception.S:843
#10 0x0000000000000000 in ?? ()
Current language:  auto; currently minimal
(kgdb)

Comment 3 Kaho Toshikazu 2018-03-15 05:03:27 UTC

(In reply to Niclas Zeising from comment #1)

Please try the patch. The problem seems to be caused by the function
iflib_init_locked() inside iflib_device_resume(), and iflib_init_locked()
should be invoked after iflib_stop().

Probably, i219 has a problem when its PCI power state changes to D3.
You may loose a internet connection after resume because of a failure
of waking up.

As a workaround, hw.pci.do_power_suspend=0 prevents to change
from D1 to D3 state, but it makes energy consumption more. 

--- sys/net/iflib.c	(revision 330961)
+++ sys/net/iflib.c	(working copy)
@@ -4526,6 +4526,7 @@
 	if_ctx_t ctx = device_get_softc(dev);
 
 	CTX_LOCK(ctx);
+	iflib_stop(ctx);
 	IFDI_SUSPEND(ctx);
 	CTX_UNLOCK(ctx);

Comment 4 Kurt Jaeger freebsd_committer

2018-12-17 09:22:58 UTC

I have this problem on two Lenovo laptops with a 12.0-REL install, although not always after wakeup.

One is a X220, it displays this in syslog (not in all cases):

Dec 16 13:32:08 udog kernel: em0: TX(0) desc avail = 1024, pidx = 0

The other is a X201, which worked fine for years under 11.XpX.

Both have hw.pci.do_power_suspend=0 in /etc/sysctl.conf.

I'll try the patch given in comment 3 now and report back.

Comment 5 Kurt Jaeger freebsd_committer

2018-12-18 09:32:51 UTC

It looks like the patch from commen#3 fixes the problem.

Comment 6 Stephen Hurd freebsd_committer

2018-12-18 18:12:08 UTC

Could you try this patch instead?

Index: iflib.c
===================================================================
--- iflib.c	(revision 341824)
+++ iflib.c	(working copy)
@@ -4894,7 +4894,7 @@
 
 	CTX_LOCK(ctx);
 	IFDI_RESUME(ctx);
-	iflib_init_locked(ctx);
+	iflib_if_init_locked(ctx);
 	CTX_UNLOCK(ctx);
 	for (int i = 0; i < NTXQSETS(ctx); i++, txq++)
 		iflib_txq_check_drain(txq, IFLIB_RESTART_BUDGET);

Comment 7 Kurt Jaeger freebsd_committer

2018-12-18 21:51:48 UTC

Kernel is installed, tests pending. I'll get back to you in approx. 24h.

Comment 8 Kurt Jaeger freebsd_committer

2018-12-19 06:54:43 UTC

Test looks fine on X220 with hw.pci.do_power_suspend on default 1.

Comment 9 commit-hook freebsd_committer

2019-01-07 23:47:46 UTC

A commit references this bug:

Author: shurd
Date: Mon Jan  7 23:46:54 UTC 2019
New revision: 342855
URL: https://svnweb.freebsd.org/changeset/base/342855

Log:
  Use iflib_if_init_locked() during resume instead of iflib_init_locked().

  iflib_init_locked() assumes that iflib_stop() has been called, however,
  it is not called for suspend.  iflib_if_init_locked() calls stop then init,
  so fixes the problem.

  This was causing errors after a resume from suspend.

  PR:		224059
  Reported by:	zeising
  MFC after:	1 week
  Sponsored by:	Limelight Networks

Changes:
  head/sys/net/iflib.c

Comment 10 commit-hook freebsd_committer

2019-01-14 18:41:09 UTC

A commit references this bug:

Author: shurd
Date: Mon Jan 14 18:40:37 UTC 2019
New revision: 343024
URL: https://svnweb.freebsd.org/changeset/base/343024

Log:
  MFC r342855:

  Use iflib_if_init_locked() during resume instead of iflib_init_locked().

  iflib_init_locked() assumes that iflib_stop() has been called, however,
  it is not called for suspend.  iflib_if_init_locked() calls stop then init,
  so fixes the problem.

  This was causing errors after a resume from suspend.

  PR:		224059
  Reported by:	zeising
  Sponsored by:	Limelight Networks

Changes:
_U  stable/12/
  stable/12/sys/net/iflib.c

Comment 11 Rodney W. Grimes freebsd_committer

2019-01-15 21:13:34 UTC

Is this applicable to stable/11 or what merge would cause it to be applicable to stable/11.  I have concerns that this regresion is or could end up in the path of the upcoming 11.3 release.

Thanks,
Rod <RE

Comment 12 Stephen Hurd freebsd_committer

2019-01-16 19:02:08 UTC

(In reply to Rodney W. Grimes from comment #11)

It should be directly applicable to stable/11.  I'll take a closer look now.

Comment 13 commit-hook freebsd_committer

2019-01-16 19:20:44 UTC

A commit references this bug:

Author: shurd
Date: Wed Jan 16 19:20:14 UTC 2019
New revision: 343099
URL: https://svnweb.freebsd.org/changeset/base/343099

Log:
  MFC r342855:

  Use iflib_if_init_locked() during resume instead of iflib_init_locked().

  iflib_init_locked() assumes that iflib_stop() has been called, however,
  it is not called for suspend.  iflib_if_init_locked() calls stop then init,
  so fixes the problem.

  This was causing errors after a resume from suspend.

  PR:		224059
  Reported by:	zeising
  Sponsored by:	Limelight Networks

Changes:
_U  stable/11/
  stable/11/sys/net/iflib.c

Comment 14 Kurt Jaeger freebsd_committer

2019-07-08 16:22:26 UTC

Well, there's still no fix for 12.0p7, isn't it reasonable to do an EN for this ?