|Summary:||LOR or deadlock in em0 when resuming from suspend|
|Product:||Base System||Reporter:||Niclas Zeising <zeising>|
|Component:||bin||Assignee:||Sean Bruno <sbruno>|
|Severity:||Affects Many People||CC:||erj, johalun0, kaho, pi, re, rgrimes, rudolphfroger, shurd|
Description Niclas Zeising 2017-12-03 20:11:07 UTC
Created attachment 188497 [details] Output from pciconf -lvbc I get a lock or LOR or similar when resuming from suspend with the network cable plugged in. This is on a Lenovo Thinkpad X270. The machine resumes fine, but after a little while (minutes, if not less) the machine freezes. I can feel it get warm and the fans spin, as if the CPU is working 100%. It feels like the lockup happens once there's traffic on the NIC after the resume. Suspend/resume when not using the NIC works fine (such as when using WiFi). There is nothing on the screen when this happens, the screen just freezes in the way it was, with no reaction on keyboard input and nothing on the console. When I reboot, however, there is the following in /var/log/messages, which is what led me to em0. kernel: reversal: kernel: em0:tx(0):callo (em0:tx(0):callo) @ /usr/src/sys/kern/kern_mutex.c:182 kernel: /usr/src/sys/net/iflib.c:2143 kernel: backtrace: kernel: #0 0xffffffff805a3e93 at witness_debugger+0x73 kernel: #1 0xffffffff805a3d12 at witness_checkorder+0xe02 kernel: #2 0xffffffff8051fd6c at __mtx_lock_flags+0x9c kernel: #3 0xffffffff80653789 at iflib_timer+0x149 kernel: #4 0xffffffff8055856c at softclock_call_cc+0x14c kernel: #5 0xffffffff8055892c at softclock+0x7c kernel: #6 0xffffffff805046a9 at intr_event_execute_handlers+0x99 kernel: #7 0xffffffff80504d96 at ithread_loop+0xb6 kernel: #8 0xffffffff80501ae4 at fork_exit+0x84 kernel: #9 0xffffffff8087718e at fork_trampoline+0xe System is: FreeBSD garnet.daemonic.se 12.0-CURRENT FreeBSD 12.0-CURRENT #0 r325963M: Sat Nov 18 14:01:30 CET 2017 firstname.lastname@example.org:/usr/obj/usr/src/amd64.amd64/sys/GARNET amd64 Attached is also output from pciconf -lvbc
Comment 1 Niclas Zeising 2017-12-04 21:11:01 UTC
Updated to latest source (r326539) and the deadlock is still there. Same trace, only difference is the line number in sys/kern/kern_mutex.c, which is 184 now.
Comment 2 Johannes Lundberg 2018-02-28 10:55:07 UTC
Here's a bug report I was just about to file when I heard about this one. --- if_em is loadable module. Network on em0 not working after suspend/resume. ifconfig output stuck after nd6 options... line. procstat -ak: --- ifconfig - mi_switch turnstile_wait __mtx_lock_sleep __mtx_lock_flags iflib_media_status ifmedia_ioctl ifioctl ... --- While writing this on another machine, the machine with stuck ifconfig rebooted by itself (about 5 minutes after doing resume and issuing ifconfig command). backtrace: #0 doadump (textdump=0) at pcpu.h:230 #1 0xffffffff81d94528 in vt_kms_postswitch () from /boot/modules.drm-v4.9/drm.ko #2 0xffffffff80543b78 in vt_window_switch (vw=0xffffffff80c99e28) at /usr/src/sys/dev/vt/vt_core.c:563 #3 0xffffffff805412a0 in vtterm_cngrab (tm=<value optimized out>) at /usr/src/sys/dev/vt/vt_core.c:1530 #4 0xffffffff80648162 in cngrab () at /usr/src/sys/kern/kern_cons.c:370 #5 0xffffffff806a8acb in vpanic (fmt=0xffffffff80b0fac3 "%s: possible deadlock detected for %p, blocked for %d ticks\n", ap=0xfffffe00407f2a00) at /usr/src/sys/kern/kern_shutdown.c:786 #6 0xffffffff806a8c03 in panic (fmt=<value optimized out>) at /usr/src/sys/kern/kern_shutdown.c:739 #7 0xffffffff806429dc in deadlkres () at /usr/src/sys/kern/kern_clock.c:242 #8 0xffffffff80669144 in fork_exit (callout=0xffffffff80642680 <deadlkres>, arg=0x0, frame=0xfffffe00407f2ac0) at /usr/src/sys/kern/kern_fork.c:1039 #9 0xffffffff809f9dbe in fork_trampoline () at /usr/src/sys/amd64/amd64/exception.S:843 #10 0x0000000000000000 in ?? () Current language: auto; currently minimal (kgdb)
Comment 3 Kaho Toshikazu 2018-03-15 05:03:27 UTC
(In reply to Niclas Zeising from comment #1) Please try the patch. The problem seems to be caused by the function iflib_init_locked() inside iflib_device_resume(), and iflib_init_locked() should be invoked after iflib_stop(). Probably, i219 has a problem when its PCI power state changes to D3. You may loose a internet connection after resume because of a failure of waking up. As a workaround, hw.pci.do_power_suspend=0 prevents to change from D1 to D3 state, but it makes energy consumption more. --- sys/net/iflib.c (revision 330961) +++ sys/net/iflib.c (working copy) @@ -4526,6 +4526,7 @@ if_ctx_t ctx = device_get_softc(dev); CTX_LOCK(ctx); + iflib_stop(ctx); IFDI_SUSPEND(ctx); CTX_UNLOCK(ctx);
Comment 4 Kurt Jaeger 2018-12-17 09:22:58 UTC
I have this problem on two Lenovo laptops with a 12.0-REL install, although not always after wakeup. One is a X220, it displays this in syslog (not in all cases): Dec 16 13:32:08 udog kernel: em0: TX(0) desc avail = 1024, pidx = 0 The other is a X201, which worked fine for years under 11.XpX. Both have hw.pci.do_power_suspend=0 in /etc/sysctl.conf. I'll try the patch given in comment 3 now and report back.
Comment 5 Kurt Jaeger 2018-12-18 09:32:51 UTC
It looks like the patch from commen#3 fixes the problem.
Comment 6 Stephen Hurd 2018-12-18 18:12:08 UTC
Could you try this patch instead? Index: iflib.c =================================================================== --- iflib.c (revision 341824) +++ iflib.c (working copy) @@ -4894,7 +4894,7 @@ CTX_LOCK(ctx); IFDI_RESUME(ctx); - iflib_init_locked(ctx); + iflib_if_init_locked(ctx); CTX_UNLOCK(ctx); for (int i = 0; i < NTXQSETS(ctx); i++, txq++) iflib_txq_check_drain(txq, IFLIB_RESTART_BUDGET);
Comment 7 Kurt Jaeger 2018-12-18 21:51:48 UTC
Kernel is installed, tests pending. I'll get back to you in approx. 24h.
Comment 8 Kurt Jaeger 2018-12-19 06:54:43 UTC
Test looks fine on X220 with hw.pci.do_power_suspend on default 1.
Comment 9 commit-hook 2019-01-07 23:47:46 UTC
A commit references this bug: Author: shurd Date: Mon Jan 7 23:46:54 UTC 2019 New revision: 342855 URL: https://svnweb.freebsd.org/changeset/base/342855 Log: Use iflib_if_init_locked() during resume instead of iflib_init_locked(). iflib_init_locked() assumes that iflib_stop() has been called, however, it is not called for suspend. iflib_if_init_locked() calls stop then init, so fixes the problem. This was causing errors after a resume from suspend. PR: 224059 Reported by: zeising MFC after: 1 week Sponsored by: Limelight Networks Changes: head/sys/net/iflib.c
Comment 10 commit-hook 2019-01-14 18:41:09 UTC
A commit references this bug: Author: shurd Date: Mon Jan 14 18:40:37 UTC 2019 New revision: 343024 URL: https://svnweb.freebsd.org/changeset/base/343024 Log: MFC r342855: Use iflib_if_init_locked() during resume instead of iflib_init_locked(). iflib_init_locked() assumes that iflib_stop() has been called, however, it is not called for suspend. iflib_if_init_locked() calls stop then init, so fixes the problem. This was causing errors after a resume from suspend. PR: 224059 Reported by: zeising Sponsored by: Limelight Networks Changes: _U stable/12/ stable/12/sys/net/iflib.c
Comment 11 Rodney W. Grimes 2019-01-15 21:13:34 UTC
Is this applicable to stable/11 or what merge would cause it to be applicable to stable/11. I have concerns that this regresion is or could end up in the path of the upcoming 11.3 release. Thanks, Rod <RE
Comment 12 Stephen Hurd 2019-01-16 19:02:08 UTC
(In reply to Rodney W. Grimes from comment #11) It should be directly applicable to stable/11. I'll take a closer look now.
Comment 13 commit-hook 2019-01-16 19:20:44 UTC
A commit references this bug: Author: shurd Date: Wed Jan 16 19:20:14 UTC 2019 New revision: 343099 URL: https://svnweb.freebsd.org/changeset/base/343099 Log: MFC r342855: Use iflib_if_init_locked() during resume instead of iflib_init_locked(). iflib_init_locked() assumes that iflib_stop() has been called, however, it is not called for suspend. iflib_if_init_locked() calls stop then init, so fixes the problem. This was causing errors after a resume from suspend. PR: 224059 Reported by: zeising Sponsored by: Limelight Networks Changes: _U stable/11/ stable/11/sys/net/iflib.c
Comment 14 Kurt Jaeger 2019-07-08 16:22:26 UTC
Well, there's still no fix for 12.0p7, isn't it reasonable to do an EN for this ?