Bug 227213

Summary: FreeBSD 10.4 kernel deadlocks on sysctlmemlock
Product: Base System Reporter: Mark Knight <markk>
Component: kernAssignee: freebsd-bugs (Nobody) <bugs>
Status: Closed Overcome By Events    
Severity: Affects Only Me CC: acpi, bahlgren, eugen, kib, pi, terry-freebsd
Priority: --- Keywords: crash, needs-qa
Version: 10.4-STABLEFlags: koobs: mfc-stable11?
koobs: mfc-stable10?
Hardware: amd64   
OS: Any   
Attachments:
Description Flags
sysctl.conf
none
loader.conf
none
some debug info for snmpd from the crashdump
none
Kernel config file
none
thread 257/100247
none
Giant lock(?) none

Description Mark Knight 2018-04-02 20:47:10 UTC
When upgrading kernel from FreeBSD FreeBSD 10.3-RELEASE-p28 to FreeBSD 10.4-RELEASE or FreeBSD 10.4-RELEASE-p7 various user land programs hang. Same problem with 10.3 or 10.4 user land, so simply upgrading the kernel triggers the issue.

For example, I cannot login as myself on the console, or if I login as root and try to su or sudo myself, the tty hangs (CTRL-C usually does nothing, CTRL-T shows [sysctl mem], and I have to login via another virtual tty. Sadly there's nothing appearing in /var/log/all.log when the deadlock occurs.

Another command that causes a hang is "killall -HUP syslogd" when logged in as root.

For additional information, see the similarly titled thread on freebsd-stable.

Kernel boot log: http://www.knigma.org/scratch/010418.10.4.txt

I created the following kernel dump shortly after a normal boot (booting hasn't been affected so far), and then immediately trying to login as my normal user and hitting Ctr-Alt-ESC after the tty hung. CTRL-T showed [sysctl-mem]. Kernel dump created with "call doadump":

   http://www.knigma.org/scratch/core.txt.0
   http://www.knigma.org/scratch/vmcore.0.gz
   http://www.knigma.org/scratch/kernel.debug.gz

Motherboard is running the latest BIOS and system has been rock solid on 10.3 for months.

All advise/assistance appreciated - thank you :)
Comment 1 Eugene Grosbein freebsd_committer freebsd_triage 2018-04-02 21:20:18 UTC
As temporary work-around, you could 1) disable snmpd that seems to trigger this kernel problem and 2) avoid using "sysctl -a" as it may trigger it too.

Do you have some non-standard tuning in /boot/loader.conf and/or /etc/sysctl.conf?
If so, please add them as attachments.
Comment 2 Eugene Grosbein freebsd_committer freebsd_triage 2018-04-02 21:25:31 UTC
You should also rebuild net-snmpd port or update package as its stale version may be a culprit as it tries to dig the kernel data wrong way. Its upgrade may solve your problem completely.
Comment 3 Mark Knight 2018-04-02 21:46:14 UTC
Created attachment 192121 [details]
sysctl.conf

Adding sysctl.conf as requested.
Comment 4 Mark Knight 2018-04-02 21:46:58 UTC
Created attachment 192122 [details]
loader.conf

Adding loader.conf as requested.
Comment 5 Mark Knight 2018-04-02 22:01:12 UTC
Thanks. It sounds like you have seen something similar before?

I've just tried commenting out snmpd and snmptrap from /etc/rc.conf but the problem still occurs. I haven't run sysctl -a.

In better news, I've found an even similar way to trigger the deadlock.

1) Boot the OS into multiuser
2) Login on the console as root
3) Type ps followed by enter. ps never returns.
Comment 6 Eugene Grosbein freebsd_committer freebsd_triage 2018-04-02 22:02:03 UTC
What do you use vm.defer_swapspace_pageouts=1 for?

Note this sysctl was removed from FreeBSD 11 altogether. I would recommend trying to comment it out for a test, too.
Comment 7 Eugene Grosbein freebsd_committer freebsd_triage 2018-04-02 22:04:42 UTC
(In reply to Mark Knight from comment #5)

> Thanks. It sounds like you have seen something similar before?

No. I've just digged your crashdump a little.

> In better news, I've found an even similar way to trigger the deadlock.

Without snmpd running? If so, please reproduce it and generate another crashdump and make it available.
Comment 8 Eugene Grosbein freebsd_committer freebsd_triage 2018-04-03 15:58:19 UTC
Created attachment 192172 [details]
some debug info for snmpd from the crashdump

Note that snmpd is hung on sx_xlock() with insanely large req.validlen == 140737488349408
Comment 9 Mark Knight 2018-04-03 18:04:25 UTC
You asked:

> What do you use vm.defer_swapspace_pageouts=1 for?

After moving existing disks to a new chassis I have more RAM than swap and repartitioning would be a pain - I really don't need 16GB of swap! Without this sysctl I was finding my swap was becoming full and lots of associated "out of swap space" error. Since adding this sysctl that problem as ceased. I hope 11.x won't have the same problem...

As requested, here's a fresh kernel dump created as follows:

   1) Disabled snmpd and snmptrapd in /etc/rc.conf, so definitely no snmpd running
   2) reboot with 10.4 kernel
   3) Login to console as root
   4) Type ps <return>
   5) Deadlock occurs
   6) CTRL-T shows [sysctl-mem]
   7) Ctrl-Alt-ESC followed by call doadump

http://www.knigma.org/scratch/core.txt.1
http://www.knigma.org/scratch/vmcore.1.gz

Kernel is the same file as before, so no need to download that again!

Many thanks!!
Comment 10 Mark Knight 2018-04-03 18:14:35 UTC
As suggested, I just tried commenting out vm.defer_swapspace_pageouts=1. Commenting it out from /etc/sysctl.conf didn't make any difference, 10.4 still deadlocks.
Comment 11 Eugene Grosbein freebsd_committer freebsd_triage 2018-04-03 19:32:44 UTC
(In reply to Mark Knight from comment #9)

What is UID 104 here? It seems it is used for some perl process that may be triggering the problem and not snmpd.
Comment 12 Mark Knight 2018-04-03 21:10:30 UTC
UID 104 is:

slimserv:*:104:104:Slim Devices SlimServer pseudo-user:/nonexistent:/usr/sbin/nologin

ps (on a working kernel), shows:

slimserv    1117   0.0  0.9 253920 147596  -  Ss    7:13PM    0:15.34 /usr/local/bin/perl /usr/local/share/logitechmediaserver/slimserver.pl --daemon --pidfile=/var/run/logitechmediaserver/logitechmediaserver.pid --user=slimserv --group=slimserv

This is logitechmediaserver-7.9.1.g2017.07.11, from /usr/ports/audio/logitechmediaserver

I'm surprised that's mischievous - it's just a music server, but I'll try disabling it in a moment and seeing if that helps :)

The only other perl I have running as a daemon is mrtg-2.17.4_7,1 from /usr/ports/net-mgmt/mrtg.

I should mention that all my ports are up-to-date.
Comment 13 Mark Knight 2018-04-03 21:25:13 UTC
I just tried disabling logitechmediaserver and restarting. Didn't help. Then I disabled mrtg and restarted. Still didn't help.

Here's a fresh dump with neither running. Same repo steps (root login followed by ps) and same symptoms.

   http://www.knigma.org/scratch/core.txt.2
   http://www.knigma.org/scratch/vmcore.2.gz

Incidentally, ps works in single user mode, but it would be very tedious to do a binary chop to find what's tickling the problem
Comment 14 Eugene Grosbein freebsd_committer freebsd_triage 2018-04-04 06:47:58 UTC
(In reply to Mark Knight from comment #13)

Do you use GENERIC kernel? If not, please attach your kernel config file.
Comment 15 Mark Knight 2018-04-04 07:21:17 UTC
Created attachment 192199 [details]
Kernel config file

Attached kernel config. My kernel is close to generic, extras are:

> options         IPFIREWALL
> options         IPDIVERT
> options         DUMMYNET

then added these for debug, but the problem occurred without them:

> options         KDB_UNATTENDED
> options         DDB                     # Support DDB.
> options         GDB                     # Support remote GDB.
> options         INVARIANTS              # Enable calls of extra sanity checking
> options         INVARIANT_SUPPORT       # Extra sanity checks of internal structures, required by INVARIANTS
> options         WITNESS                 # Enable checks to detect deadlocks and cycles
> options         WITNESS_SKIPSPIN        # Don't run witness on spinlocks for speed

You've probably got these from loader.conf, but I also have these:

mkn@shrewd$ kldstat
Id Refs Address            Size     Name
 1   35 0xffffffff80200000 17dbb70  kernel
 2    1 0xffffffff819dc000 215a8    geom_eli.ko
 3    2 0xffffffff819fe000 352d0    crypto.ko
 4    1 0xffffffff81a34000 23fb0    geom_mirror.ko
 5    1 0xffffffff81a58000 157c0    aio.ko
 6    1 0xffffffff81a6e000 4a70     coretemp.ko
 7    1 0xffffffff81a73000 b188     sem.ko
 8    1 0xffffffff81a7f000 26e0     cc_cubic.ko
 9    1 0xffffffff81c11000 56c6     fdescfs.ko
10    1 0xffffffff81c17000 8df4     uftdi.ko
11    1 0xffffffff81c20000 54c4     ucom.ko
12    1 0xffffffff81c26000 3d12c    linux.ko
13    2 0xffffffff81c64000 683b     linux_common.ko
14    1 0xffffffff81c6b000 37ae9    linux64.ko
Comment 16 Mark Knight 2018-05-03 08:00:24 UTC
Any more thoughts on this please?

Now Ports collection support for 10.3 has ended, this bug is now a hard blocker preventing me from upgrading to 10.4 and continuing to apply security fixes.
Comment 17 Kurt Jaeger freebsd_committer freebsd_triage 2018-05-03 08:12:28 UTC
I once had a box where sysctl -a would hang, too. It's been a while, so no more details until I stumble open that box again.
Comment 18 Eugene Grosbein freebsd_committer freebsd_triage 2018-05-03 10:09:29 UTC
(In reply to Mark Knight from comment #16)

I'm out of ideas.

Could you please test 11.1-RELEASE or 11.2-PRERELEASE? I've seen similar reports that were 'fixed' by upgrading to 11.1.
Comment 19 Konstantin Belousov freebsd_committer freebsd_triage 2018-05-03 10:26:23 UTC
(In reply to Eugene Grosbein from comment #8)
Well, there is the sx owner which blocks snmpd.  It does not make your note about the request length less valid, but still first thing to look at is the state of the lock owner.

Print p *(struct thread *)0xfffff80079fd3000, then get the backtrace for that thread.
Comment 20 Eugene Grosbein freebsd_committer freebsd_triage 2018-05-03 15:33:53 UTC
Created attachment 193023 [details]
thread 257/100247

(In reply to Konstantin Belousov from comment #19)

How does one recover original sysctl name? This one seems to be ACPI-related, see attachment.
Comment 21 Eugene Grosbein freebsd_committer freebsd_triage 2018-05-03 15:37:06 UTC
Mark, you should check if you have BIOS update for this box.
Comment 22 Mark Knight 2018-05-03 16:48:20 UTC
Thanks. BIOS is 1006 (updated before I filed this bug) from https://www.asus.com/uk/Motherboards/PRIME-H270M-PLUS/HelpDesk_Download/. I see there's now a 1010 I could try. To reiterate this is fine under 10.3.
Comment 23 Eugene Grosbein freebsd_committer freebsd_triage 2018-05-03 19:31:02 UTC
The process holding sx lock is hald:

257 Thread 100247 (PID=1343: hald)  sched_switch (td=0xfffff80079fd3000,
    newtd=<value optimized out>, flags=<value optimized out>)
    at /usr/src/sys/kern/sched_ule.c:1956

Hald is installed with ports/sysutils/hal.

Mark, why do you need hald? If it's possible, try running 10.4 with hald disabled.
Comment 24 Mark Knight 2018-05-03 20:14:08 UTC
Ah, thank you!!

I don't recall why hald was installed (it was probably from a long time ago). I've removed it, and so far, I'm not seeing any symptoms, so long as I avoid sysctl -a.

:)
Comment 25 Eugene Grosbein freebsd_committer freebsd_triage 2018-05-04 10:54:49 UTC
(In reply to Mark Knight from comment #24)

Have you rebuilt/reinstalled sysutils/hal after upgrade from 10.3 to 10.4?
Comment 26 Bengt Ahlgren 2018-05-04 11:47:46 UTC
(In reply to Eugene Grosbein from comment #25)
I have run hald on my desktop and laptop with 10.4-REL and 10.4-STABLE for a long time without issues.

To OP: have you rebuilt kernel modules?  You only write that you tried with a new kernel.  I believe that there are internal kernel interfaces that have changed between 10.3 and 10.4, so it is essential to rebuild all modules.  Likewise, I don't expect "ps" from 10.3 to work on 10.4.
Comment 27 Mark Knight 2018-05-04 12:01:32 UTC
The move to 10.4 was a buildworld buildkernel installkernel installworld and mergemaster so all modules and userland were rebuilt. Indeed, for a long time I've been running a 10.3 kernel with 10.4 userland as a workaround while we found this (don't run hald) workaround.

I'll try a freshly rebuilt hald and report back shortly.
Comment 28 Eugene Grosbein freebsd_committer freebsd_triage 2018-05-04 12:13:22 UTC
(In reply to Bengt Ahlgren from comment #26)

I think this is kernel or hardware problem despite of possible desync.
In no event should simple sysctl system call hang "forever" holding sx lock that prevents other processes to call other sysctl OIDs.
Comment 29 Konstantin Belousov freebsd_committer freebsd_triage 2018-05-04 12:27:22 UTC
(In reply to Eugene Grosbein from comment #28)
Did you rebuild hald itself when upgraded from 10.3 to 10.4 ? If not, do that.
Comment 30 Mark Knight 2018-05-06 16:45:41 UTC
Kernel and user land are now FreeBSD 10.4-RELEASE-p8 (SHREWD) #0 r333193: Thu May  3 22:30:26 BST 2018.

Has been fine since I stopped running hald. As a test I just rebuilt hal-0.5.14_32 and all of its dependencies using my up-to-date ports tree.

Then, as soon as I ran "service hald onestart", the ps command (without an parameters) hangs. Nasty.
Comment 31 Andriy Gapon freebsd_committer freebsd_triage 2018-05-06 19:25:47 UTC
Have we seen a stack trace of what hald is doing in kernel yet?
Konstantin requested that information in comment #19.
Comment 32 Eugene Grosbein freebsd_committer freebsd_triage 2018-05-06 20:48:30 UTC
(In reply to Andriy Gapon from comment #31)

You probably overlooked the attachment https://bz-attachments.freebsd.org/attachment.cgi?id=193023 named "thread 257/100247", it has that.
Comment 33 Eugene Grosbein freebsd_committer freebsd_triage 2018-05-06 21:01:14 UTC
Created attachment 193116 [details]
Giant lock(?)

It seems, we have Giant lock here? See attachment.
Comment 34 Andriy Gapon freebsd_committer freebsd_triage 2018-05-07 06:37:15 UTC
(In reply to Eugene Grosbein from comment #32)
Oh, yes, I have.
I think that that stack trace indicates two issues.

1. That a sysctl handler to print dev.XXX.N.%pnpinfo performs an ACPI evaluation.
That seems to be too heavy.  The code should probably use a cached value.

2. That the thread is stuck in AcpiOsAcquireMutex when trying to acquire ACPI_MTX_INTERPRETER (the interpreter lock).
This could be either because some other thread is holding that lock (for whatever reason) or because of a bug in AcpiOsAcquireMutex (or maybe a bug in DSDT).
This needs further analysis.
Comment 35 Eugene Grosbein freebsd_committer freebsd_triage 2018-05-11 00:40:41 UTC
Adding freebsd-acpi@ in hope to gather more people potentialy interested in this case.
Comment 36 Eugene Grosbein freebsd_committer freebsd_triage 2018-12-02 21:09:57 UTC
Can you test FreeBSD 11.2? 10.4 went EoL already.
Comment 37 Mark Knight 2018-12-08 13:15:59 UTC
Seems fine / resolved under FreeBSD 11.2-RELEASE-p4, so I guess this can now be closed.
Comment 38 Rodney W. Grimes freebsd_committer freebsd_triage 2019-02-13 01:56:17 UTC
Please do not put bugs on stable@, current@, hackers@, etc
Placing this bug back on bugs@ to be triaged properly to a technical list