When upgrading kernel from FreeBSD FreeBSD 10.3-RELEASE-p28 to FreeBSD 10.4-RELEASE or FreeBSD 10.4-RELEASE-p7 various user land programs hang. Same problem with 10.3 or 10.4 user land, so simply upgrading the kernel triggers the issue.
For example, I cannot login as myself on the console, or if I login as root and try to su or sudo myself, the tty hangs (CTRL-C usually does nothing, CTRL-T shows [sysctl mem], and I have to login via another virtual tty. Sadly there's nothing appearing in /var/log/all.log when the deadlock occurs.
Another command that causes a hang is "killall -HUP syslogd" when logged in as root.
For additional information, see the similarly titled thread on freebsd-stable.
Kernel boot log: http://www.knigma.org/scratch/010418.10.4.txt
I created the following kernel dump shortly after a normal boot (booting hasn't been affected so far), and then immediately trying to login as my normal user and hitting Ctr-Alt-ESC after the tty hung. CTRL-T showed [sysctl-mem]. Kernel dump created with "call doadump":
Motherboard is running the latest BIOS and system has been rock solid on 10.3 for months.
All advise/assistance appreciated - thank you :)
As temporary work-around, you could 1) disable snmpd that seems to trigger this kernel problem and 2) avoid using "sysctl -a" as it may trigger it too.
Do you have some non-standard tuning in /boot/loader.conf and/or /etc/sysctl.conf?
If so, please add them as attachments.
You should also rebuild net-snmpd port or update package as its stale version may be a culprit as it tries to dig the kernel data wrong way. Its upgrade may solve your problem completely.
Created attachment 192121 [details]
Adding sysctl.conf as requested.
Created attachment 192122 [details]
Adding loader.conf as requested.
Thanks. It sounds like you have seen something similar before?
I've just tried commenting out snmpd and snmptrap from /etc/rc.conf but the problem still occurs. I haven't run sysctl -a.
In better news, I've found an even similar way to trigger the deadlock.
1) Boot the OS into multiuser
2) Login on the console as root
3) Type ps followed by enter. ps never returns.
What do you use vm.defer_swapspace_pageouts=1 for?
Note this sysctl was removed from FreeBSD 11 altogether. I would recommend trying to comment it out for a test, too.
(In reply to Mark Knight from comment #5)
> Thanks. It sounds like you have seen something similar before?
No. I've just digged your crashdump a little.
> In better news, I've found an even similar way to trigger the deadlock.
Without snmpd running? If so, please reproduce it and generate another crashdump and make it available.
Created attachment 192172 [details]
some debug info for snmpd from the crashdump
Note that snmpd is hung on sx_xlock() with insanely large req.validlen == 140737488349408
> What do you use vm.defer_swapspace_pageouts=1 for?
After moving existing disks to a new chassis I have more RAM than swap and repartitioning would be a pain - I really don't need 16GB of swap! Without this sysctl I was finding my swap was becoming full and lots of associated "out of swap space" error. Since adding this sysctl that problem as ceased. I hope 11.x won't have the same problem...
As requested, here's a fresh kernel dump created as follows:
1) Disabled snmpd and snmptrapd in /etc/rc.conf, so definitely no snmpd running
2) reboot with 10.4 kernel
3) Login to console as root
4) Type ps <return>
5) Deadlock occurs
6) CTRL-T shows [sysctl-mem]
7) Ctrl-Alt-ESC followed by call doadump
Kernel is the same file as before, so no need to download that again!
As suggested, I just tried commenting out vm.defer_swapspace_pageouts=1. Commenting it out from /etc/sysctl.conf didn't make any difference, 10.4 still deadlocks.
(In reply to Mark Knight from comment #9)
What is UID 104 here? It seems it is used for some perl process that may be triggering the problem and not snmpd.
UID 104 is:
slimserv:*:104:104:Slim Devices SlimServer pseudo-user:/nonexistent:/usr/sbin/nologin
ps (on a working kernel), shows:
slimserv 1117 0.0 0.9 253920 147596 - Ss 7:13PM 0:15.34 /usr/local/bin/perl /usr/local/share/logitechmediaserver/slimserver.pl --daemon --pidfile=/var/run/logitechmediaserver/logitechmediaserver.pid --user=slimserv --group=slimserv
This is logitechmediaserver-7.9.1.g2017.07.11, from /usr/ports/audio/logitechmediaserver
I'm surprised that's mischievous - it's just a music server, but I'll try disabling it in a moment and seeing if that helps :)
The only other perl I have running as a daemon is mrtg-2.17.4_7,1 from /usr/ports/net-mgmt/mrtg.
I should mention that all my ports are up-to-date.
I just tried disabling logitechmediaserver and restarting. Didn't help. Then I disabled mrtg and restarted. Still didn't help.
Here's a fresh dump with neither running. Same repo steps (root login followed by ps) and same symptoms.
Incidentally, ps works in single user mode, but it would be very tedious to do a binary chop to find what's tickling the problem
(In reply to Mark Knight from comment #13)
Do you use GENERIC kernel? If not, please attach your kernel config file.
Created attachment 192199 [details]
Kernel config file
Attached kernel config. My kernel is close to generic, extras are:
> options IPFIREWALL
> options IPDIVERT
> options DUMMYNET
then added these for debug, but the problem occurred without them:
> options KDB_UNATTENDED
> options DDB # Support DDB.
> options GDB # Support remote GDB.
> options INVARIANTS # Enable calls of extra sanity checking
> options INVARIANT_SUPPORT # Extra sanity checks of internal structures, required by INVARIANTS
> options WITNESS # Enable checks to detect deadlocks and cycles
> options WITNESS_SKIPSPIN # Don't run witness on spinlocks for speed
You've probably got these from loader.conf, but I also have these:
Id Refs Address Size Name
1 35 0xffffffff80200000 17dbb70 kernel
2 1 0xffffffff819dc000 215a8 geom_eli.ko
3 2 0xffffffff819fe000 352d0 crypto.ko
4 1 0xffffffff81a34000 23fb0 geom_mirror.ko
5 1 0xffffffff81a58000 157c0 aio.ko
6 1 0xffffffff81a6e000 4a70 coretemp.ko
7 1 0xffffffff81a73000 b188 sem.ko
8 1 0xffffffff81a7f000 26e0 cc_cubic.ko
9 1 0xffffffff81c11000 56c6 fdescfs.ko
10 1 0xffffffff81c17000 8df4 uftdi.ko
11 1 0xffffffff81c20000 54c4 ucom.ko
12 1 0xffffffff81c26000 3d12c linux.ko
13 2 0xffffffff81c64000 683b linux_common.ko
14 1 0xffffffff81c6b000 37ae9 linux64.ko
Any more thoughts on this please?
Now Ports collection support for 10.3 has ended, this bug is now a hard blocker preventing me from upgrading to 10.4 and continuing to apply security fixes.
I once had a box where sysctl -a would hang, too. It's been a while, so no more details until I stumble open that box again.
(In reply to Mark Knight from comment #16)
I'm out of ideas.
Could you please test 11.1-RELEASE or 11.2-PRERELEASE? I've seen similar reports that were 'fixed' by upgrading to 11.1.
(In reply to Eugene Grosbein from comment #8)
Well, there is the sx owner which blocks snmpd. It does not make your note about the request length less valid, but still first thing to look at is the state of the lock owner.
Print p *(struct thread *)0xfffff80079fd3000, then get the backtrace for that thread.
Created attachment 193023 [details]
(In reply to Konstantin Belousov from comment #19)
How does one recover original sysctl name? This one seems to be ACPI-related, see attachment.
Mark, you should check if you have BIOS update for this box.
Thanks. BIOS is 1006 (updated before I filed this bug) from https://www.asus.com/uk/Motherboards/PRIME-H270M-PLUS/HelpDesk_Download/. I see there's now a 1010 I could try. To reiterate this is fine under 10.3.
The process holding sx lock is hald:
257 Thread 100247 (PID=1343: hald) sched_switch (td=0xfffff80079fd3000,
newtd=<value optimized out>, flags=<value optimized out>)
Hald is installed with ports/sysutils/hal.
Mark, why do you need hald? If it's possible, try running 10.4 with hald disabled.
Ah, thank you!!
I don't recall why hald was installed (it was probably from a long time ago). I've removed it, and so far, I'm not seeing any symptoms, so long as I avoid sysctl -a.
(In reply to Mark Knight from comment #24)
Have you rebuilt/reinstalled sysutils/hal after upgrade from 10.3 to 10.4?
(In reply to Eugene Grosbein from comment #25)
I have run hald on my desktop and laptop with 10.4-REL and 10.4-STABLE for a long time without issues.
To OP: have you rebuilt kernel modules? You only write that you tried with a new kernel. I believe that there are internal kernel interfaces that have changed between 10.3 and 10.4, so it is essential to rebuild all modules. Likewise, I don't expect "ps" from 10.3 to work on 10.4.
The move to 10.4 was a buildworld buildkernel installkernel installworld and mergemaster so all modules and userland were rebuilt. Indeed, for a long time I've been running a 10.3 kernel with 10.4 userland as a workaround while we found this (don't run hald) workaround.
I'll try a freshly rebuilt hald and report back shortly.
(In reply to Bengt Ahlgren from comment #26)
I think this is kernel or hardware problem despite of possible desync.
In no event should simple sysctl system call hang "forever" holding sx lock that prevents other processes to call other sysctl OIDs.
(In reply to Eugene Grosbein from comment #28)
Did you rebuild hald itself when upgraded from 10.3 to 10.4 ? If not, do that.
Kernel and user land are now FreeBSD 10.4-RELEASE-p8 (SHREWD) #0 r333193: Thu May 3 22:30:26 BST 2018.
Has been fine since I stopped running hald. As a test I just rebuilt hal-0.5.14_32 and all of its dependencies using my up-to-date ports tree.
Then, as soon as I ran "service hald onestart", the ps command (without an parameters) hangs. Nasty.
Have we seen a stack trace of what hald is doing in kernel yet?
Konstantin requested that information in comment #19.
(In reply to Andriy Gapon from comment #31)
You probably overlooked the attachment https://bz-attachments.freebsd.org/attachment.cgi?id=193023 named "thread 257/100247", it has that.
Created attachment 193116 [details]
It seems, we have Giant lock here? See attachment.
(In reply to Eugene Grosbein from comment #32)
Oh, yes, I have.
I think that that stack trace indicates two issues.
1. That a sysctl handler to print dev.XXX.N.%pnpinfo performs an ACPI evaluation.
That seems to be too heavy. The code should probably use a cached value.
2. That the thread is stuck in AcpiOsAcquireMutex when trying to acquire ACPI_MTX_INTERPRETER (the interpreter lock).
This could be either because some other thread is holding that lock (for whatever reason) or because of a bug in AcpiOsAcquireMutex (or maybe a bug in DSDT).
This needs further analysis.
Adding freebsd-acpi@ in hope to gather more people potentialy interested in this case.