Bug 239822 - deadlock since FreeBSD 12.0 p9
Summary: deadlock since FreeBSD 12.0 p9
Status: Closed Overcome By Events
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 12.0-RELEASE
Hardware: Any Any
: --- Affects Only Me
Assignee: freebsd-bugs mailing list
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-08-13 15:59 UTC by Martin Birgmeier
Modified: 2019-11-10 11:29 UTC (History)
1 user (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Martin Birgmeier 2019-08-13 15:59:45 UTC
Scenario:
- FreeBSD releng/12.0 updated via SVN
- recompiled (using -DNO_CLEAN) after FreeBSD-EN-19:14.epoch
- Running on 3 machines
- I have not recompiled the following ports with files in /boot/modules:
  . Machine 1: intel-em-kmod-7.7.5, nvidia-driver-390.87_3, virtualbox-ose-kmod-5.2.32_1
  . Machine 2: drm-fbsd12.0-kmod-4.16.g20190722_1, gpu-firmware-kmod-g20190620, virtualbox-ose-kmod-5.2.32_1
- uname -a: FreeBSD hal.xyzzy 12.0-RELEASE-p9 FreeBSD 12.0-RELEASE-p9 #9 r350677M: Wed Aug  7 18:35:55 CEST 2019     root@v908.xyzzy:/.../hal/z/OBJ/FreeBSD/amd64/releng/12.0/amd64.amd64/sys/XYZZY_SMP  amd64

Result:
- On two of the three machines I have now experienced something which seems like a livelock: Some processes are stuck, others still progress and can be terminated (e.g., EOF in shells).
- Rebooting the machines never succeeds completely: At some point there is no progress anymore and the machines have to be hard reset.
- In the first livelock, two tabs of Firefox seemed to be "stuck" against each other; 100% CPU was used (1 out of 4 cores).
- In the second livelock, VirtualBox (VBoxHeadless) was spinning at 100% (1 out of 4 cores), and a running svn command was stuck.

I have the impression that this started with -p9. Before, I mostly got panics with immediate reboots (no kernel dumps although configured; see bug report #235865).

-- Martin
Comment 1 Martin Birgmeier 2019-08-13 16:03:23 UTC
Btw in the second case involving VBoxHeadless:
- Tried to interrupt the VBoxHeadless process (kill -INT)
- 100% CPU load fell to 0%
- Process did not get killed, instead pstree reported "STOP"
Comment 2 Martin Birgmeier 2019-08-14 19:14:50 UTC
I just had another of these deadlocks:
- Running firefox
- Opening a context menu
- Firefox does not respond anymore
- Klicked on upper right corner to close it
- KDE says that window does not close, do I want to forcibly close it
- "yes"
- result: complete system deadlocked
Comment 3 Mark Johnston freebsd_committer 2019-08-14 19:22:33 UTC
(In reply to Martin Birgmeier from comment #2)
Are you able to capture the output of "procstat -kka" once an application changes?
Comment 4 Mark Johnston freebsd_committer 2019-08-14 19:25:04 UTC
s/changes/hangs/
Comment 5 Martin Birgmeier 2019-08-14 19:27:28 UTC
I'll try it next time.

Currently I am recompiling all those ports with kernel modules.

-- Martin
Comment 6 Martin Birgmeier 2019-08-16 05:32:03 UTC
Tried to let the machines run overnight (to recompile KDE applications 19.08.0). Machine 1 crashed at around midnight. In the morning found a blank screen and complete unresponsiveness.

Firefox was not running, but a vbox client was hosted to run portmaster.

The system was definitely more stable before -p9.

-- Martin
Comment 7 Martin Birgmeier 2019-08-16 05:33:06 UTC
To be added: unresponsive, with fan at full blast (so most likely more than 1 of the 4 cores fully loaded).
Comment 8 Martin Birgmeier 2019-08-16 05:45:15 UTC
I am not sure the following is of relevance, but /var/log/messages yesterday was spammed like this:

Aug 15 23:00:00 mizar newsyslog[10909]: logfile turned over due to size>1000K
Aug 15 23:00:00 mizar wpa_supplicant[394]: wlan0: Reject scan trigger since one 
is already pending
Aug 15 23:00:00 mizar wpa_supplicant[394]: wlan0: Failed to initiate AP scan
Aug 15 23:00:01 mizar wpa_supplicant[394]: wlan0: Reject scan trigger since one 
is already pending
Aug 15 23:00:01 mizar wpa_supplicant[394]: wlan0: Failed to initiate AP scan
Aug 15 23:00:02 mizar wpa_supplicant[394]: wlan0: Reject scan trigger since one 
is already pending
Aug 15 23:00:02 mizar wpa_supplicant[394]: wlan0: Failed to initiate AP scan
Aug 15 23:00:03 mizar wpa_supplicant[394]: wlan0: Reject scan trigger since one 
is already pending

But is it doing this since at least Aug 10 (which is the oldest messages left due to excessive rotation because of this).
Comment 9 Martin Birgmeier 2019-08-16 05:52:25 UTC
I am backup up the machines using rsync and zfs snapshots which I keep on a monthly basis. This allows me to look at older messages...

[0]# bzgrep -c 'Reject scan trigger since one is already pending' */var/log/messages*
...
backup.2019-04-28.19:40:47/var/log/messages:0
backup.2019-04-28.19:40:47/var/log/messages.0.bz2:0
backup.2019-04-28.19:40:47/var/log/messages.1.bz2:0
backup.2019-04-28.19:40:47/var/log/messages.2.bz2:0
backup.2019-04-28.19:40:47/var/log/messages.3.bz2:0
backup.2019-04-28.19:40:47/var/log/messages.4.bz2:0
backup.2019-06-02.16:32:51/var/log/messages:0
backup.2019-06-02.16:32:51/var/log/messages.0.bz2:6540
backup.2019-06-02.16:32:51/var/log/messages.1.bz2:5153
backup.2019-06-02.16:32:51/var/log/messages.2.bz2:5565
backup.2019-06-02.16:32:51/var/log/messages.3.bz2:5096
backup.2019-06-02.16:32:51/var/log/messages.4.bz2:8572
...
[0]#

So I guess this is related to -p4 of 20190514: "Fix multiple vulnerabilities in hostapd/wpa_supplicant. [SA-19:03.wpa]"
Comment 10 Mark Johnston freebsd_committer 2019-08-16 14:48:28 UTC
So, you still see problems after recompiling the kernel modules against -p9 sources?  What version were you running before the update?
Comment 11 Martin Birgmeier 2019-08-17 14:51:38 UTC
Before I was running -p7:

[0]# strings /z/backup/rsync/mizar/ROOT/.zfs/snapshot/backup.2019-08-04.12:27:28/boot/kernel/kernel | tail | grep FreeBSD
@(#)FreeBSD 12.0-RELEASE-p7 #8 r349733M: Thu Jul  4 19:39:47 CEST 2019
FreeBSD 12.0-RELEASE-p7 #8 r349733M: Thu Jul  4 19:39:47 CEST 2019
    root@v908.xyzzy:/.../hal/z/OBJ/FreeBSD/amd64/releng/12.0/amd64.amd64/sys/XYZZY_SMP
FreeBSD clang version 6.0.1 (tags/RELEASE_601/final 335540) (based on LLVM 6.0.1)
FreeBSD
[0]#
Comment 12 Mark Johnston freebsd_committer 2019-08-19 16:12:52 UTC
(In reply to Martin Birgmeier from comment #11)
And to be clear, you were seeing panics on -p7 too?
Comment 13 Martin Birgmeier 2019-08-19 16:51:39 UTC
With -p7 the machine would simply reboot, but less frequently than before the fix of bug #236121.

-- Martin
Comment 14 Martin Birgmeier 2019-11-10 11:29:50 UTC
This does not seem to occur anymore. (Now at FreeBSD 12.0-RELEASE-p10 #10 r351430M)

-- Martin