Summary: | Freebsd 11.0p11 - system freeze on intensive I/O | ||
---|---|---|---|
Product: | Base System | Reporter: | Gautam Mani <execve> |
Component: | kern | Assignee: | freebsd-bugs (Nobody) <bugs> |
Status: | New --- | ||
Severity: | Affects Some People | CC: | marklmi26-fbsd |
Priority: | --- | ||
Version: | 11.0-RELEASE | ||
Hardware: | amd64 | ||
OS: | Any | ||
See Also: | https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=206048 |
Description
Gautam Mani
2017-07-24 16:33:23 UTC
See bugzilla 206048 for more examples and notes. It is a long term issue, not new with 11.x . This submittal is a duplicate. There may be others besides 206048. One update on this one. I installed a custom stable/11 kernel and world (sched = 4BSD was the only change), and the problem is no longer seen. From my untrained eyes, it looks like some kind of swap request starvation causing a hang when ULE is in use. root@mellon:~ # uname -a FreeBSD mellon 11.1-STABLE FreeBSD 11.1-STABLE #0 r313908+14aefcc16ee(stable/11): Sat Aug 12 00:33:04 IST 2017 root@mellon:/usr/obj/usr/home/user1/src/freebsd/sys/MYKERNEL amd64 (In reply to execve from comment #2) I suggest the test of using the port sysutils/stress and trying: stress -d 2 -m 3 --vm-keep I'd be interested to know if a swap file context handles that. Yes, it seems to hold itself together fine for more than 10 minutes :). root@mellon:~ # swapinfo Device 1K-blocks Used Avail Capacity /dev/md99 8388608 0 8388608 0% root@mellon:~ # sysctl hw.physmem hw.physmem: 8463953920 root@mellon:~ # date Sat Aug 12 12:55:36 IST 2017 root@mellon:~ # stress -d 2 -m 3 --vm-keep stress: info: [11727] dispatching hogs: 0 cpu, 0 io, 3 vm, 2 hdd ^C root@mellon:~ # date Sat Aug 12 13:06:37 IST 2017 root@mellon:~ # uname -a FreeBSD mellon 11.1-STABLE FreeBSD 11.1-STABLE #0 r313908+14aefcc16ee(stable/11): Sat Aug 12 00:33:04 IST 2017 root@mellon:/usr/obj/usr/home/user1/src/freebsd/sys/MYKERNEL amd64 (In reply to execve from comment #4) Cool. (I suppose top or some such could be used to confirm the expected activity, given the amount of RAM and other such context.) I wonder what it would do without the "sched = 4BSD was the only change". (Historically 11.x has been a problem but likely all the examples had not adjusted that.) (In reply to execve from comment #4) FYI since you have more RAM than the original context for that stress command I'll quote from the man page: -m, --vm N spawn N workers spinning on malloc()/free() --vm-bytes B malloc B bytes per vm worker (default is 256MB) -d, --hdd N spawn N workers spinning on write()/unlink() --vm-keep redirty memory instead of freeing and reallocating So: stress -d 2 -m 3 --vm-keep is only doing 3*256MB = 768MB of VM use. That was a large percentage of the 1GB of RAM that the related bugzilla 206048 indicated as the context for the command. It is not that much of around 8GiBytes of RAM. I do not think there is any need to increase the memory usage. Like I mentioned in the original PR description, even without X running on the same system with 8GB RAM, I could reproduce this using a split command on a 6-7GB file via the console.
>> running split on a 6.4G file (filesystem dump of disk using dump) -- something like
>> split -d -b 200M -a 4 - part
>> This would then freeze at one point - making the system unusable. I tried this 2-3 times.
It is very clear there is an issue - and from my experience it is narrowed to when the ULE scheduler and swapfile is in use.
(In reply to execve from comment #7) I tried a couple of variations of the experiment that I suggested. Unfortunately the results are a little complicated to interpret. Context: under virtualbox (on Windows 10 Pro) with. . . (Bugzilla 206048 has pointed out reproducibility under virtual machines.) FreeBSDx64OPC11S# uname -apKU FreeBSD FreeBSDx64OPC11S 11.1-STABLE FreeBSD 11.1-STABLE r322433M amd64 amd64 1101501 1101501 # svnlite diff /usr/src/ Index: /usr/src/sys/amd64/conf/GENERIC =================================================================== --- /usr/src/sys/amd64/conf/GENERIC (revision 322433) +++ /usr/src/sys/amd64/conf/GENERIC (working copy) @@ -24,7 +24,8 @@ makeoptions DEBUG=-g # Build kernel with gdb(1) debug symbols makeoptions WITH_CTF=1 # Run ctfconvert(1) for DTrace support -options SCHED_ULE # ULE scheduler +#options SCHED_ULE # ULE scheduler +options SCHED_4BSD # 4BSD scheduler options PREEMPTION # Enable kernel thread preemption options INET # InterNETworking options INET6 # IPv6 communications protocols I tried: 4 processors and 1 GiBYte of RAM assigned using: stress -d 2 -m 3 --vm-keep and separately: 8 processors and 1 GiByte of RAM assigned using: stress -d 6 -m 3 --vm-keep I had a top -Cawopid running in each case with its own ssh into the virtual machine. stress was via ssh as well. In the 2nd case I got to a lock-up: top stopped updating and input was ignored to both the ssh's (top and stress) and the console window, including input such as ^C and ^T . The console window did eventually show: swap_pager: I/O error - pageout failed; blkno 7367,size 4096, error 12 (After seeing that I waited a while longer but I gave up on waiting and eventually killed the virtual machine.) I later found a list message reporting about such "error 12" variants of the message: QUOTE > I think it might be ENOMEM from a geom when trying to g_clone_bio. . . . It shouldn't happen, but you should notice no ill effects (that is, the page isn't lost, it just wasn't paged out and there's a few bytes less that the pager could do at the moment). END QUOTE. As for the lock-up structure. . . Unfortunately top did not happen to update showing any of the lock up structure in other processes before it locked up. It does at least appear not as easy to get a lock-up (or get ENOMEM and failure to page out) with SCHED_4BSD (to the degree that just a couple of tests indicate anything about such). But getting stuck appears possible and pageout's can fail to happen for lack of memory, or so it appears. (In reply to Mark Millard from comment #8) I should also have said: The Windows 10 Task Manager Performance tab display of CPU usage on threads/cores suggested a possible live-lock instead of a dead-lock: 7 of 8 "processors" (in virtualbox terms) fairly busy but not any 8th one being noticeably busy. (Windows 10 Pro was not otherwise busy in any sustained way.) But technically I can not prove which of: lack of overall progress vs. very slow overall progress off my evidence. |