Hi, I am duplicating my text from email to freebsd-questions and also indicating clues from my investigation so far. I came back to using FreeBSD after many years on my laptop. The machine is currently in a dual-boot configuration and I basically replaced FreeBSD to replace the earlier Ubuntu installation. Machine is at 11.0-RELEASE updated to p11 via freebsd-update. I encountered a similar outcome of total system freeze in two kinds of usage. State of the system: 1) X is not usable - I use xfce - no login manager 2) I cannot ssh into the box - I do not get the username or password prompt - connection just times out 3) the network interface is ping-able. 4) I am not able to switch to the system consoles using Ctrl-Alt-F1..8 5) No mouse movement nor screen update 6) I suspected an issue with soft-update and then turned that off. Filesystem is ufs. fsck is clean. Recovery: hard-boot, system comes up, fsck happens, some errors fixed and resumes working normally. The problem became visible in two kind of usage scenarios: 1) Running the backup port duplicity to create a backup of the / filesystem. It would start but at some point but then get stuck. Running it in verbose mode would sometimes indicate that this would happen when the write to the volume (default setting of 200M) happens. This was tried 4-5 times. 2) running split on a 6.4G file (filesystem dump of disk using dump) -- something like split -d -b 200M -a 4 - part This would then freeze at one point - making the system unusable. I tried this 2-3 times. I finally got it to work using idprio 31 before the split command. Tried this only once - havent tried it with the duplicity command. The machine has 8GB RAM and is clearly not reaching the out of memory kind of situation - basically only about 1.1 or so GB is used. I also ran this from the system console without X and faced the same issue - no panic message - nothing in the logs as well. I got a clue from my further searching on the freebsd mailing lists and forums. It has to do with the swap file. I dont have a swap partition since I just put the freebsd root over the ubuntu partition. And I had created a swap file based on the instructions in 11.12.2 Creating a swap file. https://www.freebsd.org/doc/handbook/adding-swap-space.html I got a clue from the freebsd-forums: https://forums.freebsd.org/threads/58266/ If the link is down (for some reason, I see the server is down quite often these days), use the below cached link: https://webcache.googleusercontent.com/search?q=cache:y1bJLmSEjWUJ:https://forums.freebsd.org/threads/58266/+&cd=2&hl=en&ct=clnk&gl=in So, I just did a swapoff -La and then ran the split command again - no issues whatsover! Some idea about the configuration can be had by looking at the information below; please let me know if any further logs / debugging information needed. $ freebsd-version 11.0-RELEASE-p11 $ uname -a FreeBSD mellon 11.0-RELEASE-p9 FreeBSD 11.0-RELEASE-p9 #0: Tue Apr 11 08:48:40 UTC 2017 root@amd64-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC amd64 CPU: Intel(R) Core(TM) i3-2330M CPU @ 2.20GHz (2195.06-MHz K8-class CPU) Origin="GenuineIntel" Id=0x206a7 Family=0x6 Model=0x2a Stepping=7 Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE> Features2=0x1dbae3bf<SSE3,PCLMULQDQ,DTES64,MON,DS_CPL,VMX,EST,TM2,SSSE3,CX16,xTPR,PDCM,PCID,SSE4.1,SSE4.2,x2APIC,POPCNT,TSCDLT,XSAVE,OSXSAVE,AVX> AMD Features=0x28100800<SYSCALL,NX,RDTSCP,LM> AMD Features2=0x1<LAHF> XSAVE Features=0x1<XSAVEOPT> VT-x: PAT,HLT,MTF,PAUSE,EPT,UG,VPID TSC: P-state invariant, performance statistics real memory = 8589934592 (8192 MB) avail memory = 8172896256 (7794 MB)
See bugzilla 206048 for more examples and notes. It is a long term issue, not new with 11.x . This submittal is a duplicate. There may be others besides 206048.
One update on this one. I installed a custom stable/11 kernel and world (sched = 4BSD was the only change), and the problem is no longer seen. From my untrained eyes, it looks like some kind of swap request starvation causing a hang when ULE is in use. root@mellon:~ # uname -a FreeBSD mellon 11.1-STABLE FreeBSD 11.1-STABLE #0 r313908+14aefcc16ee(stable/11): Sat Aug 12 00:33:04 IST 2017 root@mellon:/usr/obj/usr/home/user1/src/freebsd/sys/MYKERNEL amd64
(In reply to execve from comment #2) I suggest the test of using the port sysutils/stress and trying: stress -d 2 -m 3 --vm-keep I'd be interested to know if a swap file context handles that.
Yes, it seems to hold itself together fine for more than 10 minutes :). root@mellon:~ # swapinfo Device 1K-blocks Used Avail Capacity /dev/md99 8388608 0 8388608 0% root@mellon:~ # sysctl hw.physmem hw.physmem: 8463953920 root@mellon:~ # date Sat Aug 12 12:55:36 IST 2017 root@mellon:~ # stress -d 2 -m 3 --vm-keep stress: info: [11727] dispatching hogs: 0 cpu, 0 io, 3 vm, 2 hdd ^C root@mellon:~ # date Sat Aug 12 13:06:37 IST 2017 root@mellon:~ # uname -a FreeBSD mellon 11.1-STABLE FreeBSD 11.1-STABLE #0 r313908+14aefcc16ee(stable/11): Sat Aug 12 00:33:04 IST 2017 root@mellon:/usr/obj/usr/home/user1/src/freebsd/sys/MYKERNEL amd64
(In reply to execve from comment #4) Cool. (I suppose top or some such could be used to confirm the expected activity, given the amount of RAM and other such context.) I wonder what it would do without the "sched = 4BSD was the only change". (Historically 11.x has been a problem but likely all the examples had not adjusted that.)
(In reply to execve from comment #4) FYI since you have more RAM than the original context for that stress command I'll quote from the man page: -m, --vm N spawn N workers spinning on malloc()/free() --vm-bytes B malloc B bytes per vm worker (default is 256MB) -d, --hdd N spawn N workers spinning on write()/unlink() --vm-keep redirty memory instead of freeing and reallocating So: stress -d 2 -m 3 --vm-keep is only doing 3*256MB = 768MB of VM use. That was a large percentage of the 1GB of RAM that the related bugzilla 206048 indicated as the context for the command. It is not that much of around 8GiBytes of RAM.
I do not think there is any need to increase the memory usage. Like I mentioned in the original PR description, even without X running on the same system with 8GB RAM, I could reproduce this using a split command on a 6-7GB file via the console. >> running split on a 6.4G file (filesystem dump of disk using dump) -- something like >> split -d -b 200M -a 4 - part >> This would then freeze at one point - making the system unusable. I tried this 2-3 times. It is very clear there is an issue - and from my experience it is narrowed to when the ULE scheduler and swapfile is in use.
(In reply to execve from comment #7) I tried a couple of variations of the experiment that I suggested. Unfortunately the results are a little complicated to interpret. Context: under virtualbox (on Windows 10 Pro) with. . . (Bugzilla 206048 has pointed out reproducibility under virtual machines.) FreeBSDx64OPC11S# uname -apKU FreeBSD FreeBSDx64OPC11S 11.1-STABLE FreeBSD 11.1-STABLE r322433M amd64 amd64 1101501 1101501 # svnlite diff /usr/src/ Index: /usr/src/sys/amd64/conf/GENERIC =================================================================== --- /usr/src/sys/amd64/conf/GENERIC (revision 322433) +++ /usr/src/sys/amd64/conf/GENERIC (working copy) @@ -24,7 +24,8 @@ makeoptions DEBUG=-g # Build kernel with gdb(1) debug symbols makeoptions WITH_CTF=1 # Run ctfconvert(1) for DTrace support -options SCHED_ULE # ULE scheduler +#options SCHED_ULE # ULE scheduler +options SCHED_4BSD # 4BSD scheduler options PREEMPTION # Enable kernel thread preemption options INET # InterNETworking options INET6 # IPv6 communications protocols I tried: 4 processors and 1 GiBYte of RAM assigned using: stress -d 2 -m 3 --vm-keep and separately: 8 processors and 1 GiByte of RAM assigned using: stress -d 6 -m 3 --vm-keep I had a top -Cawopid running in each case with its own ssh into the virtual machine. stress was via ssh as well. In the 2nd case I got to a lock-up: top stopped updating and input was ignored to both the ssh's (top and stress) and the console window, including input such as ^C and ^T . The console window did eventually show: swap_pager: I/O error - pageout failed; blkno 7367,size 4096, error 12 (After seeing that I waited a while longer but I gave up on waiting and eventually killed the virtual machine.) I later found a list message reporting about such "error 12" variants of the message: QUOTE > I think it might be ENOMEM from a geom when trying to g_clone_bio. . . . It shouldn't happen, but you should notice no ill effects (that is, the page isn't lost, it just wasn't paged out and there's a few bytes less that the pager could do at the moment). END QUOTE. As for the lock-up structure. . . Unfortunately top did not happen to update showing any of the lock up structure in other processes before it locked up. It does at least appear not as easy to get a lock-up (or get ENOMEM and failure to page out) with SCHED_4BSD (to the degree that just a couple of tests indicate anything about such). But getting stuck appears possible and pageout's can fail to happen for lack of memory, or so it appears.
(In reply to Mark Millard from comment #8) I should also have said: The Windows 10 Task Manager Performance tab display of CPU usage on threads/cores suggested a possible live-lock instead of a dead-lock: 7 of 8 "processors" (in virtualbox terms) fairly busy but not any 8th one being noticeably busy. (Windows 10 Pro was not otherwise busy in any sustained way.) But technically I can not prove which of: lack of overall progress vs. very slow overall progress off my evidence.