I built a new kernel on -current. The build was successfully, and I rebooted the machine. To my surprise the machine hangs in a kernel panic during swapoff sys call. It turns out that if the machine used some swap, it will panic at the next reboot. This is 100% reproducible. How to repeat: # make sure that the machine used some swap. Here I’m using a perl script which will # grow at least a 1GB RAM big $ perl -e '$a=`man tcsh`; for(0..3000) { $b.=$a}' # check the swap usage, e.g. $ top -b 2 | egrep ^Swap Swap: 2500M Total, 18M Used, 2482M Free # reboot the machine $ reboot on the console I get this: swap_pager: I/O error - pagein failed; blkno 280694, size 4096, error 5 panic: swap_pager_force_pagein: read from swap failed I saw this on two machines, one use a 1GB swap device and the other a 2GB swap file cat /etc/fstab /dev/vtbd0p3 / ufs rw 1 1 cat /etc/fstab md99 none swap sw,file=/var/swap/swap.0,late 0 0 This problem is new to me, I didn’t saw this 3 weeks ago in November. Using swapoff/swapon on the command line works fine. Workaround: before a reboot, disable swap with the swapoff command: $ swapoff -a swapoff: removing /dev/md99 as swap device $ reboot
here are screen shots from the VNC console: https://people.freebsd.org/~wosch/tmp/pr-224479/swapoff-device.png https://people.freebsd.org/~wosch/tmp/pr-224479/swapoff-file.png
the suspect commit is: commit c123c7433b7eb3ccacfa1bae8ae136c61cfe8462 Author: alc <alc@FreeBSD.org> Date: Tue Nov 28 17:46:03 2017 +0000 When the swap pager allocates space on disk, it requests contiguous blocks in a single call to blist_alloc(). However, when it frees that space, it previously called blist_free() on each block, one at a time. With this change, the swap pager identifies ranges of contiguous blocks to be freed, and calls blist_free() once per range. In one extreme case, that is described in the review, the time to perform an munmap(2) was reduced by 55%. Submitted by: Doug Moore <dougm@rice.edu> Reviewed by: kib MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D12397 I did a git revert, recompiled the kernel and the problem was gone. Note: I do not say this commit created a new bug. It is very likely that it triggered an old bug. There are many reports about this kind of bug for older releases on the net.
I can repeat this with: Unmount a file system with an active swap file. Details @ https://people.freebsd.org/~pho/stress/log/swapoff2.txt
Note: it seems that you need some swap space in use. 5MB may be not enough, make sure you use at least 10-20MB
Older, similar reports: FreeBSD 7.x/8.0 http://www.bsdportal.ru/viewtopic.php?p=137518 FreeBSD 9.3 http://forum.lissyara.su/viewtopic.php?t=43340 FreeBSD 6.x http://www.bsdforen.de/threads/swap_pager-indefinite-wait-buffer-unter-7-rc-mit-swapfile.20324/
(In reply to Peter Holm from comment #3) Peter, can you repeat the stress test with the mount flag sync (mount -o sync /mnt) for the filesystem with the active swap file? Thanks.
I don't have the means to reproduce this bug. For someone who does (Peter?), could you possibly add this assertion to the code and see if the conditions that trigger the bug trigger the assertion first? index 22bf6c72b8b..c026dadf8be 100644 --- a/sys/vm/swap_pager.c +++ b/sys/vm/swap_pager.c @@ -778,6 +778,7 @@ swp_pager_freeswapspace(daddr_t blk, daddr_t npages) mtx_lock(&sw_dev_mtx); TAILQ_FOREACH(sp, &swtailq, sw_list) { if (blk >= sp->sw_first && blk < sp->sw_end) { + MPASS(blk + npages <= sp->sw_end); sp->sw_used -= npages; /*
(In reply to Doug Moore from comment #7) The scenario is: - Use a vnode backed memory disk as swap device. - unmount the file system containing the vnode. The problem is also seen with an old 10.3-STABLE FreeBSD 10.3-STABLE #0 r321658M The assert did not trigger: https://people.freebsd.org/~pho/stress/log/swapoff2-2.txt
(In reply to Wolfram Schneider from comment #4) I do not see the kernel panic if there are more than 100MB swap space in use. So, we have a lower bound of 10MB and and upper bound of 100MB swap usage on a swap file. For 20MB swap usage I always get a kernel panic.
(In reply to Peter Holm from comment #8) I suggest seeing bugzilla 206048 about file system based swap spaces hanging up. In particular comments #7 and #8 that quote Konstantin Belousov on the subject (or reference his material). Part of the text says: As result, swapfile swapping is more prone to the trivial and unavoidable deadlocks where the pagedaemon thread, which produces free memory, needs more free memory to make a progress. Swap write on the raw partition over simple partitioning scheme directly over HBA are usually safe, while e.g. zfs over geli over umass is the worst construction.
I do not quite understand what a cause for the complain is. If the system cannot get the swapped out page data on swapoff, this means that the data is corrupted. Preventing further user data corruption by panicing is a normal reaction. In the case of the swap file on the unmounted filesystem, this is expected mode of failure. Unmount reclaims the vnode, making all io to fail, putting the system into situation described in the previous paragraph.
(In reply to Konstantin Belousov from comment #11) the complain is: I built and installed a new kernel on 12-current. I run "reboot" and the kernel panics during shutdown and hangs in the kernel debugger on the console. It takes me less than a minute to reproduce this, see the howto in the bug report. The machine crashes 9 out of 10 times. Pretty close to a deterministic failure. Apparently, there is a race condition here: we unmount a file system while swapoff is not done yet with the swapfile on the file system. It would be great if we can fix this. Thanks.
(In reply to Wolfram Schneider from comment #12) Does the problem happen for: shutdown -r now instead of reboot? I ask because the man page for reboot says that reboot does not cleanly shut down all processes, recommending the use of shutdown instead: Normally, the shutdown(8) utility is used when the system needs to be halted or restarted, giving users advance warning of their impending doom and cleanly terminating specific programs. So, reboot is documented to be more likely to have race conditions. Side note: Going the other direction, given what is known about file-system swapspaces (i.e., vnode-backed), I'm not sure why they are allowed by default instead of needing support to be explicitly enabled. (The enabling indicating that one is supposed to understand the issues involved in their use.)
(In reply to Mark Millard from comment #13) There is no race, the system behaves as it supposed to. If swap storage is revoked, the swap in procedure panics. I am not sure that trying to convert this situation into SIGSEGV to the affected process is useful, because kernel stacks can be also swapped out, and the kernel has no other way out except panicing. You complain about swap over file being enabled by default does not make sense either. The configuration requires explicit user involvement, and it is allowed by combination of the components that were designed to fit. The consequences of doing so is up to the user, we cannot and must not hold user' hands. Taking your line of reasoning, shall we disallow for installing the system on unreliable media like sdcard 'by default', because wearing makes the system fail fast ?
(In reply to Konstantin Belousov from comment #14) On the other hand, if the kernel knows that a vnode is used for swap and the kernel knows that the force reclaim of that vnode will lead to a panic, then why should the kernel allow that? I see several possibilities, but not sure if any of them makes sense from the FreeBSD architecture and design point of view. 1. When vgone-ing the swap vnode somehow perform an equivalent of swap off on it. 2. Do not allow even the force unmount of a filesystem if there is a vnode used for swap. 3. "Orphan" the swap vnode such that it is removed from its mount list, the name cache, etc, but it is still alive and is owned by the swap pager. I feel that this is impossible to do, however. Also, I am not sure about any "race", but it seems that during a shutdown we should first turn off all the file-backed swap and only then start unmounting filesystems. From comment #0 it seems that we do not follow this order.
(In reply to Konstantin Belousov from comment #14) You have written in the past that for file based (vnode-based) swap files the system is deadlock prone in a manor not under significant user control. From what I've seen lots of folks set up the file based swap spaces without having a clue that such is the status. I was one of them at one time, following http://www.wonkity.com/~wblock/docs/html/ssd.html#_filesystems_and_trim material for SSDs that gave no hint of the issues involved. If the instructions had told me that I needed to enable the mode of use because of deadlock issues that do not happen with partition based swap spaces, I never would have tied it. I had swap space based deadlocks vastly faster than any sdcard wear out would have occurred: the configuration was simply unreliable over fairly short time frames. Also, the deadlocks are not examples of wear-and-tear. (You might want a better analogy for your point in that respect.) I view FreeBSD as designed to automatically avoid deadlocks for swapping only for partition-based swap spaces. Lesser points but more tied to this report: I was expecting that "shutdown -r now" might stop some processes before initiating the v-node removal, making such processes no longer sources of swap-in activity for later stages. (I was not thinking of any general fix to the deadlocking issues.) reboot, by contrast, I was expecting leaves more processes around that might try to swap-in in an untimely manor. I freely admit my expectations might be garbage-in/garbage-out. I was not expecting the sending of SIGSEGV or other signals to a process that is trying to swap-in after there is effectively nothing available to swap-in from. Does some involved kernel stack need to be in a swapped out state to have the problem that has been described? Or can it be a problem when no kernel stacks are swapped out?
(In reply to Andriy Gapon from comment #15) +1 to all of this. We know what devices/files are used for swap and should be able to prevent this panic for user-initiated action like umount.
(In reply to Andriy Gapon from comment #15) This is perhaps going too far on the silly fest. Andrey, you do understand VFS, so I am quite frustrated. 1. If the vnode carriers some database, should kernel stop the database on the basis that otherwise user data might be corrupted ? Should kernel print (Abort, Retry, Ignore ?) 2.If force unmount is not allowed because there is the swap on a file, would the next request be to add really-force option which causes unmount to proceed even in this case. 3. You agree on your own that io to reclaimed vnode is not possible. May be we should not reclaim such vnode, but then add VOP_RECLAIMFORREAL. It is very clear situation, system was directed to change its configuration in a way which makes the continuation of the operations sometimes problematic, sometimes not. Why people consider it is reasonable to either deny the reconfiguration or to deny and have kernel to spit the whole man page on the console as the additional punishment, is beyond me.
(In reply to Konstantin Belousov from comment #18) It is probably not clear about my notes (other and shutdown -r now vs. reboot testing directly): My experience was deadlocks during normal operation when I used a swapfile, not during shutdown. The specifics of just this submittal do not by itself drive why I wrote what I wrote. As such, I'll stop making such notes here. Sorry for the noise.
(In reply to Konstantin Belousov from comment #18) Re: (1) If the database is owned and operated by the kernel, YES. Why not swapoff any such vnode backed file before reclaiming the vnode, or at least before umounting the backing filesystem?
(In reply to Mark Millard from comment #13) Hi Mark, good point about "reboot" <-> "shutdown -r now". The handbook recommends to use shutdown. My bad. I tried shutdown -r now on the console and it works fine, not kernel panic. Good. I will try to setup a test script which checks if "shutdown -r now" will always work (untils it fails).
(In reply to Conrad Meyer from comment #20) Because we do not enforce a policy in the kernel on the possible md(4) usage. Even assuming that we do, we would need (a) teach VFS to see through md(4) to understand the usages for the files (b) call swap_pager_swapoff() while VFS flushes vnodes. Both are intolerable layering violations, see the reference to 'policy' above.
I have a feeling that the current design for the file-backed swap is broken.
(In reply to Andriy Gapon from comment #23) [Ignore if you care not for my history and opinion for swapfile usage vs. partition usage.] I will note that when I ran into the deadlock problems from using a swapfile instead of a partition, it was not obvious at the time that swapspace handling was involved at all: it was also first use of a platform as well. I was not lucky to have changed just one thing to know to revert it. For all I knew the type of platform might have been having problems in general. It was a pain to figure out that swapfile use was what I could avoid using in order to avoid the hangups that were happening. To me the discovery seemed to mean that FreeBSD violated a variation on the principle of least astonishment (but not tied to changes to historical behavior). (Having a partition with UFS and only one user-file, the swapfile, might avoid some of the issues. I had done this earlier [by an accidental choice] on powerpc64 and powerpc and had not seen any problems. This contributed to my not challenging the swapfile use earlier in trying to isolate contributions to the hangups: I did not then realize that I'd done something possibly special in that earlier use.)
Isn't this problem because "umount" happens before "swapoff"? I also use NFS swapfiles from time to time and when I forget to swapoff such files before shutdown, kernel panics, too. # mount -t nfs xxx:/yyy /mnt/nfs # dd if=/dev/zero of=/mnt/nfs/swap bs=1M seek=999 # swapon /mnt/ntfs/swap # --- use swap.... # i.e. mount -t tmpfs tmp /mnt/tmpfs; dd if=/dev/zero of=/mnt/tmpfs/fill # shutdown -p now I'm not sure what kinds of side effects will happen if we "swapoff" before "umount", though...
(In reply to Wolfram Schneider from comment #21) shutdown case is taken care by https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=187081 - swaplate "reboot" doesn't call this as you discover.
Created attachment 189605 [details] swapoff, umount, swapoff Primary target of the patch is for NFS swapfiles but also reduces chance of panics in reboot cases as well.
(In reply to ota from comment #27) Patch seems incomplete. There is a mismatched brace added in swapoff_all.
Created attachment 189630 [details] swapoff, umount, swapoff 2 The previous patch didn't delete TAILQ_FOREACH_SAFE() line properly, my bad.
(In reply to ota from comment #29) Why is this line changed? It seems gratuitous: > static TAILQ_HEAD(swdevt_head, swdevt) swtailq = TAILQ_HEAD_INITIALIZER(swtailq); Second: Can swapoff_all be moved much later? Possibly just before vfs_unmountall()? My hunch is that syncing / flushing buffers may free up enough memory to disable swap, if we were in an impossible-to-swapoff scenario before. I may be missing something, though.
(In reply to Conrad Meyer from comment #30) One other comment: Should the existing swapoff_all() call at the end of bufshutdown() should be deleted if we are moving the call higher up? Or is there value in invoking it twice? (Maybe unmounting a filesystem frees sufficient memory for a swap partition to swapoff.)
(In reply to Conrad Meyer from comment #31) if tmpfs uses large memory so that files on tmpfs were swapped out, the 1st swapoff_all() can fail due to insufficient memory. vfs_unmountall() will unmount and release such tmpfs memory so that 2nd swapoff_all() may be able to swapoff this time; however, if swap spaces left are still NFS files or mdconfig files, we will still get panic. So, I changed swapoff_all() to release in the reverse order swap spaces are added. System adds swap devices at start (given the penalty of mdconfig/NFS, raw devices are best and most suitable for default entries). When memory and swap space go short, I add md-file or NFS files to avoid immediate issues. So, reverse order in swap_all will take outs manually added entries first to reduce chances of panic. Actually... if all file systems are already unmounted, any processes get swapped-in from swap spaces cannot do much useful anyway; indeed, we may not need 2nd(existing) swapoff_all()...
(In reply to ota from comment #32) I removed the bottom/2nd swapoff_all() since the above post and testing ( I shutdown multiple times in a day ) but haven't had any issues such as panic or any data loss.
I didn't had the issue for a long time, closing.