Bug 224479 - kernel panic in reboot+swapoff sys call
Summary: kernel panic in reboot+swapoff sys call
Status: Open
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: CURRENT
Hardware: Any Any
: --- Affects Only Me
Assignee: freebsd-bugs mailing list
URL: https://reviews.freebsd.org/D12397
Keywords: patch
Depends on:
Blocks: 206048 224975
  Show dependency treegraph
 
Reported: 2017-12-20 11:17 UTC by Wolfram Schneider
Modified: 2018-06-27 01:07 UTC (History)
9 users (show)

See Also:


Attachments
swapoff, umount, swapoff (1.77 KB, patch)
2018-01-11 05:32 UTC, ota
no flags Details | Diff
swapoff, umount, swapoff 2 (1.77 KB, patch)
2018-01-11 19:38 UTC, ota
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Wolfram Schneider freebsd_committer 2017-12-20 11:17:30 UTC
I built a new kernel on -current. The build was successfully, and I rebooted the machine. To my surprise the machine hangs in a kernel panic during swapoff sys call.

It turns out that if the machine used some swap, it will panic at the next reboot. This is 100% reproducible.

How to repeat:

# make sure that the machine used some swap. Here I’m using a perl script which will 
# grow at least a 1GB RAM big
$ perl -e '$a=`man tcsh`; for(0..3000) { $b.=$a}'

# check the swap usage, e.g.
$  top -b 2 | egrep ^Swap
Swap: 2500M Total, 18M Used, 2482M Free

# reboot the machine
$ reboot

on the console I get this:
swap_pager: I/O error - pagein failed; blkno 280694, size 4096, error 5
panic: swap_pager_force_pagein: read from swap failed

I saw this on two machines, one use a 1GB swap device and the other a 2GB swap file
cat /etc/fstab 
/dev/vtbd0p3    /               ufs     rw      1       1

cat /etc/fstab 
md99	none	swap	sw,file=/var/swap/swap.0,late	0	0

This problem is new to me, I didn’t saw this 3 weeks ago in November. Using swapoff/swapon on the command line works fine.

Workaround: before a reboot, disable swap with the swapoff command:
$ swapoff  -a
swapoff: removing /dev/md99 as swap device
$ reboot
Comment 2 Wolfram Schneider freebsd_committer 2017-12-20 12:51:38 UTC
the suspect commit is:

commit c123c7433b7eb3ccacfa1bae8ae136c61cfe8462
Author: alc <alc@FreeBSD.org>
Date:   Tue Nov 28 17:46:03 2017 +0000

    When the swap pager allocates space on disk, it requests contiguous
    blocks in a single call to blist_alloc().  However, when it frees
    that space, it previously called blist_free() on each block, one at a
    time.  With this change, the swap pager identifies ranges of
    contiguous blocks to be freed, and calls blist_free() once per
    range.  In one extreme case, that is described in the review, the time
    to perform an munmap(2) was reduced by 55%.
    
    Submitted by:   Doug Moore <dougm@rice.edu>
    Reviewed by:    kib
    MFC after:      1 week
    Differential Revision:  https://reviews.freebsd.org/D12397


I did a git revert, recompiled the kernel and the problem was gone.

Note: I do not say this commit created a new bug. It is very likely that it triggered an old bug. There are many reports about this kind of bug for older releases on the net.
Comment 3 Peter Holm freebsd_committer 2017-12-20 15:09:56 UTC
I can repeat this with:
Unmount a file system with an active swap file.

Details @ https://people.freebsd.org/~pho/stress/log/swapoff2.txt
Comment 4 Wolfram Schneider freebsd_committer 2017-12-20 15:55:24 UTC
Note: it seems that you need some swap space in use. 5MB may be not enough, make sure you use at least 10-20MB
Comment 6 Wolfram Schneider freebsd_committer 2017-12-20 16:52:03 UTC
(In reply to Peter Holm from comment #3)

Peter, can you repeat the stress test with the mount flag sync (mount -o sync /mnt) for the filesystem with the active swap file? Thanks.
Comment 7 Doug Moore 2017-12-24 00:28:31 UTC
I don't have the means to reproduce this bug.  For someone who does (Peter?), could you possibly add this assertion to the code and see if the conditions that trigger the bug trigger the assertion first?


index 22bf6c72b8b..c026dadf8be 100644
--- a/sys/vm/swap_pager.c
+++ b/sys/vm/swap_pager.c
@@ -778,6 +778,7 @@ swp_pager_freeswapspace(daddr_t blk, daddr_t npages)
        mtx_lock(&sw_dev_mtx);
        TAILQ_FOREACH(sp, &swtailq, sw_list) {
                if (blk >= sp->sw_first && blk < sp->sw_end) {
+                       MPASS(blk + npages <= sp->sw_end);
                        sp->sw_used -= npages;
                        /*
Comment 8 Peter Holm freebsd_committer 2017-12-24 07:44:35 UTC
(In reply to Doug Moore from comment #7)
The scenario is:

- Use a vnode backed memory disk as swap device.
- unmount the file system containing the vnode.

The problem is also seen with an old 10.3-STABLE FreeBSD 10.3-STABLE #0 r321658M
The assert did not trigger:
https://people.freebsd.org/~pho/stress/log/swapoff2-2.txt
Comment 9 Wolfram Schneider freebsd_committer 2017-12-24 11:25:24 UTC
(In reply to Wolfram Schneider from comment #4)

I do not see the kernel panic if there are more than 100MB swap space in use. So, we have a lower bound of 10MB and and upper bound of 100MB swap usage on a swap file.

For 20MB swap usage I always get a kernel panic.
Comment 10 Mark Millard 2017-12-25 00:27:18 UTC
(In reply to Peter Holm from comment #8)

I suggest seeing bugzilla 206048 about file
system based swap spaces hanging up. In
particular comments #7 and #8 that quote
Konstantin Belousov on the subject (or
reference his material).

Part of the text says:

As result, swapfile swapping is more prone to the trivial and unavoidable
deadlocks where the pagedaemon thread, which produces free memory, needs
more free memory to make a progress.  Swap write on the raw partition over
simple partitioning scheme directly over HBA are usually safe, while e.g.
zfs over geli over umass is the worst construction.
Comment 11 Konstantin Belousov freebsd_committer 2017-12-26 21:12:42 UTC
I do not quite understand what a cause for the complain is.  If the system cannot get the swapped out page data on swapoff, this means that the data is corrupted.  Preventing further user data corruption by panicing is a normal reaction.

In the case of the swap file on the unmounted filesystem, this is expected mode of failure.  Unmount reclaims the vnode, making all io to fail, putting the system into situation described in the previous paragraph.
Comment 12 Wolfram Schneider freebsd_committer 2017-12-27 19:51:21 UTC
(In reply to Konstantin Belousov from comment #11)

the complain is: I built and installed a new kernel on 12-current. I run "reboot" and the kernel panics during shutdown and hangs in the kernel debugger on the console.

It takes me less than a minute to reproduce this, see the howto in the bug report. The machine crashes 9 out of 10 times. Pretty close to a deterministic failure.

Apparently, there is a race condition here: we unmount a file system while swapoff is not done yet with the swapfile on the file system.

It would be great if we can fix this. Thanks.
Comment 13 Mark Millard 2017-12-27 20:20:35 UTC
(In reply to Wolfram Schneider from comment #12)

Does the problem happen for:

shutdown -r now

instead of reboot?

I ask because the man page for reboot says that
reboot does not cleanly shut down all processes,
recommending the use of shutdown instead:

 Normally, the shutdown(8) utility is used when the system needs to be
     halted or restarted, giving users advance warning of their impending doom
     and cleanly terminating specific programs.

So, reboot is documented to be more likely to have
race conditions.

Side note:

Going the other direction, given what is known
about file-system swapspaces (i.e., vnode-backed),
I'm not sure why they are allowed by default
instead of needing support to be explicitly
enabled. (The enabling indicating that one is
supposed to understand the issues involved in
their use.)
Comment 14 Konstantin Belousov freebsd_committer 2017-12-28 12:31:25 UTC
(In reply to Mark Millard from comment #13)
There is no race, the system behaves as it supposed to.  If swap storage is revoked, the swap in procedure panics.  I am not sure that trying to convert this situation into SIGSEGV to the affected process is useful, because kernel stacks can be also swapped out, and the kernel has no other way out except panicing.

You complain about swap over file being enabled by default does not make sense either.  The configuration requires explicit user involvement, and it is allowed by combination of the components that were designed to fit.  The consequences of doing so is up to the user, we cannot and must not hold user' hands.  Taking your line of reasoning, shall we disallow for installing the system on unreliable media like sdcard 'by default', because wearing makes the system fail fast ?
Comment 15 Andriy Gapon freebsd_committer 2017-12-28 13:09:05 UTC
(In reply to Konstantin Belousov from comment #14)
On the other hand, if the kernel knows that a vnode is used for swap and the kernel knows that the force reclaim of that vnode will lead to a panic, then why should the kernel allow that?

I see several possibilities, but not sure if any of them makes sense from the FreeBSD architecture and design point of view.

1. When vgone-ing the swap vnode somehow perform an equivalent of swap off on it.

2. Do not allow even the force unmount of a filesystem if there is a vnode used for swap.

3. "Orphan" the swap vnode such that it is removed from its mount list, the name cache, etc, but it is still alive and is owned by the swap pager. I feel that this is impossible to do, however.

Also, I am not sure about any "race", but it seems that during a shutdown we should first turn off all the file-backed swap and only then start unmounting filesystems.  From comment #0 it seems that we do not follow this order.
Comment 16 Mark Millard 2017-12-28 14:12:18 UTC
(In reply to Konstantin Belousov from comment #14)

You have written in the past that for file based
(vnode-based) swap files the system is deadlock
prone in a manor not under significant user control.
From what I've seen lots of folks set up the file
based swap spaces without having a clue that such
is the status.

I was one of them at one time, following
http://www.wonkity.com/~wblock/docs/html/ssd.html#_filesystems_and_trim
material for SSDs that gave no hint of the issues
involved. If the instructions had told me that I
needed to enable the mode of use because of deadlock
issues that do not happen with partition based swap
spaces, I never would have tied it.

I had swap space based deadlocks vastly faster than
any sdcard wear out would have occurred: the
configuration was simply unreliable over fairly
short time frames. Also, the deadlocks are not
examples of wear-and-tear. (You might want a
better analogy for your point in that respect.)

I view FreeBSD as designed to automatically avoid
deadlocks for swapping only for partition-based
swap spaces.

Lesser points but more tied to this report:

I was expecting that "shutdown -r now" might stop
some processes before initiating the v-node removal,
making such processes no longer sources of swap-in
activity for later stages. (I was not thinking of
any general fix to the deadlocking issues.) reboot,
by contrast, I was expecting leaves more processes
around that might try to swap-in in an untimely
manor. I freely admit my expectations might be
garbage-in/garbage-out.

I was not expecting the sending of SIGSEGV or other
signals to a process that is trying to swap-in
after there is effectively nothing available to
swap-in from.

Does some involved kernel stack need to be in a
swapped out state to have the problem that has
been described? Or can it be a problem when no
kernel stacks are swapped out?
Comment 17 Conrad Meyer freebsd_committer 2017-12-28 17:17:37 UTC
(In reply to Andriy Gapon from comment #15)
+1 to all of this.  We know what devices/files are used for swap and should be able to prevent this panic for user-initiated action like umount.
Comment 18 Konstantin Belousov freebsd_committer 2017-12-28 17:36:32 UTC
(In reply to Andriy Gapon from comment #15)
This is perhaps going too far on the silly fest.  Andrey, you do understand VFS, so I am quite frustrated.

1. If the vnode carriers some database, should kernel stop the database on the basis that otherwise user data might be corrupted ?  Should kernel print (Abort, Retry, Ignore ?)
2.If force unmount is not allowed because there is the swap on a file, would the next request be to add really-force option which causes unmount to proceed even in this case.
3. You agree on your own that io to reclaimed vnode is not possible.  May be we should not reclaim such vnode, but then add VOP_RECLAIMFORREAL.

It is very clear situation, system was directed to change its configuration in a way which makes the continuation of the operations sometimes problematic, sometimes not.  Why people consider it is reasonable to either deny the reconfiguration or to deny and have kernel to spit the whole man page on the console as the additional punishment, is beyond me.
Comment 19 Mark Millard 2017-12-28 17:56:03 UTC
(In reply to Konstantin Belousov from comment #18)

It is probably not clear about my notes (other
and shutdown -r now vs. reboot testing directly):

My experience was deadlocks during normal operation
when I used a swapfile, not during shutdown. The
specifics of just this submittal do not by itself
drive why I wrote what I wrote.

As such, I'll stop making such notes here. Sorry
for the noise.
Comment 20 Conrad Meyer freebsd_committer 2017-12-28 18:12:36 UTC
(In reply to Konstantin Belousov from comment #18)
Re: (1) If the database is owned and operated by the kernel, YES.

Why not swapoff any such vnode backed file before reclaiming the vnode, or at least before umounting the backing filesystem?
Comment 21 Wolfram Schneider freebsd_committer 2017-12-28 20:51:56 UTC
(In reply to Mark Millard from comment #13)

Hi Mark,

good point about "reboot" <-> "shutdown -r now". The handbook recommends to use shutdown. My bad.

I tried shutdown -r now on the console and it works fine, not kernel panic. Good.

I will try to setup a test script which checks if "shutdown -r now" will always work (untils it fails).
Comment 22 Konstantin Belousov freebsd_committer 2017-12-28 23:09:46 UTC
(In reply to Conrad Meyer from comment #20)
Because we do not enforce a policy in the kernel on the possible md(4) usage.

Even assuming that we do, we would need
(a) teach VFS to see through md(4) to understand the usages for the files
(b) call swap_pager_swapoff() while VFS flushes vnodes.
Both are intolerable layering violations, see the reference to 'policy' above.
Comment 23 Andriy Gapon freebsd_committer 2017-12-29 10:53:15 UTC
I have a feeling that the current design for the file-backed swap is broken.
Comment 24 Mark Millard 2017-12-29 22:58:44 UTC
(In reply to Andriy Gapon from comment #23)

[Ignore if you care not for my history and
opinion for swapfile usage vs. partition
usage.]

I will note that when I ran into the deadlock
problems from using a swapfile instead of a
partition, it was not obvious at the time that
swapspace handling was involved at all: it was
also first use of a platform as well. I was not
lucky to have changed just one thing to know
to revert it. For all I knew the type of
platform might have been having problems in
general.

It was a pain to figure out that swapfile use
was what I could avoid using in order to avoid
the hangups that were happening. To me the
discovery seemed to mean that FreeBSD violated
a variation on the principle of least
astonishment (but not tied to changes to
historical behavior).

(Having a partition with UFS and only one
user-file, the swapfile, might avoid some of
the issues. I had done this earlier [by an
accidental choice] on powerpc64 and powerpc and
had not seen any problems. This contributed
to my not challenging the swapfile use earlier
in trying to isolate contributions to the
hangups: I did not then realize that I'd done
something possibly special in that earlier
use.)
Comment 25 ota 2018-01-03 03:48:28 UTC
Isn't this problem because "umount" happens before "swapoff"?

I also use NFS swapfiles from time to time and when I forget to swapoff such files before shutdown, kernel panics, too.

# mount -t nfs xxx:/yyy /mnt/nfs
# dd if=/dev/zero of=/mnt/nfs/swap bs=1M seek=999
# swapon /mnt/ntfs/swap
# --- use swap....
# i.e. mount -t tmpfs tmp /mnt/tmpfs; dd if=/dev/zero of=/mnt/tmpfs/fill
# shutdown -p now

I'm not sure what kinds of side effects will happen if we "swapoff" before "umount", though...
Comment 26 ota 2018-01-07 05:32:01 UTC
(In reply to Wolfram Schneider from comment #21)

shutdown case is taken care by https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=187081 - swaplate

"reboot" doesn't call this as you discover.
Comment 27 ota 2018-01-11 05:32:00 UTC
Created attachment 189605 [details]
swapoff, umount, swapoff

Primary target of the patch is for NFS swapfiles but also reduces chance of panics in reboot cases as well.
Comment 28 Conrad Meyer freebsd_committer 2018-01-11 06:04:35 UTC
(In reply to ota from comment #27)
Patch seems incomplete.  There is a mismatched brace added in swapoff_all.
Comment 29 ota 2018-01-11 19:38:52 UTC
Created attachment 189630 [details]
swapoff, umount, swapoff 2

The previous patch didn't delete TAILQ_FOREACH_SAFE() line properly, my bad.
Comment 30 Conrad Meyer freebsd_committer 2018-01-11 21:53:56 UTC
(In reply to ota from comment #29)
Why is this line changed?  It seems gratuitous:

> static TAILQ_HEAD(swdevt_head, swdevt) swtailq = TAILQ_HEAD_INITIALIZER(swtailq);

Second: Can swapoff_all be moved much later?  Possibly just before vfs_unmountall()?

My hunch is that syncing / flushing buffers may free up enough memory to disable swap, if we were in an impossible-to-swapoff scenario before.  I may be missing something, though.
Comment 31 Conrad Meyer freebsd_committer 2018-01-11 21:55:39 UTC
(In reply to Conrad Meyer from comment #30)
One other comment: Should the existing swapoff_all() call at the end of bufshutdown() should be deleted if we are moving the call higher up?  Or is there value in invoking it twice?  (Maybe unmounting a filesystem frees sufficient memory for a swap partition to swapoff.)
Comment 32 ota 2018-01-11 23:46:15 UTC
(In reply to Conrad Meyer from comment #31)

if tmpfs uses large memory so that files on tmpfs were swapped out, the 1st swapoff_all() can fail due to insufficient memory.

vfs_unmountall() will unmount and release such tmpfs memory so that 2nd swapoff_all() may be able to swapoff this time; however, if swap spaces left are still NFS files or mdconfig files, we will still get panic.

So, I changed swapoff_all() to release in the reverse order swap spaces are added.  System adds swap devices at start (given the penalty of mdconfig/NFS, raw devices are best and most suitable for default entries).

When memory and swap space go short, I add md-file or NFS files to avoid immediate issues.  So, reverse order in swap_all will take outs manually added entries first to reduce chances of panic.

Actually... if all file systems are already unmounted, any processes get swapped-in from swap spaces cannot do much useful anyway; indeed, we may not need 2nd(existing) swapoff_all()...
Comment 33 ota 2018-01-24 03:11:06 UTC
(In reply to ota from comment #32)

I removed the bottom/2nd swapoff_all() since the above post and testing ( I shutdown multiple times in a day ) but haven't had any issues such as panic or any data loss.