195458 – Hang on shutdown/root unmount after FreeBSD 10.1R upgrade

Bug 195458 - Hang on shutdown/root unmount after FreeBSD 10.1R upgrade

Summary: Hang on shutdown/root unmount after FreeBSD 10.1R upgrade

Status:	Closed FIXED

Alias:	None

Product:	Base System
Classification:	Unclassified
Component:	kern (show other bugs)
Version:	10.1-RELEASE
Hardware:	Any Any

Importance:	--- Affects Many People
Assignee:	FreeBSD Release Engineering

URL:
Keywords:

Depends on:
Blocks:

Reported:	2014-11-27 22:44 UTC by Walter Hop
Modified:	2015-05-13 23:30 UTC (History)
CC List:	33 users (show)

See Also:

Attachments
Screenshot of the hang (63.38 KB, image/png) 2014-11-27 22:44 UTC, Walter Hop	no flags	Details
DIAGNOSTIC panic (1/2) (16.34 KB, image/png) 2015-03-11 16:22 UTC, Glen Barber	no flags	Details
DIAGNOSTIC panic (2/2) (18.13 KB, image/png) 2015-03-11 16:22 UTC, Glen Barber	no flags	Details
lock order reversal after kern_reboot() (14.92 KB, image/png) 2015-03-11 16:24 UTC, Glen Barber	no flags	Details
ddb transcript of panic-on-boot with debugging options enabled (21.86 KB, text/plain) 2015-03-11 17:54 UTC, Glen Barber	no flags	Details
debugging session after freebsd-update and reboot (47.15 KB, text/plain) 2015-03-11 20:27 UTC, Glen Barber	no flags	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Walter Hop 2014-11-27 22:44:46 UTC

Created attachment 149945 [details]
Screenshot of the hang

On three out of six FreeBSD installs upgraded from 10.0R to 10.1R, the system freezes as the 10.1 system shutdowns for the first time. After "All buffers synced." the system remains at 100% CPU and makes no progress for a long time. After a forced reset, the file system is dirty.

On another upgrade, the first reboot in 10.1 seemed to go fine. However, a subsequent shutdown flashed the following error:

    All buffers synced.
    softdep_waitidle: Failed to flush worklist for 0xfffff800027b4330
    unmount of / failed (BUSY)

leaving the file system also dirty. This seems related to the first problem (see below).

--

Throughout the 10.1-RC cycle, various others have described the hang after upgrading to 10.1 on the freebsd-stable mailinglist:
https://lists.freebsd.org/pipermail/freebsd-stable/2014-October/080595.html

I spoke to another user on IRC who confirmed the hang with 10.1-RELEASE on two physical servers.

In #195183, another hang at reboot after updating to 10.1 is reported (although I am not sure about the effect of ipfw; in any case enabling/disabling ipfw has no effect for me.)

--

I reproduced the problem on a clean 10.0-RELEASE install in VMware after a freebsd-update to 10.1-RELEASE. I snapshotted this VM after freebsd-update but before rebooting it, so I can do experiments on it if needed.

It appears to me that the problem happens during unmounting of the UFS root filesystem. If after upgrading I drop to single user mode ("shutdown now"), and attempt the command "/sbin/mount -o ro /", this should normally succeed. However, on a failed 10.1 machine, the CPU goes to 100% and the command never finishes. Just like during a shutdown, the kernel is alive (e.g. the host pings) but it's not possible to recover from the situation.

I have not seen this problem on subsequent reboots of 10.1 systems, nor on clean 10.1 installs (non-upgraded), but very consistently after a 10.0 to 10.1 upgrade.

It's a pretty big showstopper for me at this point, so please let me know if I can provide more info from the test box or help in other ways.

--

Reproduce:
- download 10.0-RELEASE amd64 ISO
- create a VM in VMware
- install 10.0-RELEASE, UFS2, all defaults # problem happens on SU and SU+J
- freebsd-update upgrade -r 10.1-RELEASE
- freebsd-update install
- shutdown -r now
- freebsd-update install
- freebsd-update install # make a VM snapshot before continuing
- shutdown -r now # CPU goes to 100% after "All buffers synced."

Comment 1 Walter Hop 2014-11-29 22:38:48 UTC

I noticed today that the same problem happens when upgrading from 10.1-RC3 to 10.1-RELEASE, and also on a clean install of 10.1-RELEASE.

I did some more digging. As a refresher, the hang didn't occur just when booting the new kernel. It happened only after "freebsd-update install" was executed to replace userland. What is so special about "freebsd-update install" that would trigger the problem?

I think the interesting bit might be that it replaced /sbin/init.

I can completely reliably trigger a hang on a default 10.1-RELEASE install on UFS2 in VMware Fusion with the following procedure:

# chflags noschg /sbin/init
# cp -Rp /sbin/init /sbin/init2
# rm -f /sbin/init
# mv /sbin/init2 /sbin/init
# chflags schg /sbin/init
# reboot
=> Hang after "All buffers synced."

I created two clean 10.1 UFS2 installs which both exhibit the problem 100% of trying.

I tried doing the same on two clean 10.1 ZFS (auto) installs. ZFS does NOT exhibit the problem so far.

I tried disabling softupdates (tunefs -n disable /dev/da0p2) before doing the procedure. In this case, on multiple machines, it also does NOT hang.

On 10.0 there's also no problem.

So: on FreeBSD 10.1, when using UFS2 with journaled softupdates, replacing init leads to a hang when rebooting/unmounting root afterwards.

And a workaround might be to disable softupdates before upgrading to 10.1.

Comment 2 Juan Ramón Molina Menor 2014-12-10 13:35:33 UTC

Confirmed again with today updates:
https://lists.freebsd.org/pipermail/freebsd-stable/2014-December/081246.html

And sorry Walter, I just realize that I managed to delete your name from the mail quotes… :(

Comment 3 Walter Hop 2014-12-10 20:59:13 UTC

I’ve tried if the hang is apparent in a 11-CURRENT snapshot too, and it is. (Tried the iso snapshot available today, which is r273635). WITNESS indicates a lock order reversal, which looks related to me:

root@current:~ # chflags noschg /sbin/init
root@current:~ # cp -Rp /sbin/init /sbin/init2
lock order reversal:
1st 0xfffffe007b842fa0 bufwait (bufwait) @ /usr/src/sys/kern/vfs_bio.c:3093
2nd 0xfffff80002b9ea00 dirhash (dirhash) @ /usr/src/sys/ufs/ufs/ufs_dirhash.c:284
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe000025c270
kdb_backtrace() at kdb_backtrace+0x39/frame 0xfffffe000025c320
witness_checkorder() at witness_checkorder+0xdad/frame 0xfffffe000025c3b0
_sx_xlock() at _sx_xlock+0x75/frame 0xfffffe000025c3f0
ufsdirhash_add() at ufsdirhash_add+0x3a/frame 0xfffffe000025c430
ufs_direnter() at ufs_direnter+0x6a0/frame 0xfffffe000025c4f0
ufs_makeinode() at ufs_makeinode+0x560/frame 0xfffffe000025c6a0
VOP_CREATE_APV() at VOP_CREATE_APV+0xf1/frame 0xfffffe000025c6d0
vn_open_cred() at vn_open_cred+0x29d/frame 0xfffffe000025c820
kern_openat() at kern_openat+0x26f/frame 0xfffffe000025c9a0
amd64_syscall() at amd64_syscall+0x25a/frame 0xfffffe000025cab0
Xfast_syscall() at Xfast_syscall+0xfb/frame 0xfffffe000025cab0
--- syscall (5, FreeBSD ELF64, sys_open), rip = 0x80094f01a, rsp = 0x7fffffffe958, rbp = 0x7fffffffe9c0 ---

Finally, when rebooting, another lock order reversal appears and the system hangs.
I don’t have a text log of this, so I’ll copy the first few lines:

Syncing disks, vnodes remaining...1 0 0 done
All buffers synced.
lock order reversal:
1st 0xfffff80002e65d50 ufs (ufs) @ /usr/src/sys/kern/vfs_mount.c:1223
2nd 0xfffff80002e665f0 devfs (devfs) @ /usr/src/sys/kern/vfs_subr.c:2144

Screenshot is here: http://lf.ms/current-r273635-hang-2.png

Comment 4 Walter Hop 2014-12-10 21:21:32 UTC

Sorry, I lied! r273635 was not the latest snapshot available, my apologies. I retried with a snapshot from 7 Dec (r275582).

In r275582, the LOR when fiddling with /sbin/init is the same as in 10.1, but the behavior at reboot is different, although it is still pathological (“Giving up on 1 buffers”) . Screenshot: http://lf.ms/current-r275582-hang-givingup.png

I have hit this behavior twice on r275582, but not on 10.1. So something has changed in CURRENT after r273635, but the basic problem still seems to be there.

Comment 5 quickfox 2014-12-13 10:34:29 UTC

I also experienced it. I freebsd-update upgraded two physical 10.0-RELEASE-p13 installs to 10.1-RELEASE-p1.

For the first, I didn't disable soft updates and it froze at "All buffers synced." during the last reboot (as per [1]).

For the second, I disabled soft updates and everything went fine.

[1] https://www.freebsd.org/releases/10.1R/installation.html

Comment 6 Walter Hop 2014-12-14 19:40:10 UTC

According to Jilles Tjoelker the LOR are false positives so please disregard them.

Comment 7 stoa 2014-12-19 12:47:27 UTC

I have the same problem, however, I noticed the issue after installing patch-1 to a 10.1-RELEASE box that had been previously (weeks before) upgraded from 10.0-RELEASE.  I'm using UFS, not ZFS.  More confounding, it doesn't always happen, but I haven't been able to pin down the trigger.  As with others above, this issue only presents on one of two very similar installs (while the installs are very similar, the hardware is different, but both are amd64.)

Comment 8 Davide Davini 2014-12-29 15:16:22 UTC

Same issue here on three VM's 10.1R-p1 during today upgrade to p3. They were recently upgraded to 10.1R from 10.0R. All three VM's run on ESXi 5.5. All three VM's are clones of the first one I installed. We are talking about amd64 machines using UFS with soft-updates on.

Comment 9 Daniel Ylitalo 2015-01-05 09:55:50 UTC

I just went from 10.1-p0 to 10.1-p3 on a completly new server and got stuck after "All buffers synced" so started looking around and found this pr :)

Last week i upgraded 16 supermicro boxes from 10.0 to 10.1 and every single one of them got stuck att "All buffers synced" after freebsd-update

Comment 10 Andrew Smith 2015-01-14 09:44:07 UTC

It seems that this at last effects all Kernels since 10.1-RELEASE through p3.

If you have one of these Kernels with Softupdates active on the root filesystem and you replace /sbin/init then you get this behaviour.

If you either disable file system softupdates on the filesystem or you disable the softupdates option in a new Kernel build then the issue does not exist.

I suspect people that haven't had this issue have some other environmental difference that nobody has highlighted yet.

The issue of course hits 10.0-RELEASE to 10.1-RELEASE upgrades since when freebsd-update install is run, the first thing done is to replace the Kernel and ask for a reboot then run freebsd-update again. The Kernel exhibiting the problem is then in place.

Could this be related to the changes made to ufs for the per FFS-Filesystem threading around August? (r269457, r269533, r269583).

Comment 11 Andrew Smith 2015-01-14 10:32:06 UTC

Additionally it appears that if you leave soft updates on but disable soft update journal (tunefs -j disable) the problem disappears so the problem may be more focussed than generally soft updates.

Comment 12 Walter Hop 2015-01-14 11:11:50 UTC

I have spent some hours bisecting -STABLE between 10.0 and 10.1, but I couldn't pinpoint the revision, as the bug seems to depend on the clang build! So, the bisecting procedure should probably involve building a clean toolchain at every revision.

Comment 13 Andrew Smith 2015-01-14 14:45:04 UTC

Have rebuilt the 10.1 Kernel using CLANG 3.3 on a FreeBSD 10.0 system and on a 10.1-RELEASE-p3 system from the current 10.1-RELEASE-p3 source tree delivered via freebsd-update.

The problem is apparent with both builds so it would suggest this isn't directly tied to the updated toolchain.

Comment 14 stoa 2015-01-14 23:56:44 UTC

I was (concurrently) having many problems with the transition to xorg-server-1.14.  Long story short, I disabled hald (8) and have not had any hangs since. I will note I have no real evidence of any connection here, only the end result on this particular install.

Comment 15 Andrew Smith 2015-01-15 10:41:42 UTC

To confirm, p4 seemed to appear as available today and the problem persists.

Comment 16 Matt Kollross 2015-01-16 15:06:55 UTC

I've also noticed this issue. I upgraded 4 machines last night, and all had the same problem.  

Specs:  10.0-RELEASE p4 -> 10.1-RELEASE

Two where VMs on VMWARE EXI, and the other two were bare metal Dell R310.  I believe all are running UFS.

During the freebsd-update process, it installs the kernel with no issues, prompts for a reboot.  This appeared to work fine on all 4 machines.  However after running freebsd-update install an additional two times (one to install userland and again to remove older libraries) I reboot again and this is where it hangs.

Comment 17 Charley Sheets 2015-02-04 09:07:34 UTC

I've also just seen this issue. I installed 10.1-RELEASE on an HP ProLiant DL360 G5. After initial configuration, I did a freebsd-update fetch, freebsd-update install, and reboot.

Assuming I'm experiencing the same issue as others, it seems to be 10.1 itself and not anything particular to the 10.0 -> 10.1 upgrade, as I did not perform that upgrade.

Comment 18 Matt Simerson 2015-02-04 18:11:30 UTC

I just experienced this as well. It was on a system I just upgraded from 9.3 -> 10.1-RELEASE via the "make world" dance (because of bug #195484 in freebsd-update). Installing the 10.1 kernel and reboot worked fine. Then I installed world, mergemaster, deleted old libs, and finally ran freebsd-update to install the latest security patches. Then it hung on reboot. Ouch.

Comment 19 Matt Simerson 2015-02-04 18:21:32 UTC

This seems related. On a server I upgraded to 10.1-RELEASE (via buildworld), and then updated to the latest patch level, I get this in dmesg after boot:

Trying to mount root from ufs:/dev/mirror/root [rw]...
WARNING: / was not properly dismounted
WARNING: / was not properly dismounted

Comment 20 rmthomas1947 2015-02-05 22:51:56 UTC

I installed Release 10.1 on an ancient desktop computer from a DVD image:

10.1-RELEASE FreeBSD 10.1-RELEASE #0 r274401: Tue Nov 11 22:51:51 UTC 2014

I can't remember what happened with the first two or three boots, but the shutdown problem became apparent rather quickly.  I have many packages installed, including xdm and the xfce desktop.  I've set up the necessary .pkla file for polkit and put all users in the operator group, who conseqently have a non-greyed Shutdown option in the GUI menu.  Clicking on this and answering "yes" to "do you want to shut down?" results in the computer becoming unresponsive, in the sense that it can no longer be reached by ssh from another machine, but it does not power down.

An alternative scenario is that the command "shutdown -p now" is issued by root in a ssh terminal window.  In this case the "System going down immediately" message is received and the connection is lost as expected, but again the computer does not power down.

On reboot into single-user mode, one always sees the message "/ was not properly dismounted".  Typing "shutdown -p now" at the console prompt does effect a true shutdown with power off.

This is a troublesome fault.  If one takes no remedial action, the filesytem corruption eventally reaches a stage where the computer reboots spontaneously and unpredictably while in normal use.  It's tiresome to have to run fsck at every boot.  A workaround would be useful if a cure is not immediately available.

--Mike

Comment 21 ncrogers 2015-02-08 23:04:51 UTC

FWIW I've been having this problem as well, so far under VMWare Fusion and ESXi. Happens when upgrading from 9.1-STABLE to 10.1-RELEASE during the last reboot of the upgrade process (i.e., after the third freebsd-update install). I am using UFS and have softupdates + journaling enabled.

Comment 22 ncrogers 2015-02-10 21:39:10 UTC

Disabling soft-updates journaling on just the root partition before the upgrade alleviates the hanging issue for me as well. However, the filesystem still ends up being dirty with an incorrect block count on the subsequent reboot during fsck.

Comment 23 Andrew Smith 2015-02-10 22:05:59 UTC

It's a bit of an issue for me since I need to plan a series of unattended upgrades. Thankfully those systems are on an unaffected 10.0 Kernel level but I don;t have the option of disabling softupdates by booting from alternative media. 

I've try to prioritise some time to look at the problem but I'm mid way through another  large piece of work and can't stop :(

Comment 24 ncrogers 2015-02-10 22:14:18 UTC

I am in the same situation of having to do remote unattended upgrades of a few hundred boxes... For a while now I've been using a custom rc.d script to run tunefs before the filesystems are mounted. I am trying to use this to disable soft-updates journaling on the root partition before the upgrade with something like this...

cat /etc/rc.d/tunefs
#!/bin/sh

# PROVIDE: tunefs
# REQUIRE: root
# BEFORE: fsck FILESYSTEMS
# KEYWORD: nojail

. /etc/rc.subr

name="tunefs"
start_cmd="tunefs_start"
stop_cmd=":"

tunefs_start()
{
    echo -n "Tuning devices..."
    tunefs -j disable /
}

load_rc_config $name
run_rc_command "$1"


Perhaps this will help someone. The problem is that at least for me the root filesystem comes back dirty even though it does not hang. Maybe I am not disabling journaling correctly.

Comment 25 jamie.maher 2015-02-27 19:16:57 UTC

This just happened to a brand new install of FreeBSD 10.1 from disk1 AMD64 ISO inside ESXi 5.5. 

1. Installed under ESXi 5.5, ran through installer - chose default UFS filesystem layout install with defaults options of SSH,etc. 
2. Rebooted into new system, performed freebsd-update fetch, freebsd-update install.
3. ran "reboot"
4. console was stuck after "All buffers synced."

Had to reset the VM, and then upon bootup:

Feb 27 10:52:30 proxy-01 kernel: Trying to mount root from ufs:/dev/da0p2 [rw]...
Feb 27 10:52:30 proxy-01 kernel: WARNING: / was not properly dismounted

So it affects fresh installs of 10.1R when updating to 10.1-RELEASE-p6.

Comment 26 elofu17 2015-03-02 14:48:21 UTC

I have no idea if this is related or not, but...


When I install FreeBSD 10.0 via an ILO/IPMI/KVM-mounted .iso image, the machine usually hang in very much the same manner as described in this thread, just when it is supposed to reboot.

The problem seem to be related with the USB system, which simply won't die, and the machine will therefore never reboot itself.
(I guess the USB system is busy with my virtual CD/keyboard/mouse)

Workarounds:
press the reset-button, and the machine will reboot just fine...
or first set the sysctl hw.usb.no_shutdown_wait=1 before the installer reboot the machine.



So if someone has a machine where the problem can *always* be reproduced, test the sysctl command above just to see if the USB system is the real reason for the hang and the subsequent dirty filesystem.

/Elof

Comment 27 Matt Simerson 2015-03-02 21:46:58 UTC

Re: elofu17@hotmail.com

I needed a KVM to be put on two servers exactly *because* of this issue. So maybe that's part of it, but it's certainly not all of it.

Comment 28 elofu17 2015-03-03 13:44:36 UTC

That's what I was thinking...

Perhaps the more connected USB-stuff (such as KVM-keyboard, etc), the more prone to freezing the machine gets?




Anyone have an old snapshot of a non-upgraded vm that *always* freeze when upgraded?
Please run 'sysctl hw.usb.no_shutdown_wait=1' and 'echo "hw.usb.no_shutdown_wait=1" >> /etc/sysctl.conf' and then upgrade the vm as usual.
Could you reproduce the problem or did it pass?

Comment 29 jamie.maher 2015-03-03 17:02:37 UTC

I just tried the following suggestion in a new stock 10.1R VM with UFS fs - same hang on "All buffers synced." when trying to reboot after update:

1. 'sysctl hw.usb.no_shutdown_wait=1' 
   and 'echo "hw.usb.no_shutdown_wait=1" >> /etc/sysctl.conf'
2.  freebsd-update fetch
3.  freebsd-update install
4.  reboot -> hang after "All buffers synced."

The CPU usage jumps up from being fairly low to a steady 65 - 75% on a dual CPU VM after it hits the All buffers synced.

Comment 30 elofu17 2015-03-03 22:40:04 UTC

Thanks Jamie.
It was a long shot, but worth testing.

Comment 31 ncrogers 2015-03-09 16:28:35 UTC

+1 for this breaking upgrading to 10.1-RELEASE-p6 from p5. Not being able to update to the latest patch level without having to physically reboot a production server when it hangs during reboot really sucks...

In my opinion based on testing, the problem seems more to do with having UFS soft updates journaling enabled, possibly coupled with replacing /sbin/init. Temporarily disabling soft updates journaling somewhat alleviates the problem for me when going from 9.x to 10.1 (the hang stops, but the file system comes back dirty). However it is not a tractable solution when going from one 10.1 patch level to the next (e.g., 10.1-RELEASE-p5 to p6). It would be really nice to get some sort of resolution to this bug.

Comment 32 stoa 2015-03-09 16:58:53 UTC

I can refine my comments above.  Apparently, while I can shutdown cleanly at all times, I cannot reboot at anytime without hanging.  I am presently on 10.1-RELEASE-p6 on both a "generally Intel" desktop and a "generally AMD/NVIDIA" desktop; both act identically, i.e., can shutdown cleanly but not reboot.  (I know those general descriptions are mostly useless, so, if anyone needs dmesgs, etc, I'll gladly provide.)

I response to some of the above comments, I do not have soft updates enabled and none of the kernel tunables had any positive effect.

Comment 33 Glen Barber freebsd_committer

2015-03-09 17:02:39 UTC

For clarification, when you say "reboot", do you mean reboot(8), or shutdown(8) with the '-r' flag?

Comment 34 ncrogers 2015-03-09 17:07:23 UTC

In my cases shutdown(8) ("shutdown -r now") and reboot(8) exhibit the same behavior.

Comment 35 Glen Barber freebsd_committer

2015-03-09 17:09:11 UTC

(In reply to ncrogers from comment #34)
> In my cases shutdown(8) ("shutdown -r now") and reboot(8) exhibit the same
> behavior.

Ok, thank you.

Comment 36 stoa 2015-03-09 17:18:31 UTC

I'm using "shutdown -r now" to reboot and "shutdown -p now" to shutdown.

Comment 37 Glen Barber freebsd_committer

2015-03-09 17:49:47 UTC

(In reply to stoa from comment #36)
> I'm using "shutdown -r now" to reboot and "shutdown -p now" to shutdown.

Thank you.

Comment 38 Walter Hop 2015-03-09 19:07:06 UTC

Please note that in my tests it's not the actual reboot that's the problem, rather unmounting the root FS. Replacing /sbin/init binary on UFS+S and doing a "mount -o ro /" afterwards also hangs the system at 100%CPU.

Comment 39 ncrogers 2015-03-09 19:55:05 UTC

(In reply to Walter Hop from comment #38)
I concur. The problem is with unmounting the root filesystem, and not so much the reboot itself. In my tests simply executing a "mount -r /" (which should fail) after the last freebsd-update stage results in a hang. This happens before mount(8) is able to return a "Device busy" error.

Comment 40 Glen Barber freebsd_committer

2015-03-09 20:35:17 UTC

Can a few people please paste output from 'service -e | grep ^/etc' and 'sysctl kern.shutdown' ?

Comment 41 ncrogers 2015-03-09 20:38:24 UTC

(In reply to Glen Barber from comment #40)
# uname -v
FreeBSD 10.1-RELEASE-p6 #8 r279296M: Wed Feb 25 16:15:37 EST 2015     root@fbsd_101_amd64_builder.rgnets.com:/usr/obj/usr/src/sys/RGNETS
# service -e | grep ^/etc
/etc/rc.d/hostid
/etc/rc.d/hostid_save
/etc/rc.d/cleanvar
/etc/rc.d/ip6addrctl
/etc/rc.d/devd
/etc/rc.d/newsyslog
/etc/rc.d/syslogd
/etc/rc.d/dmesg
/etc/rc.d/virecover
/etc/rc.d/motd
/etc/rc.d/sshd
/etc/rc.d/sendmail
/etc/rc.d/cron
/etc/rc.d/mixer
/etc/rc.d/gptboot
# sysctl kern.shutdown
kern.shutdown.show_busybufs: 0
kern.shutdown.poweroff_delay: 5000
kern.shutdown.kproc_shutdown_wait: 60
kern.shutdown.dumpdevname: 
rxg#

Comment 42 stoa 2015-03-09 20:42:51 UTC

/usr/home/dutch $ service -e | grep ^/etc
/etc/rc.d/hostid
/etc/rc.d/hostid_save
/etc/rc.d/cleanvar
/etc/rc.d/ip6addrctl
/etc/rc.d/devd
/etc/rc.d/pflog
/etc/rc.d/pf
/etc/rc.d/newsyslog
/etc/rc.d/syslogd
/etc/rc.d/dmesg
/etc/rc.d/virecover
/etc/rc.d/lpd
/etc/rc.d/motd
/etc/rc.d/ntpd
/etc/rc.d/moused
/etc/rc.d/sendmail
/etc/rc.d/cron
/etc/rc.d/mixer
/etc/rc.d/gptboot
/etc/rc.d/bgfsck

/usr/home/dutch $ sysctl kern.shutdown
kern.shutdown.show_busybufs: 0
kern.shutdown.poweroff_delay: 5000
kern.shutdown.kproc_shutdown_wait: 60
kern.shutdown.dumpdevname: ada0s2b

/usr/home/dutch $

Comment 43 Daniel Ylitalo 2015-03-09 20:45:53 UTC

[root@www-04-portlane ~]# uname -a
FreeBSD www-04-portlane.p203.se 10.1-RELEASE FreeBSD 10.1-RELEASE #0 r274401: Tue Nov 11 21:02:49 UTC 2014     root@releng1.nyi.freebsd.org:/usr/obj/usr/src/sys/GENERIC  amd64
[root@www-04-portlane ~]# service -e | grep ^/etc
/etc/rc.d/hostid
/etc/rc.d/hostid_save
/etc/rc.d/cleanvar
/etc/rc.d/ip6addrctl
/etc/rc.d/devd
/etc/rc.d/newsyslog
/etc/rc.d/syslogd
/etc/rc.d/dmesg
/etc/rc.d/virecover
/etc/rc.d/motd
/etc/rc.d/ntpd
/etc/rc.d/sshd
/etc/rc.d/cron
/etc/rc.d/mixer
/etc/rc.d/gptboot
/etc/rc.d/bgfsck
[root@www-04-portlane ~]# sysctl kern.shutdown
kern.shutdown.show_busybufs: 0
kern.shutdown.poweroff_delay: 5000
kern.shutdown.kproc_shutdown_wait: 60
kern.shutdown.dumpdevname: da0p3
[root@www-04-portlane ~]#

Comment 44 Glen Barber freebsd_committer

2015-03-09 20:51:44 UTC

Thanks.

Comment 45 rkoberman 2015-03-09 23:05:09 UTC

I have succeeded in working around the issue by:
# shutdown now
# sync
wait a minute
# reboot

Since I started doing this I have not had my system hang. Note that "shutdown now" followed by a reboot is functionally identical to "shutdown -r now" except for the very long delay between termination of multi-user services the actual reboot. I am guessing that something is flushing. I really doubt the "sync" is required, but I do it just to be sure.

I have no explanation as to why this seems to work, but it has for the three updates I have used it on. I can only hope that this gives someone a clue as to what the actual issue is.

Comment 46 Glen Barber freebsd_committer

2015-03-09 23:15:56 UTC

(In reply to rkoberman from comment #45)
> I have succeeded in working around the issue by:
> # shutdown now
> # sync
> wait a minute
> # reboot
> 
> Since I started doing this I have not had my system hang. Note that
> "shutdown now" followed by a reboot is functionally identical to "shutdown
> -r now" except for the very long delay between termination of multi-user
> services the actual reboot. I am guessing that something is flushing. I
> really doubt the "sync" is required, but I do it just to be sure.
> 
> I have no explanation as to why this seems to work, but it has for the three
> updates I have used it on. I can only hope that this gives someone a clue as
> to what the actual issue is.

This is somewhat the direction I was going when asking for the service(8) and sysctl(8) output.  I've finally reproduced the issue in VirtualBox, so now that I can reproduce it reliably, and with your findings, hope to be able to identify the underlying cause soon.

Thank you for providing this information.

Comment 47 Guy Helmer freebsd_committer

2015-03-10 17:15:30 UTC

(In reply to rkoberman from comment #45)

It seems the complaints here all involve FreeBSD 10, but I am seeing similar issues on FreeBSD 9.3 on VMware ESXi servers.
The systems in question have / filesystem without soft-updates, and /usr filesystem with soft-updates enabled. File contents have been disappearing from the /usr partition after reboot.
A "sync" before "shutdown -r now" seems to have significantly reduced loss of file contents.

Comment 48 Glen Barber freebsd_committer

2015-03-10 17:41:33 UTC

While continuing to look into this, I think I may have found a workaround.

Can someone test running 'freebsd-update install' twice *without* the intermediate reboot between the kernel and userland updates?

The specific command sequence I'm interested in is:

 # freebsd-update -r 10.1-RELEASE upgrade
 # freebsd-update install
 # freebsd-update install
 # shutdown -r now

Comment 49 rkoberman 2015-03-10 19:34:32 UTC

(In reply to Guy Helmer from comment #47)
I suspect that this is a different, though possibly related issue. In all cases reported, while all filesystems were unclean on reboot, fsck never found any errors on my system and ended up simply marking he volume "clean" (after a long time had passed).

OTOH, the problem started on a system that had installed 10.0-R right after I received it and figured out how to turn off boot signature checking so I could boot the memstick install media. (That was hidden several menus deep in a menu that only could be brought up when another, seemingly unrelated BIOS option was modified.)

I can say that before I retired and still had many FreeBSD systems to maintain that I never saw this with freebsd-update. Those were all version 9 systems, and all were physical system, no virtualization involved.

Comment 50 Glen Barber freebsd_committer

2015-03-10 20:07:32 UTC

(In reply to Andrew Smith from comment #10)
> It seems that this at last effects all Kernels since 10.1-RELEASE through p3.
> 
> If you have one of these Kernels with Softupdates active on the root
> filesystem and you replace /sbin/init then you get this behaviour.
> 
> If you either disable file system softupdates on the filesystem or you
> disable the softupdates option in a new Kernel build then the issue does not
> exist.
> 
> I suspect people that haven't had this issue have some other environmental
> difference that nobody has highlighted yet.
> 
> The issue of course hits 10.0-RELEASE to 10.1-RELEASE upgrades since when
> freebsd-update install is run, the first thing done is to replace the Kernel
> and ask for a reboot then run freebsd-update again. The Kernel exhibiting
> the problem is then in place.
> 
> Could this be related to the changes made to ufs for the per FFS-Filesystem
> threading around August? (r269457, r269533, r269583).

I built and installed GENERIC with these commit reverted (releng/10.1@r270157), and still see this behavior after freebsd-update(8) installs the userland updates, so I'm inclined to think these are not the cause.

Comment 51 Glen Barber freebsd_committer

2015-03-10 21:12:01 UTC

After editing sys/kern/kern_shutdown.c to be a bit more verbose, it appears kern_reboot() is getting stuck on line 429:

421                 if (nbusy) {
422                         /*
423                          * Failed to sync all blocks. Indicate this and don't
424                          * unmount filesystems (thus forcing an fsck on reboot).
425                          */
426                         printf("Giving up on %d buffers\n", nbusy);
427                         DELAY(5000000); /* 5 seconds */
428                 } else {
429                         if (!first_buf_printf)
430                                 printf("Final sync complete\n");
431                         /*
432                          * Unmount filesystems
433                          */
434                         if (panicstr == 0)
435                                 vfs_unmountall();
436                 }
437                 swapoff_all();

Comment 52 Glen Barber freebsd_committer

2015-03-11 03:15:01 UTC

(In reply to Glen Barber from comment #51)
> After editing sys/kern/kern_shutdown.c to be a bit more verbose, it appears
> kern_reboot() is getting stuck on line 429:
> 
> 421                 if (nbusy) {
> 422                         /*
> 423                          * Failed to sync all blocks. Indicate this and don't
> 424                          * unmount filesystems (thus forcing an fsck on reboot).
> 425                          */
> 426                         printf("Giving up on %d buffers\n", nbusy);
> 427                         DELAY(5000000); /* 5 seconds */
> 428                 } else {
> 429                         if (!first_buf_printf)
> 430                                 printf("Final sync complete\n");
> 431                         /*
> 432                          * Unmount filesystems
> 433                          */
> 434                         if (panicstr == 0)
> 435                                 vfs_unmountall();
> 436                 }
> 437                 swapoff_all();

After looking further, it appears to make it through the if/else to at least line 436, and swapoff_all() is triggered.  So, still looking...

Comment 53 Konstantin Belousov freebsd_committer

2015-03-11 09:28:03 UTC

(In reply to Glen Barber from comment #52)
It is physically impossible to hang on a line which is not loop.

I suspect that it is either the buffer flush code, or softdep worker thread which loop and cause shutdown thread to wait.  It is good that you have reproducable case and willing to move it further, previous reporters only bother to whine.

See https://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug-deadlocks.html for the instructions on how to configure your kernel and what information to get.

Comment 54 elofu17 2015-03-11 16:12:23 UTC

Install a VirtualBox VM via 10.1-RELEASE CD.
Reboot after installer: OK

Boot the installed system.
Reboot after doing nothing at all but logging in: OK

Boot again.
Run 'freebsd-update fetch install' (to 10.1-RELEASE-p6, 705 patches, 462 files)
Now I take a snapshot in VirtualBox - "Updates installed, not rebooted yet".
I now run 'reboot': Fail
Syncing disks, vnodes remaining...3 1 0 done
All buffers synced.

Poweroff in VirtualBox.
Boot the machine again.
Reboot: OK.

I now Restore Snapshot "Updates installed, not rebooted yet" and start the VM again.
'reboot' again fails:
Syncing disks, vnodes remaining...3 1 0 0 done
All buffers synced.

Restore again and run 'sync && reboot': Fail
Syncing disks, vnodes remaining...3 1 0 0 done
All buffers synced.

Restore again.
sync
sync
sleep 30
reboot Fail
Syncing disks, vnodes remaining...3 1 0 0 done
All buffers synced.

Restore again.
sync ; sleep 5 ; sync ; sleep 5 ; shutdown -r now
Syncing disks, vnodes remaining...2 0 0 done
All buffers synced.

So... It fails with both 'reboot' and 'shutdown -r now'.

Restore again.
shutdown now
stopping cron
stopping sshd
stopping devd
Writing entropy file:.
.
syslogd exiting on signal 15
Enter full pathname of shell or RETURN for /bin/sh:
<return>
sync; sleep 5; reboot
Syncing disks, vnodes remaining...0 done
All buffers synced.

Now it hangs for 20 seconds, so it looks like it once again failed, BUT...
Suddenly the machine reboots!!!

(Normally the machine waits less than 1 second after the "All buffers synced" message when I've run a 'reboot' command, so this must be a 20 second timeout somewhere)

Also, I see no root (/) fs warnings upon booting. Yay!

I went back and re-ran the 'sync ; sleep 5 ; sync ; sleep 5 ; shutdown -r now' command and waited several minutes. No reboot. Fail.

Restore again, ran 'shutdown now', enter single-user-shell and 'reboot'
Syncing disks, vnodes remaining...1 0 done
All buffers synced.
After 20 seconds, the machine reboots.

Restore again, stopped devd and killed cron, syslogd, adjkerntz, dhclient, sendmail and ran 'reboot'
Syncing disks, vnodes remaining...1 0 done
All buffers synced.
Nope it fails. Waited several minutes.

Restore again, ran 'shutdown -ro now' (execute 'reboot' instead of signalling init().
Syncing disks, vnodes remaining...2 2 0 0 done
All buffers synced.
Fail.

Restore again, ran 'shutdown -ron now' (prevent filesystem cache from being flushed)
Syncing disks, vnodes remaining...2 2 0 0 done
All buffers synced.
Now the machine instantly reboots! Yay!
/ was not properly dismounted
/: mount pending error: blocks 8512 files 5
...Rebuilding fs from journal...

Findings:
'reboot' or 'shutdown -r' get the same results.
Manual pre 'sync' does nothing.
Running 'shutdown now' and hence entering single-user mode apparently does something good.
Some buffers seem to be connected to a 20 second timeout.
Not flushing the buffers at all on shutdown removes the 20 second timeout (but generates a corrupt fs).

You can easily reproduce this yourselves in VirtualBox to debug further. See above.

The main problem is still a CRITICAL one, since even if you use the 'shutdown now+single-user+20sec timeout'-approach to get the machine to finally _reboot_ OK, you still need KVM-access for the single-user-mode.
And if you use the 'shutdown -ron now'-approach, you do get the much needed reboot, but you also get a corrupt fs... :-(
So remote FreeBSD machines without any iLO/IPMI still suffer badly from this. I hope someone will find a fix soon.

/Elof

Comment 55 Glen Barber freebsd_committer

2015-03-11 16:21:10 UTC

(In reply to Konstantin Belousov from comment #53)
> (In reply to Glen Barber from comment #52)
> It is physically impossible to hang on a line which is not loop.
> 

Yes, understood.

> I suspect that it is either the buffer flush code, or softdep worker thread
> which loop and cause shutdown thread to wait.  It is good that you have
> reproducable case and willing to move it further, previous reporters only
> bother to whine.
> 
> See
> https://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/
> kerneldebug-deadlocks.html for the instructions on how to configure your
> kernel and what information to get.

The test machine now has INVARIANTS, INVARIANT_SUPPORT, WITNESS, DEBUG_LOCKS, DEBUG_VFS_LOCKS, DIAGNOSTIC, and ALT_BREAK_TO_DEBUGGER.  Unfortunately, it panics on boot now, so I cannot proceed to the 'freebsd-update install; reboot' phase.

Just prior to this, I left out DIAGNOSTIC and saw a lock order reversal after the "All buffers synced." message.  (I will provide screenshots in a separate update.)

It looks like I will need to remove DIAGNOSTIC to get the system to boot.

Comment 56 Glen Barber freebsd_committer

2015-03-11 16:22:26 UTC

Created attachment 154206 [details]
DIAGNOSTIC panic (1/2)

Comment 57 Glen Barber freebsd_committer

2015-03-11 16:22:53 UTC

Created attachment 154207 [details]
DIAGNOSTIC panic (2/2)

Comment 58 Glen Barber freebsd_committer

2015-03-11 16:24:23 UTC

Created attachment 154208 [details]
lock order reversal after kern_reboot()

Comment 59 Glen Barber freebsd_committer

2015-03-11 17:53:35 UTC

(In reply to Glen Barber from comment #55)
> The test machine now has INVARIANTS, INVARIANT_SUPPORT, WITNESS,
> DEBUG_LOCKS, DEBUG_VFS_LOCKS, DIAGNOSTIC, and ALT_BREAK_TO_DEBUGGER. 
> Unfortunately, it panics on boot now, so I cannot proceed to the
> 'freebsd-update install; reboot' phase.
> 
> Just prior to this, I left out DIAGNOSTIC and saw a lock order reversal
> after the "All buffers synced." message.  (I will provide screenshots in a
> separate update.)
> 
> It looks like I will need to remove DIAGNOSTIC to get the system to boot.

It seems removing DIAGNOSTIC alone was not enough, since now the test machine panics on boot.  Although unrelated to the original problem in this PR, the ddb session will be included in a followup.

Comment 60 Glen Barber freebsd_committer

2015-03-11 17:54:53 UTC

Created attachment 154211 [details]
ddb transcript of panic-on-boot with debugging options enabled

Comment 61 Glen Barber freebsd_committer

2015-03-11 20:26:28 UTC

I've finally gotten the machine into a state where I can access the debugger after it hangs.  script(1) output of the debugging session will be attached.

Comment 62 Glen Barber freebsd_committer

2015-03-11 20:27:30 UTC

Created attachment 154224 [details]
debugging session after freebsd-update and reboot

Comment 63 Konstantin Belousov freebsd_committer

2015-03-11 21:35:34 UTC

(In reply to Glen Barber from comment #62)
You should add WITNESS_SKIPSPIN kernel option, it is known that console spinlocks are not in order.

So for the attachment id=154224, is it possible to do show mount and show mount <addr> for the root mp ?

You can 'set $lines 0' to disable pager in ddb.

Comment 64 Glen Barber freebsd_committer

2015-03-11 21:45:28 UTC

(In reply to Konstantin Belousov from comment #63)
> (In reply to Glen Barber from comment #62)
> You should add WITNESS_SKIPSPIN kernel option, it is known that console
> spinlocks are not in order.
> 

Okay, I wasn't sure if we wanted to see spinlocks.

> So for the attachment id=154224, is it possible to do show mount and show
> mount <addr> for the root mp ?
> 

Sure.  One thing to note (though it shouldn't matter) is that each iteration requires a rollback of the VM.  I only mention this in case there is inconsitencies between ddb sessions.

> You can 'set $lines 0' to disable pager in ddb.

Thank you, I wasn't aware of this.

Comment 65 Glen Barber freebsd_committer

2015-03-12 01:52:38 UTC

(In reply to Glen Barber from comment #59)
> (In reply to Glen Barber from comment #55)
> > The test machine now has INVARIANTS, INVARIANT_SUPPORT, WITNESS,
> > DEBUG_LOCKS, DEBUG_VFS_LOCKS, DIAGNOSTIC, and ALT_BREAK_TO_DEBUGGER. 
> > Unfortunately, it panics on boot now, so I cannot proceed to the
> > 'freebsd-update install; reboot' phase.
> > [...]
> > It looks like I will need to remove DIAGNOSTIC to get the system to boot.
> 
> It seems removing DIAGNOSTIC alone was not enough, since now the test
> machine panics on boot.

Just a note:

This particular issue (panic-on-boot with DIAGNOSTIC) will need to be reinvestigated after the original issue discussed in this PR is identified and resolved, as right now it is difficult to tell if this is an effect of a larger issue.

Comment 66 Julien Cigar 2015-03-13 10:20:26 UTC

FYI I had an issue on a HP Proliant 10.0-RELEASE-p7 box which may be related: the machine has been installed and worked perfectly for ~30 days, and one day it suddenly "froze".. I had to physically power off/power on the box and the FS was corrupted afterwards. SU+J was unable to recover it (it segfaulted everywhere) but hopefully a manual fsck was able to repair it. As I had a *lot* of issues with SU+J in the past I turned it off (the +J part) on all my FS and since then it has been rock solid (the machine has ~200 days of uptime). I was told that SU+J had been fixed on 10, but apparently there are still problems..

Comment 67 elofu17 2015-03-20 15:09:42 UTC

Has anyone had any luck debugging this critical issue?

1)
A normal upgrade+reboot always freeze the machine permanently. :-(

2)
A normal upgrade+reboot, followed by 'shutdown now', entering single user mode and finally rebooting from there always reboot my machine after a 20 second timeout.
Better, but still not a solution since I need ssh-access to do this remotely (no iLO/IPMI/KVM exists).

3)
A normal upgrade+reboot, followed by 'shutdown -ron now' (to prevent filesystem cache from being flushed) always make my machine reboot immediately as it should.
However, this too is not a perfect workaround since the filesystem gets corrupted.



Given these three scenarios, and because they are reproduceable every time, I hope a soluction will be found soon.
FreeBSD 10.0 is now unsupported so us FreeBSD users need to upgrade all our machines. In my case this is >100 machines located all over the world.

Comment 68 elofu17 2015-03-21 02:56:04 UTC

Sorry. I was a bit fast with the copy-n-paste there.

Sections 2 and three should read:
"A normal upgrade, followed by"...

/Elof

Comment 69 Glen Barber freebsd_committer

2015-03-21 03:03:25 UTC

Just an update to note that this issue is not forgotten, and is being actively (and heavily) investigated.

The underlying causes are not yet fully understood, and are quite complex by nature.

Comment 70 elofu17 2015-03-21 11:18:36 UTC

Glen, that sounds good.

Just an update from me too:
I counted the seconds once more today when I installed and upgraded a 10.1-machine, and I think that all my "20 seconds" above should actually read "30 seconds".

Comment 71 commit-hook freebsd_committer

2015-03-27 13:56:45 UTC

A commit references this bug:

Author: kib
Date: Fri Mar 27 13:55:57 UTC 2015
New revision: 280760
URL: https://svnweb.freebsd.org/changeset/base/280760

Log:
  Fix the hand after the immediate reboot when the following command
  sequence is performed on UFS SU+J rootfs:
  cp -Rp /sbin/init /sbin/init.old
  mv -f /sbin/init.old /sbin/init

  Hang occurs on the rootfs unmount.  There are two issues:

  1. Removed init binary, which is still mapped, creates a reference to
  the removed vnode. The inodeblock for such vnode must have active
  inodedep, which is (eventually) linked through the unlinked list. This
  means that ffs_sync(MNT_SUSPEND) cannot succeed, because number of
  softdep workitems for the mp is always > 0.  FFS is suspended during
  unmount, so unmount just hangs.

  2. As noted above, the inodedep is linked eventually.  It is not
  linked until the superblock is written.  But at the vfs_unmountall()
  time, when the rootfs is unmounted, the call is made to
  ffs_unmount()->ffs_sync() before vflush(), and ffs_sync() only calls
  ffs_sbupdate() after all workitems are flushed.  It is masked for
  normal system operations, because syncer works in parallel and
  eventually flushes superblock.  Syncer is stopped when rootfs
  unmounted, so ffs_sync() must do sb update on its own.

  Correct the issues listed above. For MNT_SUSPEND, count the number of
  linked unlinked inodedeps (this is not a typo) and substract the count
  of such workitems from the total. For the second issue, the
  ffs_sbupdate() is called right after device sync in ffs_sync() loop.

  There is third problem, occuring with both SU and SU+J. The
  softdep_waitidle() loop, which waits for softdep_flush() thread to
  clear the worklist, only waits 20ms max. It seems that the 1 tick,
  specified for msleep(9), was a typo.

  Add fsync(devvp, MNT_WAIT) call to softdep_waitidle(), which seems to
  significantly help the softdep thread, and change the MNT_LAZY update
  at the reboot time to MNT_WAIT for similar reasons.  Note that
  userspace cannot create more work while devvp is flushed, since the
  mount point is always suspended before the call to softdep_waitidle()
  in unmount or remount path.

  PR:	195458
  In collaboration with:	gjb, pho
  Reviewed by:	mckusick
  Sponsored by:	The FreeBSD Foundation
  MFC after:	2 weeks

Changes:
  head/sys/ufs/ffs/ffs_softdep.c
  head/sys/ufs/ffs/ffs_vfsops.c

Comment 72 ncrogers 2015-04-09 21:53:27 UTC

(In reply to commit-hook from comment #71)

Great work tracking this one down. I am guessing this is a no, but are there any plans for this to make it into the 10.1-RELEASE branch? Another one of my systems was hit by this today going from 10.1-p8 to 10.1-p9. Or is patching a custom kernel the recommended solution?

Comment 73 Xin LI freebsd_committer

2015-04-14 23:32:42 UTC

Take.

Comment 74 Bryan Drewery freebsd_committer

2015-04-24 20:41:30 UTC

(In reply to Xin LI from comment #73)
> Take.

Ping? Is this being scheduled for EN?

Comment 75 ncrogers 2015-04-27 15:06:40 UTC

(In reply to Bryan Drewery from comment #74)

Glen replied to me on the freebsd-stable list, and said that an EN was planned but there was no ETA.

https://www.mail-archive.com/freebsd-stable@freebsd.org/msg129134.html

Comment 76 Xin LI freebsd_committer

2015-05-13 23:30:09 UTC

This is fixed in FreeBSD-EN-15:05.ufs.

ari
bc
bdrewery
ben
csheets
dan.offord
daniel
delphij
diotonante
elofu17
freebsd
freebsd
garrych
ghelmer
hal
iamasmith.home
jamie.maher
john
julien
kib
kollross
matt
ncrogers
nomoo
olof.hellqvist
payam_hekmat
petteri.valkonen
quickfox
re
rkoberman
rmthomas1947
stoa
vendion