Summary: | Hang on shutdown/root unmount after FreeBSD 10.1R upgrade | ||
---|---|---|---|
Product: | Base System | Reporter: | Walter Hop <walter> |
Component: | kern | Assignee: | FreeBSD Release Engineering <re> |
Status: | Closed FIXED | ||
Severity: | Affects Many People | CC: | ari, bc, bdrewery, ben, csheets, dan.offord, daniel, delphij, diotonante, elofu17, freebsd, freebsd, garrych, ghelmer, hal, iamasmith.home, jamie.maher, john, julien, kib, kollross, matt, ncrogers, nomoo, olof.hellqvist, payam_hekmat, petteri.valkonen, quickfox, re, rkoberman, rmthomas1947, stoa, vendion |
Priority: | --- | ||
Version: | 10.1-RELEASE | ||
Hardware: | Any | ||
OS: | Any | ||
Attachments: |
Description
Walter Hop
2014-11-27 22:44:46 UTC
I noticed today that the same problem happens when upgrading from 10.1-RC3 to 10.1-RELEASE, and also on a clean install of 10.1-RELEASE. I did some more digging. As a refresher, the hang didn't occur just when booting the new kernel. It happened only after "freebsd-update install" was executed to replace userland. What is so special about "freebsd-update install" that would trigger the problem? I think the interesting bit might be that it replaced /sbin/init. I can completely reliably trigger a hang on a default 10.1-RELEASE install on UFS2 in VMware Fusion with the following procedure: # chflags noschg /sbin/init # cp -Rp /sbin/init /sbin/init2 # rm -f /sbin/init # mv /sbin/init2 /sbin/init # chflags schg /sbin/init # reboot => Hang after "All buffers synced." I created two clean 10.1 UFS2 installs which both exhibit the problem 100% of trying. I tried doing the same on two clean 10.1 ZFS (auto) installs. ZFS does NOT exhibit the problem so far. I tried disabling softupdates (tunefs -n disable /dev/da0p2) before doing the procedure. In this case, on multiple machines, it also does NOT hang. On 10.0 there's also no problem. So: on FreeBSD 10.1, when using UFS2 with journaled softupdates, replacing init leads to a hang when rebooting/unmounting root afterwards. And a workaround might be to disable softupdates before upgrading to 10.1. Confirmed again with today updates: https://lists.freebsd.org/pipermail/freebsd-stable/2014-December/081246.html And sorry Walter, I just realize that I managed to delete your name from the mail quotes… :( I’ve tried if the hang is apparent in a 11-CURRENT snapshot too, and it is. (Tried the iso snapshot available today, which is r273635). WITNESS indicates a lock order reversal, which looks related to me: root@current:~ # chflags noschg /sbin/init root@current:~ # cp -Rp /sbin/init /sbin/init2 lock order reversal: 1st 0xfffffe007b842fa0 bufwait (bufwait) @ /usr/src/sys/kern/vfs_bio.c:3093 2nd 0xfffff80002b9ea00 dirhash (dirhash) @ /usr/src/sys/ufs/ufs/ufs_dirhash.c:284 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe000025c270 kdb_backtrace() at kdb_backtrace+0x39/frame 0xfffffe000025c320 witness_checkorder() at witness_checkorder+0xdad/frame 0xfffffe000025c3b0 _sx_xlock() at _sx_xlock+0x75/frame 0xfffffe000025c3f0 ufsdirhash_add() at ufsdirhash_add+0x3a/frame 0xfffffe000025c430 ufs_direnter() at ufs_direnter+0x6a0/frame 0xfffffe000025c4f0 ufs_makeinode() at ufs_makeinode+0x560/frame 0xfffffe000025c6a0 VOP_CREATE_APV() at VOP_CREATE_APV+0xf1/frame 0xfffffe000025c6d0 vn_open_cred() at vn_open_cred+0x29d/frame 0xfffffe000025c820 kern_openat() at kern_openat+0x26f/frame 0xfffffe000025c9a0 amd64_syscall() at amd64_syscall+0x25a/frame 0xfffffe000025cab0 Xfast_syscall() at Xfast_syscall+0xfb/frame 0xfffffe000025cab0 --- syscall (5, FreeBSD ELF64, sys_open), rip = 0x80094f01a, rsp = 0x7fffffffe958, rbp = 0x7fffffffe9c0 --- Finally, when rebooting, another lock order reversal appears and the system hangs. I don’t have a text log of this, so I’ll copy the first few lines: Syncing disks, vnodes remaining...1 0 0 done All buffers synced. lock order reversal: 1st 0xfffff80002e65d50 ufs (ufs) @ /usr/src/sys/kern/vfs_mount.c:1223 2nd 0xfffff80002e665f0 devfs (devfs) @ /usr/src/sys/kern/vfs_subr.c:2144 Screenshot is here: http://lf.ms/current-r273635-hang-2.png Sorry, I lied! r273635 was not the latest snapshot available, my apologies. I retried with a snapshot from 7 Dec (r275582). In r275582, the LOR when fiddling with /sbin/init is the same as in 10.1, but the behavior at reboot is different, although it is still pathological (“Giving up on 1 buffers”) . Screenshot: http://lf.ms/current-r275582-hang-givingup.png I have hit this behavior twice on r275582, but not on 10.1. So something has changed in CURRENT after r273635, but the basic problem still seems to be there. I also experienced it. I freebsd-update upgraded two physical 10.0-RELEASE-p13 installs to 10.1-RELEASE-p1. For the first, I didn't disable soft updates and it froze at "All buffers synced." during the last reboot (as per [1]). For the second, I disabled soft updates and everything went fine. [1] https://www.freebsd.org/releases/10.1R/installation.html According to Jilles Tjoelker the LOR are false positives so please disregard them. I have the same problem, however, I noticed the issue after installing patch-1 to a 10.1-RELEASE box that had been previously (weeks before) upgraded from 10.0-RELEASE. I'm using UFS, not ZFS. More confounding, it doesn't always happen, but I haven't been able to pin down the trigger. As with others above, this issue only presents on one of two very similar installs (while the installs are very similar, the hardware is different, but both are amd64.) Same issue here on three VM's 10.1R-p1 during today upgrade to p3. They were recently upgraded to 10.1R from 10.0R. All three VM's run on ESXi 5.5. All three VM's are clones of the first one I installed. We are talking about amd64 machines using UFS with soft-updates on. I just went from 10.1-p0 to 10.1-p3 on a completly new server and got stuck after "All buffers synced" so started looking around and found this pr :) Last week i upgraded 16 supermicro boxes from 10.0 to 10.1 and every single one of them got stuck att "All buffers synced" after freebsd-update It seems that this at last effects all Kernels since 10.1-RELEASE through p3. If you have one of these Kernels with Softupdates active on the root filesystem and you replace /sbin/init then you get this behaviour. If you either disable file system softupdates on the filesystem or you disable the softupdates option in a new Kernel build then the issue does not exist. I suspect people that haven't had this issue have some other environmental difference that nobody has highlighted yet. The issue of course hits 10.0-RELEASE to 10.1-RELEASE upgrades since when freebsd-update install is run, the first thing done is to replace the Kernel and ask for a reboot then run freebsd-update again. The Kernel exhibiting the problem is then in place. Could this be related to the changes made to ufs for the per FFS-Filesystem threading around August? (r269457, r269533, r269583). Additionally it appears that if you leave soft updates on but disable soft update journal (tunefs -j disable) the problem disappears so the problem may be more focussed than generally soft updates. I have spent some hours bisecting -STABLE between 10.0 and 10.1, but I couldn't pinpoint the revision, as the bug seems to depend on the clang build! So, the bisecting procedure should probably involve building a clean toolchain at every revision. Have rebuilt the 10.1 Kernel using CLANG 3.3 on a FreeBSD 10.0 system and on a 10.1-RELEASE-p3 system from the current 10.1-RELEASE-p3 source tree delivered via freebsd-update. The problem is apparent with both builds so it would suggest this isn't directly tied to the updated toolchain. I was (concurrently) having many problems with the transition to xorg-server-1.14. Long story short, I disabled hald (8) and have not had any hangs since. I will note I have no real evidence of any connection here, only the end result on this particular install. To confirm, p4 seemed to appear as available today and the problem persists. I've also noticed this issue. I upgraded 4 machines last night, and all had the same problem. Specs: 10.0-RELEASE p4 -> 10.1-RELEASE Two where VMs on VMWARE EXI, and the other two were bare metal Dell R310. I believe all are running UFS. During the freebsd-update process, it installs the kernel with no issues, prompts for a reboot. This appeared to work fine on all 4 machines. However after running freebsd-update install an additional two times (one to install userland and again to remove older libraries) I reboot again and this is where it hangs. I've also just seen this issue. I installed 10.1-RELEASE on an HP ProLiant DL360 G5. After initial configuration, I did a freebsd-update fetch, freebsd-update install, and reboot. Assuming I'm experiencing the same issue as others, it seems to be 10.1 itself and not anything particular to the 10.0 -> 10.1 upgrade, as I did not perform that upgrade. I just experienced this as well. It was on a system I just upgraded from 9.3 -> 10.1-RELEASE via the "make world" dance (because of bug #195484 in freebsd-update). Installing the 10.1 kernel and reboot worked fine. Then I installed world, mergemaster, deleted old libs, and finally ran freebsd-update to install the latest security patches. Then it hung on reboot. Ouch. This seems related. On a server I upgraded to 10.1-RELEASE (via buildworld), and then updated to the latest patch level, I get this in dmesg after boot: Trying to mount root from ufs:/dev/mirror/root [rw]... WARNING: / was not properly dismounted WARNING: / was not properly dismounted I installed Release 10.1 on an ancient desktop computer from a DVD image: 10.1-RELEASE FreeBSD 10.1-RELEASE #0 r274401: Tue Nov 11 22:51:51 UTC 2014 I can't remember what happened with the first two or three boots, but the shutdown problem became apparent rather quickly. I have many packages installed, including xdm and the xfce desktop. I've set up the necessary .pkla file for polkit and put all users in the operator group, who conseqently have a non-greyed Shutdown option in the GUI menu. Clicking on this and answering "yes" to "do you want to shut down?" results in the computer becoming unresponsive, in the sense that it can no longer be reached by ssh from another machine, but it does not power down. An alternative scenario is that the command "shutdown -p now" is issued by root in a ssh terminal window. In this case the "System going down immediately" message is received and the connection is lost as expected, but again the computer does not power down. On reboot into single-user mode, one always sees the message "/ was not properly dismounted". Typing "shutdown -p now" at the console prompt does effect a true shutdown with power off. This is a troublesome fault. If one takes no remedial action, the filesytem corruption eventally reaches a stage where the computer reboots spontaneously and unpredictably while in normal use. It's tiresome to have to run fsck at every boot. A workaround would be useful if a cure is not immediately available. --Mike FWIW I've been having this problem as well, so far under VMWare Fusion and ESXi. Happens when upgrading from 9.1-STABLE to 10.1-RELEASE during the last reboot of the upgrade process (i.e., after the third freebsd-update install). I am using UFS and have softupdates + journaling enabled. Disabling soft-updates journaling on just the root partition before the upgrade alleviates the hanging issue for me as well. However, the filesystem still ends up being dirty with an incorrect block count on the subsequent reboot during fsck. It's a bit of an issue for me since I need to plan a series of unattended upgrades. Thankfully those systems are on an unaffected 10.0 Kernel level but I don;t have the option of disabling softupdates by booting from alternative media. I've try to prioritise some time to look at the problem but I'm mid way through another large piece of work and can't stop :( I am in the same situation of having to do remote unattended upgrades of a few hundred boxes... For a while now I've been using a custom rc.d script to run tunefs before the filesystems are mounted. I am trying to use this to disable soft-updates journaling on the root partition before the upgrade with something like this... cat /etc/rc.d/tunefs #!/bin/sh # PROVIDE: tunefs # REQUIRE: root # BEFORE: fsck FILESYSTEMS # KEYWORD: nojail . /etc/rc.subr name="tunefs" start_cmd="tunefs_start" stop_cmd=":" tunefs_start() { echo -n "Tuning devices..." tunefs -j disable / } load_rc_config $name run_rc_command "$1" Perhaps this will help someone. The problem is that at least for me the root filesystem comes back dirty even though it does not hang. Maybe I am not disabling journaling correctly. This just happened to a brand new install of FreeBSD 10.1 from disk1 AMD64 ISO inside ESXi 5.5. 1. Installed under ESXi 5.5, ran through installer - chose default UFS filesystem layout install with defaults options of SSH,etc. 2. Rebooted into new system, performed freebsd-update fetch, freebsd-update install. 3. ran "reboot" 4. console was stuck after "All buffers synced." Had to reset the VM, and then upon bootup: Feb 27 10:52:30 proxy-01 kernel: Trying to mount root from ufs:/dev/da0p2 [rw]... Feb 27 10:52:30 proxy-01 kernel: WARNING: / was not properly dismounted So it affects fresh installs of 10.1R when updating to 10.1-RELEASE-p6. I have no idea if this is related or not, but... When I install FreeBSD 10.0 via an ILO/IPMI/KVM-mounted .iso image, the machine usually hang in very much the same manner as described in this thread, just when it is supposed to reboot. The problem seem to be related with the USB system, which simply won't die, and the machine will therefore never reboot itself. (I guess the USB system is busy with my virtual CD/keyboard/mouse) Workarounds: press the reset-button, and the machine will reboot just fine... or first set the sysctl hw.usb.no_shutdown_wait=1 before the installer reboot the machine. So if someone has a machine where the problem can *always* be reproduced, test the sysctl command above just to see if the USB system is the real reason for the hang and the subsequent dirty filesystem. /Elof Re: elofu17@hotmail.com I needed a KVM to be put on two servers exactly *because* of this issue. So maybe that's part of it, but it's certainly not all of it. That's what I was thinking... Perhaps the more connected USB-stuff (such as KVM-keyboard, etc), the more prone to freezing the machine gets? Anyone have an old snapshot of a non-upgraded vm that *always* freeze when upgraded? Please run 'sysctl hw.usb.no_shutdown_wait=1' and 'echo "hw.usb.no_shutdown_wait=1" >> /etc/sysctl.conf' and then upgrade the vm as usual. Could you reproduce the problem or did it pass? I just tried the following suggestion in a new stock 10.1R VM with UFS fs - same hang on "All buffers synced." when trying to reboot after update: 1. 'sysctl hw.usb.no_shutdown_wait=1' and 'echo "hw.usb.no_shutdown_wait=1" >> /etc/sysctl.conf' 2. freebsd-update fetch 3. freebsd-update install 4. reboot -> hang after "All buffers synced." The CPU usage jumps up from being fairly low to a steady 65 - 75% on a dual CPU VM after it hits the All buffers synced. Thanks Jamie. It was a long shot, but worth testing. +1 for this breaking upgrading to 10.1-RELEASE-p6 from p5. Not being able to update to the latest patch level without having to physically reboot a production server when it hangs during reboot really sucks... In my opinion based on testing, the problem seems more to do with having UFS soft updates journaling enabled, possibly coupled with replacing /sbin/init. Temporarily disabling soft updates journaling somewhat alleviates the problem for me when going from 9.x to 10.1 (the hang stops, but the file system comes back dirty). However it is not a tractable solution when going from one 10.1 patch level to the next (e.g., 10.1-RELEASE-p5 to p6). It would be really nice to get some sort of resolution to this bug. I can refine my comments above. Apparently, while I can shutdown cleanly at all times, I cannot reboot at anytime without hanging. I am presently on 10.1-RELEASE-p6 on both a "generally Intel" desktop and a "generally AMD/NVIDIA" desktop; both act identically, i.e., can shutdown cleanly but not reboot. (I know those general descriptions are mostly useless, so, if anyone needs dmesgs, etc, I'll gladly provide.) I response to some of the above comments, I do not have soft updates enabled and none of the kernel tunables had any positive effect. For clarification, when you say "reboot", do you mean reboot(8), or shutdown(8) with the '-r' flag? In my cases shutdown(8) ("shutdown -r now") and reboot(8) exhibit the same behavior. (In reply to ncrogers from comment #34) > In my cases shutdown(8) ("shutdown -r now") and reboot(8) exhibit the same > behavior. Ok, thank you. I'm using "shutdown -r now" to reboot and "shutdown -p now" to shutdown. (In reply to stoa from comment #36) > I'm using "shutdown -r now" to reboot and "shutdown -p now" to shutdown. Thank you. Please note that in my tests it's not the actual reboot that's the problem, rather unmounting the root FS. Replacing /sbin/init binary on UFS+S and doing a "mount -o ro /" afterwards also hangs the system at 100%CPU. (In reply to Walter Hop from comment #38) I concur. The problem is with unmounting the root filesystem, and not so much the reboot itself. In my tests simply executing a "mount -r /" (which should fail) after the last freebsd-update stage results in a hang. This happens before mount(8) is able to return a "Device busy" error. Can a few people please paste output from 'service -e | grep ^/etc' and 'sysctl kern.shutdown' ? (In reply to Glen Barber from comment #40) # uname -v FreeBSD 10.1-RELEASE-p6 #8 r279296M: Wed Feb 25 16:15:37 EST 2015 root@fbsd_101_amd64_builder.rgnets.com:/usr/obj/usr/src/sys/RGNETS # service -e | grep ^/etc /etc/rc.d/hostid /etc/rc.d/hostid_save /etc/rc.d/cleanvar /etc/rc.d/ip6addrctl /etc/rc.d/devd /etc/rc.d/newsyslog /etc/rc.d/syslogd /etc/rc.d/dmesg /etc/rc.d/virecover /etc/rc.d/motd /etc/rc.d/sshd /etc/rc.d/sendmail /etc/rc.d/cron /etc/rc.d/mixer /etc/rc.d/gptboot # sysctl kern.shutdown kern.shutdown.show_busybufs: 0 kern.shutdown.poweroff_delay: 5000 kern.shutdown.kproc_shutdown_wait: 60 kern.shutdown.dumpdevname: rxg# /usr/home/dutch $ service -e | grep ^/etc /etc/rc.d/hostid /etc/rc.d/hostid_save /etc/rc.d/cleanvar /etc/rc.d/ip6addrctl /etc/rc.d/devd /etc/rc.d/pflog /etc/rc.d/pf /etc/rc.d/newsyslog /etc/rc.d/syslogd /etc/rc.d/dmesg /etc/rc.d/virecover /etc/rc.d/lpd /etc/rc.d/motd /etc/rc.d/ntpd /etc/rc.d/moused /etc/rc.d/sendmail /etc/rc.d/cron /etc/rc.d/mixer /etc/rc.d/gptboot /etc/rc.d/bgfsck /usr/home/dutch $ sysctl kern.shutdown kern.shutdown.show_busybufs: 0 kern.shutdown.poweroff_delay: 5000 kern.shutdown.kproc_shutdown_wait: 60 kern.shutdown.dumpdevname: ada0s2b /usr/home/dutch $ [root@www-04-portlane ~]# uname -a FreeBSD www-04-portlane.p203.se 10.1-RELEASE FreeBSD 10.1-RELEASE #0 r274401: Tue Nov 11 21:02:49 UTC 2014 root@releng1.nyi.freebsd.org:/usr/obj/usr/src/sys/GENERIC amd64 [root@www-04-portlane ~]# service -e | grep ^/etc /etc/rc.d/hostid /etc/rc.d/hostid_save /etc/rc.d/cleanvar /etc/rc.d/ip6addrctl /etc/rc.d/devd /etc/rc.d/newsyslog /etc/rc.d/syslogd /etc/rc.d/dmesg /etc/rc.d/virecover /etc/rc.d/motd /etc/rc.d/ntpd /etc/rc.d/sshd /etc/rc.d/cron /etc/rc.d/mixer /etc/rc.d/gptboot /etc/rc.d/bgfsck [root@www-04-portlane ~]# sysctl kern.shutdown kern.shutdown.show_busybufs: 0 kern.shutdown.poweroff_delay: 5000 kern.shutdown.kproc_shutdown_wait: 60 kern.shutdown.dumpdevname: da0p3 [root@www-04-portlane ~]# Thanks. I have succeeded in working around the issue by: # shutdown now # sync wait a minute # reboot Since I started doing this I have not had my system hang. Note that "shutdown now" followed by a reboot is functionally identical to "shutdown -r now" except for the very long delay between termination of multi-user services the actual reboot. I am guessing that something is flushing. I really doubt the "sync" is required, but I do it just to be sure. I have no explanation as to why this seems to work, but it has for the three updates I have used it on. I can only hope that this gives someone a clue as to what the actual issue is. (In reply to rkoberman from comment #45) > I have succeeded in working around the issue by: > # shutdown now > # sync > wait a minute > # reboot > > Since I started doing this I have not had my system hang. Note that > "shutdown now" followed by a reboot is functionally identical to "shutdown > -r now" except for the very long delay between termination of multi-user > services the actual reboot. I am guessing that something is flushing. I > really doubt the "sync" is required, but I do it just to be sure. > > I have no explanation as to why this seems to work, but it has for the three > updates I have used it on. I can only hope that this gives someone a clue as > to what the actual issue is. This is somewhat the direction I was going when asking for the service(8) and sysctl(8) output. I've finally reproduced the issue in VirtualBox, so now that I can reproduce it reliably, and with your findings, hope to be able to identify the underlying cause soon. Thank you for providing this information. (In reply to rkoberman from comment #45) It seems the complaints here all involve FreeBSD 10, but I am seeing similar issues on FreeBSD 9.3 on VMware ESXi servers. The systems in question have / filesystem without soft-updates, and /usr filesystem with soft-updates enabled. File contents have been disappearing from the /usr partition after reboot. A "sync" before "shutdown -r now" seems to have significantly reduced loss of file contents. While continuing to look into this, I think I may have found a workaround. Can someone test running 'freebsd-update install' twice *without* the intermediate reboot between the kernel and userland updates? The specific command sequence I'm interested in is: # freebsd-update -r 10.1-RELEASE upgrade # freebsd-update install # freebsd-update install # shutdown -r now (In reply to Guy Helmer from comment #47) I suspect that this is a different, though possibly related issue. In all cases reported, while all filesystems were unclean on reboot, fsck never found any errors on my system and ended up simply marking he volume "clean" (after a long time had passed). OTOH, the problem started on a system that had installed 10.0-R right after I received it and figured out how to turn off boot signature checking so I could boot the memstick install media. (That was hidden several menus deep in a menu that only could be brought up when another, seemingly unrelated BIOS option was modified.) I can say that before I retired and still had many FreeBSD systems to maintain that I never saw this with freebsd-update. Those were all version 9 systems, and all were physical system, no virtualization involved. (In reply to Andrew Smith from comment #10) > It seems that this at last effects all Kernels since 10.1-RELEASE through p3. > > If you have one of these Kernels with Softupdates active on the root > filesystem and you replace /sbin/init then you get this behaviour. > > If you either disable file system softupdates on the filesystem or you > disable the softupdates option in a new Kernel build then the issue does not > exist. > > I suspect people that haven't had this issue have some other environmental > difference that nobody has highlighted yet. > > The issue of course hits 10.0-RELEASE to 10.1-RELEASE upgrades since when > freebsd-update install is run, the first thing done is to replace the Kernel > and ask for a reboot then run freebsd-update again. The Kernel exhibiting > the problem is then in place. > > Could this be related to the changes made to ufs for the per FFS-Filesystem > threading around August? (r269457, r269533, r269583). I built and installed GENERIC with these commit reverted (releng/10.1@r270157), and still see this behavior after freebsd-update(8) installs the userland updates, so I'm inclined to think these are not the cause. After editing sys/kern/kern_shutdown.c to be a bit more verbose, it appears kern_reboot() is getting stuck on line 429: 421 if (nbusy) { 422 /* 423 * Failed to sync all blocks. Indicate this and don't 424 * unmount filesystems (thus forcing an fsck on reboot). 425 */ 426 printf("Giving up on %d buffers\n", nbusy); 427 DELAY(5000000); /* 5 seconds */ 428 } else { 429 if (!first_buf_printf) 430 printf("Final sync complete\n"); 431 /* 432 * Unmount filesystems 433 */ 434 if (panicstr == 0) 435 vfs_unmountall(); 436 } 437 swapoff_all(); (In reply to Glen Barber from comment #51) > After editing sys/kern/kern_shutdown.c to be a bit more verbose, it appears > kern_reboot() is getting stuck on line 429: > > 421 if (nbusy) { > 422 /* > 423 * Failed to sync all blocks. Indicate this and don't > 424 * unmount filesystems (thus forcing an fsck on reboot). > 425 */ > 426 printf("Giving up on %d buffers\n", nbusy); > 427 DELAY(5000000); /* 5 seconds */ > 428 } else { > 429 if (!first_buf_printf) > 430 printf("Final sync complete\n"); > 431 /* > 432 * Unmount filesystems > 433 */ > 434 if (panicstr == 0) > 435 vfs_unmountall(); > 436 } > 437 swapoff_all(); After looking further, it appears to make it through the if/else to at least line 436, and swapoff_all() is triggered. So, still looking... (In reply to Glen Barber from comment #52) It is physically impossible to hang on a line which is not loop. I suspect that it is either the buffer flush code, or softdep worker thread which loop and cause shutdown thread to wait. It is good that you have reproducable case and willing to move it further, previous reporters only bother to whine. See https://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug-deadlocks.html for the instructions on how to configure your kernel and what information to get. Install a VirtualBox VM via 10.1-RELEASE CD. Reboot after installer: OK Boot the installed system. Reboot after doing nothing at all but logging in: OK Boot again. Run 'freebsd-update fetch install' (to 10.1-RELEASE-p6, 705 patches, 462 files) Now I take a snapshot in VirtualBox - "Updates installed, not rebooted yet". I now run 'reboot': Fail Syncing disks, vnodes remaining...3 1 0 done All buffers synced. Poweroff in VirtualBox. Boot the machine again. Reboot: OK. I now Restore Snapshot "Updates installed, not rebooted yet" and start the VM again. 'reboot' again fails: Syncing disks, vnodes remaining...3 1 0 0 done All buffers synced. Restore again and run 'sync && reboot': Fail Syncing disks, vnodes remaining...3 1 0 0 done All buffers synced. Restore again. sync sync sleep 30 reboot Fail Syncing disks, vnodes remaining...3 1 0 0 done All buffers synced. Restore again. sync ; sleep 5 ; sync ; sleep 5 ; shutdown -r now Syncing disks, vnodes remaining...2 0 0 done All buffers synced. So... It fails with both 'reboot' and 'shutdown -r now'. Restore again. shutdown now stopping cron stopping sshd stopping devd Writing entropy file:. . syslogd exiting on signal 15 Enter full pathname of shell or RETURN for /bin/sh: <return> sync; sleep 5; reboot Syncing disks, vnodes remaining...0 done All buffers synced. Now it hangs for 20 seconds, so it looks like it once again failed, BUT... Suddenly the machine reboots!!! (Normally the machine waits less than 1 second after the "All buffers synced" message when I've run a 'reboot' command, so this must be a 20 second timeout somewhere) Also, I see no root (/) fs warnings upon booting. Yay! I went back and re-ran the 'sync ; sleep 5 ; sync ; sleep 5 ; shutdown -r now' command and waited several minutes. No reboot. Fail. Restore again, ran 'shutdown now', enter single-user-shell and 'reboot' Syncing disks, vnodes remaining...1 0 done All buffers synced. After 20 seconds, the machine reboots. Restore again, stopped devd and killed cron, syslogd, adjkerntz, dhclient, sendmail and ran 'reboot' Syncing disks, vnodes remaining...1 0 done All buffers synced. Nope it fails. Waited several minutes. Restore again, ran 'shutdown -ro now' (execute 'reboot' instead of signalling init(). Syncing disks, vnodes remaining...2 2 0 0 done All buffers synced. Fail. Restore again, ran 'shutdown -ron now' (prevent filesystem cache from being flushed) Syncing disks, vnodes remaining...2 2 0 0 done All buffers synced. Now the machine instantly reboots! Yay! / was not properly dismounted /: mount pending error: blocks 8512 files 5 ...Rebuilding fs from journal... Findings: 'reboot' or 'shutdown -r' get the same results. Manual pre 'sync' does nothing. Running 'shutdown now' and hence entering single-user mode apparently does something good. Some buffers seem to be connected to a 20 second timeout. Not flushing the buffers at all on shutdown removes the 20 second timeout (but generates a corrupt fs). You can easily reproduce this yourselves in VirtualBox to debug further. See above. The main problem is still a CRITICAL one, since even if you use the 'shutdown now+single-user+20sec timeout'-approach to get the machine to finally _reboot_ OK, you still need KVM-access for the single-user-mode. And if you use the 'shutdown -ron now'-approach, you do get the much needed reboot, but you also get a corrupt fs... :-( So remote FreeBSD machines without any iLO/IPMI still suffer badly from this. I hope someone will find a fix soon. /Elof (In reply to Konstantin Belousov from comment #53) > (In reply to Glen Barber from comment #52) > It is physically impossible to hang on a line which is not loop. > Yes, understood. > I suspect that it is either the buffer flush code, or softdep worker thread > which loop and cause shutdown thread to wait. It is good that you have > reproducable case and willing to move it further, previous reporters only > bother to whine. > > See > https://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/ > kerneldebug-deadlocks.html for the instructions on how to configure your > kernel and what information to get. The test machine now has INVARIANTS, INVARIANT_SUPPORT, WITNESS, DEBUG_LOCKS, DEBUG_VFS_LOCKS, DIAGNOSTIC, and ALT_BREAK_TO_DEBUGGER. Unfortunately, it panics on boot now, so I cannot proceed to the 'freebsd-update install; reboot' phase. Just prior to this, I left out DIAGNOSTIC and saw a lock order reversal after the "All buffers synced." message. (I will provide screenshots in a separate update.) It looks like I will need to remove DIAGNOSTIC to get the system to boot. Created attachment 154206 [details]
DIAGNOSTIC panic (1/2)
Created attachment 154207 [details]
DIAGNOSTIC panic (2/2)
Created attachment 154208 [details]
lock order reversal after kern_reboot()
(In reply to Glen Barber from comment #55) > The test machine now has INVARIANTS, INVARIANT_SUPPORT, WITNESS, > DEBUG_LOCKS, DEBUG_VFS_LOCKS, DIAGNOSTIC, and ALT_BREAK_TO_DEBUGGER. > Unfortunately, it panics on boot now, so I cannot proceed to the > 'freebsd-update install; reboot' phase. > > Just prior to this, I left out DIAGNOSTIC and saw a lock order reversal > after the "All buffers synced." message. (I will provide screenshots in a > separate update.) > > It looks like I will need to remove DIAGNOSTIC to get the system to boot. It seems removing DIAGNOSTIC alone was not enough, since now the test machine panics on boot. Although unrelated to the original problem in this PR, the ddb session will be included in a followup. Created attachment 154211 [details]
ddb transcript of panic-on-boot with debugging options enabled
I've finally gotten the machine into a state where I can access the debugger after it hangs. script(1) output of the debugging session will be attached. Created attachment 154224 [details]
debugging session after freebsd-update and reboot
(In reply to Glen Barber from comment #62) You should add WITNESS_SKIPSPIN kernel option, it is known that console spinlocks are not in order. So for the attachment id=154224, is it possible to do show mount and show mount <addr> for the root mp ? You can 'set $lines 0' to disable pager in ddb. (In reply to Konstantin Belousov from comment #63) > (In reply to Glen Barber from comment #62) > You should add WITNESS_SKIPSPIN kernel option, it is known that console > spinlocks are not in order. > Okay, I wasn't sure if we wanted to see spinlocks. > So for the attachment id=154224, is it possible to do show mount and show > mount <addr> for the root mp ? > Sure. One thing to note (though it shouldn't matter) is that each iteration requires a rollback of the VM. I only mention this in case there is inconsitencies between ddb sessions. > You can 'set $lines 0' to disable pager in ddb. Thank you, I wasn't aware of this. (In reply to Glen Barber from comment #59) > (In reply to Glen Barber from comment #55) > > The test machine now has INVARIANTS, INVARIANT_SUPPORT, WITNESS, > > DEBUG_LOCKS, DEBUG_VFS_LOCKS, DIAGNOSTIC, and ALT_BREAK_TO_DEBUGGER. > > Unfortunately, it panics on boot now, so I cannot proceed to the > > 'freebsd-update install; reboot' phase. > > [...] > > It looks like I will need to remove DIAGNOSTIC to get the system to boot. > > It seems removing DIAGNOSTIC alone was not enough, since now the test > machine panics on boot. Just a note: This particular issue (panic-on-boot with DIAGNOSTIC) will need to be reinvestigated after the original issue discussed in this PR is identified and resolved, as right now it is difficult to tell if this is an effect of a larger issue. FYI I had an issue on a HP Proliant 10.0-RELEASE-p7 box which may be related: the machine has been installed and worked perfectly for ~30 days, and one day it suddenly "froze".. I had to physically power off/power on the box and the FS was corrupted afterwards. SU+J was unable to recover it (it segfaulted everywhere) but hopefully a manual fsck was able to repair it. As I had a *lot* of issues with SU+J in the past I turned it off (the +J part) on all my FS and since then it has been rock solid (the machine has ~200 days of uptime). I was told that SU+J had been fixed on 10, but apparently there are still problems.. Has anyone had any luck debugging this critical issue? 1) A normal upgrade+reboot always freeze the machine permanently. :-( 2) A normal upgrade+reboot, followed by 'shutdown now', entering single user mode and finally rebooting from there always reboot my machine after a 20 second timeout. Better, but still not a solution since I need ssh-access to do this remotely (no iLO/IPMI/KVM exists). 3) A normal upgrade+reboot, followed by 'shutdown -ron now' (to prevent filesystem cache from being flushed) always make my machine reboot immediately as it should. However, this too is not a perfect workaround since the filesystem gets corrupted. Given these three scenarios, and because they are reproduceable every time, I hope a soluction will be found soon. FreeBSD 10.0 is now unsupported so us FreeBSD users need to upgrade all our machines. In my case this is >100 machines located all over the world. Sorry. I was a bit fast with the copy-n-paste there. Sections 2 and three should read: "A normal upgrade, followed by"... /Elof Just an update to note that this issue is not forgotten, and is being actively (and heavily) investigated. The underlying causes are not yet fully understood, and are quite complex by nature. Glen, that sounds good. Just an update from me too: I counted the seconds once more today when I installed and upgraded a 10.1-machine, and I think that all my "20 seconds" above should actually read "30 seconds". A commit references this bug: Author: kib Date: Fri Mar 27 13:55:57 UTC 2015 New revision: 280760 URL: https://svnweb.freebsd.org/changeset/base/280760 Log: Fix the hand after the immediate reboot when the following command sequence is performed on UFS SU+J rootfs: cp -Rp /sbin/init /sbin/init.old mv -f /sbin/init.old /sbin/init Hang occurs on the rootfs unmount. There are two issues: 1. Removed init binary, which is still mapped, creates a reference to the removed vnode. The inodeblock for such vnode must have active inodedep, which is (eventually) linked through the unlinked list. This means that ffs_sync(MNT_SUSPEND) cannot succeed, because number of softdep workitems for the mp is always > 0. FFS is suspended during unmount, so unmount just hangs. 2. As noted above, the inodedep is linked eventually. It is not linked until the superblock is written. But at the vfs_unmountall() time, when the rootfs is unmounted, the call is made to ffs_unmount()->ffs_sync() before vflush(), and ffs_sync() only calls ffs_sbupdate() after all workitems are flushed. It is masked for normal system operations, because syncer works in parallel and eventually flushes superblock. Syncer is stopped when rootfs unmounted, so ffs_sync() must do sb update on its own. Correct the issues listed above. For MNT_SUSPEND, count the number of linked unlinked inodedeps (this is not a typo) and substract the count of such workitems from the total. For the second issue, the ffs_sbupdate() is called right after device sync in ffs_sync() loop. There is third problem, occuring with both SU and SU+J. The softdep_waitidle() loop, which waits for softdep_flush() thread to clear the worklist, only waits 20ms max. It seems that the 1 tick, specified for msleep(9), was a typo. Add fsync(devvp, MNT_WAIT) call to softdep_waitidle(), which seems to significantly help the softdep thread, and change the MNT_LAZY update at the reboot time to MNT_WAIT for similar reasons. Note that userspace cannot create more work while devvp is flushed, since the mount point is always suspended before the call to softdep_waitidle() in unmount or remount path. PR: 195458 In collaboration with: gjb, pho Reviewed by: mckusick Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Changes: head/sys/ufs/ffs/ffs_softdep.c head/sys/ufs/ffs/ffs_vfsops.c (In reply to commit-hook from comment #71) Great work tracking this one down. I am guessing this is a no, but are there any plans for this to make it into the 10.1-RELEASE branch? Another one of my systems was hit by this today going from 10.1-p8 to 10.1-p9. Or is patching a custom kernel the recommended solution? Take. (In reply to Xin LI from comment #73) > Take. Ping? Is this being scheduled for EN? (In reply to Bryan Drewery from comment #74) Glen replied to me on the freebsd-stable list, and said that an EN was planned but there was no ETA. https://www.mail-archive.com/freebsd-stable@freebsd.org/msg129134.html This is fixed in FreeBSD-EN-15:05.ufs. |