Bug 211491 - System hangs after "Uptime" on reboot with ZFS
Summary: System hangs after "Uptime" on reboot with ZFS
Status: Open
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 11.0-STABLE
Hardware: amd64 Any
: --- Affects Many People
Assignee: freebsd-fs mailing list
URL:
Keywords: needs-qa
Depends on:
Blocks:
 
Reported: 2016-08-01 02:56 UTC by g_amanakis
Modified: 2017-05-16 00:21 UTC (History)
6 users (show)

See Also:
koobs: mfc-stable11?
koobs: mfc-stable10?


Attachments
dmesg (11.41 KB, text/plain)
2016-08-01 02:56 UTC, g_amanakis
no flags Details
loader.conf (367 bytes, text/plain)
2016-08-01 02:56 UTC, g_amanakis
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description g_amanakis 2016-08-01 02:56:01 UTC
Created attachment 173147 [details]
dmesg

System is an Supermicro X9SCM running FreeBSD 11.0-BETA3 amd64 with GENERIC kernel. When issuing "reboot" the system will hang after the "Uptime" message. The system has to be reset by physically holding the power button or issuing a power cycle from the IPMI interface.

I am attaching the dmesg and loader.conf output.
Comment 1 g_amanakis 2016-08-01 02:56:31 UTC
Created attachment 173148 [details]
loader.conf
Comment 2 g_amanakis 2016-08-01 02:57:23 UTC
Of note, despite issuing the "reboot" command, I can still ping the em0 interface on this machine.
Comment 3 Eric van Gyzen freebsd_committer 2016-08-09 15:10:02 UTC
I have also seen this on a Dell Precision Workstation running 11-CURRENT.  It doesn't always happen, but it seems more likely after a buildworld/installworld.  I'm using ZFS on a single SSD and an iSCSI LUN using the in-kernel initiator.  I'm also using the nvidia driver, since this is my main desktop.

I'll try to reproduce and get more details (i.e. a core dump).
Comment 4 Eric van Gyzen freebsd_committer 2016-08-09 18:23:51 UTC
I think I've narrowed it down to having a ZFS pool imported from an iSCSI LUN using the in-kernel software initiator.

g_amanakis:  Is this true on your system, too?
Comment 5 Eric van Gyzen freebsd_committer 2016-08-09 19:42:47 UTC
As an amendment to my last comment, the zpool has to be imported with altroot:

    zpool import -R /foo bar_pool

I was failing to reproduce this on two other systems by having the pool imported without -R (altroot).  When I added the altroot, those two systems began reproducing it consistenly.
Comment 6 Eric van Gyzen freebsd_committer 2016-08-09 19:45:37 UTC
One of those systems is a bhyve VM.  On the host, I see that all CPU threads are "vmidle" and consuming no CPU time, so they're blocked on some event, not spinning.  (I realize this doesn't help much.)
Comment 7 Eric van Gyzen freebsd_committer 2016-08-09 20:09:57 UTC
So far, I've been importing the volume using the iSCSI initiator in a bhyve guest.  To determine whether iSCSI is really involved, I reconfigured to import the volume onto the bhyve host and pass it to the VM as a virtio block device.  I failed to reproduce the hang.  (I'm still importing with altroot.)

Also, I have only reproduced the hang on systems with debugging kernel options such as INVARIANTS and WITNESS.  To determine whether they're really involved, I'm rebuilding my VM's kernel without these options.
Comment 8 Eric van Gyzen freebsd_committer 2016-08-09 20:29:14 UTC
The debugging kernel options are not part of the problem.  I removed them and still reproduced the hang.
Comment 9 Eric van Gyzen freebsd_committer 2016-08-09 20:36:25 UTC
Steps to reproduce on a stable/11 r303878 GENERIC kernel (and many earlier revs):

1. Attach to an iSCSI LUN with /etc/iscsi.conf and iscsid.
2. Create a ZFS pool on that LUN:  zpool create iscsi_test da0
3. Export the pool:  zpool export iscsi_test
4. Import the pool with an altroot: zpool import -R /blah iscsi_test
5. shutdown -r now

The system will hang after printing Uptime: XdYmZs.

iSCSI and altroot are key to producing the hang.
Comment 10 Xin LI freebsd_committer 2016-08-09 20:39:40 UTC
I noticed this too but not 100% reproducible.

I don't have iSCSI setup, but do have zvol.  It was a fresh -CURRENT.
Comment 11 Eric van Gyzen freebsd_committer 2016-08-09 21:25:51 UTC
I just reproduced this on 10.3-STABLE r303633.  I'll try to reproduce on 10.3-RELEASE to see if it would be a new regression in 11.0-RELEASE.
Comment 12 Eric van Gyzen freebsd_committer 2016-08-09 21:35:07 UTC
I just reproduced this on 12-CURRENT r303626.  I'm now updating that machine to the latest head.
Comment 13 Enji Cooper freebsd_committer 2016-08-09 23:10:17 UTC
Please don't add -current or -stable to bugs like this; it spams the list unnecessarily (this issue impacts users of iSCSI + ZFS -- which seems a bit niche right now)
Comment 14 g_amanakis 2016-08-09 23:21:44 UTC
On my system ctld is enabled but there are no clients. The bug persists with ctld disabled, too. Also there is a bhyve-VM with passthrough of an onboard NIC but this doesn't affect the bug. The root filesystem is ZFS. Summing up, no iSCSI, no altroot, ZFS on root. I could normally reboot and shutdown the system on 10.3-RELEASE and 10.3-STABLE before upgrading to 11.0-BETA1, which is when I noticed the bug.

No kernel panic happens though. Could I get more verbose logging during shutdown to see what is going on? Most strikingly, after the system "hangs" on "Uptime ..." I can still successfully ping one of the onboard ifaces, not the VT-d one.
Comment 15 Eric van Gyzen freebsd_committer 2016-08-10 14:56:44 UTC
This bug is not limited to iSCSI.  I have updated the summary accordingly.
Comment 16 Eric van Gyzen freebsd_committer 2016-08-10 15:34:34 UTC
I could NOT reproduce this on 10.3-RELEASE, so this will be a new regression in 11.0-RELEASE.

I can still reproduce it on head at r303895 (9 August).

I can't spend any more time on this.  I suggest that someone reproduce and bisect the commits between 10.3-RELEASE and 10-STABLE.
Comment 17 Xin LI freebsd_committer 2016-08-14 23:21:21 UTC
I can't seem to be able to reproduce this anymore on -CURRENT (currently at r304072), FYI.
Comment 18 Eric van Gyzen freebsd_committer 2016-08-17 15:50:47 UTC
I can still reproduce this on head at r304162.
Comment 19 Adam Stylinski 2016-11-27 09:20:39 UTC
I too am having this issue, no iscsi, no altroot, just zvols.  This is happening on 11.0-RELEASE, and on reboot the zpool complains that it was already imported on another system.
Comment 20 g_amanakis 2017-05-16 00:21:20 UTC
This bug seems to be the same as "https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=167685"

Setting hw.usb.no_shutdown_wait=1 resolves it. 

Strangely though if I manually export the zpool with 6 SATA drives I can reboot it, even with hw.usb.no_shutdown_wait=0 (the default value). 

It is unclear to me how the 6 SATA-drive zpool is connected with hw.usb.no_shutdown_wait.