Bug 247544 - zvol: unbootable system with root on zfs
Summary: zvol: unbootable system with root on zfs
Status: Open
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: CURRENT
Hardware: Any Any
: --- Affects Only Me
Assignee: freebsd-fs (Nobody)
URL: https://reviews.freebsd.org/D23478
Keywords: regression
Depends on:
Reported: 2020-06-25 18:29 UTC by Evilham
Modified: 2020-07-21 11:26 UTC (History)
3 users (show)

See Also:

core.txt (135.87 KB, text/plain)
2020-07-15 09:44 UTC, Evilham
no flags Details
gdb.txt (763.80 KB, text/plain)
2020-07-18 18:18 UTC, Evilham
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Evilham 2020-06-25 18:29:35 UTC
I have bisected the source (took me a few days, wanted to report with more info) and traced my issue down to revisions 362048 and 362047, coming from differential: https://reviews.freebsd.org/D23478

Running HEAD the two afore mentioned revisions reverted results in a properly working system.

What I am experiencing with the mentioned revisions applied is a hanging boot sequence with following message (modulo typos because it's a transcript):

/etc/rc: DEBUG: checkyesno: zfs_enable is set to YES.
/etc/rc: DEBUG: load_kld: zfs kernel module already loaded.
/etc/rc: DEBUG: run_rc_command: doit: zvol_start

(no further messages are shown)

The setup itself is somewhat standard:

Created with the installer geli full disk encryption + zfs on root using the whole disk (with partitions for EFI, boot, swap as created by the installer).

Inside the main pool, which contains the root dataset, I do have a zvol which in turn is also encrypted with geli and I only unlock when necessary.

Hardware wise: this is a Lenovo A485 laptop with an AMD Ryzen processor (BIOS being up to date) and the disk is nvme using the nda driver (as opposed to nvd).

I'll try to re-create this with a simpler setup that isn't my work one, but since this is scheduled to be MFC'd, I thought I'd notify this with as much information as I could first.

Comment 1 Evilham 2020-06-25 18:31:35 UTC
Added avg to CC since they might still have this in memory / have more hints as to what to check or how to reproduce this.
Comment 2 Andriy Gapon freebsd_committer 2020-06-25 19:08:51 UTC
Do you have swap on zvol ?
Comment 3 Evilham 2020-06-25 19:48:35 UTC
(In reply to Andriy Gapon from comment #2)
I don't have swap on zvol, sorry for not sending more complete information in the first place.

Please let me know if I can help provide more information.

# This partitioning scheme was created by the installer:
# gpart show
=>        40  1000215136  nda0  GPT  (477G)
          40      409600     1  efi  (200M)
      409640        1024     2  freebsd-boot  (512K)
      410664         984        - free -  (492K)
      411648     4194304     3  freebsd-swap  (2.0G)
     4605952   995608576     4  freebsd-zfs  (475G)
  1000214528         648        - free -  (324K)

# This zpool was also created by the installer
# zpool status zroot
  pool: zroot
 state: ONLINE
status: Some supported features are not enabled on the pool. The pool can
	still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
	the pool may no longer be accessible by software that does not support
	the features. See zpool-features(7) for details.
  scan: scrub repaired 0 in 0 days 00:14:28 with 0 errors on Thu Dec 19 11:37:05 2019
checkpoint: created Thu Jun 25 18:13:14 2020, consumes 586M

	zroot         ONLINE       0     0     0
	  nda0p4.eli  ONLINE       0     0     0

errors: No known data errors

# This is the only zvol in the 'zroot' zpool
# zfs list zroot/things
zroot/things  2,79G  17,2G   265M  -

# The zroot/things zvol is actually geli-encrypted with a zpool in it
# the 'nested' zpool shouldn't be relevant for the boot sequence though as
# this zvol is decrypted only manually
# geli attach /dev/zvol/zroot/things
# zpool status things
  pool: things
 state: ONLINE
status: Some supported features are not enabled on the pool. The pool can
	still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
	the pool may no longer be accessible by software that does not support
	the features. See zpool-features(7) for details.
  scan: none requested

	NAME                      STATE     READ WRITE CKSUM
	things                   ONLINE       0     0     0
	  zvol/zroot/things.eli  ONLINE       0     0     0

errors: No known data errors
Comment 4 Andriy Gapon freebsd_committer 2020-06-26 08:29:01 UTC
Can you try pressing Ctrl-T when it hang?
Is there any interesting output?

Can Ctrl-C interrupt it and allow the startup to proceed?
If yes, then it would be easier to get more information.
Comment 5 Evilham 2020-06-28 11:41:03 UTC
(In reply to Andriy Gapon from comment #4)
Hello, sorry for the delay.
^C does not help the boot sequence to move on, ^T does give out some information (transcription):

load: 0.26 cmd: sh 7099 [piperd] 172.27r 0.00u 0.00s 0% 2888k
#0 0xffffffff80be6331 at mi_switch+0xc1
#1 0xffffffff80c36551 at sleepq_catch_signals+0x3d1
#2 0xffffffff80c36169 at sleepq_wait_sig+0x9
#3 0xffffffff80be5a4e at _sleep+0x1be
#4 0xffffffff80c49c9a at pipe_read+0x43a
#5 0xffffffff80c46961 at dofileread+0x81
#6 0xffffffff80c464e0 at sys_read+0xc0
#7 0xffffffff8106f089 at amd64_syscall+0x119
#8 0xffffffff81045630 at fast_syscall_common+0x101

It doesn't seem to change after some time either.

I tried manually running service zvol start from a single-user session and it doesn't hang but exits successfully instead.
Comment 6 Andriy Gapon freebsd_committer 2020-06-30 20:45:38 UTC
Can you get a crash dump?
E.g. via Ctrl+Alt+Esc or Ctrl+SysRq ?

Note that if you can, then I will ask you to run some kgdb commands on it.
I do not need a crash dump file (vmcore.*).
Comment 7 Evilham 2020-07-09 22:09:14 UTC
(In reply to Andriy Gapon from comment #6)

Hello, thank you for the reply, I had totally missed it, sorry.

Just gave it a go, by modifying /etc/rc.d/zvol to REQUIRE dumpon I was able to get the prompt, create a dump and get a savecore after booting the system normally.
I'd be delighted to run kgdb on the crash dump as requested.

Perhaps it's interesting to mention that I have dropped my reverting of the mentioned revisions, and am running HEAD without patches; so far I haven't noticed anything off.
Setting rc_debug="YES" in /etc/rc.conf is what allows me to boot the system and now it's easier for me to test potential fixes since it's just a matter of (un)commenting that in single-user mode.

Dropping /etc/rc.d/zvol *without* rc_debug="YES" resulted in the hang moving to "Mounting filesystems...", so it's a tad misleading that I'm seeing this with /etc/rc.d/zvol, according to service -e it just happens to be the second rc script (first one being hostid).

I tried reproducing this under bhyve / with a different hard drive and an "identical setup" (as much as I was able to) but was not successful at it; maybe it has something to do with the nvme being quick enough for something to get deadlocked, and also that'd be why rc_debug="YES" allows me to boot normally.
Comment 8 Andriy Gapon freebsd_committer 2020-07-15 08:07:24 UTC
(In reply to Evilham from comment #7)
Sorry for the delay.
Have you got a core.txt file corresponding to the core (in /var/crash) ?
If yes, could you please share it?
Comment 9 Evilham 2020-07-15 09:44:46 UTC
Created attachment 216460 [details]

Here you go!

I've been testing the Matt's OpenZFS and haven't upgraded the pool, so I can rollback and keep helping debug this since non-CURRENT could be affected.

Thank you for looking into it!
Comment 10 Andriy Gapon freebsd_committer 2020-07-15 10:29:57 UTC
(In reply to Evilham from comment #9)
Could you please open the core in kgdb, run the following commands and share the resulting gdb.txt:
set pagination off
set logging on
set logging redirect on
thread apply all bt
<this may take a while to complete>
Comment 11 Evilham 2020-07-18 18:18:39 UTC
Created attachment 216546 [details]

I made a mistake and deleted the BE where the core.txt came from, so this was result of recompiling >,<.

Please let me know if the matching core.txt is necessary.
Comment 12 Andriy Gapon freebsd_committer 2020-07-21 08:15:56 UTC
(In reply to Evilham from comment #11)
That's not a problem.
Thank you very much for the data.

I think that I see a deadlock.
It seems to happen as a result of a race condition where two unrelated events can cause the deadlock if they happen at the same time.
I will need some more time to analyze the problem and to, hopefully, come up with a fix.

The threads involved:
Thread 815 (Thread 100046):
#0  sched_switch (td=0xfffffe003a5b7100, flags=<optimized out>) at /freebsd/src/sys/kern/sched_ule.c:2147
#1  0xffffffff80be1481 in mi_switch (flags=260) at /freebsd/src/sys/kern/kern_synch.c:542
#2  0xffffffff80bdf761 in _sx_xlock_hard (sx=0xffffffff822bea08 <spa_namespace_lock>, x=<optimized out>, opts=<optimized out>) at /freebsd/src/sys/kern/kern_sx.c:858
#3  0xffffffff8221062b in __sx_xlock (sx=<optimized out>, td=0xfffffe003a5b7100, opts=<error reading variable: Cannot access memory at address 0x0>, file=<optimized out>, line=<optimized out>) at /freebsd/src/sys/sys/sx.h:168
#4  zvol_close (pp=0xfffff8002710e700, flag=<optimized out>, count=<optimized out>) at /freebsd/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zvol.c:1270
#5  zvol_geom_access (pp=0xfffff8002710e700, acr=<optimized out>, acw=0, ace=<optimized out>) at /freebsd/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zvol.c:2753
#6  0xffffffff80b1f304 in g_access (cp=0xfffff8003da02580, dcr=<optimized out>, dcw=<optimized out>, dce=0) at /freebsd/src/sys/geom/geom_subr.c:1041
#7  0xffffffff80b22586 in g_label_taste (mp=0xffffffff818f2ac0 <g_label_class>, pp=<optimized out>, flags=<optimized out>) at /freebsd/src/sys/geom/label/g_label.c:411
#8  0xffffffff80b1faaa in g_new_provider_event (arg=0xfffff8002710e700, flag=<optimized out>) at /freebsd/src/sys/geom/geom_subr.c:596
#9  0xffffffff80b1acee in one_event () at /freebsd/src/sys/geom/geom_event.c:265
#10 g_run_events () at /freebsd/src/sys/geom/geom_event.c:288
#11 0xffffffff80b9360e in fork_exit (callout=0xffffffff80b1d240 <g_event_procbody>, arg=0x0, frame=0xfffffe003a809c00) at /freebsd/src/sys/kern/kern_fork.c:1052

Thread 322 (Thread 100542):
#0  sched_switch (td=0xfffffe00dca8c000, flags=<optimized out>) at /freebsd/src/sys/kern/sched_ule.c:2147
#1  0xffffffff80be1481 in mi_switch (flags=260) at /freebsd/src/sys/kern/kern_synch.c:542
#2  0xffffffff80be0bab in _sleep (ident=0xfffff8002710e800, lock=<optimized out>, priority=0, wmesg=0xffffffff811b9f44 "gtopol", sbt=0, pr=<optimized out>, flags=256) at /freebsd/src/sys/kern/kern_synch.c:221
#3  0xffffffff80b1f207 in g_access (cp=0xfffff8003d6b3480, dcr=1, dcw=0, dce=1) at /freebsd/src/sys/geom/geom_subr.c:1000
#4  0xffffffff8221277a in vdev_geom_attach (pp=0xfffff8002710e700, vd=0x0, sanity=1) at /freebsd/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c:289
#5  0xffffffff822149b9 in vdev_attach_ok (vd=0xfffff8003ddfa000, pp=0xfffff8002710e700) at /freebsd/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c:661
#6  0xffffffff822139fa in vdev_geom_attach_by_guids (vd=0xfffff8003ddfa000) at /freebsd/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c:735
#7  vdev_geom_open_by_guids (vd=0xfffff8003ddfa000) at /freebsd/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c:772
#8  vdev_geom_open (vd=0xfffff8003ddfa000, psize=0xfffffe00dd1342d8, max_psize=0xfffffe00dd1342e0, logical_ashift=<optimized out>, physical_ashift=0xfffffe00dd1342d0) at /freebsd/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c:877
#9  0xffffffff821a5081 in vdev_open (vd=0xfffff8003ddfa000) at /freebsd/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev.c:1696
#10 0xffffffff821a4f29 in vdev_open_children (vd=0xfffff8003dd6a000) at /freebsd/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev.c:1620
#11 0xffffffff821bbc1a in vdev_root_open (vd=0xfffff8003dd6a000, asize=0xfffffe00dd1343b8, max_asize=0xfffffe00dd1343c0, logical_ashift=0xfffffe00dd1343a8, physical_ashift=0xfffffe00dd1343b0) at /freebsd/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_root.c:96
#12 0xffffffff821a5081 in vdev_open (vd=0xfffff8003dd6a000) at /freebsd/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev.c:1696
#13 0xffffffff82195abc in spa_ld_open_vdevs (spa=0xfffffe00dddfd000) at /freebsd/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa.c:2921
#14 0xffffffff82194309 in spa_ld_mos_init (spa=0xfffffe00dddfd000, type=SPA_IMPORT_EXISTING) at /freebsd/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa.c:3970
#15 0xffffffff82194099 in spa_ld_mos_with_trusted_config (spa=0xfffffe00dddfd000, type=SPA_IMPORT_EXISTING, update_config_cache=0xfffffe00dd13461c) at /freebsd/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa.c:4113
#16 0xffffffff8218d2e9 in spa_load_impl (spa=0xfffffe00dddfd000, type=SPA_IMPORT_EXISTING, ereport=<optimized out>) at /freebsd/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa.c:4167
#17 spa_load (spa=0xfffffe00dddfd000, state=<optimized out>, type=SPA_IMPORT_EXISTING) at /freebsd/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa.c:2433
#18 0xffffffff8218cbd5 in spa_load_best (spa=0xfffffe00dddfd000, state=SPA_LOAD_OPEN, max_request=<optimized out>, rewind_flags=1) at /freebsd/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa.c:4441
#19 0xffffffff82189304 in spa_open_common (pool=0xfffffe00d11f7000 "servers", spapp=0xfffffe00dd134748, tag=0xffffffff8223fb89, nvpolicy=<optimized out>, config=0x0) at /freebsd/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa.c:4576
#20 0xffffffff821eeae9 in pool_status_check (name=<unavailable>, type=<optimized out>, check=POOL_CHECK_SUSPENDED) at /freebsd/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ioctl.c:6804
#21 0xffffffff821f86c1 in zfsdev_ioctl (dev=<optimized out>, zcmd=<optimized out>, arg=<optimized out>, flag=<optimized out>, td=<optimized out>) at /freebsd/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ioctl.c:7117
#22 0xffffffff80a80007 in devfs_ioctl (ap=0xfffffe00dd134878) at /freebsd/src/sys/fs/devfs/devfs_vnops.c:824
#23 0xffffffff80cbdab9 in vn_ioctl (fp=0xfffff8003d61df50, com=<optimized out>, data=0xfffffe00dd134a20, active_cred=0xfffff8003b9a9900, td=<unavailable>) at /freebsd/src/sys/kern/vfs_vnops.c:1588
#24 0xffffffff80a8063e in devfs_ioctl_f (fp=<unavailable>, com=<unavailable>, data=<unavailable>, cred=<unavailable>, td=0xfffffe00dca8c000) at /freebsd/src/sys/fs/devfs/devfs_vnops.c:755
#25 0xffffffff80c42dce in fo_ioctl (fp=0xfffff8003d61df50, com=<optimized out>, data=<unavailable>, active_cred=<unavailable>, td=0xfffffe00dca8c000) at /freebsd/src/sys/sys/file.h:339
#26 kern_ioctl (td=<unavailable>, fd=3, com=<optimized out>, data=<unavailable>) at /freebsd/src/sys/kern/sys_generic.c:802
#27 0xffffffff80c42aca in sys_ioctl (td=0xfffffe00dca8c000, uap=0xfffffe00dca8c3e8) at /freebsd/src/sys/kern/sys_generic.c:710
#28 0xffffffff80ff5569 in syscallenter (td=0xfffffe00dca8c000) at /freebsd/src/sys/amd64/amd64/../../kern/subr_syscall.c:162
#29 amd64_syscall (td=0xfffffe00dca8c000, traced=0) at /freebsd/src/sys/amd64/amd64/trap.c:1161

This is a more subtle variation of a known FreeBSD problem with pools on top of zvols.
The deadlock is between GEOM and ZFS's spa_namespace_lock.

By the way, what vfs.zfs.vol.recursive is set to on that system?
Comment 13 Evilham 2020-07-21 08:57:27 UTC
Yay! We found a bug!

# sysctl vfs.zfs.vol.recursive
vfs.zfs.vol.recursive: 0

Just out of curiosity and because I haven't seen a mention of it: do you know how support for ZFS in the other branches is going to be maintained after OpenZFS lands in CURRENT?
Because e.g. this doesn't show up with OpenZFS and I guess spotting future issues before MFC is going to be trickier.

Anyway, good thing that this starts making sense, thank you!
Comment 14 Andriy Gapon freebsd_committer 2020-07-21 11:26:22 UTC
(In reply to Evilham from comment #13)
I don't know for sure.  But my guess that ZFS in 12/11 is going to be mostly frozen save for bug fixed.  I do not expect new features, etc.  Those who need them will be encouraged to use the port.