Bug 228384 - zfs utilities hang on [spa_namespace_lock] (transferred from 203906 as an unrelated issue)
Summary: zfs utilities hang on [spa_namespace_lock] (transferred from 203906 as an unr...
Status: New
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 11.1-STABLE
Hardware: i386 Any
: --- Affects Only Me
Assignee: freebsd-fs mailing list
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-05-20 22:52 UTC by Alexandre Kovalenko
Modified: 2018-05-31 15:15 UTC (History)
1 user (show)

See Also:


Attachments
procstat output (4.57 KB, application/x-xz)
2018-05-20 22:52 UTC, Alexandre Kovalenko
no flags Details
Lock owner (970.38 KB, image/jpeg)
2018-05-20 22:53 UTC, Alexandre Kovalenko
no flags Details
Stack trace part 1 (810.89 KB, image/jpeg)
2018-05-20 22:53 UTC, Alexandre Kovalenko
no flags Details
Stack trace part 2 (794.22 KB, image/jpeg)
2018-05-20 22:54 UTC, Alexandre Kovalenko
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Alexandre Kovalenko 2018-05-20 22:52:30 UTC
Created attachment 193578 [details]
procstat output

zpool status hangs on [spa_namespace_lock]. Output of the procstat -a -kk and pictures of the stack trace from thread holding the lock attached.

From Andriy Gapon <avg@FreeBSD.org>:

So, it's:
   13 100015 geom                g_event             mi_switch+0xe6
sleepq_wait+0x2c _sx_xlock_hard+0x314 zvol_geom_access+0x148 g_access+0x1fd
g_eli_read_metadata+0x249 g_eli_config+0x3ed g_run_events+0x13e fork_exit+0x83
fork_trampoline+0xe
  930 100354 zfsd                -                   mi_switch+0xe6
sleepq_wait+0x2c _sleep+0x23e g_access+0xf7 vdev_geom_attach+0x61c
vdev_attach_ok+0x29 vdev_geom_open+0x394 vdev_open+0x115
vdev_open_children+0x30 vdev_root_open+0x3a vdev_open+0x115
spa_ld_open_vdevs+0x5e spa_ld_mos_init+0x1be
spa_ld_mos_with_trusted_config+0x19 spa_load+0x5c spa_load_best+0x6b
spa_open_common+0x11d spa_get_stats+0x4f

I think I know what's going on in your case (not sure if it's the same a as
previous reports in this bug).  It's probably a consequence of base r330977, a
fix to bug 225960.
I didn't fully realize at that time, but that commit introduced a "g_access
lock" in disguise.
So, we went from a LOR between geom topology lock and spa_namespace_lock to a
race caused by dropping the topology lock to a LOR between spa_namespace_lock
and "g_access lock".

The toughest part now is to decide how to solve the LOR without re-introducing
the race.  Or alternatively, how to solve the race without introducing a
deadlock.
Comment 1 Alexandre Kovalenko 2018-05-20 22:53:17 UTC
Created attachment 193579 [details]
Lock owner
Comment 2 Alexandre Kovalenko 2018-05-20 22:53:43 UTC
Created attachment 193580 [details]
Stack trace part 1
Comment 3 Alexandre Kovalenko 2018-05-20 22:54:12 UTC
Created attachment 193581 [details]
Stack trace part 2
Comment 4 Alexandre Kovalenko 2018-05-20 22:59:25 UTC
I can test patches, produce and provide link to the coredump and whatever else is necessary -- machine is under my total control and could be taken offline at will. The only caveat is that the CPU there is not all that powerful -- complete rebuild of the kernel will take several hours.