Bug 210409 - zfs: panic during boot, spa_refcount < spa_minref
Summary: zfs: panic during boot, spa_refcount < spa_minref
Status: Open
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: CURRENT
Hardware: Any Any
: --- Affects Only Me
Assignee: freebsd-fs mailing list
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-06-20 15:30 UTC by Kristof Provost
Modified: 2018-12-06 13:54 UTC (History)
4 users (show)

See Also:


Attachments
Initialize needs_update in vdev_geom_set_physpath (521 bytes, patch)
2016-06-20 23:24 UTC, Alan Somers
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Kristof Provost freebsd_committer 2016-06-20 15:30:57 UTC
I’m running a root-on-ZFS system and reliably see this panic during boot.
It’s a 4 disk raidz-1, no log or cache devices.

Hardware is an HP Microserver.

Likely culprit (through bisect) is r300881.
It’s now running r302028 with r300881 backed out, and booting fine.

The panic:
panic: solaris assert: refcount(count(&spa->spa_refcount) >= spa->spa_minref ||
MUTEX_HELD(&spa_namespace_lock), file:
/usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa_misc.c, line: 863

Unfortunately I can’t get a dump, but here’s a picture of the backtrace:
https://people.freebsd.org/~kp/zfs_panic.jpg
Comment 1 Alan Somers freebsd_committer 2016-06-20 23:24:15 UTC
Created attachment 171628 [details]
Initialize needs_update in vdev_geom_set_physpath
Comment 2 Alan Somers freebsd_committer 2016-06-20 23:25:07 UTC
Are your disks SAS or SATA?  I can't reproduce this bug using a 4 disk RAIDZ1 SATA pool.  Also, could you please try the attached patch?
Comment 3 Kristof Provost freebsd_committer 2016-06-21 08:06:51 UTC
(In reply to Alan Somers from comment #2)
They're all SATA disks. 3 x 4TB and one 3TB disk.

The patch appears to be working for me, the box boots again.
Comment 4 Alan Somers freebsd_committer 2016-06-21 15:09:30 UTC
I am certain that that patch does not address the root cause of your panic, but I'll commit it anyway.  Thanks for testing it.
Comment 5 Kristof Provost freebsd_committer 2016-06-21 15:10:34 UTC
Let me know if there's anything else I can test, or any more information that would help.
Comment 6 commit-hook freebsd_committer 2016-06-21 15:28:07 UTC
A commit references this bug:

Author: asomers
Date: Tue Jun 21 15:27:16 UTC 2016
New revision: 302058
URL: https://svnweb.freebsd.org/changeset/base/302058

Log:
  Fix uninitialized variable from r300881

  sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c
  	Initialize needs_update in vdev_geom_set_physpath

  PR:		210409
  Reported by:	kp
  Reviewed by:	kp
  Approved by:	re (hrs)
  MFC after:	4 weeks
  X-MFC-With:	300881
  Sponsored by:	Spectra Logic Corp

Changes:
  head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c
Comment 7 Alan Somers freebsd_committer 2016-06-21 15:53:24 UTC
Fixed by 302058
Comment 8 Andriy Gapon freebsd_committer 2016-08-09 13:56:22 UTC
It doesn't seem that either the problem was really caused by r300881 or that it was really fixed by r302058.
Comment 9 Andriy Gapon freebsd_committer 2018-01-09 09:54:29 UTC
I am seeing this problem from time to time (very rarely) in my test VMs.
I suspect that under some conditions there is a race between a thread doing the pool import and a txg sync thread spawned by it.   If spa_minref is recorded when the sync thread is accessing the pool, then the value would be higher than it should be.
Comment 10 Bryan Drewery freebsd_committer 2018-03-05 16:49:41 UTC
I hit this like 7 times yesterday on r330386.

# zpool status
  pool: scratch
 state: ONLINE
  scan: scrub repaired 0 in 0h20m with 0 errors on Mon Jan 29 10:08:25 2018
config:

        NAME          STATE     READ WRITE CKSUM
        scratch       ONLINE       0     0     0
          gpt/disk2   ONLINE       0     0     0
        logs
          gpt/log1    ONLINE       0     0     0
        cache
          gpt/cache1  ONLINE       0     0     0

errors: No known data errors

  pool: zroot
 state: ONLINE
  scan: scrub repaired 0 in 2h43m with 0 errors on Tue Feb 20 06:42:23 2018
config:

        NAME           STATE     READ WRITE CKSUM
        zroot          ONLINE       0     0     0
          mirror-0     ONLINE       0     0     0
            gpt/disk0  ONLINE       0     0     0
            gpt/disk1  ONLINE       0     0     0
        logs
          gpt/log0     ONLINE       0     0     0
        cache
          gpt/cache0   ONLINE       0     0     0

errors: No known data errors

# gpart show
=>        40  1953525088  ada0  GPT  (932G)
          40         256     1  freebsd-boot  (128K)
         296    16777216     2  freebsd-swap  (8.0G)
    16777512  1936747616     3  freebsd-zfs  (924G)

=>        40  1953525088  ada1  GPT  (932G)
          40         256     1  freebsd-boot  (128K)
         296    16777216     2  freebsd-swap  (8.0G)
    16777512  1936747616     3  freebsd-zfs  (924G)

=>       34  250069613  ada2  GPT  (119G)
         34       2014        - free -  (1.0M)
       2048    2097152     1  freebsd-zfs  (1.0G)
    2099200  104857600     2  freebsd-zfs  (50G)
  106956800    2097152     3  freebsd-zfs  (1.0G)
  109053952  104857600     4  freebsd-zfs  (50G)
  213911552   36158095        - free -  (17G)

=>        40  1953525088  ada3  GPT  (932G)
          40         256     1  freebsd-boot  (128K)
         296   960495616     2  freebsd-zfs  (458G)
   960495912   209715200     3  freebsd-swap  (100G)
  1170211112   783314016        - free -  (374G)
Comment 11 Andriy Gapon freebsd_committer 2018-03-07 08:24:59 UTC
(In reply to Bryan Drewery from comment #10)
This information does not tell anything particular...
I think that the problem might correlate with ZFS restarting some pool activity after a reboot (like processing async free list).
Comment 12 Bryan Drewery freebsd_committer 2018-03-09 17:32:25 UTC
Something weird is that I've consistently seen that if I drop to loader for a moment and then boot, the problem does not show up.