Bug 202607 - [panic] Poudriere umounting file systems causes 'solaris assert: avl_is_empty(&dn->dn_dbufs)' panic
Summary: [panic] Poudriere umounting file systems causes 'solaris assert: avl_is_empty...
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: CURRENT
Hardware: Any Any
: --- Affects Some People
Assignee: freebsd-fs (Nobody)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-08-24 01:21 UTC by Jason Unovitch
Modified: 2015-10-14 17:11 UTC (History)
13 users (show)

See Also:


Attachments
11.0-CURRENT r287028 built 20150823 avl_is_empty panic (6.26 KB, text/plain)
2015-08-24 01:21 UTC, Jason Unovitch
no flags Details
Some debugging data from ZoL (5.32 KB, text/plain)
2015-09-22 18:09 UTC, Andriy Gapon
no flags Details
debugging patch (666 bytes, patch)
2015-09-22 20:03 UTC, Andriy Gapon
no flags Details | Diff
full dumps of the dnode and dbuf (12.26 KB, text/plain)
2015-09-22 20:13 UTC, Andriy Gapon
no flags Details
more debug data, some kernel stacks (6.44 KB, text/plain)
2015-09-22 20:20 UTC, Andriy Gapon
no flags Details
Generically handle deferred dbuf eviction (9.58 KB, patch)
2015-09-27 21:43 UTC, Justin T. Gibbs
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Jason Unovitch freebsd_committer freebsd_triage 2015-08-24 01:21:14 UTC
Created attachment 160278 [details]
11.0-CURRENT r287028 built 20150823 avl_is_empty panic

Panic seen while doing a Poudriere run to test ports.

solaris assert: avl_is_empty(&dn->dn_dbufs), file: /usr/src/head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dnode_sync.c, line: 495

Good - 11.0-CURRENT r286204 built 20150803
Bad  - 11.0-CURRENT built 20150817 (didn't note revision)
Bad  - 11.0-CURRENT r287028 built 20150823

Full panicmail log is in the attachment.
Comment 1 Jason Unovitch freebsd_committer freebsd_triage 2015-09-11 19:24:19 UTC
CC the folks from freebsd-fs mailing list thread,

Has this been fixed?  I'm not seeing the panic upon a Poudriere run finishing up anymore on this build:
FreeBSD xts-bsd 11.0-CURRENT FreeBSD 11.0-CURRENT #1 r287666: Fri Sep 11 14:46:32 UTC 2015     root@xts-bsd:/usr/obj/usr/src/head/sys/GENERIC  amd64

I have /var/crash contents from the earlier r287028 build if needed.
Comment 2 Xin LI freebsd_committer freebsd_triage 2015-09-11 22:44:53 UTC
(In reply to Jason Unovitch from comment #1)
I haven't fix it (*), looking at the code I don't think others have touched the code either.

But are you running without INVARIANTS?  You would probably not see it if you are using non-INVARIANT-enabled kernel.

(*) I have an impression that Justin was on it but I could be wrong.

Based on my understanding of Justin's changes, I _think_ that the assertion can be safely removed, but I am not yet confident enough to commit it.  Currently, I'm running a kernel from a few days ago, with that assertion removed, but with a few new assertion instrumented and haven't seen anything weird so far, however, it seems that my assertion made an unrelated panic go away, which makes me feel very uncomfortable in committing the change at this time.
Comment 3 Jean-Sébastien Pédron freebsd_committer freebsd_triage 2015-09-11 23:10:27 UTC
I still hit this panic, with Poudriere too. I'm running r287695.
Comment 4 Justin T. Gibbs freebsd_committer freebsd_triage 2015-09-12 04:05:24 UTC
(In reply to Xin LI from comment #2)
I spent some more time this evening reviewing the code. I believe that removing the assert is safe. However, it would be good to assert that the only dbufs on the dnode are either for the bonus buffer or in the evicting state. This should hold true because the dnode free paths at higher levels (e.g. ZPL and ZVOL) enforce mutual exclusion between reads/writes and frees. 

Currently, dbuf_rele_and_unlock() doesn't contain any logic to ensure regular (not bonus buffer) dbufs are immediately evicted for free'd dnodes. So if, due to a bug or regression, a dbuf for a dnode being freed is held during the call to dnode_sync_free() , the dbuf and dnode will stick around in memory until memory pressure evicts the dbuf. From the perspective of dnode_sync_free() and even memory usage, this is no big deal, but it probably indicates a bug.
Comment 5 Jason Unovitch freebsd_committer freebsd_triage 2015-09-13 13:14:13 UTC
Interesting. I thought I may have just missed a fix since I did 4 poudriere runs with no panic. I had a panic after that. It's odd that it will panic for me and if I restart the exact same job after a reboot it will usually succeed.

(In reply to Xin LI from comment #2)
Completely GENERIC for head, so INVARIANTS and the default debugging is all present.
Comment 6 Sean Bruno freebsd_committer freebsd_triage 2015-09-18 15:06:56 UTC
I can trivially make this happen in the freebsd cluster on a single machine.  It "started" doing it while testing jails at different revisions.
Comment 7 Andriy Gapon freebsd_committer freebsd_triage 2015-09-22 16:54:27 UTC
FWIW, I am able to reproduce this issue with ZoL. It seems that an affected dnode has always been of type DMU_OT_BPOBJ (dead-list?) so far.  Not sure if that has any significance.

I'll try to patch the code to remove the original ASSERT and replace it with what Justin suggested.
Comment 8 Andriy Gapon freebsd_committer freebsd_triage 2015-09-22 18:09:08 UTC
Created attachment 161281 [details]
Some debugging data from ZoL

Okay, seems that there is actually a bug, not a race:
VERIFY3(db->db_state == DB_EVICTING) failed (4 == 5)

Some debug info is attached (I am not very fluent with Linux's crash).
Comment 9 Justin T. Gibbs freebsd_committer freebsd_triage 2015-09-22 18:14:47 UTC
If I'm reading the output correctly, you're asserting on a bonus buffer. Bonus buffers "auto-evict" on freed dnodes and so are safe to skip.
Comment 10 Andriy Gapon freebsd_committer freebsd_triage 2015-09-22 18:19:45 UTC
(In reply to Justin T. Gibbs from comment #9)

What's the sign that it is a bonus buffer?  I thought that a bonus buffer would have some special value in db_blkid (small negative cast to large unsigned).
But I could be confusing things.
Comment 11 Justin T. Gibbs freebsd_committer freebsd_triage 2015-09-22 18:41:24 UTC
I was looking at the stack trace, but I guess that can't be trusted?  It's not clear from the attachment how you got that dbuf pointer, but the bonus buffer shouldn't be in dn_dbuf. So if you are asserting on things within dn_dbufs and using it to find what asserted, then I guess there is a data dbuf with outstanding references.

Can you provide a complete dump of all of the fields in both the dnode and the dbuf?
Comment 12 Andriy Gapon freebsd_committer freebsd_triage 2015-09-22 20:02:19 UTC
(In reply to Justin T. Gibbs from comment #11)
The stack trace can be trusted if the entries starting with '?' are ignored.
Comment 13 Andriy Gapon freebsd_committer freebsd_triage 2015-09-22 20:03:55 UTC
Created attachment 161282 [details]
debugging patch

This is the patch that I used to check the outstanding buffer(s).
Comment 14 Andriy Gapon freebsd_committer freebsd_triage 2015-09-22 20:13:43 UTC
Created attachment 161283 [details]
full dumps of the dnode and dbuf

(In reply to Justin T. Gibbs from comment #11)

Requested data is attached.
Comment 15 Andriy Gapon freebsd_committer freebsd_triage 2015-09-22 20:20:04 UTC
Created attachment 161285 [details]
more debug data, some kernel stacks

I have an impression that the thread with PID 124 was trying to dbuf_rele() the outstanding dbuf concurrently with dnode_sync().
Comment 16 Justin T. Gibbs freebsd_committer freebsd_triage 2015-09-22 22:57:37 UTC
The stack traces indicate that the dataset being destroyed was referenced earlier, creating a separate copy of its in-core state (dsl_dataset_t). This other copy is still being evicted at the time that the dataset is being destroyed on disk, which is why there is still a reference on the dbuf for the deadlist for the dataset. This harmless except that this dbuf will hang around longer than necessary.

Immediate eviction in the same way as for bonus buffers could be added to dbuf_rele_and_unlock(), but it cannot be exact without acquiring locks. Dnodes cannot transition from free to allocated until after their last hold is released, so it should be safe to perform a "best effort" check that the dnode is still allocated without locks held, but I'm not sure it is worth it.
Comment 17 Andriy Gapon freebsd_committer freebsd_triage 2015-09-23 07:04:03 UTC
I agree that in this case there is no problem and the buffer gets evicted very soon.
However, I wonder if we could somehow avoid this kind of a situation in a higher level code or if we could detect it in dnode_sync_free...
I mean, removing the assertion(s) is super easy, but we might miss a real bug later on.  Am I too paranoid?
Comment 18 Andriy Gapon freebsd_committer freebsd_triage 2015-09-23 09:33:45 UTC
Another note: if I exclude DMU_OT_BPOBJ dnodes from the check, then the assertion is not hit.
Comment 19 Justin T. Gibbs freebsd_committer freebsd_triage 2015-09-23 16:46:34 UTC
Thinking about this some more, I think it may be possible for a dbuf for an older version of a dnode to be reused if it hasn't been evicted before the object id is reused. For most objects (e.g. files), there are no dangling references due to dbuf user eviction. But this can happen for datasets, and we can't know the future type of the object that will inherit this object id.

My preference would be some solution that doesn't cause TXG syncing to block. All we need to do is to mark all the dbufs as no longer cacheable in dnode_evict_dbufs() during the pass that we try to evict them. This would allow dbuf_rele_and_unlock() to determine whether eviction processing is necessary without taking the dnode lock. It looks like dmu_buf_impl_t has internal padding due to the order of its fields, so this state can probably be added without any space penalty.

If you do this, the code to forcibly evict the bonus buffer in dbuf_rele_and_unlock() should be simplified to use the same logic.
Comment 20 Will Andrews freebsd_committer freebsd_triage 2015-09-23 19:03:52 UTC
dbuf_rele_and_unlock() already contains logic that detects whether a dbuf is uncacheable and sets the db_is_ephemeral flag.  Maybe it could be further extended to allow this flag to be set elsewhere, e.g. under the circumstances Justin proposes, and then for dbuf_rele_and_unlock() to perform the eviction at that point based on that state.
Comment 21 Will Andrews freebsd_committer freebsd_triage 2015-09-23 19:18:01 UTC
Hmm, looks like I'm referring to local changes I forgot about.
Comment 22 Justin T. Gibbs freebsd_committer freebsd_triage 2015-09-27 21:43:36 UTC
Created attachment 161461 [details]
Generically handle deferred dbuf eviction

This patch was developed and tested using an illumos VM. I don't have access to a physical ZFS test rig right now, so only apply if you understand the risks and have a real test environment. It passes ztest, but needs to have the objset eviction path validated via the ZFS test suite.
Comment 23 Andriy Gapon freebsd_committer freebsd_triage 2015-09-28 15:07:06 UTC
(In reply to Justin T. Gibbs from comment #22)

Justin, I am testing this patch now.  Should I expect any verifiable effects from it other than those from the simple removal of the assertion?
Comment 24 Justin T. Gibbs freebsd_committer freebsd_triage 2015-09-28 16:18:13 UTC
(In reply to Andriy Gapon from comment #23)
You should not see any different behavior. The areas that were changed but that I couldn't test locally were filesystem mount/unmount, pool import/export, and completely unloading the module. All three of these exercise the objset eviction path.
Comment 25 Andriy Gapon freebsd_committer freebsd_triage 2015-09-29 10:30:17 UTC
(In reply to Justin T. Gibbs from comment #24)
I do not see any regressions while running typical workloads and tests.
Thank you!
Comment 26 Oliver Pinter freebsd_committer freebsd_triage 2015-10-01 20:06:59 UTC
http://permalink.gmane.org/gmane.os.freebsd.current/165824

#5  0xffffffff8039c391 in zfs_get_data (arg=<value optimized out>,
lr=<value optimized out>, buf=<value optimized out>,
zio=0xfffff8019eeb1760) at
/usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c:1355
1355                            ASSERT(error || lr->lr_length <= zp->z_blksz);
(kgdb) l
1350                            ASSERT(db->db_offset == offset);
1351                            ASSERT(db->db_size == size);
1352
1353                            error = dmu_sync(zio, lr->lr_common.lrc_txg,
1354                                zfs_get_done, zgd);
1355                            ASSERT(error || lr->lr_length <= zp->z_blksz);
1356
1357                            /*
1358                             * On success, we need to wait for the write I/O
1359                             * initiated by dmu_sync() to complete
before we can
(kgdb) p *lr
Cannot access memory at address 0xa5a5a5a5a5a5a5a5
(kgdb) p *zp
Cannot access memory at address 0xa5a5a5a5a5a5a5a5
(kgdb)
Comment 27 Justin T. Gibbs freebsd_committer freebsd_triage 2015-10-01 20:12:53 UTC
(In reply to Oliver Pinter from comment #26)

Is there a reason you believe this crash is caused by this bug?
Comment 28 Oliver Pinter freebsd_committer freebsd_triage 2015-10-01 21:19:32 UTC
The INVARIANTS implicates the newly added use-after-free checker in UMA. And this explains why this bug gets gone with disabled INVARIANTS.
Comment 29 Justin T. Gibbs freebsd_committer freebsd_triage 2015-10-01 21:21:49 UTC
(In reply to Oliver Pinter from comment #28)

This bug is about a particular assert firing. It is not a catchall for all invariants that fire in ZFS. Please open another bug for your issue.
Comment 30 Justin T. Gibbs freebsd_committer freebsd_triage 2015-10-01 21:26:06 UTC
(In reply to Oliver Pinter from comment #28)

This bug is about a particular assert firing. It is not a catchall for all invariants that fire in ZFS. Please open another bug for your issue.
Comment 31 Oliver Pinter freebsd_committer freebsd_triage 2015-10-01 21:27:05 UTC
Ahh, sorry ignore my previous comment, here is a little late now. I referred an other ZFS related assert / bug.
Comment 32 Justin T. Gibbs freebsd_committer freebsd_triage 2015-10-05 16:46:58 UTC
(In reply to Andriy Gapon from comment #25)
Can you add feedback to https://reviews.csiden.org/r/245/?
Comment 33 mikej 2015-10-06 17:36:28 UTC
I have been testing the "Generically handle deferred dbuf eviction" patch under both r288019 and r288826 for several days and I can no longer provoke this panic.  Without the patch it was easy to get the machine to panic.
Comment 34 Mark Felder freebsd_committer freebsd_triage 2015-10-09 15:01:21 UTC
I've been running this for almost a week of uptime with no crashes anymore.

10:00:54 skeletor:~ > uptime
10:00AM  up 6 days,  3:06, 5 users, load averages: 0.07, 0.12, 0.09
10:00:55 skeletor:~ > uname -a
FreeBSD skeletor.feld.me 11.0-CURRENT FreeBSD 11.0-CURRENT #46 r288524M: Sat Oct  3 06:41:42 CDT 2015     root@skeletor.feld.me:/usr/obj/usr/src/sys/GENERIC  amd64
Comment 35 Sean Bruno freebsd_committer freebsd_triage 2015-10-10 23:05:14 UTC
I'm super happy with the lack of assert -> panic with this patch?  Can we fire it into the tree now?
Comment 36 Jason Unovitch freebsd_committer freebsd_triage 2015-10-10 23:26:23 UTC
Me 4!  Thanks for the great work!

Uptime is about 24 hours on "FreeBSD 11.0-CURRENT #0 r289076M" with only the "generically handle deferred dbuf eviction" patch applied onto src.  I've fired off multiple panic free Poudriere builds testing out some port updates since the update.
Comment 37 Justin T. Gibbs freebsd_committer freebsd_triage 2015-10-11 04:51:19 UTC
(In reply to Sean Bruno from comment #35)
The patch has been accepted for integration upstream in illumos and should land in the next day or two. I'm sure Xin will pull it into FreeBSD as soon as it does.
Comment 38 Justin T. Gibbs freebsd_committer freebsd_triage 2015-10-14 17:11:56 UTC
https://svnweb.freebsd.org/changeset/base/289309