Seen on CURRENT (as of r340474) when receiving an incremental stream (zfs send -e) from a 11.2-RELEASE system. Panic message: panic: solaris assert: db->db_buf == ((void *)0) (0xfffff801f7f43e40 == 0x0), file: /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dbuf.c, line: 1097 cpuid = 6 time = 1542352871 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe00b12f9940 vpanic() at vpanic+0x1a3/frame 0xfffffe00b12f99a0 panic() at panic+0x43/frame 0xfffffe00b12f9a00 assfail3() at assfail3+0x2c/frame 0xfffffe00b12f9a20 dmu_buf_will_fill() at dmu_buf_will_fill+0x24b/frame 0xfffffe00b12f9a60 dmu_buf_write_embedded() at dmu_buf_write_embedded+0xe0/frame 0xfffffe00b12f9ab0 dmu_write_embedded() at dmu_write_embedded+0xea/frame 0xfffffe00b12f9b10 receive_writer_thread() at receive_writer_thread+0x513/frame 0xfffffe00b12f9bb0 fork_exit() at fork_exit+0x84/frame 0xfffffe00b12f9bf0 fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe00b12f9bf0 --- trap 0, rip = 0, rsp = 0, rbp = 0 --- KDB: enter: panic [ thread pid 21 tid 101208 ] Stopped at kdb_enter+0x3b: movq $0,kdb_why db> where Tracing pid 21 tid 101208 td 0xfffff8001cb49580 kdb_enter() at kdb_enter+0x3b/frame 0xfffffe00b12f9940 vpanic() at vpanic+0x1c0/frame 0xfffffe00b12f99a0 panic() at panic+0x43/frame 0xfffffe00b12f9a00 assfail3() at assfail3+0x2c/frame 0xfffffe00b12f9a20 dmu_buf_will_fill() at dmu_buf_will_fill+0x24b/frame 0xfffffe00b12f9a60 dmu_buf_write_embedded() at dmu_buf_write_embedded+0xe0/frame 0xfffffe00b12f9ab0 dmu_write_embedded() at dmu_write_embedded+0xea/frame 0xfffffe00b12f9b10 receive_writer_thread() at receive_writer_thread+0x513/frame 0xfffffe00b12f9bb0 fork_exit() at fork_exit+0x84/frame 0xfffffe00b12f9bf0 fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe00b12f9bf0 --- trap 0, rip = 0, rsp = 0, rbp = 0 --- It seems that the call path was: receive_writer_thread -> receive_process_record (DRR_WRITE_EMBEDDED) -> receive_write_embedded -> dmu_write_embedded -> dmu_buf_hold_noread() Then: dmu_buf_write_embedded -> dmu_buf_will_not_fill -> dmu_buf_will_fill -> dbuf_noread (db->db_state == DB_NOFILL) -> dbuf_clear_data -> ASSERT3P(db->db_buf, ==, NULL); and failed there. Looking at the code, it seems that dmu_buf_will_not_fill would unconditionally set db_state to DB_NOFILL, but for some reason the dbuf might have db_buf associated (maybe DB_CACHED?) I can't reproduce the issue when not using -e at sending side.
I am hitting this same assert when receiving streams with embedded blocks (zfs send -e)
Does anyone have a small reproducer for this issue, or a small send stream they could upload? We haven't seen this on Illumos or Linux, but I'd be happy to run a reproducer to see if the same issue crops up.
I've just reproduced it in clean environment. I suspect it is related to my recent patch fixing spill block loss on block size change (https://svnweb.freebsd.org/base?view=revision&revision=340096). Here is the reproduction scenario: # zfs create ssd/zzz # dd if=/dev/zero of=/ssd/zzz/f bs=4k count=1 1+0 records in 1+0 records out 4096 bytes transferred in 0.001220 secs (3356604 bytes/sec) # dd if=/dev/random of=/ssd/zzz/f bs=4 count=1 conv=notrunc 1+0 records in 1+0 records out 4 bytes transferred in 0.002598 secs (1540 bytes/sec) # zfs snapshot ssd/zzz@a # zfs send -e ssd/zzz@a |zfs receive ssd/xxx # dd if=/dev/zero of=/ssd/zzz/f bs=16k count=1 conv=notrunc 1+0 records in 1+0 records out 16384 bytes transferred in 0.002887 secs (5674227 bytes/sec) # dd if=/dev/random of=/ssd/zzz/f bs=4 count=1 conv=notrunc 1+0 records in 1+0 records out 4 bytes transferred in 0.000199 secs (20141 bytes/sec) # zfs snapshot ssd/zzz@b # zfs send -e -i a ssd/zzz@b | zfs receive ssd/xxx panic: solaris assert: db->db_buf == ((void *)0) (0xfffff80116233c00 == 0x0), file: /sources/FreeBSD/head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dbuf.c, line: 1097 cpuid = 44 time = 1542744219 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe010bc1f800 vpanic() at vpanic+0x1a3/frame 0xfffffe010bc1f860 panic() at panic+0x43/frame 0xfffffe010bc1f8c0 assfail3() at assfail3+0x2c/frame 0xfffffe010bc1f8e0 dmu_buf_will_fill() at dmu_buf_will_fill+0x24b/frame 0xfffffe010bc1f920 dmu_buf_write_embedded() at dmu_buf_write_embedded+0xe0/frame 0xfffffe010bc1f970 dmu_write_embedded() at dmu_write_embedded+0xea/frame 0xfffffe010bc1f9d0 receive_writer_thread() at receive_writer_thread+0x513/frame 0xfffffe010bc1fa70 fork_exit() at fork_exit+0x84/frame 0xfffffe010bc1fab0 fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe010bc1fab0
Yes, r340096 is what causing this. Now need to figure out why.
A commit references this bug: Author: mav Date: Wed Nov 21 18:18:57 UTC 2018 New revision: 340737 URL: https://svnweb.freebsd.org/changeset/base/340737 Log: Revert r340096: 9952 Block size change during zfs receive drops spill block It was reported, and I easily reproduced it, that this change triggers panic when receiving replication stream with enabled embedded blocks, when short file compressing into one embedded block changes its block size. I am not sure that the problem is in this particuler patch, not just triggered by it, but since investigation and fix will take some time, I've decided to revert this for now. PR: 198457, 233277 Changes: head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu_send.c
Don't know if this is helpful, but the panic only seems to happen in current. I can't get it to trigger applying r340096 to 11.2-RELEASE or 12.0-RELEASE. It seems to fix the spill block issue and I haven't been able to trip the panic on either branch. I can trip it on current. This is with very light testing so far.
No it isn't helpful because the panic doesn't happen because invariants aren't enabled on the release branches. Enabling them brings the panic back. Sorry for the noise.
Seems like base r340737 fixed this one. Not sure why it is still open.