Bug 233277 - Panic when receiving incremental ZFS send (solaris assert: db->db_buf == ((void *)0))
Summary: Panic when receiving incremental ZFS send (solaris assert: db->db_buf == ((vo...
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: CURRENT
Hardware: amd64 Any
: --- Affects Some People
Assignee: Alexander Motin
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-11-17 21:28 UTC by Xin LI
Modified: 2019-09-19 17:49 UTC (History)
7 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Xin LI freebsd_committer freebsd_triage 2018-11-17 21:28:45 UTC
Seen on CURRENT (as of r340474) when receiving an incremental stream (zfs send -e) from a 11.2-RELEASE system.

Panic message:

panic: solaris assert: db->db_buf == ((void *)0) (0xfffff801f7f43e40 == 0x0), file: /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dbuf.c, line: 1097
cpuid = 6                                                                                                                                                   
time = 1542352871                                                                                                                                           
KDB: stack backtrace:                                                                                                                                       
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe00b12f9940                                                                              
vpanic() at vpanic+0x1a3/frame 0xfffffe00b12f99a0                                                                                                           
panic() at panic+0x43/frame 0xfffffe00b12f9a00                                                                                                              
assfail3() at assfail3+0x2c/frame 0xfffffe00b12f9a20                                                                                                        
dmu_buf_will_fill() at dmu_buf_will_fill+0x24b/frame 0xfffffe00b12f9a60                                                                                     
dmu_buf_write_embedded() at dmu_buf_write_embedded+0xe0/frame 0xfffffe00b12f9ab0                                                                            
dmu_write_embedded() at dmu_write_embedded+0xea/frame 0xfffffe00b12f9b10                                                                                    
receive_writer_thread() at receive_writer_thread+0x513/frame 0xfffffe00b12f9bb0                                                                             
fork_exit() at fork_exit+0x84/frame 0xfffffe00b12f9bf0                                                                                                      
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe00b12f9bf0                                                                                           
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---                                                                                                                   
KDB: enter: panic                                                                                                                                           
[ thread pid 21 tid 101208 ]                                                                                                                                
Stopped at      kdb_enter+0x3b: movq    $0,kdb_why                                                                                                          
db> where                                                                                                                                                   
Tracing pid 21 tid 101208 td 0xfffff8001cb49580                                                                                                             
kdb_enter() at kdb_enter+0x3b/frame 0xfffffe00b12f9940                                                                                                      
vpanic() at vpanic+0x1c0/frame 0xfffffe00b12f99a0                                                                                                           
panic() at panic+0x43/frame 0xfffffe00b12f9a00                                                                                                              
assfail3() at assfail3+0x2c/frame 0xfffffe00b12f9a20                                                                                                        
dmu_buf_will_fill() at dmu_buf_will_fill+0x24b/frame 0xfffffe00b12f9a60                                                                                     
dmu_buf_write_embedded() at dmu_buf_write_embedded+0xe0/frame 0xfffffe00b12f9ab0                                                                            
dmu_write_embedded() at dmu_write_embedded+0xea/frame 0xfffffe00b12f9b10                                                                                    
receive_writer_thread() at receive_writer_thread+0x513/frame 0xfffffe00b12f9bb0                                                                             
fork_exit() at fork_exit+0x84/frame 0xfffffe00b12f9bf0                                                                                                      
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe00b12f9bf0                                                                                           
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---                                 


It seems that the call path was:

receive_writer_thread -> receive_process_record (DRR_WRITE_EMBEDDED) ->
receive_write_embedded -> dmu_write_embedded -> dmu_buf_hold_noread()

Then:

dmu_buf_write_embedded -> dmu_buf_will_not_fill -> dmu_buf_will_fill ->
dbuf_noread (db->db_state == DB_NOFILL) -> dbuf_clear_data ->
ASSERT3P(db->db_buf, ==, NULL); and failed there.

Looking at the code, it seems that dmu_buf_will_not_fill would
unconditionally set db_state to DB_NOFILL, but for some reason the dbuf
might have db_buf associated (maybe DB_CACHED?)

I can't reproduce the issue when not using -e at sending side.
Comment 1 Allan Jude freebsd_committer freebsd_triage 2018-11-19 18:57:49 UTC
I am hitting this same assert when receiving streams with embedded blocks (zfs send -e)
Comment 2 Paul Dagnelie 2018-11-20 20:15:37 UTC
Does anyone have a small reproducer for this issue, or a small send stream they could upload? We haven't seen this on Illumos or Linux, but I'd be happy to run a reproducer to see if the same issue crops up.
Comment 3 Alexander Motin freebsd_committer freebsd_triage 2018-11-20 20:24:26 UTC
I've just reproduced it in clean environment.  I suspect it is related to my recent patch fixing spill block loss on block size change (https://svnweb.freebsd.org/base?view=revision&revision=340096).

Here is the reproduction scenario:

# zfs create ssd/zzz
# dd if=/dev/zero of=/ssd/zzz/f bs=4k count=1
1+0 records in
1+0 records out
4096 bytes transferred in 0.001220 secs (3356604 bytes/sec)
# dd if=/dev/random of=/ssd/zzz/f bs=4 count=1 conv=notrunc
1+0 records in
1+0 records out
4 bytes transferred in 0.002598 secs (1540 bytes/sec)
# zfs snapshot ssd/zzz@a
# zfs send -e ssd/zzz@a |zfs receive ssd/xxx
# dd if=/dev/zero of=/ssd/zzz/f bs=16k count=1 conv=notrunc
1+0 records in
1+0 records out
16384 bytes transferred in 0.002887 secs (5674227 bytes/sec)
# dd if=/dev/random of=/ssd/zzz/f bs=4 count=1 conv=notrunc
1+0 records in
1+0 records out
4 bytes transferred in 0.000199 secs (20141 bytes/sec)
# zfs snapshot ssd/zzz@b
# zfs send -e -i a ssd/zzz@b | zfs receive ssd/xxx

panic: solaris assert: db->db_buf == ((void *)0) (0xfffff80116233c00 == 0x0), file: /sources/FreeBSD/head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dbuf.c, line: 1097
cpuid = 44
time = 1542744219
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe010bc1f800
vpanic() at vpanic+0x1a3/frame 0xfffffe010bc1f860
panic() at panic+0x43/frame 0xfffffe010bc1f8c0
assfail3() at assfail3+0x2c/frame 0xfffffe010bc1f8e0
dmu_buf_will_fill() at dmu_buf_will_fill+0x24b/frame 0xfffffe010bc1f920
dmu_buf_write_embedded() at dmu_buf_write_embedded+0xe0/frame 0xfffffe010bc1f970
dmu_write_embedded() at dmu_write_embedded+0xea/frame 0xfffffe010bc1f9d0
receive_writer_thread() at receive_writer_thread+0x513/frame 0xfffffe010bc1fa70
fork_exit() at fork_exit+0x84/frame 0xfffffe010bc1fab0
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe010bc1fab0
Comment 4 Alexander Motin freebsd_committer freebsd_triage 2018-11-20 20:41:03 UTC
Yes, r340096 is what causing this.  Now need to figure out why.
Comment 5 commit-hook freebsd_committer freebsd_triage 2018-11-21 18:19:46 UTC
A commit references this bug:

Author: mav
Date: Wed Nov 21 18:18:57 UTC 2018
New revision: 340737
URL: https://svnweb.freebsd.org/changeset/base/340737

Log:
  Revert r340096: 9952 Block size change during zfs receive drops spill block

  It was reported, and I easily reproduced it, that this change triggers panic
  when receiving replication stream with enabled embedded blocks, when short
  file compressing into one embedded block changes its block size.  I am not
  sure that the problem is in this particuler patch, not just triggered by it,
  but since investigation and fix will take some time, I've decided to revert
  this for now.

  PR:		198457, 233277

Changes:
  head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu_send.c
Comment 6 Nathan Huff 2019-04-05 21:10:54 UTC
Don't know if this is helpful, but the panic only seems to happen in current.  I can't get it to trigger applying r340096 to 11.2-RELEASE or 12.0-RELEASE. It seems to fix the spill block issue and I haven't been able to trip the panic on either branch.  I can trip it on current. This is with very light testing so far.
Comment 7 Nathan Huff 2019-04-07 20:58:39 UTC
No it isn't helpful because the panic doesn't happen because invariants aren't enabled on the release branches.  Enabling them brings the panic back. Sorry for the noise.
Comment 8 Andriy Gapon freebsd_committer freebsd_triage 2019-09-19 13:32:06 UTC
Seems like base r340737 fixed this one.
Not sure why it is still open.