Bug 238333 - bhyve random crash in rfb.c on FreeBSD current (after r346011)
Summary: bhyve random crash in rfb.c on FreeBSD current (after r346011)
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: bhyve (show other bugs)
Version: CURRENT
Hardware: Any Any
: --- Affects Only Me
Assignee: Mark Johnston
URL:
Keywords: crash, needs-qa, regression
Depends on:
Blocks:
 
Reported: 2019-06-05 09:02 UTC by Oleg Ginzburg
Modified: 2019-06-26 17:43 UTC (History)
4 users (show)

See Also:


Attachments
bhyve bin + bhyve.core + bhyve.debug (586.52 KB, application/x-compressed-tar)
2019-06-10 09:33 UTC, Oleg Ginzburg
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Oleg Ginzburg 2019-06-05 09:02:44 UTC
After -O0 flag removal ( https://svnweb.freebsd.org/base?view=revision&revision=r346011 ) bhyve process begins to crash with an active VNC session.

Not all VNC clients can crash. It seems that it depends on the default settings (some use the default compression by default, some do not), for example, it does not cause problems when using vncviewer (without any arguments) from tigervnc-viewer, but is well reproduced using KDE's KRDC: net/krdc ).

When you connect to VNC console via KRDC bhyve process crashed with SIGBUS:

pid 88831 (bhyve), jid 0, uid 0: exited on signal 10 (core dumped)

Backtrace:

Thread 29 "rfbout" received signal SIGBUS, Bus error.
[Switching to LWP 101756 of process 93578]
memcpy () at /usr/src/lib/libc/amd64/string/memmove.S:306
306             MEMMOVE erms=0 overlap=1 begin=MEMMOVE_BEGIN end=MEMMOVE_END

(gdb) bt
#0  memcpy () at /usr/src/lib/libc/amd64/string/memmove.S:306
#1  0x00000008002c85dc in flush_pending (strm=<optimized out>) at /usr/src/sys/contrib/zlib/deflate.c:741
#2  deflate (strm=0x800bb1030, flush=2) at /usr/src/sys/contrib/zlib/deflate.c:787
#3  0x000000000023f141 in rfb_send_rect (rc=<optimized out>, cfd=<optimized out>, gc=<optimized out>, x=0, y=0, w=128, h=32) at /usr/src/usr.sbin/bhyve/rfb.c:355
#4  rfb_send_screen (rc=<optimized out>, cfd=6, all=<optimized out>) at /usr/src/usr.sbin/bhyve/rfb.c:600
#5  0x000000000023f5c7 in rfb_wr_thr (arg=0x800bb1000) at /usr/src/usr.sbin/bhyve/rfb.c:730
#6  0x000000080064b726 in thread_start (curthread=0x800be1000) at /usr/src/lib/libthr/thread/thr_create.c:291
#7  0x0000000000000000 in ?? ()
Backtrace stopped: Cannot access memory at address 0x7fffdc9e3000

(gdb) p len
$1 = 9600016

print sizeof(strm)
$2 = 8

(gdb) print sizeof(s)
$3 = 8

There are no problems when bhyve is compiled without optimization (-O0), so we do not see this problem in FreeBSD <= 12.0, but the problem occurs in FreeBSD-current
Comment 1 Conrad Meyer freebsd_committer freebsd_triage 2019-06-05 15:18:12 UTC
Stack track indicates that zlib compression of some kind is required to repro.  Thanks for reporting.
Comment 2 Conrad Meyer freebsd_committer freebsd_triage 2019-06-05 16:11:27 UTC
Ok, here's what rfb.c tries to do:

1. Allocate a single zstream per rfb_softc.  (fine)
2. Handle a single connection at a time per rfb_softc.  (fine)
3. Periodically re-send the screen from a co-thread.  (fine? could just be
   done from main rfb_handle thread)
4. Serialize screen send using a mutex + flag.  (fine)

Here is some dubious stuff:

1. Client commands, including "set encoding" (i.e., RAW, ZLIB, RESIZE) are not serialized against periodic worker thread.
   a. Also including things like "change dimensions of the screen" and "enable
      ZLIB".

2. The zstream is never reinitialized, but reused.  This may be desirable
   (i.e., better compression is achieved if the client can reliably inflate
   references to historical stream data).

3. Client is allowed to arbitrarily update softc "height" and "width" fields,
   although amusingly these are never used?

4. zbuf isn't really sized for totally incompressible full size maximal
   screens?  2000*1200*4 is 9.1 MB and 16 bytes slop is definitely not enough
   for zlib --fast overhead.

---

The line number in deflate() and stack in flush_pending() suggest that assumptions in rfb.c were violated.  rfb initiates every single operation with Z_SYNC_FLUSH, which indicates that all input should be processed and nothing should be left pending in zstream's output buffer.  However, it does nothing to *verify* this assumption (avail_out >0).  I'm not immediately seeing how this leads to this crash, though.

E.g.:

  $ dd if=/dev/urandom bs=$((2000*1200*4)) count=1 of=./randomscreen.dat
  9600000 bytes transferred in 0.026229 secs (366010901 bytes/sec)
  $ gzip --fast --no-name ./randomscreen.dat
  $ ls -l ./randomscreen.dat.gz
  ... 9602938 Jun  5 09:04 ./randomscreen.dat.gz

So that's 2938 bytes of zlib overhead for a maximally sized, incompressible screen.  Subtract a handful of bytes for gzip header/trailer (maybe 32) and another 16 for the manual slop in rfb.c and you're still short 2.8 kB re: unchecked assumptions made in rfb.c zbuf sizing.  (And to what end?  zlib can
operate efficiently with much smaller buffers.)

Can you also print `*stream` and `*s`?  What are avail_out, etc?  What size of
framebuffer are you attempting to use?  Do you have a core you can share or repro steps I could follow?

Thanks!
Comment 3 Conrad Meyer freebsd_committer freebsd_triage 2019-06-05 16:17:38 UTC
I should add, if you still have gdb open:

(In reply to olevole from comment #0)
> Thread 29 "rfbout" received signal SIGBUS, Bus error.
> [Switching to LWP 101756 of process 93578]
> memcpy () at /usr/src/lib/libc/amd64/string/memmove.S:306
> 306             MEMMOVE erms=0 overlap=1 begin=MEMMOVE_BEGIN end=MEMMOVE_END
> (gdb) bt
> #0  memcpy () at /usr/src/lib/libc/amd64/string/memmove.S:306

'info registers' might be interesting here, in tandem with 'show disas' or 'x/i $rip'.

> (gdb) p len
> $1 = 9600016

That's our 2000*1200*4 + 16 magic number from earlier.

> print sizeof(strm)
> $2 = 8
> 
> (gdb) print sizeof(s)
> $3 = 8

These are just the pointers.  Please 'print *strm' and '*s'.
Comment 4 Mark Johnston freebsd_committer freebsd_triage 2019-06-05 16:57:46 UTC
I tried using krdc to connect to a UEFI-booted FreeBSD VM and haven't been able to trigger a crash so far.  I verified with gdb that the rfd thread has enc_zlib_ok = true.
Comment 5 Rodney W. Grimes freebsd_committer freebsd_triage 2019-06-05 17:05:40 UTC
Try using tightvnc, which iirc itself crashes in many situations due to rect outside limits.  I think this is caused by what cem spotted in that we ignore the viewer set size stuff, and tightvnc almost always sends a resize on startup to match what ever the display window is, ignoring the window size sent by the server, which probably leads to the rect outside limits error.

Our bhyve/vnc implementation is extremly finicky about how it works with clients,
most fail infact when doing anything but very simple stuff.
Comment 6 Rodney W. Grimes freebsd_committer freebsd_triage 2019-06-05 17:07:05 UTC
(In reply to Conrad Meyer from comment #2)
> 3. Client is allowed to arbitrarily update softc "height" and "width" fields,
>   although amusingly these are never used?

I think fixing this bug would fix a major headache many people have had, including the tossing of illegal rect or outside limit rect errors in the client.
Comment 7 Peter Grehan freebsd_committer freebsd_triage 2019-06-05 18:49:47 UTC
(In reply to Conrad Meyer from comment #2)

The maximum screen size is actually 1920 x 1200 due to clamping in pci_fbuf.c so this may allow enough overhead.
Comment 8 Conrad Meyer freebsd_committer freebsd_triage 2019-06-05 18:57:00 UTC
(In reply to Peter Grehan from comment #7)
> (In reply to Conrad Meyer from comment #2)
> 
> The maximum screen size is actually 1920 x 1200 due to clamping in
> pci_fbuf.c so this may allow enough overhead.

Yes, that provides enough overhead (80/2000 ~= 4%).  I get 9,218,823 bytes with gzip --fast of 1920x1200x4 random data (~0.03% overhead), which is << 9,600,016.  So at least buffer size + Z_FLUSH_SYNC is not a problem per se, although basically all of that should still be fixed.
Comment 9 Oleg Ginzburg 2019-06-06 09:34:01 UTC
(In reply to Conrad Meyer from comment #3)


memcpy () at /usr/src/lib/libc/amd64/string/memmove.S:306                       
306             MEMMOVE erms=0 overlap=1 begin=MEMMOVE_BEGIN end=MEMMOVE_END    
(gdb) info registers                                                            
rax            0x944e00180         39810236800                                  
rbx            0x941a7e030         39756226608                                  
rcx            0x124f82            1200002                                      
rdx            0x927c10            9600016                                      
rsi            0xa5a5a5a5a5a5a5a5  -6510615555426900571                         
rdi            0x944e00180         39810236800                                  
rbp            0x7fffdc7e1db0      0x7fffdc7e1db0                               
rsp            0x7fffdc7e1d68      0x7fffdc7e1d68                               
r8             0x5a5a5a639f3a5bdb  6510615595237137371                          
r9             0x0                 0
r10            0x0                 0
r11            0x246               582
r12            0x927c10            9600016
r13            0x927c10            9600016
r14            0x941b5f800         39757150208
r15            0x2                 2
rip            0x80083da1a         0x80083da1a <memcpy+266>
eflags         0x10206             [ PF IF RF ]
cs             0x43                67
ss             0x3b                59
ds             <unavailable>
es             <unavailable>
fs             <unavailable>
gs             <unavailable>
fs_base        0x941a0c8d0         39755761872
gs_base        0x0


(gdb) x/10i $rip
=> 0x80083da1a <memcpy+266>:    rep movsq %ds:(%rsi),%es:(%rdi)
   0x80083da1d <memcpy+269>:    mov    %rdx,%rcx
   0x80083da20 <memcpy+272>:    and    $0x7,%ecx
   0x80083da23 <memcpy+275>:    jne    0x80083d9c0 <memcpy+176>
   0x80083da25 <memcpy+277>:    retq   
   0x80083da26 <memcpy+278>:    mov    (%rsi),%r8
   0x80083da29 <memcpy+281>:    mov    0x8(%rsi),%r9
   0x80083da2d <memcpy+285>:    mov    %rdi,%r10
   0x80083da30 <memcpy+288>:    mov    %rdi,%rcx
   0x80083da33 <memcpy+291>:    and    $0xf,%rcx


Unfortunately I can not see the information for struct due to bhyve compiled without -O0:

(gdb) up
#1  0x00000008002c85dc in flush_pending (strm=<optimized out>) at /usr/src/sys/contrib/zlib/deflate.c:741
741         zmemcpy(strm->next_out, s->pending_out, len);
(gdb) print *strm
value has been optimized out
(gdb) print *s
value has been optimized out


But with -O0 i can't get crash
Comment 10 Mark Johnston freebsd_committer freebsd_triage 2019-06-06 14:43:58 UTC
(In reply to olevole from comment #9)
Please go to frame 2 and run

p *strm
p *(struct internal_state *)strm->state
Comment 11 Oleg Ginzburg 2019-06-06 15:38:46 UTC
(In reply to Mark Johnston from comment #10)

#2  deflate (strm=0x800bb1030, flush=2) at /usr/src/sys/contrib/zlib/deflate.c:787
787             flush_pending(strm);                                            

(gdb) p *strm                                                                   
$1 = {next_in = 0x941e00000 "", avail_in = 128, total_in = 0, next_out = 0x944c00180 '\245' <repeats 200 times>..., avail_out = 9600016, total_out = 0, msg = 0x0, state = 0x800bf0800, 
  zalloc = 0x8002d2110 <zcalloc>, zfree = 0x8002d2120 <zcfree>, opaque = 0x0, data_type = 0, adler = 1, reserved = 0}


(gdb) p *(struct internal_state *)strm->state
$2 = {strm = 0x800bb1030, status = 42, pending_buf = 0xa5a5a5a5a5a5a5a5 <error: Cannot access memory at address 0xa5a5a5a5a5a5a5a5>, pending_buf_size = 11936128518282651045, 
  pending_out = 0xa5a5a5a5a5a5a5a5 <error: Cannot access memory at address 0xa5a5a5a5a5a5a5a5>, pending = 11936128518282651045, wrap = 1, gzhead = 0x0, gzindex = 11936128518282651045, method = 165 '\245', 
  last_flush = 2, w_size = 32768, w_bits = 15, w_mask = 32767, window = 0x9433a9140 '\245' <repeats 200 times>..., window_size = 11936128518282651045, prev = 0x9433ba580, head = 0x9433cb780, 
  ins_h = 2779096485, hash_size = 32768, hash_bits = 15, hash_mask = 32767, hash_shift = 5, block_start = -6510615555426900571, match_length = 2779096485, prev_match = 2779096485, 
  match_available = -1515870811, strstart = 2779096485, match_start = 2779096485, lookahead = 2779096485, prev_length = 2779096485, max_chain_length = 2779096485, max_lazy_match = 2779096485, 
  level = -1515870811, strategy = -1515870811, good_match = 2779096485, nice_match = -1515870811, dyn_ltree = {{fc = {freq = 42405, code = 42405}, dl = {dad = 42405, len = 42405}} <repeats 573 times>}, 
  dyn_dtree = {{fc = {freq = 42405, code = 42405}, dl = {dad = 42405, len = 42405}} <repeats 61 times>}, bl_tree = {{fc = {freq = 42405, code = 42405}, dl = {dad = 42405, len = 42405}} <repeats 39 times>}, 
  l_desc = {dyn_tree = 0xa5a5a5a5a5a5a5a5, max_code = -1515870811, stat_desc = 0xa5a5a5a5a5a5a5a5}, d_desc = {dyn_tree = 0xa5a5a5a5a5a5a5a5, max_code = -1515870811, stat_desc = 0xa5a5a5a5a5a5a5a5}, bl_desc = {
    dyn_tree = 0xa5a5a5a5a5a5a5a5, max_code = -1515870811, stat_desc = 0xa5a5a5a5a5a5a5a5}, bl_count = {42405 <repeats 16 times>}, heap = {-1515870811 <repeats 573 times>}, heap_len = -1515870811, 
  heap_max = -1515870811, depth = '\245' <repeats 573 times>, l_buf = 0xa5a5a5a5a5a5a5a5 <error: Cannot access memory at address 0xa5a5a5a5a5a5a5a5>, lit_bufsize = 16384, last_lit = 2779096485, 
  d_buf = 0xa5a5a5a5a5a5a5a5, opt_len = 11936128518282651045, static_len = 11936128518282651045, matches = 2779096485, insert = 2779096485, bi_buf = 42405, bi_valid = -1515870811, high_water = 0}
Comment 12 Mark Johnston freebsd_committer freebsd_triage 2019-06-06 20:50:19 UTC
I think the rfb and rfbout threads are insufficiently synchronized.  As a quick hack, see if this fixes the problem:

diff --git a/usr.sbin/bhyve/rfb.c b/usr.sbin/bhyve/rfb.c
index d572037b25fe..8992995204be 100644
--- a/usr.sbin/bhyve/rfb.c
+++ b/usr.sbin/bhyve/rfb.c
@@ -273,8 +273,8 @@ rfb_recv_set_encodings_msg(struct rfb_softc *rc, int cfd)
                        rc->enc_raw_ok = true;
                        break;
                case RFB_ENCODING_ZLIB:
-                       rc->enc_zlib_ok = true;
                        deflateInit(&rc->zstream, Z_BEST_SPEED);
+                       rc->enc_zlib_ok = true;
                        break;
                case RFB_ENCODING_RESIZE:
                        rc->enc_resize_ok = true;
Comment 13 Conrad Meyer freebsd_committer freebsd_triage 2019-06-06 22:00:21 UTC
(In reply to Mark Johnston from comment #12)
> I think the rfb and rfbout threads are insufficiently synchronized.  As a
> quick hack, see if this fixes the problem:

Yeah, they aren't really synchronized at all. :-/

I think it would be relatively straightforward to just remove the rfbout thread entirely, which makes synchronization more straightforward.
Comment 14 Oleg Ginzburg 2019-06-07 09:47:51 UTC
(In reply to Mark Johnston from comment #12)

No, this change does not help. Panic in the same place
Comment 15 Mark Johnston freebsd_committer freebsd_triage 2019-06-07 15:44:35 UTC
(In reply to olevole from comment #14)
Ok.  Could you also provide

frame 5
p *(struct rfb_softc *)arg

and switch to the "rfb" thread and get its backtrace?

If you are willing to provide the core and debug info (a copy of /usr/sbin/bhyve and /usr/lib/debug/usr/sbin/bhyve.debug), it would be easier for us to look at that instead.
Comment 16 Oleg Ginzburg 2019-06-10 09:31:54 UTC
(gdb) p *(struct rfb_softc *)arg
$1 = {sfd = 9, tid = 0x800b1d200, cfd = 10, width = 0, height = 0, password = 0x0, enc_raw_ok = true, enc_zlib_ok = true, enc_resize_ok = true, zstream = {
    next_in = 0x942600000 <error: Cannot access memory at address 0x942600000>, avail_in = 128, total_in = 0, next_out = 0x944a00180 '\245' <repeats 200 times>..., avail_out = 9600016, total_out = 0, 
    msg = 0x0, state = 0x941b4e800, zalloc = 0x8002d2110 <zcalloc>, zfree = 0x8002d2120 <zcfree>, opaque = 0x0, data_type = 0, adler = 1, reserved = 0}, zbuf = 0x944a00180 '\245' <repeats 200 times>..., 
  zbuflen = 0, conn_wait = 0, sending = 1, mtx = 0x800ae2400, cond = 0x0, hw_crc = 1, crc = 0x941a7f700, crc_tmp = 0x941ad0a00, crc_width = 800, crc_height = 600}
Comment 17 Oleg Ginzburg 2019-06-10 09:33:50 UTC
Created attachment 204940 [details]
bhyve bin + bhyve.core + bhyve.debug

bhyve bin + bhyve.core + bhyve.debug
Comment 18 Mark Johnston freebsd_committer freebsd_triage 2019-06-11 16:55:41 UTC
(In reply to olevole from comment #17)
Thanks.  Are you sure that you properly rebuilt bhyve with the patch provided in comment 12?  In the core I see exactly the race that my patch tries to fix, and the debug info shows that the patch was not applied to the executable that you attached here.

After applying the patch, run "make -C usr.sbin/bhyve all install" from the root of the src tree to rebuild and reinstall.
Comment 19 Mark Johnston freebsd_committer freebsd_triage 2019-06-13 16:01:14 UTC
Sorry, it occurred to me that the crash could also arise if we received multiple encodings messages.  Please try this:

diff --git a/usr.sbin/bhyve/rfb.c b/usr.sbin/bhyve/rfb.c
index d572037b25fe..0a0538f80e53 100644
--- a/usr.sbin/bhyve/rfb.c
+++ b/usr.sbin/bhyve/rfb.c
@@ -273,8 +273,10 @@ rfb_recv_set_encodings_msg(struct rfb_softc *rc, int cfd)
                        rc->enc_raw_ok = true;
                        break;
                case RFB_ENCODING_ZLIB:
-                       rc->enc_zlib_ok = true;
-                       deflateInit(&rc->zstream, Z_BEST_SPEED);
+                       if (!rc->enc_zlib_ok) {
+                               deflateInit(&rc->zstream, Z_BEST_SPEED);
+                               rc->enc_zlib_ok = true;
+                       }
                        break;
                case RFB_ENCODING_RESIZE:
                        rc->enc_resize_ok = true;
Comment 20 Oleg Ginzburg 2019-06-14 15:17:44 UTC
(In reply to Mark Johnston from comment #19)

Mark, this patch really fixes the crash issue for me.

And yes, I'm sure I rebuilt and installed the bhyve binary with the previous patch ( https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=238333#c12 )

Is there a chance to see this patch in the upstream soon? Thanks!
Comment 21 Mark Johnston freebsd_committer freebsd_triage 2019-06-17 15:35:03 UTC
https://reviews.freebsd.org/D20673
Comment 22 commit-hook freebsd_committer freebsd_triage 2019-06-19 16:10:01 UTC
A commit references this bug:

Author: markj
Date: Wed Jun 19 16:09:21 UTC 2019
New revision: 349196
URL: https://svnweb.freebsd.org/changeset/base/349196

Log:
  Make zlib encoding messages idempotent.

  Otherwise duplicate messages can trigger a reinitialization of the
  compression stream while the update thread is running.  Also ensure
  that the stream is initialized before the update thread may attempt
  to use it.

  PR:		238333
  Reviewed by:	cem, rgrimes
  MFC after:	3 days
  Sponsored by:	The FreeBSD Foundation
  Differential Revision:	https://reviews.freebsd.org/D20673

Changes:
  head/usr.sbin/bhyve/rfb.c
Comment 23 commit-hook freebsd_committer freebsd_triage 2019-06-26 17:42:42 UTC
A commit references this bug:

Author: markj
Date: Wed Jun 26 17:41:38 UTC 2019
New revision: 349434
URL: https://svnweb.freebsd.org/changeset/base/349434

Log:
  MFC r349196:
  Make zlib encoding messages idempotent.

  PR:	238333

Changes:
_U  stable/12/
  stable/12/usr.sbin/bhyve/rfb.c
Comment 24 commit-hook freebsd_committer freebsd_triage 2019-06-26 17:43:46 UTC
A commit references this bug:

Author: markj
Date: Wed Jun 26 17:42:47 UTC 2019
New revision: 349435
URL: https://svnweb.freebsd.org/changeset/base/349435

Log:
  MFC r349196:
  Make zlib encoding messages idempotent.

  PR:	238333

Changes:
_U  stable/11/
  stable/11/usr.sbin/bhyve/rfb.c