After -O0 flag removal ( https://svnweb.freebsd.org/base?view=revision&revision=r346011 ) bhyve process begins to crash with an active VNC session. Not all VNC clients can crash. It seems that it depends on the default settings (some use the default compression by default, some do not), for example, it does not cause problems when using vncviewer (without any arguments) from tigervnc-viewer, but is well reproduced using KDE's KRDC: net/krdc ). When you connect to VNC console via KRDC bhyve process crashed with SIGBUS: pid 88831 (bhyve), jid 0, uid 0: exited on signal 10 (core dumped) Backtrace: Thread 29 "rfbout" received signal SIGBUS, Bus error. [Switching to LWP 101756 of process 93578] memcpy () at /usr/src/lib/libc/amd64/string/memmove.S:306 306 MEMMOVE erms=0 overlap=1 begin=MEMMOVE_BEGIN end=MEMMOVE_END (gdb) bt #0 memcpy () at /usr/src/lib/libc/amd64/string/memmove.S:306 #1 0x00000008002c85dc in flush_pending (strm=<optimized out>) at /usr/src/sys/contrib/zlib/deflate.c:741 #2 deflate (strm=0x800bb1030, flush=2) at /usr/src/sys/contrib/zlib/deflate.c:787 #3 0x000000000023f141 in rfb_send_rect (rc=<optimized out>, cfd=<optimized out>, gc=<optimized out>, x=0, y=0, w=128, h=32) at /usr/src/usr.sbin/bhyve/rfb.c:355 #4 rfb_send_screen (rc=<optimized out>, cfd=6, all=<optimized out>) at /usr/src/usr.sbin/bhyve/rfb.c:600 #5 0x000000000023f5c7 in rfb_wr_thr (arg=0x800bb1000) at /usr/src/usr.sbin/bhyve/rfb.c:730 #6 0x000000080064b726 in thread_start (curthread=0x800be1000) at /usr/src/lib/libthr/thread/thr_create.c:291 #7 0x0000000000000000 in ?? () Backtrace stopped: Cannot access memory at address 0x7fffdc9e3000 (gdb) p len $1 = 9600016 print sizeof(strm) $2 = 8 (gdb) print sizeof(s) $3 = 8 There are no problems when bhyve is compiled without optimization (-O0), so we do not see this problem in FreeBSD <= 12.0, but the problem occurs in FreeBSD-current
Stack track indicates that zlib compression of some kind is required to repro. Thanks for reporting.
Ok, here's what rfb.c tries to do: 1. Allocate a single zstream per rfb_softc. (fine) 2. Handle a single connection at a time per rfb_softc. (fine) 3. Periodically re-send the screen from a co-thread. (fine? could just be done from main rfb_handle thread) 4. Serialize screen send using a mutex + flag. (fine) Here is some dubious stuff: 1. Client commands, including "set encoding" (i.e., RAW, ZLIB, RESIZE) are not serialized against periodic worker thread. a. Also including things like "change dimensions of the screen" and "enable ZLIB". 2. The zstream is never reinitialized, but reused. This may be desirable (i.e., better compression is achieved if the client can reliably inflate references to historical stream data). 3. Client is allowed to arbitrarily update softc "height" and "width" fields, although amusingly these are never used? 4. zbuf isn't really sized for totally incompressible full size maximal screens? 2000*1200*4 is 9.1 MB and 16 bytes slop is definitely not enough for zlib --fast overhead. --- The line number in deflate() and stack in flush_pending() suggest that assumptions in rfb.c were violated. rfb initiates every single operation with Z_SYNC_FLUSH, which indicates that all input should be processed and nothing should be left pending in zstream's output buffer. However, it does nothing to *verify* this assumption (avail_out >0). I'm not immediately seeing how this leads to this crash, though. E.g.: $ dd if=/dev/urandom bs=$((2000*1200*4)) count=1 of=./randomscreen.dat 9600000 bytes transferred in 0.026229 secs (366010901 bytes/sec) $ gzip --fast --no-name ./randomscreen.dat $ ls -l ./randomscreen.dat.gz ... 9602938 Jun 5 09:04 ./randomscreen.dat.gz So that's 2938 bytes of zlib overhead for a maximally sized, incompressible screen. Subtract a handful of bytes for gzip header/trailer (maybe 32) and another 16 for the manual slop in rfb.c and you're still short 2.8 kB re: unchecked assumptions made in rfb.c zbuf sizing. (And to what end? zlib can operate efficiently with much smaller buffers.) Can you also print `*stream` and `*s`? What are avail_out, etc? What size of framebuffer are you attempting to use? Do you have a core you can share or repro steps I could follow? Thanks!
I should add, if you still have gdb open: (In reply to olevole from comment #0) > Thread 29 "rfbout" received signal SIGBUS, Bus error. > [Switching to LWP 101756 of process 93578] > memcpy () at /usr/src/lib/libc/amd64/string/memmove.S:306 > 306 MEMMOVE erms=0 overlap=1 begin=MEMMOVE_BEGIN end=MEMMOVE_END > (gdb) bt > #0 memcpy () at /usr/src/lib/libc/amd64/string/memmove.S:306 'info registers' might be interesting here, in tandem with 'show disas' or 'x/i $rip'. > (gdb) p len > $1 = 9600016 That's our 2000*1200*4 + 16 magic number from earlier. > print sizeof(strm) > $2 = 8 > > (gdb) print sizeof(s) > $3 = 8 These are just the pointers. Please 'print *strm' and '*s'.
I tried using krdc to connect to a UEFI-booted FreeBSD VM and haven't been able to trigger a crash so far. I verified with gdb that the rfd thread has enc_zlib_ok = true.
Try using tightvnc, which iirc itself crashes in many situations due to rect outside limits. I think this is caused by what cem spotted in that we ignore the viewer set size stuff, and tightvnc almost always sends a resize on startup to match what ever the display window is, ignoring the window size sent by the server, which probably leads to the rect outside limits error. Our bhyve/vnc implementation is extremly finicky about how it works with clients, most fail infact when doing anything but very simple stuff.
(In reply to Conrad Meyer from comment #2) > 3. Client is allowed to arbitrarily update softc "height" and "width" fields, > although amusingly these are never used? I think fixing this bug would fix a major headache many people have had, including the tossing of illegal rect or outside limit rect errors in the client.
(In reply to Conrad Meyer from comment #2) The maximum screen size is actually 1920 x 1200 due to clamping in pci_fbuf.c so this may allow enough overhead.
(In reply to Peter Grehan from comment #7) > (In reply to Conrad Meyer from comment #2) > > The maximum screen size is actually 1920 x 1200 due to clamping in > pci_fbuf.c so this may allow enough overhead. Yes, that provides enough overhead (80/2000 ~= 4%). I get 9,218,823 bytes with gzip --fast of 1920x1200x4 random data (~0.03% overhead), which is << 9,600,016. So at least buffer size + Z_FLUSH_SYNC is not a problem per se, although basically all of that should still be fixed.
(In reply to Conrad Meyer from comment #3) memcpy () at /usr/src/lib/libc/amd64/string/memmove.S:306 306 MEMMOVE erms=0 overlap=1 begin=MEMMOVE_BEGIN end=MEMMOVE_END (gdb) info registers rax 0x944e00180 39810236800 rbx 0x941a7e030 39756226608 rcx 0x124f82 1200002 rdx 0x927c10 9600016 rsi 0xa5a5a5a5a5a5a5a5 -6510615555426900571 rdi 0x944e00180 39810236800 rbp 0x7fffdc7e1db0 0x7fffdc7e1db0 rsp 0x7fffdc7e1d68 0x7fffdc7e1d68 r8 0x5a5a5a639f3a5bdb 6510615595237137371 r9 0x0 0 r10 0x0 0 r11 0x246 582 r12 0x927c10 9600016 r13 0x927c10 9600016 r14 0x941b5f800 39757150208 r15 0x2 2 rip 0x80083da1a 0x80083da1a <memcpy+266> eflags 0x10206 [ PF IF RF ] cs 0x43 67 ss 0x3b 59 ds <unavailable> es <unavailable> fs <unavailable> gs <unavailable> fs_base 0x941a0c8d0 39755761872 gs_base 0x0 (gdb) x/10i $rip => 0x80083da1a <memcpy+266>: rep movsq %ds:(%rsi),%es:(%rdi) 0x80083da1d <memcpy+269>: mov %rdx,%rcx 0x80083da20 <memcpy+272>: and $0x7,%ecx 0x80083da23 <memcpy+275>: jne 0x80083d9c0 <memcpy+176> 0x80083da25 <memcpy+277>: retq 0x80083da26 <memcpy+278>: mov (%rsi),%r8 0x80083da29 <memcpy+281>: mov 0x8(%rsi),%r9 0x80083da2d <memcpy+285>: mov %rdi,%r10 0x80083da30 <memcpy+288>: mov %rdi,%rcx 0x80083da33 <memcpy+291>: and $0xf,%rcx Unfortunately I can not see the information for struct due to bhyve compiled without -O0: (gdb) up #1 0x00000008002c85dc in flush_pending (strm=<optimized out>) at /usr/src/sys/contrib/zlib/deflate.c:741 741 zmemcpy(strm->next_out, s->pending_out, len); (gdb) print *strm value has been optimized out (gdb) print *s value has been optimized out But with -O0 i can't get crash
(In reply to olevole from comment #9) Please go to frame 2 and run p *strm p *(struct internal_state *)strm->state
(In reply to Mark Johnston from comment #10) #2 deflate (strm=0x800bb1030, flush=2) at /usr/src/sys/contrib/zlib/deflate.c:787 787 flush_pending(strm); (gdb) p *strm $1 = {next_in = 0x941e00000 "", avail_in = 128, total_in = 0, next_out = 0x944c00180 '\245' <repeats 200 times>..., avail_out = 9600016, total_out = 0, msg = 0x0, state = 0x800bf0800, zalloc = 0x8002d2110 <zcalloc>, zfree = 0x8002d2120 <zcfree>, opaque = 0x0, data_type = 0, adler = 1, reserved = 0} (gdb) p *(struct internal_state *)strm->state $2 = {strm = 0x800bb1030, status = 42, pending_buf = 0xa5a5a5a5a5a5a5a5 <error: Cannot access memory at address 0xa5a5a5a5a5a5a5a5>, pending_buf_size = 11936128518282651045, pending_out = 0xa5a5a5a5a5a5a5a5 <error: Cannot access memory at address 0xa5a5a5a5a5a5a5a5>, pending = 11936128518282651045, wrap = 1, gzhead = 0x0, gzindex = 11936128518282651045, method = 165 '\245', last_flush = 2, w_size = 32768, w_bits = 15, w_mask = 32767, window = 0x9433a9140 '\245' <repeats 200 times>..., window_size = 11936128518282651045, prev = 0x9433ba580, head = 0x9433cb780, ins_h = 2779096485, hash_size = 32768, hash_bits = 15, hash_mask = 32767, hash_shift = 5, block_start = -6510615555426900571, match_length = 2779096485, prev_match = 2779096485, match_available = -1515870811, strstart = 2779096485, match_start = 2779096485, lookahead = 2779096485, prev_length = 2779096485, max_chain_length = 2779096485, max_lazy_match = 2779096485, level = -1515870811, strategy = -1515870811, good_match = 2779096485, nice_match = -1515870811, dyn_ltree = {{fc = {freq = 42405, code = 42405}, dl = {dad = 42405, len = 42405}} <repeats 573 times>}, dyn_dtree = {{fc = {freq = 42405, code = 42405}, dl = {dad = 42405, len = 42405}} <repeats 61 times>}, bl_tree = {{fc = {freq = 42405, code = 42405}, dl = {dad = 42405, len = 42405}} <repeats 39 times>}, l_desc = {dyn_tree = 0xa5a5a5a5a5a5a5a5, max_code = -1515870811, stat_desc = 0xa5a5a5a5a5a5a5a5}, d_desc = {dyn_tree = 0xa5a5a5a5a5a5a5a5, max_code = -1515870811, stat_desc = 0xa5a5a5a5a5a5a5a5}, bl_desc = { dyn_tree = 0xa5a5a5a5a5a5a5a5, max_code = -1515870811, stat_desc = 0xa5a5a5a5a5a5a5a5}, bl_count = {42405 <repeats 16 times>}, heap = {-1515870811 <repeats 573 times>}, heap_len = -1515870811, heap_max = -1515870811, depth = '\245' <repeats 573 times>, l_buf = 0xa5a5a5a5a5a5a5a5 <error: Cannot access memory at address 0xa5a5a5a5a5a5a5a5>, lit_bufsize = 16384, last_lit = 2779096485, d_buf = 0xa5a5a5a5a5a5a5a5, opt_len = 11936128518282651045, static_len = 11936128518282651045, matches = 2779096485, insert = 2779096485, bi_buf = 42405, bi_valid = -1515870811, high_water = 0}
I think the rfb and rfbout threads are insufficiently synchronized. As a quick hack, see if this fixes the problem: diff --git a/usr.sbin/bhyve/rfb.c b/usr.sbin/bhyve/rfb.c index d572037b25fe..8992995204be 100644 --- a/usr.sbin/bhyve/rfb.c +++ b/usr.sbin/bhyve/rfb.c @@ -273,8 +273,8 @@ rfb_recv_set_encodings_msg(struct rfb_softc *rc, int cfd) rc->enc_raw_ok = true; break; case RFB_ENCODING_ZLIB: - rc->enc_zlib_ok = true; deflateInit(&rc->zstream, Z_BEST_SPEED); + rc->enc_zlib_ok = true; break; case RFB_ENCODING_RESIZE: rc->enc_resize_ok = true;
(In reply to Mark Johnston from comment #12) > I think the rfb and rfbout threads are insufficiently synchronized. As a > quick hack, see if this fixes the problem: Yeah, they aren't really synchronized at all. :-/ I think it would be relatively straightforward to just remove the rfbout thread entirely, which makes synchronization more straightforward.
(In reply to Mark Johnston from comment #12) No, this change does not help. Panic in the same place
(In reply to olevole from comment #14) Ok. Could you also provide frame 5 p *(struct rfb_softc *)arg and switch to the "rfb" thread and get its backtrace? If you are willing to provide the core and debug info (a copy of /usr/sbin/bhyve and /usr/lib/debug/usr/sbin/bhyve.debug), it would be easier for us to look at that instead.
(gdb) p *(struct rfb_softc *)arg $1 = {sfd = 9, tid = 0x800b1d200, cfd = 10, width = 0, height = 0, password = 0x0, enc_raw_ok = true, enc_zlib_ok = true, enc_resize_ok = true, zstream = { next_in = 0x942600000 <error: Cannot access memory at address 0x942600000>, avail_in = 128, total_in = 0, next_out = 0x944a00180 '\245' <repeats 200 times>..., avail_out = 9600016, total_out = 0, msg = 0x0, state = 0x941b4e800, zalloc = 0x8002d2110 <zcalloc>, zfree = 0x8002d2120 <zcfree>, opaque = 0x0, data_type = 0, adler = 1, reserved = 0}, zbuf = 0x944a00180 '\245' <repeats 200 times>..., zbuflen = 0, conn_wait = 0, sending = 1, mtx = 0x800ae2400, cond = 0x0, hw_crc = 1, crc = 0x941a7f700, crc_tmp = 0x941ad0a00, crc_width = 800, crc_height = 600}
Created attachment 204940 [details] bhyve bin + bhyve.core + bhyve.debug bhyve bin + bhyve.core + bhyve.debug
(In reply to olevole from comment #17) Thanks. Are you sure that you properly rebuilt bhyve with the patch provided in comment 12? In the core I see exactly the race that my patch tries to fix, and the debug info shows that the patch was not applied to the executable that you attached here. After applying the patch, run "make -C usr.sbin/bhyve all install" from the root of the src tree to rebuild and reinstall.
Sorry, it occurred to me that the crash could also arise if we received multiple encodings messages. Please try this: diff --git a/usr.sbin/bhyve/rfb.c b/usr.sbin/bhyve/rfb.c index d572037b25fe..0a0538f80e53 100644 --- a/usr.sbin/bhyve/rfb.c +++ b/usr.sbin/bhyve/rfb.c @@ -273,8 +273,10 @@ rfb_recv_set_encodings_msg(struct rfb_softc *rc, int cfd) rc->enc_raw_ok = true; break; case RFB_ENCODING_ZLIB: - rc->enc_zlib_ok = true; - deflateInit(&rc->zstream, Z_BEST_SPEED); + if (!rc->enc_zlib_ok) { + deflateInit(&rc->zstream, Z_BEST_SPEED); + rc->enc_zlib_ok = true; + } break; case RFB_ENCODING_RESIZE: rc->enc_resize_ok = true;
(In reply to Mark Johnston from comment #19) Mark, this patch really fixes the crash issue for me. And yes, I'm sure I rebuilt and installed the bhyve binary with the previous patch ( https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=238333#c12 ) Is there a chance to see this patch in the upstream soon? Thanks!
https://reviews.freebsd.org/D20673
A commit references this bug: Author: markj Date: Wed Jun 19 16:09:21 UTC 2019 New revision: 349196 URL: https://svnweb.freebsd.org/changeset/base/349196 Log: Make zlib encoding messages idempotent. Otherwise duplicate messages can trigger a reinitialization of the compression stream while the update thread is running. Also ensure that the stream is initialized before the update thread may attempt to use it. PR: 238333 Reviewed by: cem, rgrimes MFC after: 3 days Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D20673 Changes: head/usr.sbin/bhyve/rfb.c
A commit references this bug: Author: markj Date: Wed Jun 26 17:41:38 UTC 2019 New revision: 349434 URL: https://svnweb.freebsd.org/changeset/base/349434 Log: MFC r349196: Make zlib encoding messages idempotent. PR: 238333 Changes: _U stable/12/ stable/12/usr.sbin/bhyve/rfb.c
A commit references this bug: Author: markj Date: Wed Jun 26 17:42:47 UTC 2019 New revision: 349435 URL: https://svnweb.freebsd.org/changeset/base/349435 Log: MFC r349196: Make zlib encoding messages idempotent. PR: 238333 Changes: _U stable/11/ stable/11/usr.sbin/bhyve/rfb.c