Running an internal fork of FreeBSD -CURRENT (synced as of r312388), we have had several machines crash early in bootup with backtraces such as these: db> bt Tracing pid 20 tid 100135 td 0xfffff80015090500 vga_bitblt_one_text_pixels_block() at vga_bitblt_one_text_pixels_block+0x135/frame 0xfffffe15a81af570 vga_bitblt_text() at vga_bitblt_text+0xd0/frame 0xfffffe15a81af5d0 vt_flush() at vt_flush+0x2ae/frame 0xfffffe15a81af610 termcn_cnputc() at termcn_cnputc+0x108/frame 0xfffffe15a81af650 cnputc() at cnputc+0x6d/frame 0xfffffe15a81af680 cnputs() at cnputs+0xd8/frame 0xfffffe15a81af6a0 putchar() at putchar+0x15e/frame 0xfffffe15a81af720 kvprintf() at kvprintf+0x119/frame 0xfffffe15a81af830 _vprintf() at _vprintf+0x8d/frame 0xfffffe15a81af910 printf() at printf+0x53/frame 0xfffffe15a81af970 g_mirror_update_device() at g_mirror_update_device+0x8aa/frame 0xfffffe15a81af9b0 g_mirror_worker() at g_mirror_worker+0x1709/frame 0xfffffe15a81afa70 fork_exit() at fork_exit+0x85/frame 0xfffffe15a81afab0 fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe15a81afab0 --- trap 0, rip = 0, rsp = 0, rbp = 0 --- This appears to be occurring in /FreeBSD/sys/dev/vt/hw/vga/vt_vga.c at line 648: c = VTBUF_GET_FIELD(vb, row, col); It is crashing while trying to extract the column from the array of rows. (In other words, the row points to a NULL value.) There are 500 rows: db> x/lx 0xffffffff80d638c0 vt_conswindow+0x38: 1f4 The code is trying to access row 383: rdx 0x17f And, that does indeed point to a NULL column: db> x/gx 0xfffff80015886bf8 0xfffff80015886bf8: fffffe000493eec0 db> x/gx 0xfffffe000493eec0 0xfffffe000493eec0: 0 However, this other thread was running at the same time: db> bt 100000 Tracing pid 0 tid 100000 td 0xffffffff81020020 cpustop_handler() at cpustop_handler+0x28/frame 0xfffffe1362b62d40 ipi_nmi_handler() at ipi_nmi_handler+0x4a/frame 0xfffffe1362b62d60 trap() at trap+0x3a/frame 0xfffffe1362b62f20 nmi_calltrap() at nmi_calltrap+0x8/frame 0xfffffe1362b62f20 --- trap 0x13, rip = 0xffffffff807f683a, rsp = 0xffffffff815369a0, rbp = 0xffffffff815369a0 --- bcopy() at bcopy+0x1a/frame 0xffffffff815369a0 memmove() at memmove+0x14/frame 0xffffffff815369c0 vtbuf_grow() at vtbuf_grow+0x2d5/frame 0xffffffff81536a50 vt_change_font() at vt_change_font+0x1ac/frame 0xffffffff81536ac0 vt_upgrade() at vt_upgrade+0x66c/frame 0xffffffff81536b50 mi_startup() at mi_startup+0x118/frame 0xffffffff81536b70 btext() at btext+0x2c It looks like there is a race condition between vt_change_font() and termcn_cnputc(). vt_change_font() calls vtbuf_grow(), which messes around with the vt data structures. (In fact, vtbuf_grow() was in the middle of re-initializing these row data structures, which is probably why they were NULL.) vt_change_font() uses terminal_mute() to keep the terminal from using the vt data structures during the vtbuf_grow() call. That is all well and good except that termcn_cnputc() drops the terminal lock (and, necessarily, doesn't check if TF_MUTE is set) prior to running tm->tm_class->tc_done(tm). And, the tc_done function points to vtterm_done(), which calls vt_flush(), which eventually tries to read from those data structures. So, some sort of locking change is needed here to prevent the terminal from using the vt data structures while they are being re-initialized.
FWIW, I *suspect* we are only now seeing this problem because EARLY_AP_STARTUP lets threads run concurrently earlier than before. (Note: I'm not *blaming* EARLY_AP_STARTUP; this looks like this is a simple race condition. We just exercise code differently in startup now that EARLY_AP_STARTUP allows SMP to start earlier.)
A commit references this bug: Author: jtl Date: Thu Feb 23 01:18:47 UTC 2017 New revision: 314116 URL: https://svnweb.freebsd.org/changeset/base/314116 Log: Fix a panic during boot caused by inadequate locking of some vt(4) driver data structures. vt_change_font() calls vtbuf_grow() to change some vt driver data structures. It uses TF_MUTE to prevent the console from trying to use those data structures while it changes them. During the early stage of the boot process, the vt driver's tc_done routine uses those data structures; however, it is currently called outside the TF_MUTE check. Move the tc_done routine inside the locked TF_MUTE check. PR: 217282 Reviewed by: ed, ray Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D9709 Changes: head/sys/kern/subr_terminal.c
Committed fix to -CURRENT. This still needs an MFC to stable/11. After some more investigation, it appears that it is only really feasible to hit this during the early stages of the boot process or during a panic (when you can't use locking to protect this anyway). I think we are only hitting this now on -CURRENT because EARLY_AP_STARTUP now allows SMP to start much earlier in the boot process. Because EARLY_AP_STARTUP is an option in stable/11, this needs to be MFCd there.
Still applies to 11.1. Can this be MFCd?
Confirmed to work on 11-STABLE with VMware Workstation, which seemed to trigger the race condition on one boot out of three.
Tried 12.0-CURRENT and 11-STABLE with VMware Fusion, and this fix looks good to me :)
Any news on this? 11 is still affected.
Yes, please MFC to stable/11. :)
Ping for the MFC as VMware is affected.
A commit references this bug: Author: jtl Date: Fri Apr 20 15:55:10 UTC 2018 New revision: 332831 URL: https://svnweb.freebsd.org/changeset/base/332831 Log: MFC r314116: Fix a panic during boot caused by inadequate locking of some vt(4) driver data structures. vt_change_font() calls vtbuf_grow() to change some vt driver data structures. It uses TF_MUTE to prevent the console from trying to use those data structures while it changes them. During the early stage of the boot process, the vt driver's tc_done routine uses those data structures; however, it is currently called outside the TF_MUTE check. Move the tc_done routine inside the locked TF_MUTE check. PR: 217282 Sponsored by: Netflix, Inc. Changes: _U stable/11/ stable/11/sys/kern/subr_terminal.c
I'm sorry for the delay! It should be in 11.1 now and be part of the 11.2 release. Thanks for your patience!
(In reply to Jonathan T. Looney from comment #11) > It should be in 11.1 now and be part of the 11.2 release. Sorry, that should say: "It should be in stable/11 now and be part of the 11.2 release."
*** Bug 220923 has been marked as a duplicate of this bug. ***