Bug 217282 - panics during bootup due to a race between vt_change_font() and termcn_cnputc()
Summary: panics during bootup due to a race between vt_change_font() and termcn_cnputc()
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: CURRENT
Hardware: amd64 Any
: --- Affects Some People
Assignee: Jonathan T. Looney
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-02-21 23:59 UTC by Jonathan T. Looney
Modified: 2018-04-20 20:39 UTC (History)
3 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Jonathan T. Looney freebsd_committer 2017-02-21 23:59:36 UTC
Running an internal fork of FreeBSD -CURRENT (synced as of r312388), we have had several machines crash early in bootup with backtraces such as these:
  db> bt
  Tracing pid 20 tid 100135 td 0xfffff80015090500
  vga_bitblt_one_text_pixels_block() at vga_bitblt_one_text_pixels_block+0x135/frame 0xfffffe15a81af570
  vga_bitblt_text() at vga_bitblt_text+0xd0/frame 0xfffffe15a81af5d0
  vt_flush() at vt_flush+0x2ae/frame 0xfffffe15a81af610
  termcn_cnputc() at termcn_cnputc+0x108/frame 0xfffffe15a81af650
  cnputc() at cnputc+0x6d/frame 0xfffffe15a81af680
  cnputs() at cnputs+0xd8/frame 0xfffffe15a81af6a0
  putchar() at putchar+0x15e/frame 0xfffffe15a81af720
  kvprintf() at kvprintf+0x119/frame 0xfffffe15a81af830
  _vprintf() at _vprintf+0x8d/frame 0xfffffe15a81af910
  printf() at printf+0x53/frame 0xfffffe15a81af970
  g_mirror_update_device() at g_mirror_update_device+0x8aa/frame 0xfffffe15a81af9b0
  g_mirror_worker() at g_mirror_worker+0x1709/frame 0xfffffe15a81afa70
  fork_exit() at fork_exit+0x85/frame 0xfffffe15a81afab0
  fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe15a81afab0
  --- trap 0, rip = 0, rsp = 0, rbp = 0 ---


This appears to be occurring in /FreeBSD/sys/dev/vt/hw/vga/vt_vga.c at line 648:
  c = VTBUF_GET_FIELD(vb, row, col);


It is crashing while trying to extract the column from the array of rows. (In other words, the row points to a NULL value.)


There are 500 rows:
  db> x/lx 0xffffffff80d638c0
  vt_conswindow+0x38:     1f4


The code is trying to access row 383:
  rdx                      0x17f


And, that does indeed point to a NULL column:
  db> x/gx 0xfffff80015886bf8
  0xfffff80015886bf8:     fffffe000493eec0
  db> x/gx 0xfffffe000493eec0
  0xfffffe000493eec0:     0


However, this other thread was running at the same time:
  db> bt 100000
  Tracing pid 0 tid 100000 td 0xffffffff81020020
  cpustop_handler() at cpustop_handler+0x28/frame 0xfffffe1362b62d40
  ipi_nmi_handler() at ipi_nmi_handler+0x4a/frame 0xfffffe1362b62d60
  trap() at trap+0x3a/frame 0xfffffe1362b62f20
  nmi_calltrap() at nmi_calltrap+0x8/frame 0xfffffe1362b62f20
  --- trap 0x13, rip = 0xffffffff807f683a, rsp = 0xffffffff815369a0, rbp = 0xffffffff815369a0 ---
  bcopy() at bcopy+0x1a/frame 0xffffffff815369a0
  memmove() at memmove+0x14/frame 0xffffffff815369c0
  vtbuf_grow() at vtbuf_grow+0x2d5/frame 0xffffffff81536a50
  vt_change_font() at vt_change_font+0x1ac/frame 0xffffffff81536ac0
  vt_upgrade() at vt_upgrade+0x66c/frame 0xffffffff81536b50
  mi_startup() at mi_startup+0x118/frame 0xffffffff81536b70
  btext() at btext+0x2c


It looks like there is a race condition between vt_change_font() and termcn_cnputc().

vt_change_font() calls vtbuf_grow(), which messes around with the vt data structures. (In fact, vtbuf_grow() was in the middle of re-initializing these row data structures, which is probably why they were NULL.) vt_change_font() uses terminal_mute() to keep the terminal from using the vt data structures during the vtbuf_grow() call.

That is all well and good except that termcn_cnputc() drops the terminal lock (and, necessarily, doesn't check if TF_MUTE is set) prior to running tm->tm_class->tc_done(tm). And, the tc_done function points to vtterm_done(), which calls vt_flush(), which eventually tries to read from those data structures.

So, some sort of locking change is needed here to prevent the terminal from using the vt data structures while they are being re-initialized.
Comment 1 Jonathan T. Looney freebsd_committer 2017-02-22 00:12:07 UTC
FWIW, I *suspect* we are only now seeing this problem because EARLY_AP_STARTUP lets threads run concurrently earlier than before. (Note: I'm not *blaming* EARLY_AP_STARTUP; this looks like this is a simple race condition. We just exercise code differently in startup now that EARLY_AP_STARTUP allows SMP to start earlier.)
Comment 2 commit-hook freebsd_committer 2017-02-23 01:19:14 UTC
A commit references this bug:

Author: jtl
Date: Thu Feb 23 01:18:47 UTC 2017
New revision: 314116
URL: https://svnweb.freebsd.org/changeset/base/314116

Log:
  Fix a panic during boot caused by inadequate locking of some vt(4) driver
  data structures.

  vt_change_font() calls vtbuf_grow() to change some vt driver data
  structures. It uses TF_MUTE to prevent the console from trying to use those
  data structures while it changes them.

  During the early stage of the boot process, the vt driver's tc_done routine
  uses those data structures; however, it is currently called outside the
  TF_MUTE check.

  Move the tc_done routine inside the locked TF_MUTE check.

  PR:		217282
  Reviewed by:	ed, ray
  Sponsored by:	Netflix
  Differential Revision:	https://reviews.freebsd.org/D9709

Changes:
  head/sys/kern/subr_terminal.c
Comment 3 Jonathan T. Looney freebsd_committer 2017-02-23 01:38:34 UTC
Committed fix to -CURRENT. This still needs an MFC to stable/11.

After some more investigation, it appears that it is only really feasible to hit this during the early stages of the boot process or during a panic (when you can't use locking to protect this anyway). I think we are only hitting this now on -CURRENT because EARLY_AP_STARTUP now allows SMP to start much earlier in the boot process. Because EARLY_AP_STARTUP is an option in stable/11, this needs to be MFCd there.
Comment 4 Jimmy Olgeni freebsd_committer 2017-11-06 10:06:05 UTC
Still applies to 11.1. Can this be MFCd?
Comment 5 Jimmy Olgeni freebsd_committer 2017-11-08 12:44:15 UTC
Confirmed to work on 11-STABLE with VMware Workstation, which seemed to trigger the race condition on one boot out of three.
Comment 6 Zhenlei Huang 2017-11-11 16:21:08 UTC
Tried 12.0-CURRENT and 11-STABLE with VMware Fusion, and this fix looks good to me :)
Comment 7 Jimmy Olgeni freebsd_committer 2017-12-17 09:51:21 UTC
Any news on this? 11 is still affected.
Comment 8 Franco Fichtner 2017-12-17 14:48:37 UTC
Yes, please MFC to stable/11. :)
Comment 9 Jimmy Olgeni freebsd_committer 2018-04-01 09:19:54 UTC
Ping for the MFC as VMware is affected.
Comment 10 commit-hook freebsd_committer 2018-04-20 15:55:34 UTC
A commit references this bug:

Author: jtl
Date: Fri Apr 20 15:55:10 UTC 2018
New revision: 332831
URL: https://svnweb.freebsd.org/changeset/base/332831

Log:
  MFC r314116:
    Fix a panic during boot caused by inadequate locking of some vt(4) driver
    data structures.

    vt_change_font() calls vtbuf_grow() to change some vt driver data
    structures. It uses TF_MUTE to prevent the console from trying to use
    those data structures while it changes them.

    During the early stage of the boot process, the vt driver's tc_done
    routine uses those data structures; however, it is currently called
    outside the TF_MUTE check.

    Move the tc_done routine inside the locked TF_MUTE check.

  PR:		217282
  Sponsored by:	Netflix, Inc.

Changes:
_U  stable/11/
  stable/11/sys/kern/subr_terminal.c
Comment 11 Jonathan T. Looney freebsd_committer 2018-04-20 20:38:28 UTC
I'm sorry for the delay! It should be in 11.1 now and be part of the 11.2 release. Thanks for your patience!
Comment 12 Jonathan T. Looney freebsd_committer 2018-04-20 20:39:25 UTC
(In reply to Jonathan T. Looney from comment #11)
> It should be in 11.1 now and be part of the 11.2 release.

Sorry, that should say: "It should be in stable/11 now and be part of the 11.2 release."