Bug 215209

Summary: Kernel panic at boot under XEN: integer divide fault while in kernel mode
Product: Base System Reporter: Sylvain Garrigues <sylvain>
Component: kernAssignee: freebsd-xen (Nobody) <xen>
Status: Closed FIXED    
Severity: Affects Only Me CC: cperciva, dim, meyer.sydney, royger, sylvain, xen
Priority: ---    
Version: CURRENT   
Hardware: amd64   
OS: Any   
Attachments:
Description Flags
EC2 log none

Description Sylvain Garrigues 2016-12-11 10:55:25 UTC
Created attachment 177864 [details]
EC2 log

I had a system running FreeBSD 12 dated Nov. 3 2016. I updated from source to 2016-12-10. Now kernel panics on boot:

xenbusb_front0: <Xen Frontend Devices> on xenstore0
xn0: <Virtual Network Interface> at device/vif/0 on xenbusb_front0
xn0: Ethernet address: 06:f6:cd:2a:fa:e1
xenbusb_back0: <Xen Backend Devices> on xenstore0
xn0: backend features:xenballoon0: <Xen Balloon Device> on xenstore0
 feature-sg feature-gso-tcp4


Fatal trap 18: integer divide fault while in kernel mode
xctrl0: <Xen Control Device> on xenstore0
cpuid = 0; apic id = 00
instruction pointer	= 0x20:0xffffffff808f24ad
stack pointer	        = 0x28:0xfffffe003c91d9c0
frame pointer	        = 0x28:0xfffffe003c91da50
code segment		= base rx0, limit 0xfffff, type 0x1b
xs_dev0: <Xenstore user-space device> on xenstore0
			= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags	= interrupt enabled, resume, IOPL = 0
current process		= 14 (xenwatch)
[ thread pid 14 tid 100029 ]
Stopped at      xbd_backend_changed+0x84d:      divq    0xffffffffffffffa8(%rbp),%eax
db>

Attached is full system log. 

When I install a fresh FreeBSD snapshot on Amazon EC2, dated 2016-11-01, I don't have this bug. With the 2016-11-30, I do have it.
Comment 1 Sylvain Garrigues 2016-12-11 11:01:45 UTC
May I say that I discourage all FreeBSD users to upgrade their CURRENT systems if running on Amazon EC2! Machine won't reboot until this is fixed. This is urgent bug.
Comment 2 Colin Percival freebsd_committer freebsd_triage 2016-12-11 19:10:47 UTC
This seems to have been introduced by the import of clang 3.9.0 in r309124.  I'll work on tracking this down further next week; I'm working on NFS right now and it's best if I don't context-switch.

As you say, best to avoid HEAD right now on EC2.
Comment 3 Roger Pau Monné freebsd_committer freebsd_triage 2016-12-12 10:15:57 UTC
I'm aware of this, I've already realised last week but haven't been able to debug it. Will get to it now. In the meantime, you can boot the the previous kernel using:

> boot kernel.old

From the loader command line.
Comment 4 Roger Pau Monné freebsd_committer freebsd_triage 2016-12-12 14:57:01 UTC
I don't seem to be able to reproduce this with r309875, can you check if you still get the panic with that or any later revision?

Thanks, Roger.
Comment 5 Colin Percival freebsd_committer freebsd_triage 2016-12-12 21:03:40 UTC
Which compiler did you use?  I'm seeing this only with clang 3.9.0; if you check out a new src tree on a pre-r309124 system and just buildkernel then it will use the 3.8.0 compiler.
Comment 6 Sylvain Garrigues 2016-12-12 23:11:32 UTC
(In reply to Colin Percival from comment #5)

I had a system running CURRENT as of Nov. 3 so it was pre-r309124 and I upgraded to CURRENT as of Dec. 10.

At first since the machine didn't survive a reboot I thought I did a bad mergemaster on passwd files. So I installed a fresh EC2 AMI which also happened to be from CURRENT-amd64-2016-12-10 and I had the same issue. I looked at the console and saw the kernel panic.

Tried the EC2 AMI from CURRENT-amd64-2016-11-30, same problem.
Finally tried the EC2 AMI from CURRENT-amd64-2016-11-01, NO problem.
Comment 7 commit-hook freebsd_committer freebsd_triage 2016-12-13 06:54:36 UTC
A commit references this bug:

Author: cperciva
Date: Tue Dec 13 06:54:14 UTC 2016
New revision: 310013
URL: https://svnweb.freebsd.org/changeset/base/310013

Log:
  Check that blkfront devices have a non-zero number of sectors and a
  non-zero sector size.  Such a device would be a virtual disk of zero
  bytes; clearly not useful, and not something we should try to attach.

  As a fortuitous side effect, checking that these values are non-zero
  here results in them not *becoming* zero later on the function.  This
  odd behaviour began with r309124 (clang 3.9.0) but is challenging to
  debug; making any changes to this function whatsoever seems to affect
  the llvm optimizer behaviour enough to make the unexpected zeroing of
  the sector_size variable cease.

  PR:		215209
  Security:	The potential for variables to unexpectedly become zero
  		has worrying consequences for security in general, but
  		not so much in this particular context.

Changes:
  head/sys/dev/xen/blkfront/blkfront.c
Comment 8 commit-hook freebsd_committer freebsd_triage 2016-12-14 19:28:36 UTC
A commit references this bug:

Author: dim
Date: Wed Dec 14 19:28:19 UTC 2016
New revision: 310086
URL: https://svnweb.freebsd.org/changeset/base/310086

Log:
  In xbd_connect(), use correct scanf conversion specifiers for the
  feature_barrier and feature_flush variables.  Otherwise, adjacent
  variables on the stack, such as sector_size, may be overwritten, with
  disastrous results.

  Note that I did not see a good reason to revert the addition of zero
  checks introduced in r310013.  Better safe than sorry.

  PR:		215209
  Tested by:	royger
  MFC after:	3 days

Changes:
  head/sys/dev/xen/blkfront/blkfront.c
Comment 9 commit-hook freebsd_committer freebsd_triage 2016-12-18 14:32:16 UTC
A commit references this bug:

Author: dim
Date: Sun Dec 18 14:31:12 UTC 2016
New revision: 310228
URL: https://svnweb.freebsd.org/changeset/base/310228

Log:
  MFC r310013 (by cperciva):

  Check that blkfront devices have a non-zero number of sectors and a
  non-zero sector size.  Such a device would be a virtual disk of zero
  bytes; clearly not useful, and not something we should try to attach.

  As a fortuitous side effect, checking that these values are non-zero
  here results in them not *becoming* zero later on the function.  This
  odd behaviour began with r309124 (clang 3.9.0) but is challenging to
  debug; making any changes to this function whatsoever seems to affect
  the llvm optimizer behaviour enough to make the unexpected zeroing of
  the sector_size variable cease.

  PR:		215209
  Security:	The potential for variables to unexpectedly become zero
  		has worrying consequences for security in general, but
  		not so much in this particular context.

  MFC r310086:

  In xbd_connect(), use correct scanf conversion specifiers for the
  feature_barrier and feature_flush variables.  Otherwise, adjacent
  variables on the stack, such as sector_size, may be overwritten, with
  disastrous results.

  Note that I did not see a good reason to revert the addition of zero
  checks introduced in r310013.  Better safe than sorry.

  PR:		215209
  Tested by:	royger

Changes:
_U  stable/10/
  stable/10/sys/dev/xen/blkfront/blkfront.c
_U  stable/11/
  stable/11/sys/dev/xen/blkfront/blkfront.c
_U  stable/9/
_U  stable/9/sys/
  stable/9/sys/dev/xen/blkfront/blkfront.c
Comment 10 commit-hook freebsd_committer freebsd_triage 2016-12-18 14:32:19 UTC
A commit references this bug:

Author: dim
Date: Sun Dec 18 14:31:12 UTC 2016
New revision: 310228
URL: https://svnweb.freebsd.org/changeset/base/310228

Log:
  MFC r310013 (by cperciva):

  Check that blkfront devices have a non-zero number of sectors and a
  non-zero sector size.  Such a device would be a virtual disk of zero
  bytes; clearly not useful, and not something we should try to attach.

  As a fortuitous side effect, checking that these values are non-zero
  here results in them not *becoming* zero later on the function.  This
  odd behaviour began with r309124 (clang 3.9.0) but is challenging to
  debug; making any changes to this function whatsoever seems to affect
  the llvm optimizer behaviour enough to make the unexpected zeroing of
  the sector_size variable cease.

  PR:		215209
  Security:	The potential for variables to unexpectedly become zero
  		has worrying consequences for security in general, but
  		not so much in this particular context.

  MFC r310086:

  In xbd_connect(), use correct scanf conversion specifiers for the
  feature_barrier and feature_flush variables.  Otherwise, adjacent
  variables on the stack, such as sector_size, may be overwritten, with
  disastrous results.

  Note that I did not see a good reason to revert the addition of zero
  checks introduced in r310013.  Better safe than sorry.

  PR:		215209
  Tested by:	royger

Changes:
_U  stable/10/
  stable/10/sys/dev/xen/blkfront/blkfront.c
_U  stable/11/
  stable/11/sys/dev/xen/blkfront/blkfront.c
_U  stable/9/
_U  stable/9/sys/
  stable/9/sys/dev/xen/blkfront/blkfront.c