Bug 217994 - Kernel panic in native_lapic_setup with 12-CURRENT on EC2 machine
Summary: Kernel panic in native_lapic_setup with 12-CURRENT on EC2 machine
Status: Closed Not A Bug
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: CURRENT
Hardware: Any Any
: --- Affects Only Me
Assignee: freebsd-bugs (Nobody)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-03-22 08:03 UTC by Sylvain Garrigues
Modified: 2017-03-22 16:12 UTC (History)
4 users (show)

See Also:


Attachments
Kernel panic (525.46 KB, image/jpeg)
2017-03-22 08:03 UTC, Sylvain Garrigues
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Sylvain Garrigues 2017-03-22 08:03:22 UTC
Created attachment 181046 [details]
Kernel panic

EC2 machines cannot boot anymore.
Comment 1 Sylvain Garrigues 2017-03-22 08:08:46 UTC
EC2 machine did boot with official EC2 snapshot on EC2, dated March 10. Rebuilt my current from a src tree which dates from a few days after that.
Comment 2 Konstantin Belousov freebsd_committer freebsd_triage 2017-03-22 08:11:16 UTC
What do you mean by 'anymore' ? Which revision causes the breakage for you ?

Show the source line number for the faulted instruction.  Also it might be useful to provide disassembly of native_lapic_setup for paniced kernel, done using objdump or gdb, since ddb disassembly shows something which is highly unlikely valid.

And this looks very strange, since Xen should not use native lapic at all.
Comment 3 Sylvain Garrigues 2017-03-22 08:17:57 UTC
I don't have access anymore to that machine, it is "in the cloud" and since it cannot boot I cannot login anymore. I just got the kernel fault screenshot from EC2 admin interface. 

I did install it with an official EC2 snapshot from 12-CURRENT-2017-03-10, logged in, updated src last week to HEAD, built and installed world and kernel this week-end, installed it this morning, and boom, kernel fault this morning.

For what it's worth, this is a t2.micro instance.
Comment 4 Sylvain Garrigues 2017-03-22 08:23:06 UTC
TO be extra clear, I got the kernel fault at boot time (right after rebooting the machine).

I did reboot several times (on 2017-03-10) the same machine with CURRENT kernel from 2017-03-10 EC2 snapshot and did not experience this issue. So I expect the regression to have been introduced after that date.
Comment 5 Andriy Gapon freebsd_committer freebsd_triage 2017-03-22 08:51:21 UTC
Should native_lapic_setup be called on Xen at all?  I thought that it overrides LAPIC methods...  Are you sure that you are using the right kernel config? But then I don't know much about xen and ec2.
Comment 6 Colin Percival freebsd_committer freebsd_triage 2017-03-22 08:58:23 UTC
I don't have time to dig into this right now (it's 2AM here), but FreeBSD/EC2 now uses GENERIC kernels since EC2 is running Xen in HVM mode.

Sylvain, can you check if the 2017-03-16 snapshot works?
Comment 7 Andriy Gapon freebsd_committer freebsd_triage 2017-03-22 09:08:17 UTC
(In reply to Sylvain Garrigues from comment #3)
If you still have the image / kernel somewhere local (or on another cloud machine), then you can open the kernel in kgdb and disassemble the function.
Comment 8 Andriy Gapon freebsd_committer freebsd_triage 2017-03-22 09:09:21 UTC
(In reply to Colin Percival from comment #6)
Ah, okay, I keep forgetting about PVM vs HVM vs something else.
Sorry for the noise.
Comment 9 Sylvain Garrigues 2017-03-22 09:15:57 UTC
@Colin: I cannot see the 2017-03-16 in the Community AMIs, where should I find it?

Maybe worth to mention is my CPUTYPE?=haswell, may this cause such a fault?

Also I did manage to mount the volume with another EC2 machine, does anyone want me to send you a file (kernel or log)?

/etc/make.conf
MALLOC_PRODUCTION=YES
WITH_CCACHE_BUILD=YES
CPUTYPE?=haswell
KERNCONF=GENERIC-NODEBUG

/etc/src.conf
MALLOC_PRODUCTION=YES
WITH_CCACHE_BUILD=YES
WITHOUT_TESTS=YES
WITHOUT_BHYVE=YES
WITHOUT_PROFILE=YES
WITHOUT_ZFS=YES
WITHOUT_SYSTEM_COMPILER=YES
Comment 10 Andriy Gapon freebsd_committer freebsd_triage 2017-03-22 09:46:43 UTC
(In reply to Sylvain Garrigues from comment #9)
I would be happy if you just run kgdb /path/to/that/kernel and then "disassemble native_lapic_setup" in kgdb.
Comment 11 Sylvain Garrigues 2017-03-22 09:49:31 UTC
(kgdb) disassemble native_lapic_setup
Dump of assembler code for function native_lapic_setup:
0xffffffff8107fd30 <native_lapic_setup+0>:	push   %rbp
0xffffffff8107fd31 <native_lapic_setup+1>:	mov    %rsp,%rbp
0xffffffff8107fd34 <native_lapic_setup+4>:	push   %r15
0xffffffff8107fd36 <native_lapic_setup+6>:	push   %r14
0xffffffff8107fd38 <native_lapic_setup+8>:	push   %r13
0xffffffff8107fd3a <native_lapic_setup+10>:	push   %r12
0xffffffff8107fd3c <native_lapic_setup+12>:	push   %rbx
0xffffffff8107fd3d <native_lapic_setup+13>:	sub    $0x38,%rsp
0xffffffff8107fd41 <native_lapic_setup+17>:	mov    %edi,%r14d
0xffffffff8107fd44 <native_lapic_setup+20>:	mov    0xffffffff81d6d320,%rax
0xffffffff8107fd4c <native_lapic_setup+28>:	mov    %rax,-0x30(%rbp)
0xffffffff8107fd50 <native_lapic_setup+32>:	pushfq 
0xffffffff8107fd51 <native_lapic_setup+33>:	pop    %rbx
0xffffffff8107fd52 <native_lapic_setup+34>:	cli    
0xffffffff8107fd53 <native_lapic_setup+35>:	callq  *0xffffffff81a216d0
0xffffffff8107fd5a <native_lapic_setup+42>:	movslq %eax,%rsi
0xffffffff8107fd5d <native_lapic_setup+45>:	cmpl   $0x0,0xffffffff81edba40
0xffffffff8107fd65 <native_lapic_setup+53>:	je     0xffffffff8107fdaa <native_lapic_setup+122>
0xffffffff8107fd67 <native_lapic_setup+55>:	mov    $0x803,%ecx
0xffffffff8107fd6c <native_lapic_setup+60>:	rdmsr  
0xffffffff8107fd6e <native_lapic_setup+62>:	mov    $0x810,%ecx
0xffffffff8107fd73 <native_lapic_setup+67>:	(bad)  
0xffffffff8107fd74 <native_lapic_setup+68>:	(bad)  
0xffffffff8107fd75 <native_lapic_setup+69>:	jo     0xffffffff8107fd6e <native_lapic_setup+62>
0xffffffff8107fd77 <native_lapic_setup+71>:	loopne 0xffffffff8107fcfc <native_lapic_xapic_mode+28>
0xffffffff8107fd79 <native_lapic_setup+73>:	cmp    $0x25,%al
0xffffffff8107fd7b <native_lapic_setup+75>:	rex mov    $0x740081ed,%edx
0xffffffff8107fd81 <native_lapic_setup+81>:	cmp    0x808(%rcx),%edi
0xffffffff8107fd87 <native_lapic_setup+87>:	rdmsr  
0xffffffff8107fd89 <native_lapic_setup+89>:	and    $0xffffff00,%eax
0xffffffff8107fd8e <native_lapic_setup+94>:	cmpl   $0x0,0xffffffff81edba40
0xffffffff8107fd96 <native_lapic_setup+102>:	je     0xffffffff810800e2 <native_lapic_setup+946>
0xffffffff8107fd9c <native_lapic_setup+108>:	mfence 
0xffffffff8107fd9f <native_lapic_setup+111>:	xor    %edx,%edx
0xffffffff8107fda1 <native_lapic_setup+113>:	mov    $0x808,%ecx
0xffffffff8107fda6 <native_lapic_setup+118>:	wrmsr  
0xffffffff8107fda8 <native_lapic_setup+120>:	jmp    0xffffffff8107fdd6 <native_lapic_setup+166>
0xffffffff8107fdaa <native_lapic_setup+122>:	mov    0xffffffff81edba38,%rax
0xffffffff8107fdb2 <native_lapic_setup+130>:	mov    $0x810,%ecx
0xffffffff8107fdb7 <native_lapic_setup+135>:	(bad)  
0xffffffff8107fdb8 <native_lapic_setup+136>:	(bad)  
0xffffffff8107fdb9 <native_lapic_setup+137>:	jo     0xffffffff8107fdb2 <native_lapic_setup+130>
0xffffffff8107fdbb <native_lapic_setup+139>:	(bad)  
0xffffffff8107fdbc <native_lapic_setup+140>:	xor    %cl,-0x75(%rax)
0xffffffff8107fdbf <native_lapic_setup+143>:	or     $0x25,%al
0xffffffff8107fdc1 <native_lapic_setup+145>:	cmp    %bh,0xb881ed(%rdx)
0xffffffff8107fdc7 <native_lapic_setup+151>:	(bad)  
0xffffffff8107fdc8 <native_lapic_setup+152>:	(bad)  
0xffffffff8107fdc9 <native_lapic_setup+153>:	jmpq   *(%rbx)
0xffffffff8107fdcb <native_lapic_setup+155>:	addl   $0x8081,-0x77000000(%rax)
0xffffffff8107fdd5 <native_lapic_setup+165>:	add    %cl,-0x12(%rcx,%rbp,2)
0xffffffff8107fdd9 <native_lapic_setup+169>:	push   %rax
0xffffffff8107fdda <native_lapic_setup+170>:	add    (%rax),%eax
0xffffffff8107fddc <native_lapic_setup+172>:	add    %al,-0x45bfdac4(%rbx)
Comment 12 Andriy Gapon freebsd_committer freebsd_triage 2017-03-22 09:54:35 UTC
(In reply to Sylvain Garrigues from comment #11)
Those "(bad)" instructions don't look good.
Could you please install gdb from packages (pkg install gdb) and repeat the same procedure using kgdb7xxx (kgdb7121)?
Perhaps the CPU type override is indeed the problem here.
Comment 13 Sylvain Garrigues 2017-03-22 09:58:22 UTC
(In reply to Andriy Gapon from comment #12)

root@ip-172-31-17-21:~ # kgdb7121 /mnt/boot/kernel/kernel
GNU gdb (GDB) 7.12.1 [GDB v7.12.1 for FreeBSD]
Copyright (C) 2017 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-portbld-freebsd12.0".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /mnt/boot/kernel/kernel...(no debugging symbols found)...done.
(kgdb) disassemble native_lapic_setup
Dump of assembler code for function native_lapic_setup:
   0xffffffff8107fd30 <+0>:	push   %rbp
   0xffffffff8107fd31 <+1>:	mov    %rsp,%rbp
   0xffffffff8107fd34 <+4>:	push   %r15
   0xffffffff8107fd36 <+6>:	push   %r14
   0xffffffff8107fd38 <+8>:	push   %r13
   0xffffffff8107fd3a <+10>:	push   %r12
   0xffffffff8107fd3c <+12>:	push   %rbx
   0xffffffff8107fd3d <+13>:	sub    $0x38,%rsp
   0xffffffff8107fd41 <+17>:	mov    %edi,%r14d
   0xffffffff8107fd44 <+20>:	mov    0xffffffff81d6d320,%rax
   0xffffffff8107fd4c <+28>:	mov    %rax,-0x30(%rbp)
   0xffffffff8107fd50 <+32>:	pushfq 
   0xffffffff8107fd51 <+33>:	pop    %rbx
   0xffffffff8107fd52 <+34>:	cli    
   0xffffffff8107fd53 <+35>:	callq  *0xffffffff81a216d0
   0xffffffff8107fd5a <+42>:	movslq %eax,%rsi
   0xffffffff8107fd5d <+45>:	cmpl   $0x0,0xffffffff81edba40
   0xffffffff8107fd65 <+53>:	je     0xffffffff8107fdaa <native_lapic_setup+122>
   0xffffffff8107fd67 <+55>:	mov    $0x803,%ecx
   0xffffffff8107fd6c <+60>:	rdmsr  
   0xffffffff8107fd6e <+62>:	mov    $0x810,%ecx
   0xffffffff8107fd73 <+67>:	bextr  %ecx,%eax,%r12d
   0xffffffff8107fd78 <+72>:	cmpl   $0x0,0xffffffff81edba40
   0xffffffff8107fd80 <+80>:	je     0xffffffff8107fdbd <native_lapic_setup+141>
   0xffffffff8107fd82 <+82>:	mov    $0x808,%ecx
   0xffffffff8107fd87 <+87>:	rdmsr  
   0xffffffff8107fd89 <+89>:	and    $0xffffff00,%eax
   0xffffffff8107fd8e <+94>:	cmpl   $0x0,0xffffffff81edba40
   0xffffffff8107fd96 <+102>:	je     0xffffffff810800e2 <native_lapic_setup+946>
   0xffffffff8107fd9c <+108>:	mfence 
   0xffffffff8107fd9f <+111>:	xor    %edx,%edx
   0xffffffff8107fda1 <+113>:	mov    $0x808,%ecx
   0xffffffff8107fda6 <+118>:	wrmsr  
   0xffffffff8107fda8 <+120>:	jmp    0xffffffff8107fdd6 <native_lapic_setup+166>
   0xffffffff8107fdaa <+122>:	mov    0xffffffff81edba38,%rax
   0xffffffff8107fdb2 <+130>:	mov    $0x810,%ecx
   0xffffffff8107fdb7 <+135>:	bextr  %ecx,0x30(%rax),%r12d
   0xffffffff8107fdbd <+141>:	mov    0xffffffff81edba38,%rcx
   0xffffffff8107fdc5 <+149>:	mov    $0xffffff00,%eax
   0xffffffff8107fdca <+154>:	and    0x80(%rcx),%eax
   0xffffffff8107fdd0 <+160>:	mov    %eax,0x80(%rcx)
   0xffffffff8107fdd6 <+166>:	imul   $0x350,%rsi,%r13
   0xffffffff8107fddd <+173>:	cmpl   $0x0,0xffffffff81edba40
   0xffffffff8107fde5 <+181>:	mov    %rbx,-0x58(%rbp)
   0xffffffff8107fde9 <+185>:	je     0xffffffff8107fe29 <native_lapic_setup+249>
   0xffffffff8107fdeb <+187>:	mov    $0x80f,%ecx
   0xffffffff8107fdf0 <+192>:	rdmsr  
   0xffffffff8107fdf2 <+194>:	mov    %eax,%ecx
   0xffffffff8107fdf4 <+196>:	and    $0xfffffc00,%ecx
   0xffffffff8107fdfa <+202>:	cmpl   $0x0,0xffffffff81edba28
   0xffffffff8107fe02 <+210>:	mov    $0x1ff,%edx
   0xffffffff8107fe07 <+215>:	mov    $0x11ff,%eax
   0xffffffff8107fe0c <+220>:	cmove  %edx,%eax
   0xffffffff8107fe0f <+223>:	or     %ecx,%eax
Comment 14 Andriy Gapon freebsd_committer freebsd_triage 2017-03-22 10:09:57 UTC
(In reply to Sylvain Garrigues from comment #13)
Thanks! I suspect that the problem is with BEXTR instruction that is available on Haswell, but does not seem to provided by (at least some) EC2 instances.  Not sure if it requires any support from the hypervisor or if it's a problem with underlying hardware not supporting the instruction.
So, I bet on your guess from comment #9.
Comment 15 Sylvain Garrigues 2017-03-22 10:12:56 UTC
(In reply to Andriy Gapon from comment #14)
I'm recompiling kernel w/ GENERIC (instead of GENERIC-NODEBUG) and no CPUTYPE line. Hold your break and expect the end of the suspense in a couple hours.
Comment 16 Sylvain Garrigues 2017-03-22 15:28:46 UTC
Guys, I recompiled without CPUTYPE and now the kernel boots. I am so ashamed to have disturbed you for this.

Though, my CPU *is* Haswell:
# sysctl -a | egrep -i 'hw.machine|hw.ncpu|hw.model'
hw.machine: amd64
hw.model: Intel(R) Xeon(R) CPU E5-2676 v3 @ 2.40GHz
hw.ncpu: 1
hw.machine_arch: amd64

So I'm pissed I can't use CPUTYPE here. Don't know if I should blame KVM or Amazon. I have another older EC2 machine where CPUTYPE=ivybridge works BTW.

Nevertheless, thanks so much for quick and accurate feedback, you guys are the reason I wouldn't dream of using any other OS than FreeBSD. THANKS!

Bug can be closed.
Comment 17 Andriy Gapon freebsd_committer freebsd_triage 2017-03-22 16:12:51 UTC
FWIW, here is a similar problem reported, with a similar resolution:
https://lists.freebsd.org/pipermail/freebsd-stable/2016-July/084960.html

You might want to ask Amazon support if you are really curious.