Created attachment 181046 [details] Kernel panic EC2 machines cannot boot anymore.
EC2 machine did boot with official EC2 snapshot on EC2, dated March 10. Rebuilt my current from a src tree which dates from a few days after that.
What do you mean by 'anymore' ? Which revision causes the breakage for you ? Show the source line number for the faulted instruction. Also it might be useful to provide disassembly of native_lapic_setup for paniced kernel, done using objdump or gdb, since ddb disassembly shows something which is highly unlikely valid. And this looks very strange, since Xen should not use native lapic at all.
I don't have access anymore to that machine, it is "in the cloud" and since it cannot boot I cannot login anymore. I just got the kernel fault screenshot from EC2 admin interface. I did install it with an official EC2 snapshot from 12-CURRENT-2017-03-10, logged in, updated src last week to HEAD, built and installed world and kernel this week-end, installed it this morning, and boom, kernel fault this morning. For what it's worth, this is a t2.micro instance.
TO be extra clear, I got the kernel fault at boot time (right after rebooting the machine). I did reboot several times (on 2017-03-10) the same machine with CURRENT kernel from 2017-03-10 EC2 snapshot and did not experience this issue. So I expect the regression to have been introduced after that date.
Should native_lapic_setup be called on Xen at all? I thought that it overrides LAPIC methods... Are you sure that you are using the right kernel config? But then I don't know much about xen and ec2.
I don't have time to dig into this right now (it's 2AM here), but FreeBSD/EC2 now uses GENERIC kernels since EC2 is running Xen in HVM mode. Sylvain, can you check if the 2017-03-16 snapshot works?
(In reply to Sylvain Garrigues from comment #3) If you still have the image / kernel somewhere local (or on another cloud machine), then you can open the kernel in kgdb and disassemble the function.
(In reply to Colin Percival from comment #6) Ah, okay, I keep forgetting about PVM vs HVM vs something else. Sorry for the noise.
@Colin: I cannot see the 2017-03-16 in the Community AMIs, where should I find it? Maybe worth to mention is my CPUTYPE?=haswell, may this cause such a fault? Also I did manage to mount the volume with another EC2 machine, does anyone want me to send you a file (kernel or log)? /etc/make.conf MALLOC_PRODUCTION=YES WITH_CCACHE_BUILD=YES CPUTYPE?=haswell KERNCONF=GENERIC-NODEBUG /etc/src.conf MALLOC_PRODUCTION=YES WITH_CCACHE_BUILD=YES WITHOUT_TESTS=YES WITHOUT_BHYVE=YES WITHOUT_PROFILE=YES WITHOUT_ZFS=YES WITHOUT_SYSTEM_COMPILER=YES
(In reply to Sylvain Garrigues from comment #9) I would be happy if you just run kgdb /path/to/that/kernel and then "disassemble native_lapic_setup" in kgdb.
(kgdb) disassemble native_lapic_setup Dump of assembler code for function native_lapic_setup: 0xffffffff8107fd30 <native_lapic_setup+0>: push %rbp 0xffffffff8107fd31 <native_lapic_setup+1>: mov %rsp,%rbp 0xffffffff8107fd34 <native_lapic_setup+4>: push %r15 0xffffffff8107fd36 <native_lapic_setup+6>: push %r14 0xffffffff8107fd38 <native_lapic_setup+8>: push %r13 0xffffffff8107fd3a <native_lapic_setup+10>: push %r12 0xffffffff8107fd3c <native_lapic_setup+12>: push %rbx 0xffffffff8107fd3d <native_lapic_setup+13>: sub $0x38,%rsp 0xffffffff8107fd41 <native_lapic_setup+17>: mov %edi,%r14d 0xffffffff8107fd44 <native_lapic_setup+20>: mov 0xffffffff81d6d320,%rax 0xffffffff8107fd4c <native_lapic_setup+28>: mov %rax,-0x30(%rbp) 0xffffffff8107fd50 <native_lapic_setup+32>: pushfq 0xffffffff8107fd51 <native_lapic_setup+33>: pop %rbx 0xffffffff8107fd52 <native_lapic_setup+34>: cli 0xffffffff8107fd53 <native_lapic_setup+35>: callq *0xffffffff81a216d0 0xffffffff8107fd5a <native_lapic_setup+42>: movslq %eax,%rsi 0xffffffff8107fd5d <native_lapic_setup+45>: cmpl $0x0,0xffffffff81edba40 0xffffffff8107fd65 <native_lapic_setup+53>: je 0xffffffff8107fdaa <native_lapic_setup+122> 0xffffffff8107fd67 <native_lapic_setup+55>: mov $0x803,%ecx 0xffffffff8107fd6c <native_lapic_setup+60>: rdmsr 0xffffffff8107fd6e <native_lapic_setup+62>: mov $0x810,%ecx 0xffffffff8107fd73 <native_lapic_setup+67>: (bad) 0xffffffff8107fd74 <native_lapic_setup+68>: (bad) 0xffffffff8107fd75 <native_lapic_setup+69>: jo 0xffffffff8107fd6e <native_lapic_setup+62> 0xffffffff8107fd77 <native_lapic_setup+71>: loopne 0xffffffff8107fcfc <native_lapic_xapic_mode+28> 0xffffffff8107fd79 <native_lapic_setup+73>: cmp $0x25,%al 0xffffffff8107fd7b <native_lapic_setup+75>: rex mov $0x740081ed,%edx 0xffffffff8107fd81 <native_lapic_setup+81>: cmp 0x808(%rcx),%edi 0xffffffff8107fd87 <native_lapic_setup+87>: rdmsr 0xffffffff8107fd89 <native_lapic_setup+89>: and $0xffffff00,%eax 0xffffffff8107fd8e <native_lapic_setup+94>: cmpl $0x0,0xffffffff81edba40 0xffffffff8107fd96 <native_lapic_setup+102>: je 0xffffffff810800e2 <native_lapic_setup+946> 0xffffffff8107fd9c <native_lapic_setup+108>: mfence 0xffffffff8107fd9f <native_lapic_setup+111>: xor %edx,%edx 0xffffffff8107fda1 <native_lapic_setup+113>: mov $0x808,%ecx 0xffffffff8107fda6 <native_lapic_setup+118>: wrmsr 0xffffffff8107fda8 <native_lapic_setup+120>: jmp 0xffffffff8107fdd6 <native_lapic_setup+166> 0xffffffff8107fdaa <native_lapic_setup+122>: mov 0xffffffff81edba38,%rax 0xffffffff8107fdb2 <native_lapic_setup+130>: mov $0x810,%ecx 0xffffffff8107fdb7 <native_lapic_setup+135>: (bad) 0xffffffff8107fdb8 <native_lapic_setup+136>: (bad) 0xffffffff8107fdb9 <native_lapic_setup+137>: jo 0xffffffff8107fdb2 <native_lapic_setup+130> 0xffffffff8107fdbb <native_lapic_setup+139>: (bad) 0xffffffff8107fdbc <native_lapic_setup+140>: xor %cl,-0x75(%rax) 0xffffffff8107fdbf <native_lapic_setup+143>: or $0x25,%al 0xffffffff8107fdc1 <native_lapic_setup+145>: cmp %bh,0xb881ed(%rdx) 0xffffffff8107fdc7 <native_lapic_setup+151>: (bad) 0xffffffff8107fdc8 <native_lapic_setup+152>: (bad) 0xffffffff8107fdc9 <native_lapic_setup+153>: jmpq *(%rbx) 0xffffffff8107fdcb <native_lapic_setup+155>: addl $0x8081,-0x77000000(%rax) 0xffffffff8107fdd5 <native_lapic_setup+165>: add %cl,-0x12(%rcx,%rbp,2) 0xffffffff8107fdd9 <native_lapic_setup+169>: push %rax 0xffffffff8107fdda <native_lapic_setup+170>: add (%rax),%eax 0xffffffff8107fddc <native_lapic_setup+172>: add %al,-0x45bfdac4(%rbx)
(In reply to Sylvain Garrigues from comment #11) Those "(bad)" instructions don't look good. Could you please install gdb from packages (pkg install gdb) and repeat the same procedure using kgdb7xxx (kgdb7121)? Perhaps the CPU type override is indeed the problem here.
(In reply to Andriy Gapon from comment #12) root@ip-172-31-17-21:~ # kgdb7121 /mnt/boot/kernel/kernel GNU gdb (GDB) 7.12.1 [GDB v7.12.1 for FreeBSD] Copyright (C) 2017 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-portbld-freebsd12.0". Type "show configuration" for configuration details. For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>. Find the GDB manual and other documentation resources online at: <http://www.gnu.org/software/gdb/documentation/>. For help, type "help". Type "apropos word" to search for commands related to "word"... Reading symbols from /mnt/boot/kernel/kernel...(no debugging symbols found)...done. (kgdb) disassemble native_lapic_setup Dump of assembler code for function native_lapic_setup: 0xffffffff8107fd30 <+0>: push %rbp 0xffffffff8107fd31 <+1>: mov %rsp,%rbp 0xffffffff8107fd34 <+4>: push %r15 0xffffffff8107fd36 <+6>: push %r14 0xffffffff8107fd38 <+8>: push %r13 0xffffffff8107fd3a <+10>: push %r12 0xffffffff8107fd3c <+12>: push %rbx 0xffffffff8107fd3d <+13>: sub $0x38,%rsp 0xffffffff8107fd41 <+17>: mov %edi,%r14d 0xffffffff8107fd44 <+20>: mov 0xffffffff81d6d320,%rax 0xffffffff8107fd4c <+28>: mov %rax,-0x30(%rbp) 0xffffffff8107fd50 <+32>: pushfq 0xffffffff8107fd51 <+33>: pop %rbx 0xffffffff8107fd52 <+34>: cli 0xffffffff8107fd53 <+35>: callq *0xffffffff81a216d0 0xffffffff8107fd5a <+42>: movslq %eax,%rsi 0xffffffff8107fd5d <+45>: cmpl $0x0,0xffffffff81edba40 0xffffffff8107fd65 <+53>: je 0xffffffff8107fdaa <native_lapic_setup+122> 0xffffffff8107fd67 <+55>: mov $0x803,%ecx 0xffffffff8107fd6c <+60>: rdmsr 0xffffffff8107fd6e <+62>: mov $0x810,%ecx 0xffffffff8107fd73 <+67>: bextr %ecx,%eax,%r12d 0xffffffff8107fd78 <+72>: cmpl $0x0,0xffffffff81edba40 0xffffffff8107fd80 <+80>: je 0xffffffff8107fdbd <native_lapic_setup+141> 0xffffffff8107fd82 <+82>: mov $0x808,%ecx 0xffffffff8107fd87 <+87>: rdmsr 0xffffffff8107fd89 <+89>: and $0xffffff00,%eax 0xffffffff8107fd8e <+94>: cmpl $0x0,0xffffffff81edba40 0xffffffff8107fd96 <+102>: je 0xffffffff810800e2 <native_lapic_setup+946> 0xffffffff8107fd9c <+108>: mfence 0xffffffff8107fd9f <+111>: xor %edx,%edx 0xffffffff8107fda1 <+113>: mov $0x808,%ecx 0xffffffff8107fda6 <+118>: wrmsr 0xffffffff8107fda8 <+120>: jmp 0xffffffff8107fdd6 <native_lapic_setup+166> 0xffffffff8107fdaa <+122>: mov 0xffffffff81edba38,%rax 0xffffffff8107fdb2 <+130>: mov $0x810,%ecx 0xffffffff8107fdb7 <+135>: bextr %ecx,0x30(%rax),%r12d 0xffffffff8107fdbd <+141>: mov 0xffffffff81edba38,%rcx 0xffffffff8107fdc5 <+149>: mov $0xffffff00,%eax 0xffffffff8107fdca <+154>: and 0x80(%rcx),%eax 0xffffffff8107fdd0 <+160>: mov %eax,0x80(%rcx) 0xffffffff8107fdd6 <+166>: imul $0x350,%rsi,%r13 0xffffffff8107fddd <+173>: cmpl $0x0,0xffffffff81edba40 0xffffffff8107fde5 <+181>: mov %rbx,-0x58(%rbp) 0xffffffff8107fde9 <+185>: je 0xffffffff8107fe29 <native_lapic_setup+249> 0xffffffff8107fdeb <+187>: mov $0x80f,%ecx 0xffffffff8107fdf0 <+192>: rdmsr 0xffffffff8107fdf2 <+194>: mov %eax,%ecx 0xffffffff8107fdf4 <+196>: and $0xfffffc00,%ecx 0xffffffff8107fdfa <+202>: cmpl $0x0,0xffffffff81edba28 0xffffffff8107fe02 <+210>: mov $0x1ff,%edx 0xffffffff8107fe07 <+215>: mov $0x11ff,%eax 0xffffffff8107fe0c <+220>: cmove %edx,%eax 0xffffffff8107fe0f <+223>: or %ecx,%eax
(In reply to Sylvain Garrigues from comment #13) Thanks! I suspect that the problem is with BEXTR instruction that is available on Haswell, but does not seem to provided by (at least some) EC2 instances. Not sure if it requires any support from the hypervisor or if it's a problem with underlying hardware not supporting the instruction. So, I bet on your guess from comment #9.
(In reply to Andriy Gapon from comment #14) I'm recompiling kernel w/ GENERIC (instead of GENERIC-NODEBUG) and no CPUTYPE line. Hold your break and expect the end of the suspense in a couple hours.
Guys, I recompiled without CPUTYPE and now the kernel boots. I am so ashamed to have disturbed you for this. Though, my CPU *is* Haswell: # sysctl -a | egrep -i 'hw.machine|hw.ncpu|hw.model' hw.machine: amd64 hw.model: Intel(R) Xeon(R) CPU E5-2676 v3 @ 2.40GHz hw.ncpu: 1 hw.machine_arch: amd64 So I'm pissed I can't use CPUTYPE here. Don't know if I should blame KVM or Amazon. I have another older EC2 machine where CPUTYPE=ivybridge works BTW. Nevertheless, thanks so much for quick and accurate feedback, you guys are the reason I wouldn't dream of using any other OS than FreeBSD. THANKS! Bug can be closed.
FWIW, here is a similar problem reported, with a similar resolution: https://lists.freebsd.org/pipermail/freebsd-stable/2016-July/084960.html You might want to ask Amazon support if you are really curious.