SuperMicro SuperServer 6022L-6 will not fully boot RELENG_7 unless I booth with ACPI disabled. RELENG_7_0 does not crash on the same hardware with the same config. Crash is as follows: Fatal trap 12: page fault while in kernel mode cpuid = 3; apic id = 03 fault virtual address = 0x2043455c fault code = supervisor read, page not present instruction pointer = 0x20:0xc0742c86 stack pointer = 0x28:0xe8cada0c frame pointer = 0x28:0xe8cada38 code segment = base rx0, limit 0xfffff, type 0x1b = DPL 0, pres 1, def32 1, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 68 (sysctl) trap number = 12 panic: page fault cpuid = 3 Uptime: 6s Physical memory: 2035 MB Dumping 65 MB: 50 34 18 2 The crash happens just after the "Entropy harvesting..." line, before swap is started. As you can see in the crash output, the offending process is sysctl. I can boot to single user mode, but if I issue sysctl -a while there, it also crashes. When sysctl -a is run in single user mode, the last three lines before the crash are (transcribed by hand, no serial console available): dev.pcib.3.%location: handle=\_SB_.PCI3 dev.pcib.3.%pnpinfo: _HID=PNP0A03 UID=3 dev.pcib.3.%parent: acpi0 With a working RELENG_7_0 the lines immediately following this are: dev.pcib.4.%desc: ACPI Host-PCI bridge dev.pcib.4.%driver: pcib dev.pcib.4.%location: handle=\_SB_.PCI4 dev.pcib.4.%pnpinfo: _HID=PNP0A03 _UID=4 dev.pcib.4.%parent: acpi0 I tried a binary search of the source tree to narrow down the crash. I found that one possible vector for the crash was introduced between 2007/12/19 20:00:00 (booted OK) and 2007/12/19 23:59:00 (crashed), which left me with only a handful of files to test. By process of elimination, I found that if I backed some changes out in src/sys/i386/i386/machdep.c, the crash stopped. src/sys/i386/i386/machdep.c v1.658 2007/08/09 njl - Boots OK src/sys/i386/i386/machdep.c v1.658.2.1 2007/12/19 rpaulo - Crashes The confusing part (to me) is that my next step was to update all the way to RELENG_7 as of yesterday, then back out those same changes, but the crash still happened. So either I misidentified the cause of the crash -- which is quite possible -- or it was reintroduced in some other change (or both!). kgdb output from vmcore.0: Unread portion of the kernel message buffer: Copyright (c) 1992-2008 The FreeBSD Project. Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994 The Regents of the University of California. All rights reserved. FreeBSD is a registered trademark of The FreeBSD Foundation. FreeBSD 7.0-PRERELEASE #0: Mon Feb 25 15:22:54 EST 2008 root@test1.hpcisp.com:/usr/obj/usr/src/sys/GENERIC Timecounter "i8254" frequency 1193182 Hz quality 0 CPU: Intel(R) XEON(TM) CPU 2.00GHz (1999.94-MHz 686-class CPU) Origin = "GenuineIntel" Id = 0xf24 Stepping = 4 Features=0x3febfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM> Logical CPUs per core: 2 real memory = 2147418112 (2047 MB) avail memory = 2091872256 (1994 MB) ACPI APIC Table: <RCC GCHE > FreeBSD/SMP: Multiprocessor System Detected: 4 CPUs cpu0 (BSP): APIC ID: 0 cpu1 (AP): APIC ID: 1 cpu2 (AP): APIC ID: 2 cpu3 (AP): APIC ID: 3 ACPI Warning (tbfadt-0505): Optional field "Gpe1Block" has zero address or length: 0 0/8 [20070320] MADT: Forcing active-low polarity and level trigger for SCI ioapic0 <Version 1.1> irqs 0-15 on motherboard ioapic1 <Version 1.1> irqs 16-31 on motherboard ioapic2 <Version 1.1> irqs 32-47 on motherboard kbd1 at kbdmux0 ath_hal: 0.9.20.3 (AR5210, AR5211, AR5212, RF5111, RF5112, RF2413, RF5413) hptrr: HPT RocketRAID controller driver v1.1 (Feb 25 2008 15:20:56) acpi0: <RCC GCHE> on motherboard ACPI Warning (dswload-0794): Type override - [DEB_] had invalid type (Integer) for Scope operator, changed to (Scope) [20070320] ACPI Warning (dswload-0794): Type override - [MLIB] had invalid type (Integer) for Scope operator, changed to (Scope) [20070320] ACPI Warning (dswload-0794): Type override - [IO__] had invalid type (Integer) for Scope operator, changed to (Scope) [20070320] ACPI Warning (dswload-0794): Type override - [DATA] had invalid type (String) for Scope operator, changed to (Scope) [20070320] ACPI Warning (dswload-0794): Type override - [SIO_] had invalid type (String) for Scope operator, changed to (Scope) [20070320] ACPI Warning (dswload-0794): Type override - [SB__] had invalid type (String) for Scope operator, changed to (Scope) [20070320] ACPI Warning (dswload-0794): Type override - [PM__] had invalid type (String) for Scope operator, changed to (Scope) [20070320] ACPI Warning (dswload-0794): Type override - [ICNT] had invalid type (String) for Scope operator, changed to (Scope) [20070320] ACPI Warning (dswload-0794): Type override - [ACPI] had invalid type (String) for Scope operator, changed to (Scope) [20070320] ACPI Warning (dswload-0794): Type override - [IORG] had invalid type (String) for Scope operator, changed to (Scope) [20070320] ACPI Warning (dswload-0794): Type override - [SB__] had invalid type (String) for Scope operator, changed to (Scope) [20070320] ACPI Warning (dswload-0794): Type override - [PM__] had invalid type (String) for Scope operator, changed to (Scope) [20070320] ACPI Warning (dswload-0794): Type override - [SIO_] had invalid type (String) for Scope operator, changed to (Scope) [20070320] ACPI Warning (dswload-0794): Type override - [PM__] had invalid type (String) for Scope operator, changed to (Scope) [20070320] ACPI Warning (dswload-0794): Type override - [BIOS] had invalid type (Integer) for Scope operator, changed to (Scope) [20070320] ACPI Warning (dswload-0794): Type override - [CMOS] had invalid type (Integer) for Scope operator, changed to (Scope) [20070320] ACPI Warning (dswload-0794): Type override - [KBC_] had invalid type (Integer) for Scope operator, changed to (Scope) [20070320] ACPI Warning (dswload-0794): Type override - [OEM_] had invalid type (Integer) for Scope operator, changed to (Scope) [20070320] acpi0: [ITHREAD] acpi0: Power Button (fixed) acpi0: reservation of 0, a0000 (3) failed acpi0: reservation of 100000, 7ff00000 (3) failed Timecounter "ACPI-safe" frequency 3579545 Hz quality 850 acpi_timer0: <32-bit timer at 3.579545MHz> port 0x508-0x50b on acpi0 cpu0: <ACPI CPU> on acpi0 p4tcc0: <CPU Frequency Thermal Control> on cpu0 cpu1: <ACPI CPU> on acpi0 p4tcc1: <CPU Frequency Thermal Control> on cpu1 cpu2: <ACPI CPU> on acpi0 p4tcc2: <CPU Frequency Thermal Control> on cpu2 cpu3: <ACPI CPU> on acpi0 p4tcc3: <CPU Frequency Thermal Control> on cpu3 acpi_button0: <Sleep Button> on acpi0 pcib0: <ACPI Host-PCI bridge> port 0xcf8-0xcff on acpi0 pci0: <ACPI PCI bus> on pcib0 vgapci0: <VGA-compatible display> port 0xa800-0xa8ff mem 0xfd000000-0xfdffffff,0xfe5ff000-0xfe5fffff irq 18 at device 2.0 on pci0 fxp0: <Intel 82550 Pro/100 Ethernet> port 0xae80-0xaebf mem 0xfe5fc000-0xfe5fcfff,0xfe580000-0xfe59ffff irq 17 at device 4.0 on pci0 miibus0: <MII bus> on fxp0 inphy0: <i82555 10/100 media interface> PHY 1 on miibus0 inphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto fxp0: Ethernet address: 00:30:48:20:a3:9e fxp0: [ITHREAD] fxp1: <Intel 82550 Pro/100 Ethernet> port 0xaf00-0xaf3f mem 0xfe5fd000-0xfe5fdfff,0xfe5a0000-0xfe5bffff irq 19 at device 5.0 on pci0 miibus1: <MII bus> on fxp1 inphy1: <i82555 10/100 media interface> PHY 1 on miibus1 inphy1: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto fxp1: Ethernet address: 00:30:48:20:a3:9f fxp1: [ITHREAD] isab0: <PCI-ISA bridge> at device 15.0 on pci0 isa0: <ISA bus> on isab0 atapci0: <ServerWorks CSB5 UDMA100 controller> port 0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0xffa0-0xffaf at device 15.1 on pci0 ata0: <ATA channel 0> on atapci0 ata0: [ITHREAD] ata1: <ATA channel 1> on atapci0 ata1: [ITHREAD] ohci0: <OHCI (generic) USB controller> mem 0xfe5fe000-0xfe5fefff irq 10 at device 15.2 on pci0 ohci0: [GIANT-LOCKED] ohci0: [ITHREAD] usb0: OHCI version 1.0, legacy support usb0: SMM does not respond, resetting usb0: <OHCI (generic) USB controller> on ohci0 usb0: USB revision 1.0 uhub0: <(0x1166) OHCI root hub, class 9/0, rev 1.00/1.00, addr 1> on usb0 uhub0: 4 ports with 4 removable, self powered pcib1: <ACPI Host-PCI bridge> on acpi0 pci1: <ACPI PCI bus> on pcib1 pcib2: <ACPI Host-PCI bridge> on acpi0 pci2: <ACPI PCI bus> on pcib2 pcib3: <ACPI Host-PCI bridge> on acpi0 pci3: <ACPI PCI bus> on pcib3 pcib4: <ACPI Host-PCI bridge> on acpi0 pci4: <ACPI PCI bus> on pcib4 asr0: <Adaptec Caching SCSI RAID> mem 0xfeb00000-0xfebfffff,0xfb000000-0xfbffffff,0xf8000000-0xf9ffffff irq 29 at device 3.0 on pci4 asr0: [GIANT-LOCKED] asr0: [ITHREAD] asr0: ADAPTEC 2005S FW Rev. 380E, 2 channel, 2000 CCBs, Protocol I2O atkbdc0: <Keyboard controller (i8042)> port 0x60,0x64 irq 1 on acpi0 atkbd0: <AT Keyboard> irq 1 on atkbdc0 kbd0 at atkbd0 atkbd0: [GIANT-LOCKED] atkbd0: [ITHREAD] psm0: <PS/2 Mouse> irq 12 on atkbdc0 psm0: [GIANT-LOCKED] psm0: [ITHREAD] psm0: model NetMouse/NetScroll Optical, device ID 0 fdc0: <floppy drive controller (FDE)> port 0x3f2-0x3f3,0x3f4-0x3f5,0x3f7 irq 6 drq 2 on acpi0 fdc0: [FILTER] fd0: <1440-KB 3.5" drive> on fdc0 drive 0 sio0: <16550A-compatible COM port> port 0x3f8-0x3ff irq 4 flags 0x10 on acpi0 sio0: type 16550A sio0: [FILTER] sio1: <16550A-compatible COM port> port 0x2f8-0x2ff irq 3 on acpi0 sio1: type 16550A sio1: [FILTER] pmtimer0 on isa0 orm0: <ISA Option ROMs> at iomem 0xc0000-0xc7fff,0xc8000-0xcdfff,0xce000-0xcefff,0xcf000-0xcffff pnpid ORM0000 on isa0 ppc0: <Parallel port> at port 0x378-0x37f irq 7 on isa0 ppc0: Generic chipset (ECP/PS2/NIBBLE) in COMPATIBLE mode ppc0: FIFO with 16/16/8 bytes threshold ppbus0: <Parallel port bus> on ppc0 ppbus0: [ITHREAD] plip0: <PLIP network interface> on ppbus0 lpt0: <Printer> on ppbus0 lpt0: Interrupt-driven port ppi0: <Parallel I/O> on ppbus0 ppc0: [GIANT-LOCKED] ppc0: [ITHREAD] sc0: <System console> at flags 0x100 on isa0 sc0: VGA <16 virtual consoles, flags=0x300> vga0: <Generic ISA VGA> at port 0x3c0-0x3df iomem 0xa0000-0xbffff on isa0 Timecounters tick every 1.000 msec hptrr: no controller detected. acd0: CDROM <MATSHITA CR-177/7T0D> at ata1-master UDMA33 da0 at asr0 bus 0 target 0 lun 0 da0: <ADAPTEC RAID-5 380E> Fixed Direct Access SCSI-2 device ses0 at asr0 bus 0 target 6 lun 0 ses0: <SUPER GEM318 0> Fixed Processor SCSI-2 device SMP: AP CPU #3 Launched! SMP: AP CPU #2 Launched! SMP: AP CPU #1 Launched! Trying to mount root from ufs:/dev/da0s1a <118>Loading configuration files. <118>kernel dumps on /dev/da0s1b <118>Entropy harvesting: <118> interrupts <118> ethernet <118> point_to_point Fatal trap 12: page fault while in kernel mode cpuid = 3; apic id = 03 fault virtual address = 0x2043455c fault code = supervisor read, page not present instruction pointer = 0x20:0xc0742c86 stack pointer = 0x28:0xe8cada0c frame pointer = 0x28:0xe8cada38 code segment = base rx0, limit 0xfffff, type 0x1b = DPL 0, pres 1, def32 1, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 68 (sysctl) trap number = 12 panic: page fault cpuid = 3 Uptime: 6s Physical memory: 2035 MB Dumping 65 MB: 50 34 18 2 #0 doadump () at pcpu.h:195 195 pcpu.h: No such file or directory. in pcpu.h (kgdb) bt #0 doadump () at pcpu.h:195 #1 0xc073a688 in boot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:409 #2 0xc073a941 in panic (fmt=Variable "fmt" is not available. ) at /usr/src/sys/kern/kern_shutdown.c:563 #3 0xc0a19dc0 in trap_fatal (frame=0xe8cad9cc, eva=541279580) at /usr/src/sys/i386/i386/trap.c:899 #4 0xc0a1a030 in trap_pfault (frame=0xe8cad9cc, usermode=0, eva=541279580) at /usr/src/sys/i386/i386/trap.c:812 #5 0xc0a1a9ad in trap (frame=0xe8cad9cc) at /usr/src/sys/i386/i386/trap.c:490 #6 0xc0a01cab in calltrap () at /usr/src/sys/i386/i386/exception.s:139 #7 0xc0742c86 in sysctl_sysctl_next_ls (lsp=Variable "lsp" is not available. ) at /usr/src/sys/kern/kern_sysctl.c:630 #8 0xc0742d46 in sysctl_sysctl_next_ls (lsp=Variable "lsp" is not available. ) at /usr/src/sys/kern/kern_sysctl.c:618 #9 0xc0742d83 in sysctl_sysctl_next_ls (lsp=Variable "lsp" is not available. ) at /usr/src/sys/kern/kern_sysctl.c:630 #10 0xc0742d83 in sysctl_sysctl_next_ls (lsp=Variable "lsp" is not available. ) at /usr/src/sys/kern/kern_sysctl.c:630 #11 0xc0742de6 in sysctl_sysctl_next (oidp=0xc0b4c940, arg1=0xe8cadc1c, arg2=4, req=0xe8cadba4) at /usr/src/sys/kern/kern_sysctl.c:651 #12 0xc07436f2 in sysctl_root (oidp=Variable "oidp" is not available. ) at /usr/src/sys/kern/kern_sysctl.c:1306 #13 0xc074382e in userland_sysctl (td=0xc5574210, name=0xe8cadc14, namelen=6, old=0xbfbfe4e8, oldlenp=0xbfbfe598, inkernel=0, new=0x0, newlen=0, retval=0xe8cadc10, flags=0) at /usr/src/sys/kern/kern_sysctl.c:1401 #14 0xc0744462 in __sysctl (td=0xc5574210, uap=0xe8cadcfc) at /usr/src/sys/kern/kern_sysctl.c:1336 #15 0xc0a1a378 in syscall (frame=0xe8cadd38) at /usr/src/sys/i386/i386/trap.c:1035 #16 0xc0a01d10 in Xint0x80_syscall () at /usr/src/sys/i386/i386/exception.s:196 #17 0x00000033 in ?? () Previous frame inner to this frame (corrupt stack?) This is a testing machine that is only being used to evaluate 7.0 for use on similar hardware. I can take whatever debugging steps that are needed, just let me know what information is necessary to help resolve the issue. I tried posting this information to the -STABLE list, but received no replies. System is running with the most current BIOS available from the OEM. RAM tested OK with memtest86+ left running for a day or so. Fix: Workaround is to run with ACPI disabled, but that is not desired. One part of the crash was possibly introduced with rev v1.658.2.1 of src/sys/i386/i386/machdep.c, but I am unable to repeat that fix on recent RELENG_7 sources. How-To-Repeat: Attempt to boot with a RELENG_7 world/kernel on a SuperMicro SuperServer 6022L-6 with ACPI enabled. Alternately, boot to single user mode and issue "sysctl -a". Crashes every time in the exact same place.
State Changed From-To: open->feedback To submitter: Firstly, it looks to me like the commit that you narrowed the panic down to is not actually responsible for the problem you are seeing - my suspicion is that it actually just moves the layout of memory around enough to avoid seeing the problem. From single user mode, can you determine which of the following panic: sysctl dev.pcib.3 sysctl dev.pcib.4 (I'm guessing it's the latter, but it's worth checking). Secondly, I wonder if you could test setting debug.acpi.disabled="ec" from the loader, and see if that makes any difference? I notice that the "fault virtual address" is 0x2043455c, or " CE", but this may be a coincidence... Lastly, are you able to recompile the kernel with debugging support (options KDB and DDB), and also add printf's to /usr/src/sys/kern/kern_sysctl.c at lines 618 and 630 (between the setting of lsp and calling sysctl_sysctl_next_ls()) to show the value of the various variables? Something like this line should work: printf("lsp=%p, oidp=%p, oidpp=%pn", lsp, oidp, oidpp); If you can still recreate the panic with these printf's and the debugger compiled in, hopefully we can get more information out of your system as to exactly what is happening.
Responsible Changed From-To: freebsd-i386->gavin Track
> my suspicion is that it actually just moves the layout of memory > around enough to avoid seeing the problem. I have no doubt that you are correct in that. The code changes in that file seemed very unrelated to anything near the crash, but I thought it was worth mentioning anyhow. >From single user mode, can you determine which of the following panic: > sysctl dev.pcib.3 Crashes at the end. I had a debug kernel built already, I was also able to add the printfs with no problem. Here is the last bit of output: dev.pcib.3.%parent: acpi0 lsp=0xc0bcf314 oidp=0xc0b678e0 iodpp=0xe8c76b4c lsp=0xc525c7d0 oidp=0xc5262040 iodpp=0xe8c76b4c lsp=0xc52d9700 oidp=0xc52ed4c0 iodpp=0xe8c76b4c lsp=0xc52ee140 oidp=0xc52ed140 iodpp=0xe8c76b4c [crashes here] kdb says the crash happened at: sysctl_sysctl_next_ls+0x32 movl 0x8(%esi),%eax > print %esi c074b9ba > print %eax c074b9ba > sysctl dev.pcib.4 Yields: unknown oid 'dev.pcib.4' > I wonder if you could test setting debug.acpi.disabled="ec" > from the loader This made it get farther along in the boot sequence, but it then crashed in devd, also a fatal trap 12, with a virtual fault address of 0x108. I can get the whole copy of the crash output if you'd like. > If you can still recreate the panic with these printf's and the debugger > compiled in, hopefully we can get more information out of your system as to > exactly what is happening. It still crashes in the same place, perfectly repeatable, and as far as I can tell the addresses are the same each time. Let me know how you'd like me to proceed. Thanks for the quick response, and all the help/ideas. Jim
State Changed From-To: feedback->open Note that feedback was received.
For bugs matching the following criteria: Status: In Progress Changed: (is less than) 2014-06-01 Reset to default assignee and clear in-progress tags. Mail being skipped
Given the huge amount of change in the ACPI code, I'm going to close this as OBE. The info here is about useless in tracking things down with the latest code. If this problem persists on newer versions of FreeBSD (11 or 12), please file a new bug with updated info.