Bug 213856

Summary: Fatal trap12: page fault while in kernel mode; Supervisor read data, page not present
Product: Base System Reporter: IPTRACE <arkadiusz.majewski>
Component: kernAssignee: freebsd-bugs (Nobody) <bugs>
Status: Closed Works As Intended    
Severity: Affects Only Me CC: markj, theraven
Priority: ---    
Version: 11.0-STABLE   
Hardware: amd64   
OS: Any   

Description IPTRACE 2016-10-28 15:36:05 UTC
Recently I've installed 11.0-RELEASEp2.
Afetr 2d14h25m33s system crashed. Please look as follow.

Fatal trap12: page fault while in kernel mode
cpuid = 2 apic id = 02
fault virtual address = 0x48
fault code = supervisor read data, page not present
instruction pointer = 0x28:0xffffffff80e186b0
stack pointer = 0x28:0xfffffe3fc9de37e0
frame pointer = 0x28:0xfffffe3fc9de3840
code segment = base rx0, limit 0xfffff, type 0x1b
             = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags = interrupt enabled, resume, IOPL = 0
current process = 17 (pagedaemon)
trap number = 12
panic: page fault
cpuid = 2
KDB: stack backtrace:
Comment 1 IPTRACE 2016-10-31 17:39:28 UTC
Second time system terminated...
Uptime: 3d0h23m19s

Fatal trap 9: general protection fault while in kernel mode
cpuid - 0; apic id = 00
instruction pointer	= 0x20:0xffffffff80e18717
stack pointer		= 0x28:0xfffffe3fc9de37e0
frame pointer		= 0x28:0xfffffe3fc9de37e0
code segment		= base rx0, limit 0xfffff, type 0x1b
			= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags	= interrupt enabled, resume, IOPL = 0
current process		= 17 (pagedaemon)
trap number		= 9
panic: general protection fault
cpuid = 0
KDB: stack backtrace:
#0 0xffffffff80b24747 at kdb_backtrace+0x67
#1 0xffffffff80ad9ab2 at vpanic+0x182
#2 0xffffffff80ad9923 at panic+0x43
#3 0xffffffff80fa9d51 at trap_fatal+0x351
#4 0xffffffff80fa99e8 at trap+0x768
#5 0xffffffff80f8d101 at calltrap+0x8
#6 0xffffffff80e17dc6 at bucket_cache_drain+0x136
#7 0xffffffff80e1150d at zone_drain_wait+0xed
#8 0xffffffff80e1638d at uma_reclaim_locked+0x7d
#9 0xffffffff80e16297 at uma_reclaim+0x77
#10 0xffffffff80e38042 at vm_pageout+0x502
#11 0xffffffff80e90725 at fork_exit+0x85
#12 0xffffffff80f8d63e at fork_trampoline+0xe
Comment 2 Mark Johnston freebsd_committer freebsd_triage 2016-10-31 19:11:38 UTC
Are kernel dumps getting saved to /var/crash after the panics? If so, could you
open one in kgdb and obtain the backtrace?
Comment 3 IPTRACE 2016-10-31 19:23:39 UTC
Unfortunately, nothing there.
What may indicate this error.

1. I've upgraded 11.0-RELEASEp2 from 10.3-RELEASEp11 and then the kernel crash occured.
2. This system is used as virtual machine host (several dozen FreeBSD guests and one Windows 10).
3. 2x CPU Xeon 20 cores with HT and 256 GB RAM.
Comment 4 Mark Johnston freebsd_committer freebsd_triage 2016-10-31 19:27:58 UTC
Do you have a dumpdev configured in rc.conf? If not, the kernel will have nowhere to dump core when it panics.
Comment 5 IPTRACE 2016-10-31 19:41:24 UTC
No.
I've set it now.

dumpdev="AUTO"
dumpdir=”/var/crash”

How to initialize dump without restart?
Can I force to auto-restart system when kernel crashed?
Comment 6 Mark Johnston freebsd_committer freebsd_triage 2016-10-31 19:45:15 UTC
Run:

# service dumpon start

and

# dumpon -l

to verify that it configured the dump device correctly.

The kernel should reboot automatically once it's finished dumping core.
Comment 7 IPTRACE 2016-10-31 20:13:28 UTC
% service dumpon start
No suitable dump device was found.
% dumpon -l
/dev/null
Comment 8 Mark Johnston freebsd_committer freebsd_triage 2016-10-31 23:15:28 UTC
dumpdev=AUTO will cause the dumpon script to select a swap partition for use as a kerneldump device. In general, there needs to be an unused partition available somewhere for kernel dumps to work.
Comment 9 IPTRACE 2016-11-01 07:37:32 UTC
Due to a lot of RAM I don't use swap partition.
Is it possible to use /var/crash folder?
/var is UFS independent partition.
Comment 10 David Chisnall freebsd_committer freebsd_triage 2016-11-01 09:44:08 UTC
This does not appear to be anything to do with standards compliance and so assigning to standards@ is inappropriate.  Resetting assignee - it looks like a VM bug and should be assigned to someone with virtual memory expertise.
Comment 11 David Chisnall freebsd_committer freebsd_triage 2016-11-01 09:45:07 UTC
Okay, apparently I'm not because FreeBSD bugzilla thinks that standards@ is the appropriate default assignee for irrelevant bugs.
Comment 12 Mark Johnston freebsd_committer freebsd_triage 2016-11-01 15:47:12 UTC
Will follow up by email.
Comment 13 Mark Johnston freebsd_committer freebsd_triage 2016-11-05 22:07:20 UTC
In both cases, we crashed in bucket_drain() when resetting bucket->ub_cnt to 0:

   0xffffffff80e17d90 <+256>:   movslq %r13d,%r13
   0xffffffff80e17d93 <+259>:   mov    0x18(%rbx,%r13,8),%rdi
   0xffffffff80e17d98 <+264>:   mov    0x10c(%r14),%esi
   0xffffffff80e17d9f <+271>:   callq  *0xe8(%r14)
   0xffffffff80e17da6 <+278>:   inc    %r13d
   0xffffffff80e17da9 <+281>:   movswl 0x10(%rbx),%eax
   0xffffffff80e17dad <+285>:   cmp    %eax,%r13d
   0xffffffff80e17db0 <+288>:   jl     0xffffffff80e17d90 <bucket_cache_drain+256>
   0xffffffff80e17db2 <+290>:   mov    0x100(%r14),%rdi
   0xffffffff80e17db9 <+297>:   movswl %ax,%edx
   0xffffffff80e17dbc <+300>:   mov    %r12,%rsi
   0xffffffff80e17dbf <+303>:   callq  *0xf8(%r14)
   0xffffffff80e17dc6 <+310>:   movw   $0x0,0x10(%rbx) <--

rbx is a callee-saved register that is dereferenced after every call to uz_fini,
so it seems as though the uz_release function for the zone is somehow corrupting
its frame. Because this is happening in the context of uma_reclaim(), we know
that this can't be a cache zone, so uz_release is zone_release().
Comment 14 IPTRACE 2016-11-18 16:11:12 UTC
The PCIe network card (QUAD PORT INTEL PRO1000ET PCI-E 0HM9JY) stopped working.
When I was waiting for the new one (INTEL GIGABIT ET2 QUAD PORT SERVER ADAPTER E1G44ET) the system worked fine through 7 days.
Then I've installed the new card and after a dozen or so hours system terminated.

Fatal trap 9: general protection fault while in kernel mode
cpuid = 33; apic id = 33
instruction pointer = 0x20:0xffffffff80b6a89e
stack pointer = 0x28:0xffffffe3fcab4e7d0
frame pointer = 0x28:ffffffe3fcab4e820
code segment = base rx0, limit 0xfffff, type 0x1b
                               = DPL 0, pres 1, long 1, def32 0, gran 1
Processor eflags = interrupt enabled, resume, IOPL = 0
Current process = 5639 (vtnet-2:0 tx)
Trap number = 9
Panic: general protection fault
Cupid = 33
KDB: stack backtrace:
#0 0xffffffff80b24747 at kdb_backtrace+0x67
#1 0xffffffff80ad9ab2 at vpanic+0x182
#2 0xffffffff80ad9923 at panic+0x43
#3 0xffffffff80fa9d51 at trap_fatal+0x351
#4 0xffffffff80fa94ec at trap+0x26c
#5 0xffffffff80f8d101 at calltrap+0x8
#6 0xffffffff821e4e1e at tapwrite+0x9e
#7 0xffffffff80986677 at devfs_write_f+0xe7
#8 0xffffffff80b419a7 at dofilewrite+0x87
#9 0xffffffff80b41688 at kern_writev+0x68
#10 0xffffffff80b418f6 at sys_writev+0x36
#11 0xffffffff80faa6ae at amd64_syscall+0x4ce
#12 0xffffffff80f8d3eb at Xfast_syscall+0xfb
Uptime: 15h45m2s


Please look at the difference between dmesg on 10.3-RELEASE and 11-RELEASE.
There is problem with PCI bus or something like this on 11.0-RELEASE?!

11.0-RELEASE:
pcib0: <ACPI Host-PCI bridge> on acpi0
pcib0: _OSC returned error 0x10
pci0: <ACPI PCI bus> on pcib0
pcib1: <ACPI Host-PCI bridge> on acpi0
pcib1: _OSC returned error 0x10
pci1: <ACPI PCI bus> on pcib1
pcib2: <ACPI Host-PCI bridge> port 0xcf8-0xcff numa-domain 0 on acpi0
pcib2: _OSC returned error 0x10
pci2: <ACPI PCI bus> numa-domain 0 on pcib2
pcib3: <ACPI PCI-PCI bridge> irq 26 at device 1.0 numa-domain 0 on pci2
pci3: <ACPI PCI bus> numa-domain 0 on pcib3
pcib4: <ACPI PCI-PCI bridge> irq 32 at device 2.0 numa-domain 0 on pci2
pci4: <ACPI PCI bus> numa-domain 0 on pcib4
pcib5: <ACPI PCI-PCI bridge> irq 32 at device 2.2 numa-domain 0 on pci2
pci5: <ACPI PCI bus> numa-domain 0 on pcib5
pcib6: <ACPI PCI-PCI bridge> irq 40 at device 3.0 numa-domain 0 on pci2
pci6: <ACPI PCI bus> numa-domain 0 on pcib6
pcib7: <ACPI PCI-PCI bridge> irq 40 at device 3.2 numa-domain 0 on pci2
pci7: <ACPI PCI bus> numa-domain 0 on pcib7
pci2: <unknown> at device 17.0 (no driver attached)
xhci0: <Intel Wellsburg USB 3.0 controller> mem 0xc7200000-0xc720ffff irq 19 at device 20.0 numa-domain 0 on pci2
pci2: <simple comms> at device 22.0 (no driver attached)
pci2: <simple comms> at device 22.1 (no driver attached)
ehci0: <Intel Wellsburg USB 2.0 controller> mem 0xc7214000-0xc72143ff irq 18 at device 26.0 numa-domain 0 on pci2
pcib8: <ACPI PCI-PCI bridge> irq 16 at device 28.0 numa-domain 0 on pci2
pci8: <ACPI PCI bus> numa-domain 0 on pcib8
pcib9: <ACPI PCI-PCI bridge> irq 18 at device 28.2 numa-domain 0 on pci2
pci9: <ACPI PCI bus> numa-domain 0 on pcib9
pcib10: <ACPI PCI-PCI bridge> at device 0.0 numa-domain 0 on pci9
pci10: <ACPI PCI bus> numa-domain 0 on pcib10


10.3-RELEASE:
pcib0: <ACPI Host-PCI bridge> on acpi0
pci255: <ACPI PCI bus> on pcib0
pcib1: <ACPI Host-PCI bridge> on acpi0
pci127: <ACPI PCI bus> on pcib1
pcib2: <ACPI Host-PCI bridge> port 0xcf8-0xcff on acpi0
pci0: <ACPI PCI bus> on pcib2
pcib3: <ACPI PCI-PCI bridge> irq 26 at device 1.0 on pci0
pci1: <ACPI PCI bus> on pcib3
pcib4: <ACPI PCI-PCI bridge> irq 32 at device 2.0 on pci0
pci2: <ACPI PCI bus> on pcib4
pcib5: <ACPI PCI-PCI bridge> irq 32 at device 2.2 on pci0
pci3: <ACPI PCI bus> on pcib5
pcib6: <ACPI PCI-PCI bridge> irq 40 at device 3.0 on pci0
pci4: <ACPI PCI bus> on pcib6
pcib7: <ACPI PCI-PCI bridge> irq 40 at device 3.2 on pci0
pci5: <ACPI PCI bus> on pcib7
pci0: <unknown> at device 17.0 (no driver attached)
Comment 15 IPTRACE 2016-11-29 00:30:43 UTC
I've upgraded OS to FreeBSD 11.0-RELEASE-p3 and compiled kernel without ALTQ.
The system seems to work fine.

As Mark Johnston mentioned by mail about ALTQ, it can be the problem with crashing the kernel.

Is it possible to compile ALTQ again without problems?
I didn't have any problems with ALTQ on 10.3-RELEASE.