Bug 233088 - Kernel Panic, mainly during high I/O stress (poudriere bulk)
Summary: Kernel Panic, mainly during high I/O stress (poudriere bulk)
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: arm (show other bugs)
Version: CURRENT
Hardware: arm64 Any
: --- Affects Some People
Assignee: Mark Johnston
URL:
Keywords: patch
Depends on:
Blocks: 228911
  Show dependency treegraph
 
Reported: 2018-11-09 15:57 UTC by Sean Bruno
Modified: 2018-11-20 17:46 UTC (History)
5 users (show)

See Also:


Attachments
proposed patch (1.75 KB, patch)
2018-11-09 17:04 UTC, Mark Johnston
no flags Details | Diff
proposed patch (2.31 KB, patch)
2018-11-13 01:00 UTC, Mark Johnston
no flags Details | Diff
proposed patch (2.45 KB, patch)
2018-11-15 17:29 UTC, Mark Johnston
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Sean Bruno freebsd_committer 2018-11-09 15:57:33 UTC
Port Manager is experiencing repeatable panics on -current when trying to build packages on the Cavium Thunderx Platform:

FreeBSD thunderx1.nyi.freebsd.org 13.0-CURRENT FreeBSD 13.0-CURRENT r340050 GENERIC-NODEBUG  arm64

Fatal data abort:
  x0:                e
  x1: ffff000000d8c118
  x2:               21
  x3:              103
  x4: fffffd1f296aa000
  x5: ffff0001780e0418
  x6: ffff0001780e051c
  x7: ffff0001780e050c
  x8:                1
  x9:                1
 x10:               21
 x11:                0
 x12:                0
 x13:         b5400703
 x14:     ffffffffa160
 x15:         4062b344
 x16:         401b3f94
 x17:     ffffffffa4f0
 x18: ffff0001780e0360
 x19: ffff0000011bd9a8
 x20: fffffd0000000000
 x21: fffffd1f296aa000
 x22: ffff0000011bd000
 x23:                0
 x24: ffff000062000000
 x25:                0
 x26:                e
 x27:        407fc0001
 x28:                1
 x29: ffff0001780e03d0
  sp: ffff0001780e0360
  lr: ffff0000006f77a4
 elr: ffff0000006f77a8
spsr:         60400345
 far:               5e
 esr:         96000007
panic: vm_fault failed: ffff0000006f77a8
cpuid = 8
time = 1541768414
KDB: stack backtrace:
db_trace_self() at db_trace_self_wrapper+0x28
         pc = 0xffff0000006e4b3c  lr = 0xffff0000000f70b8
         sp = 0xffff0001780dfd50  fp = 0xffff0001780dff60

db_trace_self_wrapper() at vpanic+0x1a8
         pc = 0xffff0000000f70b8  lr = 0xffff0000003b0ffc
         sp = 0xffff0001780dff70  fp = 0xffff0001780e0020

vpanic() at panic+0x44
         pc = 0xffff0000003b0ffc  lr = 0xffff0000003b0e50
         sp = 0xffff0001780e0030  fp = 0xffff0001780e00b0

panic() at data_abort+0x1d8
         pc = 0xffff0000003b0e50  lr = 0xffff0000006fd93c
         sp = 0xffff0001780e00c0  fp = 0xffff0001780e0170

data_abort() at do_el1h_sync+0x11c
         pc = 0xffff0000006fd93c  lr = 0xffff0000006fd660
         sp = 0xffff0001780e0180  fp = 0xffff0001780e01b0

do_el1h_sync() at handle_el1h_sync+0x74
         pc = 0xffff0000006fd660  lr = 0xffff0000006e7074
         sp = 0xffff0001780e01c0  fp = 0xffff0001780e02d0

handle_el1h_sync() at pmap_enter_l2+0x128
         pc = 0xffff0000006e7074  lr = 0xffff0000006f77a0
         sp = 0xffff0001780e02e0  fp = 0xffff0001780e03d0

pmap_enter_l2() at pmap_enter+0x104
         pc = 0xffff0000006f77a0  lr = 0xffff0000006f6830
         sp = 0xffff0001780e03e0  fp = 0xffff0001780e0470

pmap_enter() at vm_fault_hold+0xecc
         pc = 0xffff0000006f6830  lr = 0xffff000000691174
         sp = 0xffff0001780e0480  fp = 0xffff0001780e05f0

vm_fault_hold() at vm_fault+0x60
         pc = 0xffff000000691174  lr = 0xffff000000690250
         sp = 0xffff0001780e0600  fp = 0xffff0001780e0630

vm_fault() at data_abort+0xa0
         pc = 0xffff000000690250  lr = 0xffff0000006fd804
         sp = 0xffff0001780e0640  fp = 0xffff0001780e06f0

data_abort() at do_el1h_sync+0x11c
         pc = 0xffff0000006fd804  lr = 0xffff0000006fd660
         sp = 0xffff0001780e0700  fp = 0xffff0001780e0730

do_el1h_sync() at handle_el1h_sync+0x74
         pc = 0xffff0000006fd660  lr = 0xffff0000006e7074
         sp = 0xffff0001780e0740  fp = 0xffff0001780e0850

handle_el1h_sync() at exec_copyin_args+0x88
         pc = 0xffff0000006e7074  lr = 0xffff0000003683bc
         sp = 0xffff0001780e0860  fp = 0xffff0001780e0920

exec_copyin_args() at sys_execve+0x3c
         pc = 0xffff0000003683bc  lr = 0xffff00000036821c
         sp = 0xffff0001780e0930  fp = 0xffff0001780e09a0

sys_execve() at do_el0_sync+0x4f8
         pc = 0xffff00000036821c  lr = 0xffff0000006fdeac
         sp = 0xffff0001780e09b0  fp = 0xffff0001780e0a70

do_el0_sync() at handle_el0_sync+0x84
         pc = 0xffff0000006fdeac  lr = 0xffff0000006e7200
         sp = 0xffff0001780e0a80  fp = 0xffff0001780e0b90

handle_el0_sync() at 0x24940
         pc = 0xffff0000006e7200  lr = 0x0000000000024940
         sp = 0xffff0001780e0ba0  fp = 0x0000ffffffffa650
Comment 1 Rodney W. Grimes freebsd_committer 2018-11-09 16:35:27 UTC
Could the kernel be changed from GENERIC-NODEBUG to GENERIC to see if any INVARIANTS/WITNESS is triggered?
Comment 2 Mark Johnston freebsd_committer 2018-11-09 17:04:44 UTC
Created attachment 199094 [details]
proposed patch

Sean, if you're able to test kernel patches on this system, please give the attached patch a try.
Comment 3 Sean Bruno freebsd_committer 2018-11-09 18:18:39 UTC
(In reply to Mark Johnston from comment #2)
Patch applied and restarted bulk build.  If this can survive ~24 hours, during the bulk run, I'd mark this as fixed.
Comment 4 Sean Bruno freebsd_committer 2018-11-10 14:26:19 UTC
(In reply to Mark Johnston from comment #2)
Still up and running building packages.  I think you got it.
Comment 5 Mark Johnston freebsd_committer 2018-11-10 20:56:10 UTC
(In reply to Sean Bruno from comment #4)
Thanks.  In a review Alan noted that the real problem actually appears to be in arm64-specific code, so I'll have to write an alternate patch.  I'll update this PR once that's ready.
Comment 6 Mark Johnston freebsd_committer 2018-11-13 01:00:19 UTC
Created attachment 199188 [details]
proposed patch

Could you please try this patch instead of the old one?
Comment 7 Ed Maste freebsd_committer 2018-11-13 15:18:35 UTC
markj@ working to have this fixed in 12.0
Comment 8 Mark Johnston freebsd_committer 2018-11-15 17:29:17 UTC
Created attachment 199259 [details]
proposed patch

(In reply to Mark Johnston from comment #6)
Assuming you haven't already started testing the new patch, please try this one instead.  It fixes a flaw in the first version.
Comment 9 Sean Bruno freebsd_committer 2018-11-19 13:08:28 UTC
(In reply to Mark Johnston from comment #8)
We've applied this to the package builder this morning.  I'll report back after it runs for a day or two.
Comment 10 Sean Bruno freebsd_committer 2018-11-20 14:19:49 UTC
(In reply to Sean Bruno from comment #9)
The cavium box is running full out and seems to be super stable.  Thank you!

http://thunderx1.nyi.freebsd.org/index.html
Comment 11 commit-hook freebsd_committer 2018-11-20 15:13:00 UTC
A commit references this bug:

Author: markj
Date: Tue Nov 20 15:12:37 UTC 2018
New revision: 340678
URL: https://svnweb.freebsd.org/changeset/base/340678

Log:
  Handle kernel superpage mappings in pmap_remove_l2().

  PR:		233088
  Reviewed by:	alc, andrew, kib
  Tested by:	sbruno
  MFC after:	3 days
  Sponsored by:	The FreeBSD Foundation
  Differential Revision:	https://reviews.freebsd.org/D17981

Changes:
  head/sys/arm64/arm64/pmap.c
Comment 12 commit-hook freebsd_committer 2018-11-20 17:43:58 UTC
A commit references this bug:

Author: markj
Date: Tue Nov 20 17:43:24 UTC 2018
New revision: 340685
URL: https://svnweb.freebsd.org/changeset/base/340685

Log:
  MFstable/12 r340680:
  Handle kernel superpage mappings in pmap_remove_l2().

  PR:		233088
  Approved by:	re (gjb)

Changes:
_U  releng/12.0/
  releng/12.0/sys/arm64/arm64/pmap.c