Bug 228056 - powerpc64: MCE on POWER9 machine (AC922)
Summary: powerpc64: MCE on POWER9 machine (AC922)
Status: New
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: CURRENT
Hardware: powerpc Any
: --- Affects Only Me
Assignee: freebsd-ppc (Nobody)
Depends on:
Reported: 2018-05-08 00:52 UTC by Breno Leitao
Modified: 2020-05-29 01:11 UTC (History)
1 user (show)

See Also:


Note You need to log in before you can comment on or make changes to this bug.
Description Breno Leitao 2018-05-08 00:52:47 UTC
I am creating this bug to track my progress on investigating the bootstrap of FreeBSD on a AC922 (POWER9) machine.

When I boot HEAD, I found the following MCE:

 KDB: debugger backends: ddb
 KDB: current backend: ddb
 Copyright (c) 1992-2018 The FreeBSD Project.
 Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
 	The Regents of the University of California. All rights reserved.
 FreeBSD is a registered trademark of The FreeBSD Foundation.
 FreeBSD 12.0-CURRENT #152 66f063557f2(master)-dirty: Tue May  8 01:17:52 CET  2018
    root@free8:/usr/obj/root/kernel/freebsd/powerpc.powerpc64/sys/BRENO powerpc
 gcc version 4.2.1 20070831 patched [FreeBSD]
 WARNING: WITNESS option enabled, expect reduced performance.
 WARNING: DIAGNOSTIC option enabled, expect reduced performance.
 Entering uma_startup with 44 boot pages configured
 startup_alloc from "UMA Kegs", 41 boot pages left
 startup_alloc from "UMA Zones", 40 boot pages left
 startup_alloc from "UMA Zones", 38 boot pages left
 startup_alloc from "UMA Zones", 36 boot pages left
 start at c000000001e30100
 KERNEL BASE at 100100
 sum is  c000000001d30000

 fatal kernel trap:

   exception       = 0x200 (machine check)
   srr0            = 0xc00000000255d284 (0x82d284)
   srr1            = 0x9000000000201032
   current msr     = 0x9000000000000032
   lr              = 0xc00000000255d278 (0x82d278)
   curthread       = 0xc000000002e2bbc0
          pid = 0, comm = 

 [ thread pid 0 tid 0 ]
 Stopped at      0xc00000000255d284

 Digging further, this is where it is breaking:

     82d264:       7f c3 f3 78     mr      r3,r30
     82d268:       7e e4 bb 78     mr      r4,r23
     82d26c:       7f 65 db 78     mr      r5,r27
     82d270:       7f 86 e3 78     mr      r6,r28
     82d274:       4b ff f8 49     bl      82cabc <.keg_alloc_slab>                                                   
     82d278:       7c 7d 1b 79     mr.     r29,r3
     82d27c:       41 a2 00 94     beq+    82d310 <.keg_fetch_slab+0x2cc>
     82d280:       7f bc eb 78     mr      r28,r29
->>  82d284:       e8 1d 00 00     ld      r0,0(r29)
     82d288:       7f a0 f0 00     cmpd    cr7,r0,r30     

At this place, r29 contains:

  db> print $r29

Looking at that code, I think we are here:

               slab = keg_alloc_slab(keg, zone, domain, allocflags);
                 * If we got a slab here it's safe to mark it partially used
                 * and return.  We assume that the caller is going to remove                                       
                 * at least one item.
                if (slab) {
       ->>               MPASS(slab->us_keg == keg);

where 'slab' is at r29 and 'us_keg' should be the very first (0) field. Keg should be r30:

  > print $r30  

The problem seem to be when the code is dereferencing slab(r29), which seems to be causing the MCE.

This is the content of the value r30:

  db> x $r30      
  0xc00003fffffd7000:     c0000000
  0xc00003fffffd7004:     2af8bb8

But I am not able to dereference d29:

  db> x $r29 
  0xc00003fffffddf90: (machine halts)

I am wondering why accessing this page is causing this problem.
Comment 1 Breno Leitao 2018-05-08 01:25:16 UTC
I was able to reproduce this problem on different address also:

5870f0:       7c 69 58 2a     ldx     r3,r9,r11

db> print $r9
db> print $r11

I am able to access memory mapped by r11, but not by r11+r9:

db> x 0xc00003ffffd90000
0xc00003ffffd90000:     c00003ff

db> x 0xC00003FFFFD92500
0xc00003ffffd92500 (halts)
Comment 2 Justin Hibbits freebsd_committer 2020-05-29 01:11:00 UTC
I think this was fixed shortly after the PR was submitted.  Can someone confirm, so it can be closed, or addressed?