Bug 237590 - powerpc64 PowerMac11,2 and 7,2 context, -r330614 and later (including -r345758): "ofwdump -ap" crashes the system; probable -r330610 "cause"
Summary: powerpc64 PowerMac11,2 and 7,2 context, -r330614 and later (including -r34575...
Status: New
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: CURRENT
Hardware: powerpc Any
: --- Affects Some People
Assignee: freebsd-bugs mailing list
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-04-26 20:53 UTC by Mark Millard
Modified: 2019-07-20 07:37 UTC (History)
0 users

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Mark Millard 2019-04-26 20:53:36 UTC
The biggest issue may be the inability to sufficiently sleep CPUs
on powerpc64 in general, with ofwdump on old PowerMac G5's just
being a good way to test that.

This was originally observed on head -r345758.

But "bisecting" based on:

https://artifact.ci.freebsd.org/snapshot/head/r*

I found that for the likes of "ofwdump -ap > /dev/null" :

-r330572: does not crash the system (or program).
-r330614: crashes the system: timeout trying to sleep cpus.

There are no other https://artifact.ci.freebsd.org/snapshot/head/r*
between -r330572 and -r330614 with powerpc64 present. So I stopped
at this range.

Turns out that between those two versions is:

Revision 330610 . . .
Modified Wed Mar 7 17:08:07 2018 UTC . . . by nwhitehorn 
. . .
Move the powerpc64 direct map base address from zero to high memory. This
accomplishes a few things:
- Makes NULL an invalid address in the kernel, which is useful for catching
 bugs.
. . .

(It may be that -r330610 exposed another problem that was
accidentally avoided before that.)


So far it appears likely that every bootable/usable build from
-r330610 on crashes for the likes of "ofwdump -ap" when openfirmware
is used live and, so, the cpus need to avoid multi-threading the
openfirmware use. (So, for PowerMacs, when booted without usefdt mode
for  sufficiently recent builds).

I've not seen the problem with 32-bit powerpc FreeBSD on old PowerMacs,
including when used to boot the same machines. Only powerpc64 FreeBSD.

The only powerpc64 test contexts that I have access to sometimes are
old G5 PowerMacs: 2-socket/2-core-each PowerMac11,2 and
2-socket/1-core-each PowerMac7,2. So I've no direct evidence for
any other powerpc64 context.
Comment 1 Mark Millard 2019-04-26 21:35:19 UTC
(In reply to Mark Millard from comment #0)

I suppose that a possibility is that:

A) It may be ddb related code that can not sleep some
   CPU(s) (because some already are sleeping?).

B) It may be openfirmware tried to use an address that
   is invalid in the kernel on or after -r330610 .

(A) may prevent seeing a notice that would point to (B)
as a possibility, thus hiding the true cause.
Comment 2 Mark Millard 2019-04-27 02:32:25 UTC
(In reply to Mark Millard from comment #1)

From looking at the code the message about timeout
stopping cpus is from kdbtrap and does not of itself
seem to be part of the original problem.

Attempting a bt at the ddb> prompt gets its own
fatal kernel trap but some of the exception information
is possibly of interest. Typed from a screen picture
(this is from -r330614):

exception      = 0x300 (data storage interrupt)
virtual address= 0xffffffffffb7f400
dsisr          = 0x40000000
srr0           = 0x9b2194 (0x9b2194)
srr1           = 0x9000000000001032
lr             = 0x9b254c (0x9b254c)
curthread      = 0xc0000000024fb000
       pid = 12, comm = swi4: clock (0)

The stack backtrace itself listed:

0x0000000f95460: kernel DSI read trap @ 0xffffffffffb7f400
by .db_backtrace+0x40

[Multiple attempts got the same as the above.]

That matches of 0x9b2194 for srr0:

00000000009b2154 <.db_backtrace> mflr    r0
00000000009b2158 <.db_backtrace+0x4> std     r28,-32(r1)
00000000009b215c <.db_backtrace+0x8> std     r29,-24(r1)
00000000009b2160 <.db_backtrace+0xc> std     r30,-16(r1)
00000000009b2164 <.db_backtrace+0x10> std     r31,-8(r1)
00000000009b2168 <.db_backtrace+0x14> std     r0,16(r1)
00000000009b216c <.db_backtrace+0x18> stdu    r1,-144(r1)
00000000009b2170 <.db_backtrace+0x1c> mr      r31,r1
00000000009b2174 <.db_backtrace+0x20> mr      r28,r5
00000000009b2178 <.db_backtrace+0x24> ld      r9,-20736(r2)
00000000009b217c <.db_backtrace+0x28> lwz     r0,0(r9)
00000000009b2180 <.db_backtrace+0x2c> cmpwi   cr7,r0,0
00000000009b2184 <.db_backtrace+0x30> bne     cr7,00000000009b24e0 <.db_backtrace+0x38c>
00000000009b2188 <.db_backtrace+0x34> cmpldi  cr7,r4,4095
00000000009b218c <.db_backtrace+0x38> ble     cr7,00000000009b24e0 <.db_backtrace+0x38c>
00000000009b2190 <.db_backtrace+0x3c> mr      r30,r4
00000000009b2194 <.db_backtrace+0x40> ld      r30,0(r30)
. . .

Which is from:

static int
db_backtrace(struct thread *td, db_addr_t fp, int count)
{
	db_addr_t stackframe, lr, *args;
	boolean_t kernel_only = TRUE;
	boolean_t full = FALSE;

#if 0
. . .
#endif

	stackframe = fp;

	while (!db_pager_quit) {
		if (stackframe < PAGE_SIZE)
			break;

		/*
		 * Locate the next frame by grabbing the backchain ptr
		 * from frame[0]
		 */
		stackframe = *(db_addr_t *)stackframe;
. . .

That establishes that the 0xffffffffffb7f400 value is
from what the code expects was a stack location holding
a backchain pointer.
Comment 3 Mark Millard 2019-05-18 20:13:24 UTC
[The following is from a head -r347549 based
context.]

I have isolated the call into openfirmware that
leads to the crash. Running

( truss ofwdump -ap ) 2>&1

on the console left me able to take a picture at
the end. Then under (a patched-up) usefdt mode I
was able to what would normally be next.

The crash was shown after (from the usefdt
mode context were I could save the output):

      name:
write(1,"      name:\n",12)                      = 12 (0xc)
        6f 62 70 2d 74 66 74 70 00 
write(1,"        6f 62 70 2d 74 66 74 70 "...,36) = 36 (0x24)
        'obp-tftp'
write(1,"        'obp-tftp'\n",19)               = 19 (0x13)
ioctl(3,OFIOCNEXTPROP,0x3fffffffffffd648)        = 0 (0x0)
ioctl(3,OFIOCGETPROPLEN,0x3fffffffffffd528)      = 0 (0x0)
ioctl(3,OFIOCGET,0x3fffffffffffd508)             = 0 (0x0)
      stats:
write(1,"      stats:\n",13)                     = 13 (0xd)
        00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
write(1,"        00 00 00 00 00 00 00 00 "...,69) = 69 (0x45)
        00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
write(1,"        00 00 00 00 00 00 00 00 "...,69) = 69 (0x45)
ioctl(3,OFIOCNEXTPROP,0x3fffffffffffd648)        = 0 (0x0)
ioctl(3,OFIOCGETPROPLEN,0x3fffffffffffd528)      = 0 (0x0)

Then was:

timeput stopping cpus
[ thread pid 11 tid 100003 ]
Stopped at k_trap_0x28: stdu r31,r2,0xfd50
db>

(I'll note that the last two addresses in the last two
ioctl's shows addresses ending in fd678 and fd558 in the
crashing context [non-usefdt mode]. In general the
address are not exact matches for the two boot modes.)

But note that truss reports commands after they finish.
(See the writes above.) So the processor was likely
working on the command that followed.

That was the OFIOCGET for the fairly large log property
(text from non-usefdt mode output, no crash):

ioctl(3,OFIOCGET,0x3fffffffffffd508)             = 0 (0x0)
      log:
write(1,"      log:\n",11)                       = 11 (0xb)
        00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
write(1,"        00 00 00 00 00 00 00 00 "...,69) = 69 (0x45)
        00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
. . .

(I'll not list the large block of output for reporting
the log area.)
Comment 4 Mark Millard 2019-05-18 20:51:00 UTC
(In reply to Mark Millard from comment #3)

Based on register values values it looks like the
openfirmware code got something like, say, a Data
Storage Exception, that in turn went to generictrap,
that in turn went to k_trap:

<k_trap> mfsprg  r31,0
<k_trap+0x4> mfsrr0  r30
<k_trap+0x8> std     r30,288(r31)
<k_trap+0xc> mfsrr1  r30
<k_trap+0x10> std     r30,296(r31)
<k_trap+0x14> mfsprg  r31,1
<k_trap+0x18> mfmsr   r30
<k_trap+0x1c> ori     r30,r30,50
<k_trap+0x20> mtmsr   r30
<k_trap+0x24> isync
<k_trap+0x28> stdu    r31,-688(r1)

It failed at the last because the value r1 held
0xffff'ffff'ffb7'f400, apparently an openfirmware
stack address being handled by a FreeBSD exception
handler under the FreeBSD memory management.

(So far, I can not be sure of what, specifically,
initiated getting to k_trap.)
Comment 5 Mark Millard 2019-07-20 07:37:32 UTC
(In reply to Mark Millard from comment #3)

The ofwdump command that directly tries to report
the log information that leads to the crash is:

ofwdump -Plog /packages/obp-tftp

(This is easier to follow than the prior truss
output showing the matching OFIOCGET activity.