Bug 238730 - r349985 on ppc64 IBM 970MP PowerMac G5 sys/dev/bge/if_bge.c must move the device_get_devclass(bus)
Summary: r349985 on ppc64 IBM 970MP PowerMac G5 sys/dev/bge/if_bge.c must move the dev...
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: CURRENT
Hardware: powerpc Any
: --- Affects Some People
Assignee: freebsd-bugs mailing list
URL:
Keywords: regression
Depends on:
Blocks:
 
Reported: 2019-06-21 04:51 UTC by Dennis Clarke
Modified: 2019-07-20 04:25 UTC (History)
6 users (show)

See Also:


Attachments
r349985 ppc64 panic at boot on PowerMac G5 quad (572.64 KB, image/jpeg)
2019-07-15 01:07 UTC, Dennis Clarke
no flags Details
r350009 panic with if_bge.c patch (972.59 KB, image/png)
2019-07-15 21:12 UTC, Francis Little
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Dennis Clarke 2019-06-21 04:51:44 UTC
Immediately after telling the loader : 
    
    set debug.rman_debug=1
    set debug.verbose_sysinit=1
    set usefdt=1

boot -v


.
.
.
FreeBSD 13.0-CURRENT r349241 GENERIC powerpc
.
.
.
WARNING: WITNESS option enabled, expect reduced performance.
subsystem 1000000
   vm_mem_init(0)... panic: segind 91 m 0xc000000273832bf0
cpuid = 0
time = 1
KDB: stack backtrace:
0xe0000000000000008500: at .kdb_backtrace+0x5c
0xe0000000000000008630: at .vpanic+0x1b4
0xe00000000000000086f0: at .panic+0x38
0xe0000000000000008780: at .vm_phys_enqueue_contig+0x64
0xe0000000000000008850: at .vm_page_startup+0x7b4
0xe0000000000000008920: at .vm_mem_init+0x30
0xe00000000000000089b0: at .mi_startup+0x308
0xe0000000000000008a50: at .btext+0xc4
KDB: enter: panic
[ thread pid 0 tid 0 ]
Stopped at      .kdb_enter+0x60:            ld      r2, r1, 0x28
db> 

So that was fast. Looks like the vmhat layer is at issue.


-- 
Dennis Clarke
RISC-V/SPARC/PPC/ARM/CISC
UNIX and Linux spoken
GreyBeard and suspenders optional
Comment 1 Mark Johnston freebsd_committer 2019-06-21 20:33:56 UTC
Have you booted FreeBSD successfully on this hardware in the past?

segind 91 is way bigger the VM_PHYSSEG_MAX for powerpc (16), which suggests that vm_page_startup() is adding uninitialized page structures to the physical memory allocator.
Comment 2 Dennis Clarke 2019-06-27 19:26:01 UTC
(In reply to Mark Johnston from comment #1)

You asked "Have you booted FreeBSD successfully on this
hardware in the past?"

Booted ?  Ran for years and I have compiled the sources and 
many things from ports over and over and over. 

So this is very annoyying.
Comment 3 Mark Johnston freebsd_committer 2019-06-27 19:49:43 UTC
(In reply to Dennis Clarke from comment #2)
Which revision were you last able to boot successfully?
Comment 4 Mark Linimon freebsd_committer freebsd_triage 2019-06-29 00:24:32 UTC
(In reply to Dennis Clarke from comment #2)

While I understand that it's annoying, when I teach people to triage PRs, Triage 101 is to set a baseline.  Sometime that requires asking obvious questions to rule things out.

FreeBSD is a large enough community now that not everyone knows everyone, sorry.
Comment 5 Dennis Clarke 2019-06-29 00:37:07 UTC
(In reply to Mark Linimon from comment #4)

I don't mean the question is annoying. Not at all.

I mean the machine won't boot. That is annoying. 

As for last good revision on this boxen?  Good question
and for that I will need to boot the 11.2 DVD and then
mount the /dev/ada3 ( I think ) and have a look into 
ye /boot and see what I was running. Running well in fact.
Comment 6 Dennis Clarke 2019-07-15 00:39:54 UTC

Last known kernel to boot and allow me to compile head/current was
r345425 and I used that today to buildworld and kernel for r349985
which instantly panics.

So has anyone seen anything recent build and boot and run for the
PowerMac G5 quad or its brethren ?
Comment 7 Justin Hibbits freebsd_committer 2019-07-15 00:43:14 UTC
Can you try bisecting the kernel?  I don't have an applicable G5 to test with.  You have a known-good and a known-bad.  Bisecting between them should only be ~13 builds (and you might be able to narrow it down further by looking at the kernel changes between the revisions).
Comment 8 Dennis Clarke 2019-07-15 01:00:53 UTC
(In reply to Justin Hibbits from comment #7)

Excellent idea however I was going to attempt a more blunt force approach. My idea after seeing another instant panic would be to wipe the machine clean and reinstall the 12 release clean. Then checkout r345425 and buildworld and kernel and verify that I have a stable working system. If so then we have a baseline. I have done too many builds and ports builds on the current messy machine and feel a clean slate would be a good idea. 

I will then look into the bisect from r345425 up to today and try to sift through whatever I get. 

Dennis
Comment 9 Dennis Clarke 2019-07-15 01:07:42 UTC
Created attachment 205784 [details]
r349985 ppc64 panic at boot on PowerMac G5 quad

r349985 panic at boot.
Comment 10 Mark Millard 2019-07-15 03:43:37 UTC
(In reply to Dennis Clarke from comment #0)

. . .
>     set usefdt=1
. . .

Avoid usefdt=1 . It is known to be wrong and not work.
Unless something roughly equivalent to my patches that
were listed in:

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=233863

are in place, the following apply for usefdt=1 :

A) hrowpic.c uninorth.c openpic.c all need to be
   corrected for at least one of OF_xref_from_node
   or OF_node_from_xref use. (Pure FreeBSD issue.)

B) FreeBSD choose to truncate large nodes but Apple
   has nodes that are truncated and so information
   is not preserved in such cases. (Pure FreeBSD
   issue as far as I can tell. I do not claim to
   know the detailed conequences of the truncations:
   I choose to preserver information.)

C) The translation to fdt avoided /memory0/available
   information being used (and respected). The need to
   resepct the information was demonstrated on the
   iMac G3. (Pure FreeBSD issue as far as I can tell.)

D) The FreeBSD code for the powerpc64 Macs expects
   fixed child/sibling relationships in the openfirmware
   nodes (instead of always searching) but usefdt
   reverses the ordering of the siblings (and so changes
   what is the the first child of a node that has
   childern). (Either side could be changed to avoid
   the mismatch. I choose the conversion to fdt
   preserving order.)

E) Apple made some of its own rules in its supposed
   OpenFirmware implemenation, violating the
   standard. Avoiding information loss for such (without
   messing up handling of standards-conformant contexts),
   requires changes. (There is no way Apple will fix
   things now. So handling the oddities vs. not is a
   matter of choice for FreeBSD.)

F) There was also an issue for filling translations[]
   from trans_cells[] that I ran into, but I no longer
   remember if the Macs would actually see the problem
   if other things were corrected and made tolerant
   of Apple's OpenFirmware node oddities.

[None of (A)-(F) addresses the stuck-sleeping processes
that lead to the fans going fast and the like. I've Mac
specific changes for that (that is only active on Macs).
But the changes are not bugzilla anywhere.]

I expect that Justin and company are only intending on
the default lack of usefdt as what they are enabling, at
least for now. So if you specify a context with usefdt=1
I do not expect it to get much attention (for now).
Comment 11 Dennis Clarke 2019-07-15 13:30:56 UTC
(In reply to Mark Millard from comment #10)


I did try to boot -sv without "usefdt=1" and was given a new type of
panic so this may be progress. 

.
.
.
vgapci0: Boot video device
bge0: <Broadcom BCM5714 B3, ASIC rev. 0x008003> mem 0xfa530000-0xfa53ffff,0xfa520000-0xfa52ffff irq 66 at device 4.0 on pci3
ofw_pci mapdev: start fa530000, len 65536
panic: pci_get_vendor failed for pcib1 on bus ofwbus0, error = 2
cpuid = 0
time = 1
KDB: stack backtrace:
0xe000000000007f00: at .kdb_backtrace+0x5c
0xe000000000008030: at .vpanic+0x1b4
0xe0000000000080f0: at .panic+0x38
0xe000000000008180: at .bge_attach+0x2e08
0xe0000000000082c0: at .device_attach+0x420
0xe0000000000083a0: at .device_probe_and_attach+0xb4
0xe000000000008430: at .bus_generic_new_pass+0x12c
0xe0000000000084c0: at .bus_generic_new_pass+0x114
0xe000000000008550: at .bus_generic_new_pass+0x114
0xe0000000000085e0: at .bus_generic_new_pass+0x114
0xe000000000008670: at .bus_generic_new_pass+0x114
0xe000000000008700: at .bus_generic_new_pass+0x114
0xe000000000008790: at .bus_generic_new_pass+0x114
0xe000000000008820: at .bus_set_pass+0xc4
0xe0000000000088b0: at .root_bus_configure+0x1c
0xe000000000008930: at .configure+0x14
0xe0000000000089b0: at .mi_startup+0x1f8
0xe000000000008a50: at btext+0xc4
KDB: enter: panic
[ thread pid 0 tid 100000 ]
Stopped at      .kdb_enter+0x60:       ld      r2, r1, 0x28
db>

Now this is interesting and I am making a wild guess that it may be 
related to the bge0 abd bge1 interface re-order symptom which is not
really a problem at all. At least not in a practical sense as I don't
care which interface gets the lowest number mac address from the pair
of interfaces available. However I see what looks like a loop there in
bus_generic_new_pass which goes down a fark rabbit hole. Just a guess.

So this is what happens if and only if I do not suggest "usefdt=1" to 
the boot loader as that results in an instant panic.  See picture that
I attached yesterday. 

If I do nothing other than suggest "boot -sv" then we get the above
panic and that seems to be a better place to look. 

It may be useful to go hunting into sys/powerpc/powermac/macgpio.c and
drop a pile of printf's and just see where this bug actually is. 

Also I want to change the title of this bug. 

Dennis
Comment 12 Andriy Gapon freebsd_committer 2019-07-15 14:11:53 UTC
(In reply to Dennis Clarke from comment #11)
I suspect that the problem is in bge_mbox_reorder().
Could you please try to move the device_get_devclass(bus) check to be just after device_get_devclass(dev)  check?
Comment 13 Dennis Clarke 2019-07-15 14:46:24 UTC
Will do : 

# diff -c if_bge.c.orig if_bge.c
*** if_bge.c.orig       Sun Jul 14 12:00:04 2019
--- if_bge.c    Mon Jul 15 14:45:26 2019
***************
*** 3272,3279 ****
        dev = sc->bge_dev;
        bus = device_get_parent(dev);
        for (;;) {
!               dev = device_get_parent(bus);
                bus = device_get_parent(dev);
                if (device_get_devclass(dev) != pcib)
                        break;
                for (i = 0; i < nitems(mbox_reorder_lists); i++) {
--- 3272,3282 ----
        dev = sc->bge_dev;
        bus = device_get_parent(dev);
        for (;;) {
!               /* [Bug 238730] suggestion from Andriy Gapon <avg@FreeBSD.org> 
!                * is to move dev = device_get_parent(bus) just after the
!                  * call to device_get_devclass(dev) */
                bus = device_get_parent(dev);
+               dev = device_get_parent(bus);
                if (device_get_devclass(dev) != pcib)
                        break;
                for (i = 0; i < nitems(mbox_reorder_lists); i++) {
# (In reply to Andriy Gapon from comment #12)

Let's see what that does. 

Dennis
Comment 14 Dennis Clarke 2019-07-15 18:06:02 UTC
That small change in if_bge.c resulted in the exact same panic. 

So I just did a re-install of 12.0-RELEASE r341666 on that machine and
will try to work forwards from there.
Comment 15 Francis Little 2019-07-15 19:20:44 UTC
(In reply to Dennis Clarke from comment #14)

Hi, I have a PM G5 Quad thats trash-able. I have taken it to r345245.

Do you want to pick a rev number and can work forward with some builds here to share the load?
Comment 16 Andriy Gapon freebsd_committer 2019-07-15 19:54:35 UTC
(In reply to Dennis Clarke from comment #13)
It's been a while since I last saw a context diff, unified diffs are all rage now :) I think that the diff does not do what I asked in comment #12.  It moves an assignment not the check.

This is what I meant:

Index: sys/dev/bge/if_bge.c
===================================================================
--- sys/dev/bge/if_bge.c	(revision 349883)
+++ sys/dev/bge/if_bge.c	(working copy)
@@ -3276,6 +3276,8 @@ bge_mbox_reorder(struct bge_softc *sc)
 		bus = device_get_parent(dev);
 		if (device_get_devclass(dev) != pcib)
 			break;
+		if (device_get_devclass(bus) != pci)
+			break;
 		for (i = 0; i < nitems(mbox_reorder_lists); i++) {
 			if (pci_get_vendor(dev) ==
 			    mbox_reorder_lists[i].vendor &&
@@ -3287,8 +3289,6 @@ bge_mbox_reorder(struct bge_softc *sc)
 				return (1);
 			}
 		}
-		if (device_get_devclass(bus) != pci)
-			break;
 	}
 	return (0);
 }
Comment 17 Francis Little 2019-07-15 21:08:48 UTC
(In reply to Andriy Gapon from comment #16)

Hi, I tried patching if_bge.c as in your diff, against a r350009 Kernel on my G5 Quad and it panics instantly.
Comment 18 Francis Little 2019-07-15 21:12:28 UTC
Created attachment 205809 [details]
r350009 panic with if_bge.c patch
Comment 19 Dennis Clarke 2019-07-15 23:24:00 UTC
(In reply to Francis Little from comment #18)

I will give it a whirl on r350018 :

hydra# uname -a 
FreeBSD hydra 12.0-RELEASE FreeBSD 12.0-RELEASE r341666 GENERIC  powerpc

hydra# pwd
/usr/src/r350018

hydra# diff -u sys/dev/bge/if_bge.c.orig sys/dev/bge/if_bge.c
--- sys/dev/bge/if_bge.c.orig   2019-07-15 19:02:10.169287000 +0000
+++ sys/dev/bge/if_bge.c        2019-07-15 23:10:50.382654000 +0000
@@ -3276,6 +3276,8 @@
                bus = device_get_parent(dev);
                if (device_get_devclass(dev) != pcib)
                        break;
+               if (device_get_devclass(bus) != pci)
+                       break;
                for (i = 0; i < nitems(mbox_reorder_lists); i++) {
                        if (pci_get_vendor(dev) ==
                            mbox_reorder_lists[i].vendor &&
@@ -3287,8 +3289,6 @@
                                return (1);
                        }
                }
-               if (device_get_devclass(bus) != pci)
-                       break;
        }
        return (0);
 }
hydra# 


Fairly sure one must interrupt the boot loader and utter the following :
    "set kern.smp.disabled=1" 

Surely ye old PowerMac G5 quad will just panic instantly. I have not
seen mine boot reliably in a while unless I restrict it to a single cpu
core. Also for a long long while I was also using a flat device tree
with "set usefdt=1".  However this is deemed to be fixed : 

    Bug 233863 - Various PowerMac G5 models may require kern.smp.disabled=1 and must set usefdt=1 which causes net interface reorder
    
    https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=233863
    

Well clearly the net interface re-order is still here. 

-- 
Dennis Clarke
RISC-V/SPARC/PPC/ARM/CISC
UNIX and Linux spoken
GreyBeard and suspenders optional
Comment 20 Mark Millard 2019-07-15 23:49:31 UTC
(In reply to Dennis Clarke from comment #19)

> Well clearly the net interface re-order is still here. 

usefdt=1 reverses the order of sibling nodes in
OpenFirmware (item (D) of my list of known usefdt
issues). The net interface re-order is one of the
consequences. Others are worse (as I remember):
wrong nodes found and used via the likes of just
looking up the first child of first child in the
FreeBSD code. (This would be fine if there were no
siblings at either level but I no longer remember
such detail for what prompted the example.)
Comment 21 Mark Millard 2019-07-16 00:48:47 UTC
(In reply to Dennis Clarke from comment #19)

An FYI on approximately finding where things broke
in world/kernel for booting (debug builds):

For searching for approximately where a problem started
in FreeBSD builds made on FreeBSD servers, I use the
material from the likes of (using head and powerpc64 as
an example context):

https://artifact.ci.freebsd.org/snapshot/head/r*/powerpc/powerpc64/

(for such r* as have powerpc64 built). This avoids
having to build and having my patches involved in such
tests (if I tried to use my normal context).

I use distinct media from my own builds. I new the file
system in the partition that I use for world and kernel
on such media. This is before expanding .txz files from
artifact.ci into the partition.
Comment 22 Dennis Clarke 2019-07-16 00:56:55 UTC
(In reply to Andriy Gapon from comment #16)

With a small change :

hydra# diff -u sys/dev/bge/if_bge.c.orig sys/dev/bge/if_bge.c
--- sys/dev/bge/if_bge.c.orig   2019-07-15 19:02:10.169287000 +0000
+++ sys/dev/bge/if_bge.c        2019-07-15 23:10:50.382654000 +0000
@@ -3276,6 +3276,8 @@
                bus = device_get_parent(dev);
                if (device_get_devclass(dev) != pcib)
                        break;
+               if (device_get_devclass(bus) != pci)
+                       break;
                for (i = 0; i < nitems(mbox_reorder_lists); i++) {
                        if (pci_get_vendor(dev) ==
                            mbox_reorder_lists[i].vendor &&
@@ -3287,8 +3289,6 @@
                                return (1);
                        }
                }
-               if (device_get_devclass(bus) != pci)
-                       break;
        }
        return (0);
 }
hydra# 

.
.
.
--------------------------------------------------------------
>>> Installing kernel GENERIC64 completed on Tue Jul 16 00:37:19 GMT 2019
--------------------------------------------------------------
real 5065.02
user 2645.07
sys 2252.04
hydra#

Reboot and tell the boot loader : 

    set kern.smp.disabled=1 

All boots fine.  No ethernet re-order seen either which is excellent.


hydra# 
hydra# uname -a 
FreeBSD hydra 13.0-CURRENT FreeBSD 13.0-CURRENT #0 r350018M: Mon Jul 15 23:32:27 GMT 2019     root@hydra:/usr/obj/usr/src/r350018/powerpc.powerpc64/sys/GENERIC64  powerpc
hydra# sysctl -a | grep 'smp'
kern.smp.maxid: 3
kern.smp.maxcpus: 256
kern.smp.active: 0
kern.smp.disabled: 1
kern.smp.cpus: 1
kern.smp.threads_per_core: 1
kern.smp.cores: 1
kern.smp.topology: 0
kern.smp.forward_signal_enabled: 1
"devfs","crossmp"
hydra# 

I will try a boot wherein I say nothing to the boot loader.
Comment 23 Dennis Clarke 2019-07-16 01:07:42 UTC
With the slight patch to sys/dev/bge/if_bge.c all is working and that
includes smp and there is no network interface re-order seen : 

hydra# diff -u sys/dev/bge/if_bge.c.orig sys/dev/bge/if_bge.c
--- sys/dev/bge/if_bge.c.orig   2019-07-15 19:02:10.169287000 +0000
+++ sys/dev/bge/if_bge.c        2019-07-15 23:10:50.382654000 +0000
@@ -3276,6 +3276,8 @@
                bus = device_get_parent(dev);
                if (device_get_devclass(dev) != pcib)
                        break;
+               if (device_get_devclass(bus) != pci)
+                       break;
                for (i = 0; i < nitems(mbox_reorder_lists); i++) {
                        if (pci_get_vendor(dev) ==
                            mbox_reorder_lists[i].vendor &&
@@ -3287,8 +3289,6 @@
                                return (1);
                        }
                }
-               if (device_get_devclass(bus) != pci)
-                       break;
        }
        return (0);
 }
hydra# 
hydra# 
hydra# uptime 
 1:04AM  up 2 mins, 1 user, load averages: 0.45, 0.40, 0.17
hydra# uname -a 
FreeBSD hydra 13.0-CURRENT FreeBSD 13.0-CURRENT #0 r350018M: Mon Jul 15 23:32:27 GMT 2019     root@hydra:/usr/obj/usr/src/r350018/powerpc.powerpc64/sys/GENERIC64  powerpc
hydra# 
hydra# sysctl -a | grep 'smp'
kern.smp.maxid: 3
kern.smp.maxcpus: 256
kern.smp.active: 1
kern.smp.disabled: 0
kern.smp.cpus: 4
kern.smp.threads_per_core: 1
kern.smp.cores: 4
kern.smp.topology: 0
kern.smp.forward_signal_enabled: 1
"devfs","crossmp"
hydra# 

This is excellent. 

I will now stress the machine and try ports building as well as 
a buildworld. 

-- 
Dennis Clarke
RISC-V/SPARC/PPC/ARM/CISC
UNIX and Linux spoken
GreyBeard and suspenders optional
Comment 24 Andriy Gapon freebsd_committer 2019-07-16 08:25:36 UTC
(In reply to Francis Little from comment #17)
I do not see any connection between my patch and your panic.
Comment 25 commit-hook freebsd_committer 2019-07-16 08:37:42 UTC
A commit references this bug:

Author: avg
Date: Tue Jul 16 08:36:50 UTC 2019
New revision: 350025
URL: https://svnweb.freebsd.org/changeset/base/350025

Log:
  bge: check that the bus is a pci bus before using it as such

  This fixes the following panic on powerpc:
    pci_get_vendor failed for pcib1 on bus ofwbus0, error = 2

  PR:		238730
  Reported by:	Dennis Clarke <dclarke@blastwave.org>
  Tested by:	Dennis Clarke <dclarke@blastwave.org>
  MFC after:	2 weeks

Changes:
  head/sys/dev/bge/if_bge.c
Comment 26 Francis Little 2019-07-16 09:07:00 UTC
(In reply to Andriy Gapon from comment #24)

May be my machine, I'm rebuilding from 12-R up and will try current @ 350025
Comment 27 Dennis Clarke 2019-07-16 10:22:48 UTC
(In reply to Francis Little from comment #18)

I have the exact same panic this morning with my r350018M and
it is exactly the same as yours right down to the address 
data. 

Therefore the bge issue may be fixed however the real problem
is not. 

I will open a new bug report. 

Please see : 

    r350018 will panic on ppc64 PowerMac G5 in vm_phys_enqueue_contig
    https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=239245

I have reverted back to 12.0-RELEASE r341666 which seems to work if
and only if one also sets kern.smp.disabled=1 at the boot loader.


-- 
Dennis Clarke
RISC-V/SPARC/PPC/ARM/CISC
UNIX and Linux spoken
GreyBeard and suspenders optional
Comment 28 Dennis Clarke 2019-07-16 11:35:40 UTC
(In reply to commit-hook from comment #25)

Looking at https://svnweb.freebsd.org/base/head/sys/dev/bge/if_bge.c?r1=350025&r2=350024&pathrev=350025

Thank you for the commit and I will try out r350026 this morning 
however no promiseI can get past the current panic which seems to
happen for myself and Francis Little.

Dennis
Comment 29 Dennis Clarke 2019-07-19 23:44:24 UTC
(In reply to Francis Little from comment #26)

Let's close this bug as the real work is happening in 239245

    r350074 will panic on ppc64 PowerMac G5 in vm_phys_enqueue_contig
    https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=239245

If there are boot related panic's then we will hit them there. Let's
call this a "closed as DUPLICATE" which is the best I can
come up with to mean "duplicate work happening elsewhere".


-- 
Dennis Clarke
RISC-V/SPARC/PPC/ARM/CISC
UNIX and Linux spoken
GreyBeard and suspenders optional

*** This bug has been marked as a duplicate of bug 239245 ***
Comment 30 Dennis Clarke 2019-07-20 04:24:40 UTC
Not a duplicate. This is a specific fix to sys/dev/bge/if_bge.c 
wherein we must move the device_get_devclass(bus) check to be just
after device_get_devclass(dev)