Immediately after telling the loader : set debug.rman_debug=1 set debug.verbose_sysinit=1 set usefdt=1 boot -v . . . FreeBSD 13.0-CURRENT r349241 GENERIC powerpc . . . WARNING: WITNESS option enabled, expect reduced performance. subsystem 1000000 vm_mem_init(0)... panic: segind 91 m 0xc000000273832bf0 cpuid = 0 time = 1 KDB: stack backtrace: 0xe0000000000000008500: at .kdb_backtrace+0x5c 0xe0000000000000008630: at .vpanic+0x1b4 0xe00000000000000086f0: at .panic+0x38 0xe0000000000000008780: at .vm_phys_enqueue_contig+0x64 0xe0000000000000008850: at .vm_page_startup+0x7b4 0xe0000000000000008920: at .vm_mem_init+0x30 0xe00000000000000089b0: at .mi_startup+0x308 0xe0000000000000008a50: at .btext+0xc4 KDB: enter: panic [ thread pid 0 tid 0 ] Stopped at .kdb_enter+0x60: ld r2, r1, 0x28 db> So that was fast. Looks like the vmhat layer is at issue. -- Dennis Clarke RISC-V/SPARC/PPC/ARM/CISC UNIX and Linux spoken GreyBeard and suspenders optional
Have you booted FreeBSD successfully on this hardware in the past? segind 91 is way bigger the VM_PHYSSEG_MAX for powerpc (16), which suggests that vm_page_startup() is adding uninitialized page structures to the physical memory allocator.
(In reply to Mark Johnston from comment #1) You asked "Have you booted FreeBSD successfully on this hardware in the past?" Booted ? Ran for years and I have compiled the sources and many things from ports over and over and over. So this is very annoyying.
(In reply to Dennis Clarke from comment #2) Which revision were you last able to boot successfully?
(In reply to Dennis Clarke from comment #2) While I understand that it's annoying, when I teach people to triage PRs, Triage 101 is to set a baseline. Sometime that requires asking obvious questions to rule things out. FreeBSD is a large enough community now that not everyone knows everyone, sorry.
(In reply to Mark Linimon from comment #4) I don't mean the question is annoying. Not at all. I mean the machine won't boot. That is annoying. As for last good revision on this boxen? Good question and for that I will need to boot the 11.2 DVD and then mount the /dev/ada3 ( I think ) and have a look into ye /boot and see what I was running. Running well in fact.
Last known kernel to boot and allow me to compile head/current was r345425 and I used that today to buildworld and kernel for r349985 which instantly panics. So has anyone seen anything recent build and boot and run for the PowerMac G5 quad or its brethren ?
Can you try bisecting the kernel? I don't have an applicable G5 to test with. You have a known-good and a known-bad. Bisecting between them should only be ~13 builds (and you might be able to narrow it down further by looking at the kernel changes between the revisions).
(In reply to Justin Hibbits from comment #7) Excellent idea however I was going to attempt a more blunt force approach. My idea after seeing another instant panic would be to wipe the machine clean and reinstall the 12 release clean. Then checkout r345425 and buildworld and kernel and verify that I have a stable working system. If so then we have a baseline. I have done too many builds and ports builds on the current messy machine and feel a clean slate would be a good idea. I will then look into the bisect from r345425 up to today and try to sift through whatever I get. Dennis
Created attachment 205784 [details] r349985 ppc64 panic at boot on PowerMac G5 quad r349985 panic at boot.
(In reply to Dennis Clarke from comment #0) . . . > set usefdt=1 . . . Avoid usefdt=1 . It is known to be wrong and not work. Unless something roughly equivalent to my patches that were listed in: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=233863 are in place, the following apply for usefdt=1 : A) hrowpic.c uninorth.c openpic.c all need to be corrected for at least one of OF_xref_from_node or OF_node_from_xref use. (Pure FreeBSD issue.) B) FreeBSD choose to truncate large nodes but Apple has nodes that are truncated and so information is not preserved in such cases. (Pure FreeBSD issue as far as I can tell. I do not claim to know the detailed conequences of the truncations: I choose to preserver information.) C) The translation to fdt avoided /memory0/available information being used (and respected). The need to resepct the information was demonstrated on the iMac G3. (Pure FreeBSD issue as far as I can tell.) D) The FreeBSD code for the powerpc64 Macs expects fixed child/sibling relationships in the openfirmware nodes (instead of always searching) but usefdt reverses the ordering of the siblings (and so changes what is the the first child of a node that has childern). (Either side could be changed to avoid the mismatch. I choose the conversion to fdt preserving order.) E) Apple made some of its own rules in its supposed OpenFirmware implemenation, violating the standard. Avoiding information loss for such (without messing up handling of standards-conformant contexts), requires changes. (There is no way Apple will fix things now. So handling the oddities vs. not is a matter of choice for FreeBSD.) F) There was also an issue for filling translations[] from trans_cells[] that I ran into, but I no longer remember if the Macs would actually see the problem if other things were corrected and made tolerant of Apple's OpenFirmware node oddities. [None of (A)-(F) addresses the stuck-sleeping processes that lead to the fans going fast and the like. I've Mac specific changes for that (that is only active on Macs). But the changes are not bugzilla anywhere.] I expect that Justin and company are only intending on the default lack of usefdt as what they are enabling, at least for now. So if you specify a context with usefdt=1 I do not expect it to get much attention (for now).
(In reply to Mark Millard from comment #10) I did try to boot -sv without "usefdt=1" and was given a new type of panic so this may be progress. . . . vgapci0: Boot video device bge0: <Broadcom BCM5714 B3, ASIC rev. 0x008003> mem 0xfa530000-0xfa53ffff,0xfa520000-0xfa52ffff irq 66 at device 4.0 on pci3 ofw_pci mapdev: start fa530000, len 65536 panic: pci_get_vendor failed for pcib1 on bus ofwbus0, error = 2 cpuid = 0 time = 1 KDB: stack backtrace: 0xe000000000007f00: at .kdb_backtrace+0x5c 0xe000000000008030: at .vpanic+0x1b4 0xe0000000000080f0: at .panic+0x38 0xe000000000008180: at .bge_attach+0x2e08 0xe0000000000082c0: at .device_attach+0x420 0xe0000000000083a0: at .device_probe_and_attach+0xb4 0xe000000000008430: at .bus_generic_new_pass+0x12c 0xe0000000000084c0: at .bus_generic_new_pass+0x114 0xe000000000008550: at .bus_generic_new_pass+0x114 0xe0000000000085e0: at .bus_generic_new_pass+0x114 0xe000000000008670: at .bus_generic_new_pass+0x114 0xe000000000008700: at .bus_generic_new_pass+0x114 0xe000000000008790: at .bus_generic_new_pass+0x114 0xe000000000008820: at .bus_set_pass+0xc4 0xe0000000000088b0: at .root_bus_configure+0x1c 0xe000000000008930: at .configure+0x14 0xe0000000000089b0: at .mi_startup+0x1f8 0xe000000000008a50: at btext+0xc4 KDB: enter: panic [ thread pid 0 tid 100000 ] Stopped at .kdb_enter+0x60: ld r2, r1, 0x28 db> Now this is interesting and I am making a wild guess that it may be related to the bge0 abd bge1 interface re-order symptom which is not really a problem at all. At least not in a practical sense as I don't care which interface gets the lowest number mac address from the pair of interfaces available. However I see what looks like a loop there in bus_generic_new_pass which goes down a fark rabbit hole. Just a guess. So this is what happens if and only if I do not suggest "usefdt=1" to the boot loader as that results in an instant panic. See picture that I attached yesterday. If I do nothing other than suggest "boot -sv" then we get the above panic and that seems to be a better place to look. It may be useful to go hunting into sys/powerpc/powermac/macgpio.c and drop a pile of printf's and just see where this bug actually is. Also I want to change the title of this bug. Dennis
(In reply to Dennis Clarke from comment #11) I suspect that the problem is in bge_mbox_reorder(). Could you please try to move the device_get_devclass(bus) check to be just after device_get_devclass(dev) check?
Will do : # diff -c if_bge.c.orig if_bge.c *** if_bge.c.orig Sun Jul 14 12:00:04 2019 --- if_bge.c Mon Jul 15 14:45:26 2019 *************** *** 3272,3279 **** dev = sc->bge_dev; bus = device_get_parent(dev); for (;;) { ! dev = device_get_parent(bus); bus = device_get_parent(dev); if (device_get_devclass(dev) != pcib) break; for (i = 0; i < nitems(mbox_reorder_lists); i++) { --- 3272,3282 ---- dev = sc->bge_dev; bus = device_get_parent(dev); for (;;) { ! /* [Bug 238730] suggestion from Andriy Gapon <avg@FreeBSD.org> ! * is to move dev = device_get_parent(bus) just after the ! * call to device_get_devclass(dev) */ bus = device_get_parent(dev); + dev = device_get_parent(bus); if (device_get_devclass(dev) != pcib) break; for (i = 0; i < nitems(mbox_reorder_lists); i++) { # (In reply to Andriy Gapon from comment #12) Let's see what that does. Dennis
That small change in if_bge.c resulted in the exact same panic. So I just did a re-install of 12.0-RELEASE r341666 on that machine and will try to work forwards from there.
(In reply to Dennis Clarke from comment #14) Hi, I have a PM G5 Quad thats trash-able. I have taken it to r345245. Do you want to pick a rev number and can work forward with some builds here to share the load?
(In reply to Dennis Clarke from comment #13) It's been a while since I last saw a context diff, unified diffs are all rage now :) I think that the diff does not do what I asked in comment #12. It moves an assignment not the check. This is what I meant: Index: sys/dev/bge/if_bge.c =================================================================== --- sys/dev/bge/if_bge.c (revision 349883) +++ sys/dev/bge/if_bge.c (working copy) @@ -3276,6 +3276,8 @@ bge_mbox_reorder(struct bge_softc *sc) bus = device_get_parent(dev); if (device_get_devclass(dev) != pcib) break; + if (device_get_devclass(bus) != pci) + break; for (i = 0; i < nitems(mbox_reorder_lists); i++) { if (pci_get_vendor(dev) == mbox_reorder_lists[i].vendor && @@ -3287,8 +3289,6 @@ bge_mbox_reorder(struct bge_softc *sc) return (1); } } - if (device_get_devclass(bus) != pci) - break; } return (0); }
(In reply to Andriy Gapon from comment #16) Hi, I tried patching if_bge.c as in your diff, against a r350009 Kernel on my G5 Quad and it panics instantly.
Created attachment 205809 [details] r350009 panic with if_bge.c patch
(In reply to Francis Little from comment #18) I will give it a whirl on r350018 : hydra# uname -a FreeBSD hydra 12.0-RELEASE FreeBSD 12.0-RELEASE r341666 GENERIC powerpc hydra# pwd /usr/src/r350018 hydra# diff -u sys/dev/bge/if_bge.c.orig sys/dev/bge/if_bge.c --- sys/dev/bge/if_bge.c.orig 2019-07-15 19:02:10.169287000 +0000 +++ sys/dev/bge/if_bge.c 2019-07-15 23:10:50.382654000 +0000 @@ -3276,6 +3276,8 @@ bus = device_get_parent(dev); if (device_get_devclass(dev) != pcib) break; + if (device_get_devclass(bus) != pci) + break; for (i = 0; i < nitems(mbox_reorder_lists); i++) { if (pci_get_vendor(dev) == mbox_reorder_lists[i].vendor && @@ -3287,8 +3289,6 @@ return (1); } } - if (device_get_devclass(bus) != pci) - break; } return (0); } hydra# Fairly sure one must interrupt the boot loader and utter the following : "set kern.smp.disabled=1" Surely ye old PowerMac G5 quad will just panic instantly. I have not seen mine boot reliably in a while unless I restrict it to a single cpu core. Also for a long long while I was also using a flat device tree with "set usefdt=1". However this is deemed to be fixed : Bug 233863 - Various PowerMac G5 models may require kern.smp.disabled=1 and must set usefdt=1 which causes net interface reorder https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=233863 Well clearly the net interface re-order is still here. -- Dennis Clarke RISC-V/SPARC/PPC/ARM/CISC UNIX and Linux spoken GreyBeard and suspenders optional
(In reply to Dennis Clarke from comment #19) > Well clearly the net interface re-order is still here. usefdt=1 reverses the order of sibling nodes in OpenFirmware (item (D) of my list of known usefdt issues). The net interface re-order is one of the consequences. Others are worse (as I remember): wrong nodes found and used via the likes of just looking up the first child of first child in the FreeBSD code. (This would be fine if there were no siblings at either level but I no longer remember such detail for what prompted the example.)
(In reply to Dennis Clarke from comment #19) An FYI on approximately finding where things broke in world/kernel for booting (debug builds): For searching for approximately where a problem started in FreeBSD builds made on FreeBSD servers, I use the material from the likes of (using head and powerpc64 as an example context): https://artifact.ci.freebsd.org/snapshot/head/r*/powerpc/powerpc64/ (for such r* as have powerpc64 built). This avoids having to build and having my patches involved in such tests (if I tried to use my normal context). I use distinct media from my own builds. I new the file system in the partition that I use for world and kernel on such media. This is before expanding .txz files from artifact.ci into the partition.
(In reply to Andriy Gapon from comment #16) With a small change : hydra# diff -u sys/dev/bge/if_bge.c.orig sys/dev/bge/if_bge.c --- sys/dev/bge/if_bge.c.orig 2019-07-15 19:02:10.169287000 +0000 +++ sys/dev/bge/if_bge.c 2019-07-15 23:10:50.382654000 +0000 @@ -3276,6 +3276,8 @@ bus = device_get_parent(dev); if (device_get_devclass(dev) != pcib) break; + if (device_get_devclass(bus) != pci) + break; for (i = 0; i < nitems(mbox_reorder_lists); i++) { if (pci_get_vendor(dev) == mbox_reorder_lists[i].vendor && @@ -3287,8 +3289,6 @@ return (1); } } - if (device_get_devclass(bus) != pci) - break; } return (0); } hydra# . . . -------------------------------------------------------------- >>> Installing kernel GENERIC64 completed on Tue Jul 16 00:37:19 GMT 2019 -------------------------------------------------------------- real 5065.02 user 2645.07 sys 2252.04 hydra# Reboot and tell the boot loader : set kern.smp.disabled=1 All boots fine. No ethernet re-order seen either which is excellent. hydra# hydra# uname -a FreeBSD hydra 13.0-CURRENT FreeBSD 13.0-CURRENT #0 r350018M: Mon Jul 15 23:32:27 GMT 2019 root@hydra:/usr/obj/usr/src/r350018/powerpc.powerpc64/sys/GENERIC64 powerpc hydra# sysctl -a | grep 'smp' kern.smp.maxid: 3 kern.smp.maxcpus: 256 kern.smp.active: 0 kern.smp.disabled: 1 kern.smp.cpus: 1 kern.smp.threads_per_core: 1 kern.smp.cores: 1 kern.smp.topology: 0 kern.smp.forward_signal_enabled: 1 "devfs","crossmp" hydra# I will try a boot wherein I say nothing to the boot loader.
With the slight patch to sys/dev/bge/if_bge.c all is working and that includes smp and there is no network interface re-order seen : hydra# diff -u sys/dev/bge/if_bge.c.orig sys/dev/bge/if_bge.c --- sys/dev/bge/if_bge.c.orig 2019-07-15 19:02:10.169287000 +0000 +++ sys/dev/bge/if_bge.c 2019-07-15 23:10:50.382654000 +0000 @@ -3276,6 +3276,8 @@ bus = device_get_parent(dev); if (device_get_devclass(dev) != pcib) break; + if (device_get_devclass(bus) != pci) + break; for (i = 0; i < nitems(mbox_reorder_lists); i++) { if (pci_get_vendor(dev) == mbox_reorder_lists[i].vendor && @@ -3287,8 +3289,6 @@ return (1); } } - if (device_get_devclass(bus) != pci) - break; } return (0); } hydra# hydra# hydra# uptime 1:04AM up 2 mins, 1 user, load averages: 0.45, 0.40, 0.17 hydra# uname -a FreeBSD hydra 13.0-CURRENT FreeBSD 13.0-CURRENT #0 r350018M: Mon Jul 15 23:32:27 GMT 2019 root@hydra:/usr/obj/usr/src/r350018/powerpc.powerpc64/sys/GENERIC64 powerpc hydra# hydra# sysctl -a | grep 'smp' kern.smp.maxid: 3 kern.smp.maxcpus: 256 kern.smp.active: 1 kern.smp.disabled: 0 kern.smp.cpus: 4 kern.smp.threads_per_core: 1 kern.smp.cores: 4 kern.smp.topology: 0 kern.smp.forward_signal_enabled: 1 "devfs","crossmp" hydra# This is excellent. I will now stress the machine and try ports building as well as a buildworld. -- Dennis Clarke RISC-V/SPARC/PPC/ARM/CISC UNIX and Linux spoken GreyBeard and suspenders optional
(In reply to Francis Little from comment #17) I do not see any connection between my patch and your panic.
A commit references this bug: Author: avg Date: Tue Jul 16 08:36:50 UTC 2019 New revision: 350025 URL: https://svnweb.freebsd.org/changeset/base/350025 Log: bge: check that the bus is a pci bus before using it as such This fixes the following panic on powerpc: pci_get_vendor failed for pcib1 on bus ofwbus0, error = 2 PR: 238730 Reported by: Dennis Clarke <dclarke@blastwave.org> Tested by: Dennis Clarke <dclarke@blastwave.org> MFC after: 2 weeks Changes: head/sys/dev/bge/if_bge.c
(In reply to Andriy Gapon from comment #24) May be my machine, I'm rebuilding from 12-R up and will try current @ 350025
(In reply to Francis Little from comment #18) I have the exact same panic this morning with my r350018M and it is exactly the same as yours right down to the address data. Therefore the bge issue may be fixed however the real problem is not. I will open a new bug report. Please see : r350018 will panic on ppc64 PowerMac G5 in vm_phys_enqueue_contig https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=239245 I have reverted back to 12.0-RELEASE r341666 which seems to work if and only if one also sets kern.smp.disabled=1 at the boot loader. -- Dennis Clarke RISC-V/SPARC/PPC/ARM/CISC UNIX and Linux spoken GreyBeard and suspenders optional
(In reply to commit-hook from comment #25) Looking at https://svnweb.freebsd.org/base/head/sys/dev/bge/if_bge.c?r1=350025&r2=350024&pathrev=350025 Thank you for the commit and I will try out r350026 this morning however no promiseI can get past the current panic which seems to happen for myself and Francis Little. Dennis
(In reply to Francis Little from comment #26) Let's close this bug as the real work is happening in 239245 r350074 will panic on ppc64 PowerMac G5 in vm_phys_enqueue_contig https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=239245 If there are boot related panic's then we will hit them there. Let's call this a "closed as DUPLICATE" which is the best I can come up with to mean "duplicate work happening elsewhere". -- Dennis Clarke RISC-V/SPARC/PPC/ARM/CISC UNIX and Linux spoken GreyBeard and suspenders optional *** This bug has been marked as a duplicate of bug 239245 ***
Not a duplicate. This is a specific fix to sys/dev/bge/if_bge.c wherein we must move the device_get_devclass(bus) check to be just after device_get_devclass(dev)