Bug 224841 - Power64 regression: Machine does not boot anymore
Summary: Power64 regression: Machine does not boot anymore
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: CURRENT
Hardware: powerpc Any
: --- Affects Only Me
Assignee: freebsd-ppc (Nobody)
URL:
Keywords: regression
Depends on:
Blocks:
 
Reported: 2018-01-02 17:00 UTC by Breno Leitao
Modified: 2018-02-05 13:08 UTC (History)
4 users (show)

See Also:


Attachments
Debug log (97.02 KB, text/x-log)
2018-01-02 20:17 UTC, Breno Leitao
no flags Details
Draft patch to speed up max CPU identification (6.59 KB, patch)
2018-01-03 17:35 UTC, Breno Leitao
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Breno Leitao 2018-01-02 17:00:23 UTC
Hi,

My FreeBSD powerpc9 machine just stopped booting, it just shows:


 Booting [/boot/kernel/kernel]...               
 Kernel entry at 0x102620 ...
 KDB: debugger backends: ddb
 KDB: current backend: ddb



I did a bisect and found that the following commit id caused the problem:

commit 9e4c7607a7c39325daea1d92ade15787adad72b2 ('Enhance the CHRP/pSeries platform layer')
Comment 1 Breno Leitao 2018-01-02 17:16:34 UTC
➜  /dev ofwdump -a
Node 0x7e728430: /
  Node 0x7e728840: hypervisor
  Node 0x7e728958: chosen
  Node 0x7e728da0: rtas
  Node 0x7e729f90: event-sources
    Node 0x7e72a148: epow-events
  Node 0x7e72a258: cpus
    Node 0x7e72a3a8: PowerPC,POWER9
    Node 0x7e72ac50: PowerPC,POWER9
    Node 0x7e72b4f8: PowerPC,POWER9
    Node 0x7e72bda0: PowerPC,POWER9
    Node 0x7e72c648: PowerPC,POWER9
    Node 0x7e72cef0: PowerPC,POWER9
    Node 0x7e72d798: PowerPC,POWER9
    Node 0x7e72e040: PowerPC,POWER9
    Node 0x7e72e8e8: PowerPC,POWER9
    Node 0x7e72f190: PowerPC,POWER9
    Node 0x7e72fa38: PowerPC,POWER9
    Node 0x7e7302e0: PowerPC,POWER9
    Node 0x7e730b88: PowerPC,POWER9
    Node 0x7e731430: PowerPC,POWER9
    Node 0x7e731cd8: PowerPC,POWER9
    Node 0x7e732580: PowerPC,POWER9
    Node 0x7e732e28: PowerPC,POWER9
    Node 0x7e7336d0: PowerPC,POWER9
    Node 0x7e733f78: PowerPC,POWER9
    Node 0x7e734820: PowerPC,POWER9
    Node 0x7e7350c8: PowerPC,POWER9
    Node 0x7e735970: PowerPC,POWER9
    Node 0x7e736218: PowerPC,POWER9
    Node 0x7e736ac0: PowerPC,POWER9
    Node 0x7e737368: PowerPC,POWER9
    Node 0x7e737c10: PowerPC,POWER9
    Node 0x7e7384b8: PowerPC,POWER9
    Node 0x7e738d60: PowerPC,POWER9
    Node 0x7e739608: PowerPC,POWER9
    Node 0x7e739eb0: PowerPC,POWER9
    Node 0x7e73a758: PowerPC,POWER9
    Node 0x7e73b000: PowerPC,POWER9
    Node 0x7e73b8a8: PowerPC,POWER9
    Node 0x7e73c150: PowerPC,POWER9
    Node 0x7e73c9f8: PowerPC,POWER9
    Node 0x7e73d2a0: PowerPC,POWER9
    Node 0x7e73db48: PowerPC,POWER9
    Node 0x7e73e3f0: PowerPC,POWER9
    Node 0x7e73ec98: PowerPC,POWER9
    Node 0x7e73f540: PowerPC,POWER9
    Node 0x7e73fde8: PowerPC,POWER9
    Node 0x7e740690: PowerPC,POWER9
    Node 0x7e740f38: PowerPC,POWER9
    Node 0x7e7417e0: PowerPC,POWER9
    Node 0x7e742088: PowerPC,POWER9
    Node 0x7e742930: PowerPC,POWER9
    Node 0x7e7431d8: PowerPC,POWER9
    Node 0x7e743a80: PowerPC,POWER9
    Node 0x7e744328: PowerPC,POWER9
    Node 0x7e744bd0: PowerPC,POWER9
    Node 0x7e745478: PowerPC,POWER9
    Node 0x7e745d20: PowerPC,POWER9
    Node 0x7e7465c8: PowerPC,POWER9
    Node 0x7e746e70: PowerPC,POWER9
    Node 0x7e747718: PowerPC,POWER9
    Node 0x7e747fc0: PowerPC,POWER9
    Node 0x7e748868: PowerPC,POWER9
    Node 0x7e749110: PowerPC,POWER9
    Node 0x7e7499b8: PowerPC,POWER9
    Node 0x7e74a260: PowerPC,POWER9
    Node 0x7e74ab08: PowerPC,POWER9
    Node 0x7e74b3b0: PowerPC,POWER9
    Node 0x7e74bc58: PowerPC,POWER9
    Node 0x7e74c500: PowerPC,POWER9
    Node 0x7e74cda8: PowerPC,POWER9
    Node 0x7e74d650: PowerPC,POWER9
    Node 0x7e74def8: PowerPC,POWER9
    Node 0x7e74e7a0: PowerPC,POWER9
    Node 0x7e74f048: PowerPC,POWER9
    Node 0x7e74f8f0: PowerPC,POWER9
    Node 0x7e750198: PowerPC,POWER9
    Node 0x7e750a40: PowerPC,POWER9
    Node 0x7e7512e8: PowerPC,POWER9
    Node 0x7e751b90: PowerPC,POWER9
    Node 0x7e752438: PowerPC,POWER9
    Node 0x7e752ce0: PowerPC,POWER9
    Node 0x7e753588: PowerPC,POWER9
    Node 0x7e753e30: PowerPC,POWER9
    Node 0x7e7546d8: PowerPC,POWER9
    Node 0x7e754f80: PowerPC,POWER9
    Node 0x7e755828: PowerPC,POWER9
    Node 0x7e7560d0: PowerPC,POWER9
    Node 0x7e756978: PowerPC,POWER9
    Node 0x7e757220: PowerPC,POWER9
    Node 0x7e757ac8: PowerPC,POWER9
    Node 0x7e758370: PowerPC,POWER9
    Node 0x7e758c18: PowerPC,POWER9
    Node 0x7e7594c0: PowerPC,POWER9
    Node 0x7e759d68: PowerPC,POWER9
    Node 0x7e75a610: PowerPC,POWER9
    Node 0x7e75aeb8: PowerPC,POWER9
    Node 0x7e75b760: PowerPC,POWER9
    Node 0x7e75c008: PowerPC,POWER9
    Node 0x7e75c8b0: PowerPC,POWER9
    Node 0x7e75d158: PowerPC,POWER9
    Node 0x7e75da00: PowerPC,POWER9
    Node 0x7e75e2a8: PowerPC,POWER9
    Node 0x7e75eb50: PowerPC,POWER9
    Node 0x7e75f3f8: PowerPC,POWER9
    Node 0x7e75fca0: PowerPC,POWER9
    Node 0x7e760548: PowerPC,POWER9
    Node 0x7e760df0: PowerPC,POWER9
    Node 0x7e761698: PowerPC,POWER9
    Node 0x7e761f40: PowerPC,POWER9
    Node 0x7e7627e8: PowerPC,POWER9
    Node 0x7e763090: PowerPC,POWER9
    Node 0x7e763938: PowerPC,POWER9
    Node 0x7e7641e0: PowerPC,POWER9
    Node 0x7e764a88: PowerPC,POWER9
    Node 0x7e765330: PowerPC,POWER9
    Node 0x7e765bd8: PowerPC,POWER9
    Node 0x7e766480: PowerPC,POWER9
    Node 0x7e766d28: PowerPC,POWER9
    Node 0x7e7675d0: PowerPC,POWER9
    Node 0x7e767e78: PowerPC,POWER9
    Node 0x7e768720: PowerPC,POWER9
    Node 0x7e768fc8: PowerPC,POWER9
    Node 0x7e769870: PowerPC,POWER9
    Node 0x7e76a118: PowerPC,POWER9
    Node 0x7e76a9c0: PowerPC,POWER9
    Node 0x7e76b268: PowerPC,POWER9
    Node 0x7e76bb10: PowerPC,POWER9
    Node 0x7e76c3b8: PowerPC,POWER9
    Node 0x7e76cc60: PowerPC,POWER9
    Node 0x7e76d508: PowerPC,POWER9
    Node 0x7e76ddb0: PowerPC,POWER9
    Node 0x7e76e658: PowerPC,POWER9
    Node 0x7e76ef00: PowerPC,POWER9
  Node 0x7e76f808: pci
    Node 0x7e770d38: pci
    Node 0x7e771300: usb
    Node 0x7e7719c0: unknown-legacy-device
  Node 0x7e7720a0: vdevice
    Node 0x7e772320: l-lan
    Node 0x7e7726a8: v-scsi
      Node 0x7e794f88: disk
    Node 0x7e772a78: vty
    Node 0x7e772c50: nvram
  Node 0x7e772ed8: memory
  Node 0x7e773088: interrupt-controller
  Node 0x7e7748d8: aliases
  Node 0x7e7749e0: options
  Node 0x7e774aa0: openprom
  Node 0x7e774ba8: packages
    Node 0x7e774c70: deblocker
    Node 0x7e775710: disk-label
    Node 0x7e779bf0: fat-files
    Node 0x7e77c368: rom-files
    Node 0x7e77cd88: ext2-files
    Node 0x7e77ef40: obp-tftp
    Node 0x7e77f3c0: iso-9660
  Node 0x7e7824a8: mmu
Comment 2 Nathan Whitehorn freebsd_committer freebsd_triage 2018-01-02 17:27:42 UTC
Thanks for the report -- I will investigate. Are you 100% sure it is that specific commit? It changed (almost) nothing except for SMP code, which your kernel does not seem to have gotten to yet.
Comment 3 Breno Leitao 2018-01-02 18:23:29 UTC
>  Are you 100% sure it is that specific commit? 

I am not, and now (after some system changes) upstream seems to be working back.
Comment 4 Breno Leitao 2018-01-02 20:16:54 UTC
I finally got my mess fixed and I am able to reproduce it with your fix.

I created the following patch to dump what is happening:

@@ -340,12 +344,14 @@ chrp_cpuref_for_server(struct cpuref *cpuref, int cpu_n, int server)
@@ -340,12 +344,14 @@ chrp_cpuref_for_server(struct cpuref *cpuref, int cpu_n, int server)
 
        i = 0;
        for (cpu = OF_child(dev); cpu != 0; cpu = OF_peer(cpu)) {
+               printf("CPU=%d and i = %d --- ", cpu, i);
                res = OF_getprop(cpu, "device_type", buf, sizeof(buf));
                if (res <= 0 || strcmp(buf, "cpu") != 0)
                        continue;
 
                res = OF_getproplen(cpu, "ibm,ppc-interrupt-server#s");
                if (res > 0) {
+                       printf("res = %d\n", res);
                        cell_t interrupt_servers[res/sizeof(cell_t)];
                        OF_getencprop(cpu, "ibm,ppc-interrupt-server#s",
                            interrupt_servers, res);


And this line is being printed all the time, so, it seems it is at a loop.

it is not in this loop effectively because i returns to 0. Anyway, let me attach the full log here.
Comment 5 Breno Leitao 2018-01-02 20:17:24 UTC
Created attachment 189345 [details]
Debug log
Comment 6 Nathan Whitehorn freebsd_committer freebsd_triage 2018-01-02 20:41:35 UTC
That output actually looks normal. CPU enumeration happens by repeatedly descending through the tree -- you will note that the largest value attained by i gets larger every loop. This breaks down at the last loop where it would have gotten to CPU 71 but instead stalls. This is confusing to me since it had already gotten past CPU 66 a few times...

If you back out that specific revision, do things work again? I couldn't quite follow.
Comment 7 Breno Leitao 2018-01-03 17:35:53 UTC
Created attachment 189375 [details]
Draft patch to speed up max CPU identification
Comment 8 Breno Leitao 2018-01-03 17:36:38 UTC
After some debug, I found that the machine is not stopping, but it is lopping on CPU detection part for a long period, which is involving the changes in the  chrp_cpuref_for_server(), and this is being called several times.

In fact, chrp_cpuref_for_server() through chrp_smp_next_cpu (platform_smp_next_cpu,) is called n times, where n is the amount of CPU, just to count the amount of CPUs available in the machine.

The delay is proportional to the number of CPUs available. On a machine with 100+ CPUs, it takes around 30 minutes to start the boot process, and loop again after the initial headers. This long loop is being executed twice on a machine boot.

That explains why I was not able to reproduce it yesterday. I was not able to reproduce it when I tested on a machine with few CPUs (like 2 CPUs). So, the delay is indistinguishable when you have few CPUs.

That said, I created a draft of a patch that creates a platform function that counts the maximum amount of CPUs available on machine via device tree, this helps to get the mp_ncpus much faster the the current implementation.

PS: This patch was barely tested and not ready for consumption.
Comment 9 Gustavo Romero 2018-01-03 21:59:28 UTC
Hi, 

Boot on POWER8 also got stuck after:

"Trying to mount root from ufs:/dev/da0s2a [rw]..."

Enabling rc.d debug option did not show anything more, so probably it got stuck on kernelspace.

Reverting commit 9e4c7607a7c39325daea1d92ade15787adad72b2 ('Enhance the CHRP/pSeries platform layer') solves the issue and POWER8 boot works fine.

Applying Breno's draft did not solved the issue.

I'm wondering if mapping between new contiguous cpu ID's and '{pc,cr}_hwref' are correct since previously 'dev' was a very high number: 

cpu0: dev=7e72a3a0 (BSP)
cpu8: dev=7e72ac20
cpu16: dev=7e72b4a0
cpu24: dev=7e72bd20
cpu32: dev=7e72c5a0
cpu40: dev=7e72ce20
cpu48: dev=7e72d6a0
cpu56: dev=7e72df20

and now:

cpu0: dev=0 (BSP)
cpu1: dev=8
cpu2: dev=10
cpu3: dev=18
cpu4: dev=20
cpu5: dev=28
cpu6: dev=30
cpu7: dev=38

For the records, here is the full boot log that halts after "Trying to mount...":

>> FreeBSD/powerpc Open Firmware boot block
   Boot path:   /vdevice/v-scsi@2000/disk@8000000000000000
   Boot loader: /boot/loader
   Boot volume:   /vdevice/v-scsi@2000/disk@8000000000000000:2
Consoles: Open Firmware console  

FreeBSD/powerpc64 Open Firmware loader, Revision 0.1
(Wed Jan  3 14:26:49 -02 2018 root@gromero31)
Memory: 8388608KB
Booted from: /vdevice/v-scsi@2000/disk@8000000000000000

Loading /boot/defaults/loader.conf
/boot/kernel/kernel data=0xf1d038+0x4a6a08 syms=[0x8+0x155a78+0x8+0x154aac]
/boot/entropy size=0x1000

Hit [Enter] to boot immediately, or any other key for command prompt.
Booting [/boot/kernel/kernel]...               
Kernel entry at 0x102620 ...
KDB: debugger backends: ddb
KDB: current backend: ddb
Copyright (c) 1992-2017 The FreeBSD Project.
Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
	The Regents of the University of California. All rights reserved.
FreeBSD is a registered trademark of The FreeBSD Foundation.
FreeBSD 12.0-CURRENT #3 9e4c7607a7c(HEAD): Wed Jan  3 17:21:09 -02 2018
    root@gromero31:/usr/obj/root/git/freebsd/powerpc.powerpc64/sys/GENERIC64 powerpc
gcc version 4.2.1 20070831 patched [FreeBSD]
WARNING: WITNESS option enabled, expect reduced performance.
VT: init without driver.
cpu0: IBM POWER8E revision 2.1, 3425.00 MHz
cpu0: Features dc005180<PPC32,PPC64,ALTIVEC,FPU,MMU,SMT,ARCH205,ARCH206,VSX>
cpu0: Features2 ca000000<ARCH207,HTM,ISEL,VCRYPTO>
real memory  = 8552837120 (8156 MB)
avail memory = 8264536064 (7881 MB)
FreeBSD/SMP: Multiprocessor System Detected: 8 CPUs
random: unblocking device.
random: entropy device external interface
kbd0 at kbdmux0
ofwbus0: <Open Firmware Device Tree> on nexus0
xicp0: <PAPR virtual interrupt controller> on ofwbus0
rtas0: <Run-Time Abstraction Services> on ofwbus0
rtas0: registered as a time-of-day clock, resolution 0.002000s
cpulist0: <Open Firmware CPU Group> on ofwbus0
pcib0: <RTAS Host-PCI bridge> on ofwbus0
pci0: <POWER Hypervisor PCI bus> on pcib0
pcib1: <OFW PCI-PCI bridge> mem 0x90100000-0x901000ff irq 4099 at device 1.0 on pci0
pci1: <POWER Hypervisor PCI bus> on pcib1
pci0: <serial bus, USB> at device 2.0 (no driver attached)
pci0: <old> at device 4.0 (no driver attached)
vdevice0: <POWER Hypervisor Virtual Device Root> on ofwbus0
llan0: <POWER Hypervisor Virtual Ethernet> irq 16781320 on vdevice0
llan0: Ethernet address: 00:1f:36:04:00:00
vscsi0: <POWER Hypervisor Virtual SCSI Bus> irq 16781319 on vdevice0
vscsi0: Queue depth 22 commands
uart0: <POWER Hypervisor Virtual Serial Port> irq 16781321 on vdevice0
Timecounter "timebase" frequency 512000000 Hz quality 0
Event timer "decrementer" frequency 512000000 Hz quality 1000
Timecounters tick every 1.000 msec
llan0: link state changed to UP
usb_needs_explore_all: no devclass
cd0 at vscsi0 bus 0 scbus0 target 0 lun 1
cd0: <QEMU QEMU CD-ROM 2.5+> Removable CD-ROM SPC-3 SCSI device
cd0: 150.000MB/s transfers
cd0: 878MB (449960 2048 byte sectors)
da0 at vscsi0 bus 0 scbus0 target 0 lun 0
da0: <QEMU QEMU HARDDISK 2.5+> Fixed Direct Access SPC-3 SCSI device
da0: 150.000MB/s transfers
da0: 153600MB (314572800 512 byte sectors)
SMP: AP CPU #4 launched
SMP: AP CPU #7 launched
SMP: AP CPU #1 launched
SMP: AP CPU #2 launched
SMP: AP CPU #3 launched
SMP: AP CPU #6 launched
SMP: AP CPU #5 launched
WARNING: WITNESS option enabled, expect reduced performance.
Trying to mount root from ufs:/dev/da0s2a [rw]...
Comment 10 Mark Millard 2018-01-04 08:24:25 UTC
(In reply to Gustavo Romero from comment #9)

Having FreeBSD hang around:

Trying to mount root from ufs:/dev/da0s2a [rw]...

might be tied to the following. . .

Hanging here is something addressed by the
"UMA limit" part of head -r327485 . Bugzilla
224330 has material about it. As I understand
-r326347 is what broke that specific aspect
of things. (It had other issues as well,
and apparently causes some other problem to
show up that was previous hidden.)

As for -r327485:

Author: jeff
Date: Tue Jan  2 04:35:56 2018
New Revision: 327485
URL: 
https://svnweb.freebsd.org/changeset/base/327485


Log:
 Fix arc after r326347 broke various memory limit queries.  Use UMA features
 rather than kmem arena size to determine available memory.

 Initialize the UMA limit to LONG_MAX to avoid spurious wakeups on boot before
 the real limit is set.

 PR:		224330 (partial), 224080
 Reviewed by:	markj, avg
 Sponsored by:	Netflix / Dell EMC Isilon
 Differential Revision:	
https://reviews.freebsd.org/D13494
. . .


The "arc" reference is a ZFS file system handling issue.
Comment 11 Gustavo Romero 2018-01-04 12:15:00 UTC
Hi Mark,

Thanks a lot for the helpful comments.

Commit "c17fd15c00f Fix arc after r326347 broke various memory limit queries.  Use UMA features rather than kmem arena size to determine available memory." [1] is actually already present in the my build, so probably a different root cause with the same symptom?

But talking about UMA, one thing I forgot to mention is that once I boot with only 1 CPU it goes fine. Do you know if issue solved by fix [1] would also be fine (would boot normally) if there is only one CPU?

[1] https://github.com/freebsd/freebsd/commit/c17fd15c00f
Comment 12 Nathan Whitehorn freebsd_committer freebsd_triage 2018-01-04 15:19:37 UTC
(In reply to Gustavo Romero from comment #9)

OK, obviously there is something questionable about that patch. I will revert it later today and try to figure out what went wrong before reapplication.

Thank you so much for the debugging and detailed information.
Comment 13 Gustavo Romero 2018-01-04 15:48:41 UTC
(In reply to Nathan Whitehorn from comment #12)

Thanks Nathan!

If you need any help for testing it count on me.

Cheers.
Comment 14 Breno Leitao 2018-01-04 16:06:45 UTC
Thanks Nathan. I can also help to address the issue you were trying to address, if you need. Count me in also.
Comment 15 Mark Millard 2018-01-04 18:49:48 UTC
(In reply to Gustavo Romero from comment #11)

I do not know what the "only one hardware thread"
implications would be.

When it is hung-up, can you get to the db> prompt
to try commands such as:

print *uma_kmem_limit
print *uma_kmem_total
print *uma_reclaim_needed
show vmem *kernel_arena
show page
show pageq
show allpcpu
ps
acttrace
show allchains

It could give appropriate folks an idea what is
going on at the time. (There may be other, better
or additional alternatives to what I list.)

The first 4 of that list of commands are intended
as a cross check on if the LONG_MAX part of -r326347
might not be sufficient.
Comment 16 commit-hook freebsd_committer freebsd_triage 2018-01-04 23:08:19 UTC
A commit references this bug:

Author: nwhitehorn
Date: Thu Jan  4 23:07:51 UTC 2018
New revision: 327566
URL: https://svnweb.freebsd.org/changeset/base/327566

Log:
  Revert r327360, which can cause boot problems on high-CPU-count (>60)
  POWER8 and POWER9 systems, pending further analysis.

  PR:		224841

Changes:
  head/sys/powerpc/pseries/platform_chrp.c
Comment 17 Mark Millard 2018-01-04 23:24:18 UTC
(In reply to commit-hook from comment #16)

>  Revert r327360, which can cause boot problems on high-CPU-count (>60)
>  POWER8 and POWER9 systems, pending further analysis.

Note that comment #9's report is for only
"8 CPUs", quoting:

cpu0: IBM POWER8E revision 2.1, 3425.00 MHz
cpu0: Features dc005180<PPC32,PPC64,ALTIVEC,FPU,MMU,SMT,ARCH205,ARCH206,VSX>
cpu0: Features2 ca000000<ARCH207,HTM,ISEL,VCRYPTO>
real memory  = 8552837120 (8156 MB)
avail memory = 8264536064 (7881 MB)
FreeBSD/SMP: Multiprocessor System Detected: 8 CPUs

That is one of the reasons that I asked about the
"UMA limit" issue for what comment #9 reported.
Comment 18 Gustavo Romero 2018-01-05 20:30:36 UTC
(In reply to Mark Millard from comment #15)

Hi Mark,

I'm unable to get to the ddb prompt after the halt. I'm pretty sure DDB is enabled and would be triggered normally on a kernel with the commit in question reverted by hitting Ctrl-Alt-ESC. But after the halt even that sequence does not work. So no luck on getting that information, sorry.
Comment 19 Gustavo Romero 2018-01-05 20:33:09 UTC
(In reply to Mark Millard from comment #17)

That's correct. Looks like the commit in question affects POWER9 due to the huge number of CPUs but on POWER8 it looks like a different root cause. If I find something worth noting I'll update this BZ. After Nathan reverted the commit all is working again on POWER8 boot (and I guess on POWER9 as well). Thanks.
Comment 20 Wojciech Macek 2018-01-23 06:58:16 UTC
Is there a chance you could try out this version?
https://reviews.freebsd.org/D14011

I made some optimizations to cpuid/hwref mapping - the search now has O(1) complexity so should be pretty fast.

Also, this change must be applied to the HEAD, as it contains some fixes for AP startup, which I suspect might help for a hang after "trying to mount...".
Comment 21 Gustavo Romero 2018-01-23 13:28:43 UTC
(In reply to Wojciech Macek from comment #20)

Hi,

Using the current HEAD:

commit 46822a0e45dc8e37132231ed6a4f95fe6fdf6d68 (HEAD -> master, origin/master, origin/HEAD)
Author: wma <wma@FreeBSD.org>
Date:   Tue Jan 23 08:07:00 2018 +0000

    PowerNV: send MSI_EOI always after MSI unmask
    
    MSI/MSI-x interrupts are edge-triggered. If an interrupt
    arrives when IRQ line is masked, it will be lost and will
    never recover. Perform MSI_EOI always after unmask to give
    a chance for PHB/XICS to send an interrupt again if MSI/MSI-x
    pending bit is set in MSI/MSI-x BAR space.
    
    Submitted by:          Wojciech Macek <wma@semihalf.org>
    Obtained from:         Semihalf
    Sponsored by:          IBM, QCM Technologies

I'm not able to boot on my P8 VM:

>> FreeBSD/powerpc Open Firmware boot block
   Boot path:   /vdevice/v-scsi@2000/disk@8000000000000000
   Boot loader: /boot/loader
   Boot volume:   /vdevice/v-scsi@2000/disk@8000000000000000:2
Consoles: Open Firmware console  

FreeBSD/powerpc64 Open Firmware loader, Revision 0.1
(Wed Jan  3 14:26:49 -02 2018 root@gromero31)
Memory: 8388608KB
Booted from: /vdevice/v-scsi@2000/disk@8000000000000000

Loading /boot/defaults/loader.conf
/boot/kernel/kernel data=0x11ee978+0x4c6e48 syms=[0x8+0x1684e0|
elf64_loadimage: could not read symbols - skipped! (495616 != 1475808)
panic: free: guard1 fail @ 0x182e3c0 from unknown:0
--> Press a key on the console to reboot <--
Rebooting...

W3411: Client application returned.

E3406: Client application returned an error.

Trying to load:  from: /vdevice/v-scsi@2000/disk@8001000000000000 ... Memory for ELF file is already in use!
E3402: Aborting boot, internal error

  Type 'boot' and press return to continue booting the system.
  Type 'reset-all' and press return to reboot the system.


Ready! 
0 > 

I'll try to bisect and debug it. So not able to apply it on top of current HEAD.
Comment 22 Wojciech Macek 2018-01-24 07:29:42 UTC
Hm... very weird. It looks the loader fails to load the image...

Were you able to bisect which change is causing this?


Regards,
Wojtek
Comment 23 Gustavo Romero 2018-01-24 22:19:43 UTC
(In reply to Wojciech Macek from comment #22)

Not yet. But just pulled in most recent HEAD and now I'm consistently getting the same issue as report in https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=225182 and looks like even when kernel is not fully loaded:

Trying to load:  from: /vdevice/v-scsi@2000/disk@8000000000000000 ...   Successfully loaded
 
>> FreeBSD/powerpc Open Firmware boot block
   Boot path:   /vdevice/v-scsi@2000/disk@8000000000000000
   Boot loader: /boot/loader
   Boot volume:   /vdevice/v-scsi@2000/disk@8000000000000000:2
Consoles: Open Firmware console  

FreeBSD/powerpc64 Open Firmware loader, Revision 0.1
(Wed Jan  3 14:26:49 -02 2018 root@gromero31)
Memory: 8388608KB
Booted from: /vdevice/v-scsi@2000/disk@8000000000000000

Loading /boot/defaults/loader.conf
/boot/kernel/kernel data=0x11ee988+0x4c6eb8 syms=[0x8+0x1684e0 

( 300 ) Data Storage Exception [ 7dc4af00 ]


    R0 .. R7           R8 .. R15         R16 .. R23         R24 .. R31
8000000000001000   000000007dc4aef0   0000000000000000   000000007dc05100   
000000007e665fe0   616e64730a09727f   0000000000000000   000000007dc09648   
000000007dc0f200   000000007dc4aef0   000000007e462010   0000000000000003   
000000007dc4c000   0000000000000000   000000007dc09818   000000000000f001   
0000000000000000   0000000000000000   0000000000008000   000000007e666060   
000000007dc4af00   0000000008da6000   000000000000f003   ffffffffffffffff   
000000007e462008   0000000000000000   0000000000000006   000000007dbe4b60   
0000000000000020   0000000000000000   000000007e666050   000000007e708fb0   

    CR / XER           LR / CTR          SRR0 / SRR1        DAR / DSISR
        8000f882   000000007dbe34d4   000000007dbe42c8   616e64730a09727f   
0000000020000000   000000007dbe4b60   8000000000001000           40000000   


1df > 

I'll continue to investigate. Thanks.
Comment 24 Breno Leitao 2018-02-05 13:08:46 UTC
Issue close since r327360 was reverted.