Bug 243464 - rpi 3 freezes while booting - caused by r356776
Summary: rpi 3 freezes while booting - caused by r356776
Status: Open
Alias: None
Product: Base System
Classification: Unclassified
Component: arm (show other bugs)
Version: CURRENT
Hardware: arm64 Any
: --- Affects Some People
Assignee: freebsd-arm (Nobody)
URL:
Keywords: regression
Depends on:
Blocks:
 
Reported: 2020-01-20 09:12 UTC by iz-rpi03
Modified: 2020-02-14 05:14 UTC (History)
4 users (show)

See Also:


Attachments
Verbose boot of with ntpd disabled till multiuser login and failed CPU reset. (29.03 KB, text/plain)
2020-01-24 20:55 UTC, iz-rpi03
no flags Details
Some debugger outputs after waiting more than 2 hours for somenthing (may be a panic) to happen. (42.80 KB, text/plain)
2020-02-05 22:14 UTC, iz-rpi03
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description iz-rpi03 2020-01-20 09:12:08 UTC
Hello,

after updating a RPi 3 from r356417 to r356826 the RPi freezes while booting.
It does not respond to USB keyboard or serial console any more, so I am unable
to generate a crashdump or a core.

The last console output is always:

 Creating and/or trimming log files.
 Starting syslogd.
 NFS access cache time=60
 No core dumps found.
 Clearing /tmp (X related).
 Recovering vi editor sessions.
 Updating motd:.
 Mounting late filesystems:.

Those RPi is booting from a ZFS mirrored pool.

By bisecting I found that base r356776 is the first release which that behavior.
r356767 is the last one which booted fine.

Ralf
Comment 1 iz-rpi03 2020-01-24 20:55:11 UTC
Created attachment 211016 [details]
Verbose boot of with ntpd disabled till multiuser login and failed CPU reset.

Doing some additional digging I learned that disabeling ntpd via /etc/rc.conf gets the system to boot till multiuser login with the r356776 kernel.

Loading the mac_ntpd kernel module immediately freezes the system.
I tried to load it without the -q option, but there are no additional messages shown before the system freezes.

I also learned that rebooting via shutdown(8) -r or CTRL-ALT-DEL does not work any more. The last console message is always "cpu_reset failed".

The tests were done with a r356776 and a r357073 kernel. Both give the same results.

The attached console log is made from the newer r357073 kernel.
Comment 2 iz-rpi03 2020-01-25 13:31:25 UTC
(In addition to my comment #1)
> Loading the mac_ntpd kernel module immediately freezes the system.

Preloading, via the loader(8), does not.
Comment 3 iz-rpi03 2020-01-26 10:21:28 UTC
(In reply to iz-rpi03 from comment #2)
Letting the system work with a preloaded mac_ntpd kernel module
seems not to be a smart idea. Because during a make installworld the
kernel panics:

# panic: deadlres_td_sleep_q: possible deadlock detected for 0xfffffd0000f33560, blocked for 1802833 ticks

cpuid = 0
time = 1579967871
KDB: stack backtrace:

db_trace_self() at db_trace_self_wrapper+0x28
         pc = 0xffff00000073b34c  lr = 0xffff000000106814
         sp = 0xffff00005214b580  fp = 0xffff00005214b790

vpanic() at panic+0x44
         pc = 0xffff000000409838  lr = 0xffff0000004095e0
         sp = 0xffff00005214b860  fp = 0xffff00005214b8e0

panic() at deadlkres+0x2f8
         pc = 0xffff0000004095e0  lr = 0xffff0000003a46f4
         sp = 0xffff00005214b8f0  fp = 0xffff00005214b940

KDB: enter: panic
[ thread pid 0 tid 100083 ]
Stopped at      0
Comment 4 Bob Prohaska 2020-01-31 02:38:32 UTC
(In reply to iz-rpi03 from comment #3)
It's not a good idea to try 

kldload mac_ntpd
either 8-)

On an RPI3 running FreeBSD 13.0-CURRENT (GENERIC) #0 r357296M: Thu Jan 30 13:39:42 PST 2020 the result was a silent hang, with backtrace reporting

> KDB: enter: Break to debugger
> [ thread pid 1116 tid 100085 ]
> Stopped at      0x4031ed44
> db> bt
> Tracing pid 1116 tid 100085 td 0xfffffd0001020000
> db_trace_self() at db_stack_trace+0xf8
>         pc = 0xffff00000073c55c  lr = 0xffff000000103c58
>         sp = 0xffff00005a4bc830  fp = 0xffff00005a4bc860
> 
> db_stack_trace() at db_command+0x228
>         pc = 0xffff000000103c58  lr = 0xffff0000001038d0
>         sp = 0xffff00005a4bc870  fp = 0xffff00005a4bc950
> 
> db_command() at db_command_loop+0x58
>         pc = 0xffff0000001038d0  lr = 0xffff000000103678
>         sp = 0xffff00005a4bc960  fp = 0xffff00005a4bc980
> 
> db_command_loop() at db_trap+0xf4
>         pc = 0xffff000000103678  lr = 0xffff00000010697c
>         sp = 0xffff00005a4bc990  fp = 0xffff00005a4bcbb0
> 
> db_trap() at kdb_trap+0x1d8
>         pc = 0xffff00000010697c  lr = 0xffff0000004510a0
>         sp = 0xffff00005a4bcbc0  fp = 0xffff00005a4bcc70
> 
> kdb_trap() at do_el1h_sync+0xf4
>         pc = 0xffff0000004510a0  lr = 0xffff000000759568
>         sp = 0xffff00005a4bcc80  fp = 0xffff00005a4bccb0
> 
> do_el1h_sync() at handle_el1h_sync+0x78
>         pc = 0xffff000000759568  lr = 0xffff00000073f078
>         sp = 0xffff00005a4bccc0  fp = 0xffff00005a4bcdd0
> 
> handle_el1h_sync() at kdb_alt_break_internal+0x130
>         pc = 0xffff00000073f078  lr = 0xffff000000450858
>         sp = 0xffff00005a4bcde0  fp = 0xffff00005a4bce80
> 
> kdb_alt_break_internal() at kdb_alt_break+0xc
>         pc = 0xffff000000450858  lr = 0xffff000000450718
>         sp = 0xffff00005a4bce90  fp = 0xffff00005a4bce90
> 
> kdb_alt_break() at uart_intr_rxready+0x88
>         pc = 0xffff000000450718  lr = 0xffff000000263124
>         sp = 0xffff00005a4bcea0  fp = 0xffff00005a4bcec0
> 
> uart_intr_rxready() at uart_intr+0xec
>         pc = 0xffff000000263124  lr = 0xffff000000263e0c
>         sp = 0xffff00005a4bced0  fp = 0xffff00005a4bcf00
> 
> uart_intr() at intr_event_handle+0xc8
>         pc = 0xffff000000263e0c  lr = 0xffff0000003cbaf4
>         sp = 0xffff00005a4bcf10  fp = 0xffff00005a4bcf50
> 
> intr_event_handle() at intr_isrc_dispatch+0x34
>         pc = 0xffff0000003cbaf4  lr = 0xffff00000078faa4
>         sp = 0xffff00005a4bcf60  fp = 0xffff00005a4bcf70
> 
> intr_isrc_dispatch() at bcm2835_intc_intr+0xa8
>         pc = 0xffff00000078faa4  lr = 0xffff000000729a50
>         sp = 0xffff00005a4bcf80  fp = 0xffff00005a4bcfc0
> 
> bcm2835_intc_intr() at intr_event_handle+0xc8
>         pc = 0xffff000000729a50  lr = 0xffff0000003cbaf4
>         sp = 0xffff00005a4bcfd0  fp = 0xffff00005a4bd010
> 
> intr_event_handle() at intr_isrc_dispatch+0x34
>         pc = 0xffff0000003cbaf4  lr = 0xffff00000078faa4
>         sp = 0xffff00005a4bd020  fp = 0xffff00005a4bd030
> 
> intr_isrc_dispatch() at bcm_lintc_intr+0x1e8
>         pc = 0xffff00000078faa4  lr = 0xffff00000072fc88
>         sp = 0xffff00005a4bd040  fp = 0xffff00005a4bd0c0
> 
> bcm_lintc_intr() at intr_irq_handler+0x74
>         pc = 0xffff00000072fc88  lr = 0xffff00000078f904
>         sp = 0xffff00005a4bd0d0  fp = 0xffff00005a4bd0f0
> 
> intr_irq_handler() at handle_el1h_irq+0x74
>         pc = 0xffff00000078f904  lr = 0xffff00000073f140
>         sp = 0xffff00005a4bd100  fp = 0xffff00005a4bd210
> 
> handle_el1h_irq() at rms_wlock+0x124
>         pc = 0xffff00000073f140  lr = 0xffff000000403fc4
>         sp = 0xffff00005a4bd220  fp = 0xffff00005a4bd2b0
> 
> rms_wlock() at rms_wlock+0x124
>         pc = 0xffff000000403fc4  lr = 0xffff000000403fc4
>         sp = 0xffff00005a4bd2c0  fp = 0xffff00005a4bd360
> 
> rms_wlock() at mac_policy_modevent+0xc0
>         pc = 0xffff000000403fc4  lr = 0xffff00000066ba44
>         sp = 0xffff00005a4bd370  fp = 0xffff00005a4bd3c0
> 
> mac_policy_modevent() at module_register_init+0xc4
>         pc = 0xffff00000066ba44  lr = 0xffff0000003e783c
>         sp = 0xffff00005a4bd3d0  fp = 0xffff00005a4bd400
> 
> module_register_init() at linker_load_module+0xab8
>         pc = 0xffff0000003e783c  lr = 0xffff0000003d8670
>         sp = 0xffff00005a4bd410  fp = 0xffff00005a4bd740
> 
> linker_load_module() at kern_kldload+0xec
>         pc = 0xffff0000003d8670  lr = 0xffff0000003d9e40
>         sp = 0xffff00005a4bd750  fp = 0xffff00005a4bd780
> 
> kern_kldload() at sys_kldload+0x64
>         pc = 0xffff0000003d9e40  lr = 0xffff0000003d9f98
>         sp = 0xffff00005a4bd790  fp = 0xffff00005a4bd7b0
> 
> sys_kldload() at do_el0_sync+0x514
>         pc = 0xffff0000003d9f98  lr = 0xffff000000759bf4
>         sp = 0xffff00005a4bd7c0  fp = 0xffff00005a4bd860
> 
> do_el0_sync() at handle_el0_sync+0x90
>         pc = 0xffff000000759bf4  lr = 0xffff00000073f224
>         sp = 0xffff00005a4bd870  fp = 0xffff00005a4bd980
> 
> handle_el0_sync() at 0x210368
>         pc = 0xffff00000073f224  lr = 0x0000000000210368
>         sp = 0xffff00005a4bd990  fp = 0x0000ffffffffeaf0
> 
> db> 

HTH,

bob prohaska
Comment 5 Bob Prohaska 2020-01-31 02:57:48 UTC
(In reply to Bob Prohaska from comment #4)
Sorry, forgot to add this machine is running ufs.
Comment 6 iz-rpi03 2020-01-31 11:00:11 UTC
To answer a question form the ARM mailing-list if my system also silent hangs
whenit kldloads other modules than mac_ntp: No, or better none I use regularly.

Tested with r357112 kernel and world, immediately after a fresh boot:

# kldstat
Id Refs Address                Size Name
 1   21 0xffff000000000000  14f81e8 kernel
 2    1 0xffff0000014f9000   2eed00 zfs.ko
 3    1 0xffff0000017e8000    214a8 mac_ntpd.ko
 4    2 0xffff00000180a000    284d8 opensolaris.ko
 5    1 0xffff00005c400000    21000 uhid.ko
 6    1 0xffff00005c421000    21000 wmt.ko
# 

While kernel, zfs, opensolaris and mac_ntp are loaded by the loader,
uhid and wmt are loaded later. From /var/run/dmesg.boot

...
Configuring vt: keymap.
Autoloading module: uhid.ko
Autoloading module: wmt.ko
uhid0 on uhub1
...

And after mounting a smbfs:

# kldstat
Id Refs Address                Size Name
 1   30 0xffff000000000000  14f81e8 kernel
 2    1 0xffff0000014f9000   2eed00 zfs.ko
 3    1 0xffff0000017e8000    214a8 mac_ntpd.ko
 4    2 0xffff00000180a000    284d8 opensolaris.ko
 5    1 0xffff00005c400000    21000 uhid.ko
 6    1 0xffff00005c421000    21000 wmt.ko
 7    1 0xffff00005c442000    32000 smbfs.ko
 8    2 0xffff00005c474000    21000 libiconv.ko
 9    2 0xffff00005c495000    21000 libmchain.ko
#
Comment 7 iz-rpi03 2020-02-02 19:45:29 UTC
Just an update:
Tried r357408 because of the bug fix made in r357392.
System still get unresponsive when kldloading mac_ntpd.
"cpu_reset failed" still happens when trying to reboot.
Regardless how.
Comment 8 iz-rpi03 2020-02-05 22:14:39 UTC
Created attachment 211398 [details]
Some debugger outputs after waiting more than 2 hours for somenthing (may be a panic) to happen.

Inspired by the debugger commands in a syscaller crash report.
It is made with a r356776 kernel.
Trying to create a dump finally freezes the debugger as well.
Comment 9 Mark Millard 2020-02-06 04:50:41 UTC
(In reply to iz-rpi03 from comment #8)

FYI: Your verbose boot console output (or dmesg -a
outuput) shows multiple:

psci0: PSCI version number mismatched with DT
device_attach: psci0 attach returned 6
psci0: <ARM Power State Co-ordination Interface Driver> on ofwbus0

but none of:

psci0: PSCI version 0.2 compatible

So starting the extra cores and doing reset/reboot
and such are not likely to work: The "Power State
Co-ordination Interface" is apparently part of
how such activities are done. From what I can tell
not all that much should be expected to be well
behaved after it fails to produce the "PSCI version
0.2 compatible" notice (and not produce the other
type of notices).

For example, the extra cores (APs) end up with the
notice:

Release APs...APs not started

That seems to be an expected result given the earlier
notices.

At least on a RPi4B, the only earlier verbose boot
output difference vs. an older kernel that worked
was a difference for one range of "Excluded memory",
10 fewer pages excluded for the working kernel
( -r356767 ) for that specific range. (I've no
information on if this is significant to the
problems or not.)

(I ignores the lines that list the rdddddd version and
build time and such: expected to be different.)
Comment 10 Mark Millard 2020-02-07 02:59:59 UTC
(In reply to Mark Millard from comment #9)

For the artifact.ci's kernel for -r356776 the PSCI problem
still ooccurs but the "Excluded memory" was the smaller
size ( like for the working -r356767 ).

So, somehow, -r356776 makes the difference that leads to
not identifying the PSCI version on an RPi4B (and likely
on any RPi3). That in turn means much is messed up from
lack of PSCI activity.
Comment 11 Mark Millard 2020-02-14 05:06:58 UTC
A couple of work arounds are now known.

One updates the applicable sys_util/u-boot-rpi* .
The example was for RPI4B's in :

https://lists.freebsd.org/pipermail/freebsd-arm/2020-February/021258.html

The other modified FreeBSD, as shown in the
svnlite diff in:

https://lists.freebsd.org/pipermail/freebsd-arm/2020-February/021240.html

Both updates cause the FreeBSD kernel to be told
to leave alone the first 2 pages of the address
space. Without any changes, u-boot-rpi* is causing
the kernel to be told to avoid only the first page.
The FreeBSD specific armstub8*.bin is bigger than
one page but does not fill 2 pages (yet?).

The long term solution for an official update
may be more involved, such as having the number of
pages auto-adjust for the number of bytes of content
in the armstub8*.bin involved. Nor is it obvious to
me which of multiple ways to get the "avoid these
pages" information to the kernel would be used.
Comment 12 Mark Millard 2020-02-14 05:14:07 UTC
(In reply to Mark Millard from comment #11)

FYI: It is not Jeff's defect to solve. His changes
just exposed that the kernel is not being told the
correct information about what pages to avoid allocating
or touching.

Who would deal with this is unclear to me. How it is
dealt with may or may not involve kernel or other
FreeBSD changes.