Bug 225819 - base/head is currently unusable on arm64
Summary: base/head is currently unusable on arm64
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: arm (show other bugs)
Version: CURRENT
Hardware: arm64 Any
: --- Affects Some People
Assignee: freebsd-arm (Nobody)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-02-11 11:35 UTC by Antoine Brodin
Modified: 2019-02-22 07:05 UTC (History)
4 users (show)

See Also:


Attachments
Only enter panic once (833 bytes, patch)
2018-03-07 18:34 UTC, Andrew Turner
no flags Details | Diff
ddb log output (20.59 KB, application/x-gzip)
2018-04-28 11:46 UTC, Ryan Steinmetz
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Antoine Brodin freebsd_committer freebsd_triage 2018-02-11 11:35:19 UTC
base/head is currently unusable on arm64 with several panics a day.

We had to downgrade thunderx1 to base/head@325426 to be able to produce packages.
Comment 1 Antoine Brodin freebsd_committer freebsd_triage 2018-02-11 11:40:36 UTC
One of the last panics:

deadlkres() at fork_exit+0x7c pc = 0xffff0000002be944  lr = 0xffff0000002e1800 sp = 0xffff00013230c920  fp = 0xffff00013230c950
kdb_trap() at do_el1h_sync+0xf0 pc = 0xffff00000035ecb0  lr = 0xffff000000614af4 sp = 0xffff00013230c5a0  fp = 0xffff00013230c5d0
do_el1h_sync() at handle_el1h_sync+0x74 pc = 0xffff000000614af4  lr = 0xffff0000005fd074 sp = 0xffff00013230c5e0  fp = 0xffff00013230c6f0
handle_el1h_sync() at kdb_enter+0x34 pc = 0xffff0000005fd074  lr = 0xffff00000035e38c sp = 0xffff00013230c700  fp = 0xffff00013230c790
Comment 2 Andrew Turner freebsd_committer freebsd_triage 2018-02-19 12:17:26 UTC
Are you able to get any useful debug information from ddb when it panics? Things like ps, show alllocks, and show threads would be a start as I haven't been able to reproduce this on the ThunderX servers in the netperf cluster.
Comment 3 Antoine Brodin freebsd_committer freebsd_triage 2018-02-19 19:28:09 UTC
The console is hung when the problem occurs.
Comment 4 Andrew Turner freebsd_committer freebsd_triage 2018-02-26 13:49:07 UTC
Can you try with a kernel with base r330018? Olivier and I tracked down an issue with interactions between signals and interrupts that is fixed there.
Comment 5 Antoine Brodin freebsd_committer freebsd_triage 2018-02-28 20:08:44 UTC
% uname -a
FreeBSD thunderx1.nyi.freebsd.org 12.0-CURRENT FreeBSD 12.0-CURRENT #2 r330018: Wed Feb 28 20:00:30 UTC 2018     root@thunderx1.nyi.freebsd.org:/usr/obj/usr/src/arm64.aarch64/sys/GENERIC  arm64

Let's see how it works.
Comment 6 Sean Bruno freebsd_committer freebsd_triage 2018-03-03 16:05:47 UTC
Looks like a deadlock?

Mar  3 00:41:14 thunderx1 smartd[11029]: In the system's table of devices NO devices found to scan
timeout stopping cpus
panic: deadlkres: possible deadlock detected for 0xfffffd1027a6aa80, blocked for 900366 ticks

cpuid = 21
time = 1520038656
KDB: stack backtrace:
db_trace_self() at db_trace_self_wrapper+0x28
         pc = 0xffff00000062bb60  lr = 0xffff0000000b741c
         sp = 0xffff0000e39e9580  fp = 0xffff0000e39e9790

db_trace_self_wrapper() at vpanic+0x184
         pc = 0xffff0000000b741c  lr = 0xffff00000034d188
         sp = 0xffff0000e39e97a0  fp = 0xffff0000e39e9820

vpanic() at panic+0x44
         pc = 0xffff00000034d188  lr = 0xffff00000034d238
         sp = 0xffff0000e39e9830  fp = 0xffff0000e39e98b0

panic() at deadlkres+0x2d8
         pc = 0xffff00000034d238  lr = 0xffff0000002ebd48
         sp = 0xffff0000e39e98c0  fp = 0xffff0000e39e9910

deadlkres() at fork_exit+0x7c
         pc = 0xffff0000002ebd48  lr = 0xffff0000003105a4
         sp = 0xffff0000e39e9920  fp = 0xffff0000e39e9950

fork_exit() at fork_trampoline+0x10
         pc = 0xffff0000003105a4  lr = 0xffff0000006460bc
         sp = 0xffff0000e39e9960  fp = 0x0000000000000000

KDB: enter: panic
[ thread pid 0 tid 100236 ]
Stopped at      0
db> timeout stopping cpus
panic: acquiring blockable sleep lock with spinlock or critical section held (sleep mutex) pmap @ /usr/src/sys/arm64/arm64/pmap.c:4778
cpuid = 27
time = 1520038656
Uptime: 2d4h52m33s
Comment 7 Andrew Turner freebsd_committer freebsd_triage 2018-03-07 18:34:59 UTC
Created attachment 191284 [details]
Only enter panic once

There's a bug where we reboot if we enter panic twice. I've attached a patch to stop that by entering a busy wait in the second core to call panic.

When in ddb can you run:

show alllocks
show allchains
ps
Comment 8 Antoine Brodin freebsd_committer freebsd_triage 2018-03-16 13:58:15 UTC
thunderx1 has been up for 8 days with r330633 and the attached patch.
Comment 9 Antoine Brodin freebsd_committer freebsd_triage 2018-04-23 12:41:47 UTC
Another panic with head@r331373 :

https://people.freebsd.org/~gjb/thunderx1/20180423_panic.txt
Comment 10 Andrew Turner freebsd_committer freebsd_triage 2018-04-24 11:02:59 UTC
I'm not able to tell anything from the log other than the deadlock detection fired. If it happens again can you run the following?

show alllocks
show allchains
ps
Comment 11 Ryan Steinmetz freebsd_committer freebsd_triage 2018-04-28 11:46:54 UTC
Created attachment 192870 [details]
ddb log output

show alllocks
show allchains
ps
Comment 12 Andrew Turner freebsd_committer freebsd_triage 2018-04-29 11:16:02 UTC
It appears the swi4 thread is in the hung code in iflib_timer. I'm not sure where in the code if_config_tqg_0 is blocked.

When this happens again can you get a backtrace of the two threads.
Comment 13 Antoine Brodin freebsd_committer freebsd_triage 2019-02-22 07:05:11 UTC
Problems seem fixed with a new network card.