base/head is currently unusable on arm64 with several panics a day. We had to downgrade thunderx1 to base/head@325426 to be able to produce packages.
One of the last panics: deadlkres() at fork_exit+0x7c pc = 0xffff0000002be944 lr = 0xffff0000002e1800 sp = 0xffff00013230c920 fp = 0xffff00013230c950 kdb_trap() at do_el1h_sync+0xf0 pc = 0xffff00000035ecb0 lr = 0xffff000000614af4 sp = 0xffff00013230c5a0 fp = 0xffff00013230c5d0 do_el1h_sync() at handle_el1h_sync+0x74 pc = 0xffff000000614af4 lr = 0xffff0000005fd074 sp = 0xffff00013230c5e0 fp = 0xffff00013230c6f0 handle_el1h_sync() at kdb_enter+0x34 pc = 0xffff0000005fd074 lr = 0xffff00000035e38c sp = 0xffff00013230c700 fp = 0xffff00013230c790
Are you able to get any useful debug information from ddb when it panics? Things like ps, show alllocks, and show threads would be a start as I haven't been able to reproduce this on the ThunderX servers in the netperf cluster.
The console is hung when the problem occurs.
Can you try with a kernel with base r330018? Olivier and I tracked down an issue with interactions between signals and interrupts that is fixed there.
% uname -a FreeBSD thunderx1.nyi.freebsd.org 12.0-CURRENT FreeBSD 12.0-CURRENT #2 r330018: Wed Feb 28 20:00:30 UTC 2018 root@thunderx1.nyi.freebsd.org:/usr/obj/usr/src/arm64.aarch64/sys/GENERIC arm64 Let's see how it works.
Looks like a deadlock? Mar 3 00:41:14 thunderx1 smartd[11029]: In the system's table of devices NO devices found to scan timeout stopping cpus panic: deadlkres: possible deadlock detected for 0xfffffd1027a6aa80, blocked for 900366 ticks cpuid = 21 time = 1520038656 KDB: stack backtrace: db_trace_self() at db_trace_self_wrapper+0x28 pc = 0xffff00000062bb60 lr = 0xffff0000000b741c sp = 0xffff0000e39e9580 fp = 0xffff0000e39e9790 db_trace_self_wrapper() at vpanic+0x184 pc = 0xffff0000000b741c lr = 0xffff00000034d188 sp = 0xffff0000e39e97a0 fp = 0xffff0000e39e9820 vpanic() at panic+0x44 pc = 0xffff00000034d188 lr = 0xffff00000034d238 sp = 0xffff0000e39e9830 fp = 0xffff0000e39e98b0 panic() at deadlkres+0x2d8 pc = 0xffff00000034d238 lr = 0xffff0000002ebd48 sp = 0xffff0000e39e98c0 fp = 0xffff0000e39e9910 deadlkres() at fork_exit+0x7c pc = 0xffff0000002ebd48 lr = 0xffff0000003105a4 sp = 0xffff0000e39e9920 fp = 0xffff0000e39e9950 fork_exit() at fork_trampoline+0x10 pc = 0xffff0000003105a4 lr = 0xffff0000006460bc sp = 0xffff0000e39e9960 fp = 0x0000000000000000 KDB: enter: panic [ thread pid 0 tid 100236 ] Stopped at 0 db> timeout stopping cpus panic: acquiring blockable sleep lock with spinlock or critical section held (sleep mutex) pmap @ /usr/src/sys/arm64/arm64/pmap.c:4778 cpuid = 27 time = 1520038656 Uptime: 2d4h52m33s
Created attachment 191284 [details] Only enter panic once There's a bug where we reboot if we enter panic twice. I've attached a patch to stop that by entering a busy wait in the second core to call panic. When in ddb can you run: show alllocks show allchains ps
thunderx1 has been up for 8 days with r330633 and the attached patch.
Another panic with head@r331373 : https://people.freebsd.org/~gjb/thunderx1/20180423_panic.txt
I'm not able to tell anything from the log other than the deadlock detection fired. If it happens again can you run the following? show alllocks show allchains ps
Created attachment 192870 [details] ddb log output show alllocks show allchains ps
It appears the swi4 thread is in the hung code in iflib_timer. I'm not sure where in the code if_config_tqg_0 is blocked. When this happens again can you get a backtrace of the two threads.
Problems seem fixed with a new network card.