Bug 256934 - Panic in ena driver on a t3.large ec2 instance after upgrading to 13.0
Summary: Panic in ena driver on a t3.large ec2 instance after upgrading to 13.0
Status: New
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 13.0-RELEASE
Hardware: amd64 Any
: --- Affects Only Me
Assignee: freebsd-virtualization (Nobody)
URL:
Keywords: panic, regression
Depends on:
Blocks:
 
Reported: 2021-07-02 07:38 UTC by Alex Dupre
Modified: 2021-08-10 14:40 UTC (History)
3 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Alex Dupre freebsd_committer 2021-07-02 07:38:50 UTC
The instance has been running fine for years with FreeBSD 12.x.

15 days after upgrading it to FreeBSD 13.0 with freebsd-update I got the following kernel panic:

```
spin lock 0xffffffff81ce5e80 (smp rendezvous) held by 0xfffffe008e637000 (tid 100108) too long
panic: spin lock held too long
cpuid = 1
time = 1625166932
KDB: stack backtrace:
#0 0xffffffff80c57515 at kdb_backtrace+0x65
#1 0xffffffff80c09ef1 at vpanic+0x181
#2 0xffffffff80c09d63 at panic+0x43
#3 0xffffffff80be8694 at _mtx_lock_indefinite_check+0x64
#4 0xffffffff80be8245 at _mtx_lock_spin_cookie+0xd5
#5 0xffffffff80c68312 at smp_rendezvous_cpus+0x202
#6 0xffffffff80c6837c at smp_rendezvous+0x2c
#7 0xffffffff8211e734 at ena_keep_alive_wd+0x24
#8 0xffffffff821162e4 at ena_com_aenq_intr_handler+0xc4
#9 0xffffffff80bcb02d at ithread_loop+0x24d
#10 0xffffffff80bc7e2e at fork_exit+0x7e
#11 0xffffffff810629fe at fork_trampoline+0xe
Uptime: 15d7h58m56s
Rebooting...
```
Comment 1 Mark Linimon freebsd_committer freebsd_triage 2021-07-02 07:48:28 UTC
^Triage: reflect the fact that this involves AWS.
Comment 2 Artur Rojek 2021-07-08 10:53:19 UTC
(In reply to Alex Dupre from comment #0)
Hi Alex,

I researched this issue and there are no significant changes between 12.2 and 13.0 with regards to your backtrace. I don't see any connection to the ena driver either, as `ena_keep_alive_wd+0x24` is a simple call of `counter_u64_zero`. The EC2 T3 instance type you are using is a burstable performance instance, meaning that (depending on your configuration) you can spend CPU resources above the baseline for periods of time, however eventually the instance forcibly comes down to baseline CPU utilization. Perhaps you faced this bottleneck just when your machine required more CPU "oomph"?
Have you been able to reproduce this issue on 13.0 since, or is this a one time occurrence?
One thing I can suggest at the moment is to try to adjust `debug.lock.delay_max`, which is configurable since 13.0.
Comment 3 Alex Dupre freebsd_committer 2021-07-09 16:26:21 UTC
(In reply to Artur Rojek from comment #2)

The machine had plenty of CPU credits when it happened, that possibility can be discarded.

Until now, this was the only occurrence, but it happened after 15 days of uptime and now only other 7 days passed. I have other machines that were upgraded at the same time and didn't suffer the issue. But before upgrading to 13.0 they never suffered any kernel panic, FWIW.

What does the `debug.lock.delay_max` param do and what value should I try? Currently I see it at 1024.
Comment 4 Artur Rojek 2021-08-10 14:40:29 UTC
(In reply to Alex Dupre from comment #3)
> What does the `debug.lock.delay_max` param do and what value should I try?
This tunable defines how long to spin before the "spin lock held too long" panic is thrown, 1024 being default for your instance type (`roundup2(ncpus) * 256`). I suggested this thinking you were starved out of CPU resources, however your machine clearly had plenty of CPU credits, so adjusting this param would not have helped.
It could be something like temporary fluctuations on the instance side, in which case it would be very difficult to replicate the issue.