Bug 271983 - mi_switch hang in KVM on Sapphire Rapids running NET_EPOCH_DRAIN_CALLBACKS
Summary: mi_switch hang in KVM on Sapphire Rapids running NET_EPOCH_DRAIN_CALLBACKS
Status: Open
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: CURRENT
Hardware: Any Any
: --- Affects Only Me
Assignee: freebsd-bugs (Nobody)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2023-06-13 16:45 UTC by R. Christian McDonald
Modified: 2023-06-19 01:27 UTC (History)
3 users (show)

See Also:


Attachments
procstat -kka (82.56 KB, text/plain)
2023-06-13 19:35 UTC, R. Christian McDonald
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description R. Christian McDonald 2023-06-13 16:45:54 UTC
I am seeing a strange issue running CURRENT on KVM on Sapphire Rapids.

If I create an interface like `ifconfig wg create` and then destroy the interface via `ifconfig wg0 destroy`, the process hangs deep within the kernel.


load: 1.55  cmd: ifconfig 2277 [runnable] 31.14r 0.00u 0.00s 0% 3276k
mi_switch+0x175 sched_bind+0xbc epoch_drain_callbacks+0x179 wg_clone_destroy+0x25c if_clone_destroyif_flags+0x69 if_clone_destroy+0xff ifioctl+0x8d3 kern_ioctl+0x1fe sys_ioctl+0x154 amd64_syscall+0x140 fast_syscall_common+0xf8
Comment 1 Mark Johnston freebsd_committer freebsd_triage 2023-06-13 16:54:20 UTC
sched_bind() moves the calling thread to a different CPU, so here it's switched off the old CPU and waiting for a chance to run on the new one.  If that's not happening quickly, it's probably because some other, higher-priority thread is monopolizing the target CPU.  top -H ought to be able to confirm whether that's the case.
Comment 2 R. Christian McDonald 2023-06-13 19:25:19 UTC
@markj

Thanks for the clarification there.

This only happens while running in a KVM guest (Ubuntu 22.04, 5.19 kernel) and does not when running directly on the same hardware.

Additionally, this happens on an otherwise idle box.
Comment 3 Mark Johnston freebsd_committer freebsd_triage 2023-06-13 19:27:46 UTC
(In reply to R. Christian McDonald from comment #2)
That is pretty weird.  Could you share the output of "procstat -kka" while the hang is occurring?

Do you have INVARIANTS enabled?
Comment 4 R. Christian McDonald 2023-06-13 19:35:33 UTC
Created attachment 242766 [details]
procstat -kka
Comment 5 R. Christian McDonald 2023-06-13 19:37:10 UTC
(In reply to Mark Johnston from comment #3)

Yep INVARIANTS enabled.

# sysctl kern.conftxt | grep INVAR
options	INVARIANT_SUPPORT
options	INVARIANTS

# uname -a
FreeBSD SRVM-DUT-4 14.0-CURRENT FreeBSD 14.0-CURRENT #0 main-n263002-743516d51fa7: Thu May 18 08:06:33 UTC 2023     root@releng1.nyi.freebsd.org:/usr/obj/usr/src/amd64.amd64/sys/GENERIC amd64
Comment 6 Mark Johnston freebsd_committer freebsd_triage 2023-06-13 19:39:12 UTC
Comment on attachment 242766 [details]
procstat -kka

oh, you also need to pass -S to top.  I see one thread which looks to be spinning:

7 100267 rand_harvestq       -                   vtrnd_read+0xa4 random_kthread+0x174 fork_exit+0x80 fork_trampoline+0xe
Comment 7 R. Christian McDonald 2023-06-13 19:42:57 UTC
(In reply to Mark Johnston from comment #6)

last pid:  2028;  load averages:  2.00,  2.01,  1.83                                                                                                 up 0+00:39:17  14:42:30
515 threads:   21 running, 450 sleeping, 1 stopped, 43 waiting
CPU:  0.0% user,  0.0% nice,  5.6% system,  0.0% interrupt, 94.3% idle
Mem: 17M Active, 16M Inact, 271M Wired, 56K Buf, 7570M Free
ARC: 35M Total, 6297K MFU, 27M MRU, 359K Header, 1734K Other
     15M Compressed, 43M Uncompressed, 2.84:1 Ratio
Swap: 2048M Total, 2048M Free

  PID USERNAME    PRI NICE   SIZE    RES STATE    C   TIME    WCPU COMMAND
   11 root        187 ki31     0B   288K RUN      2  39:16 100.00% idle{idle: cpu2}
   11 root        187 ki31     0B   288K CPU13   13  39:16 100.00% idle{idle: cpu13}
   11 root        187 ki31     0B   288K CPU8     8  39:15 100.00% idle{idle: cpu8}
   11 root        187 ki31     0B   288K CPU7     7  39:15 100.00% idle{idle: cpu7}
   11 root        187 ki31     0B   288K CPU15   15  39:15 100.00% idle{idle: cpu15}
   11 root        187 ki31     0B   288K CPU10   10  39:15 100.00% idle{idle: cpu10}
   11 root        187 ki31     0B   288K CPU14   14  39:13 100.00% idle{idle: cpu14}
   11 root        187 ki31     0B   288K CPU9     9  39:12 100.00% idle{idle: cpu9}
    7 root        -16    -     0B    16K CPU4     4  39:10 100.00% rand_harvestq
   11 root        187 ki31     0B   288K CPU17   17  39:09 100.00% idle{idle: cpu17}
   11 root        187 ki31     0B   288K CPU5     5  39:17  98.97% idle{idle: cpu5}
   11 root        187 ki31     0B   288K CPU6     6  39:16  98.97% idle{idle: cpu6}
   11 root        187 ki31     0B   288K CPU3     3  39:16  98.97% idle{idle: cpu3}
   11 root        187 ki31     0B   288K CPU11   11  39:16  98.97% idle{idle: cpu11}
   11 root        187 ki31     0B   288K CPU16   16  39:15  98.97% idle{idle: cpu16}
   11 root        187 ki31     0B   288K CPU12   12  39:14  98.97% idle{idle: cpu12}
   11 root        187 ki31     0B   288K CPU1     1  39:14  98.97% idle{idle: cpu1}
   11 root        187 ki31     0B   288K CPU0     0  39:15  98.00% idle{idle: cpu0}
   12 root        -60    -     0B   400K WAIT    17   0:05   0.98% intr{swi0: uart}
    0 root        -16    -     0B  3664K swapin   0   6:02   0.00% kernel{swapper}
   11 root        187 ki31     0B   288K RUN      4   0:07   0.00% idle{idle: cpu4}
   15 root        -60    -     0B    80K -       10   0:02   0.00% usb{usbus0}
    6 root         -8    -     0B  2464K tx->tx  16   0:02   0.00% zfskern{txg_thread_enter}
   12 root        -64    -     0B   400K WAIT     9   0:01   0.00% intr{irq31: virtio_pci2}
    2 root        -60    -     0B   288K WAIT     0   0:00   0.00% clock{clock (0)}
 1974 root         21    0    17M  4896K STOP    14   0:00   0.00% top
    8 root        -16    -     0B    64K psleep  13   0:00   0.00% pagedaemon{dom0}
   12 root        -64    -     0B   400K WAIT     7   0:00   0.00% intr{irq29: virtio_pci1}
  813 ntpd         20    0    21M  5940K select   7   0:00   0.00% ntpd{ntpd}
 1597 root         20    0    13M  3476K wait    11   0:00   0.00% sh
    0 root        -12    -     0B  3664K -       15   0:00   0.00% kernel{z_wr_iss_13}
    0 root        -12    -     0B  3664K -       14   0:00   0.00% kernel{z_wr_iss_3}
    0 root        -12    -     0B  3664K -        8   0:00   0.00% kernel{z_wr_iss_9}
    0 root        -12    -     0B  3664K -       12   0:00   0.00% kernel{z_wr_iss_5}
    0 root        -12    -     0B  3664K -       16   0:00   0.00% kernel{z_wr_iss_2}
 1794 root         21    0    21M  9088K select   9   0:00   0.00% sshd
    0 root        -12    -     0B  3664K -       11   0:00   0.00% kernel{z_wr_iss_10}
  749 root         20    0    13M  3032K select  12   0:00   0.00% syslogd
    0 root        -12    -     0B  3664K -        0   0:00   0.00% kernel{z_wr_iss_11}
    0 root        -12    -     0B  3664K -       10   0:00   0.00% kernel{z_wr_iss_7}
    0 root        -12    -     0B  3664K -       12   0:00   0.00% kernel{z_wr_iss_6}
    0 root        -16    -     0B  3664K -       11   0:00   0.00% kernel{z_wr_int_2_1}
    0 root        -12    -     0B  3664K -       16   0:00   0.00% kernel{z_wr_iss_8}
    0 root        -12    -     0B  3664K -       17   0:00   0.00% kernel{z_wr_iss_12}
Comment 8 Mark Johnston freebsd_committer freebsd_triage 2023-06-13 19:45:10 UTC
So this is a problem where the virtio random driver is constantly polling the host for entropy but not getting it for some reason.  I'm not sure why that would be - a bug in the virtio driver or some kind of restrictive configuration by the hypervisor.
Comment 9 R. Christian McDonald 2023-06-13 19:50:46 UTC
(In reply to Mark Johnston from comment #8)

Cool.

blacklisting virtio_random works around the hang.
Comment 10 R. Christian McDonald 2023-06-14 15:15:50 UTC
I discovered today that it is possible to reproduce this bug on non-Sapphire Rapids systems. The difference being the use of the legacy vs modern virtio-rng. 

The legacy virtio-rng doesn't exhibit this problem, it's the modern virtio-rng that does.

I tested on a non-Sapphire Rapids PVE host this morning and forced the virtio-rng to explicitly use the modern device (virtio-rng-pci-non-transitional) instead of the legacy (transitional) device (which is what PVE sets up by default) and the problem manifests there as well.

Conversely, if I explicitly use the legacy device on the Sapphire Rapids box, the problem goes away there too.