271983 – mi_switch hang in KVM on Sapphire Rapids running NET_EPOCH_DRAIN_CALLBACKS

Bug 271983 - mi_switch hang in KVM on Sapphire Rapids running NET_EPOCH_DRAIN_CALLBACKS

Summary: mi_switch hang in KVM on Sapphire Rapids running NET_EPOCH_DRAIN_CALLBACKS

Status:	Open

Alias:	None

Product:	Base System
Classification:	Unclassified
Component:	kern (show other bugs)
Version:	CURRENT
Hardware:	Any Any

Importance:	--- Affects Only Me
Assignee:	freebsd-bugs (Nobody)

URL:
Keywords:

Depends on:
Blocks:

Reported:	2023-06-13 16:45 UTC by R. Christian McDonald
Modified:	2023-06-19 01:27 UTC (History)
CC List:	3 users (show)

See Also:

Attachments
procstat -kka (82.56 KB, text/plain) 2023-06-13 19:35 UTC, R. Christian McDonald	no flags	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description R. Christian McDonald 2023-06-13 16:45:54 UTC

I am seeing a strange issue running CURRENT on KVM on Sapphire Rapids.

If I create an interface like `ifconfig wg create` and then destroy the interface via `ifconfig wg0 destroy`, the process hangs deep within the kernel.


load: 1.55  cmd: ifconfig 2277 [runnable] 31.14r 0.00u 0.00s 0% 3276k
mi_switch+0x175 sched_bind+0xbc epoch_drain_callbacks+0x179 wg_clone_destroy+0x25c if_clone_destroyif_flags+0x69 if_clone_destroy+0xff ifioctl+0x8d3 kern_ioctl+0x1fe sys_ioctl+0x154 amd64_syscall+0x140 fast_syscall_common+0xf8

Comment 1 Mark Johnston freebsd_committer

2023-06-13 16:54:20 UTC

sched_bind() moves the calling thread to a different CPU, so here it's switched off the old CPU and waiting for a chance to run on the new one.  If that's not happening quickly, it's probably because some other, higher-priority thread is monopolizing the target CPU.  top -H ought to be able to confirm whether that's the case.

Comment 2 R. Christian McDonald 2023-06-13 19:25:19 UTC

@markj

Thanks for the clarification there.

This only happens while running in a KVM guest (Ubuntu 22.04, 5.19 kernel) and does not when running directly on the same hardware.

Additionally, this happens on an otherwise idle box.

Comment 3 Mark Johnston freebsd_committer

2023-06-13 19:27:46 UTC

(In reply to R. Christian McDonald from comment #2)
That is pretty weird.  Could you share the output of "procstat -kka" while the hang is occurring?

Do you have INVARIANTS enabled?

Comment 4 R. Christian McDonald 2023-06-13 19:35:33 UTC

Created attachment 242766 [details]
procstat -kka

Comment 5 R. Christian McDonald 2023-06-13 19:37:10 UTC

(In reply to Mark Johnston from comment #3)

Yep INVARIANTS enabled.

# sysctl kern.conftxt | grep INVAR
options	INVARIANT_SUPPORT
options	INVARIANTS

# uname -a
FreeBSD SRVM-DUT-4 14.0-CURRENT FreeBSD 14.0-CURRENT #0 main-n263002-743516d51fa7: Thu May 18 08:06:33 UTC 2023     root@releng1.nyi.freebsd.org:/usr/obj/usr/src/amd64.amd64/sys/GENERIC amd64

Comment 6 Mark Johnston freebsd_committer

2023-06-13 19:39:12 UTC

Comment on attachment 242766 [details]
procstat -kka

oh, you also need to pass -S to top.  I see one thread which looks to be spinning:

7 100267 rand_harvestq       -                   vtrnd_read+0xa4 random_kthread+0x174 fork_exit+0x80 fork_trampoline+0xe

Comment 7 R. Christian McDonald 2023-06-13 19:42:57 UTC

(In reply to Mark Johnston from comment #6)

last pid:  2028;  load averages:  2.00,  2.01,  1.83                                                                                                 up 0+00:39:17  14:42:30
515 threads:   21 running, 450 sleeping, 1 stopped, 43 waiting
CPU:  0.0% user,  0.0% nice,  5.6% system,  0.0% interrupt, 94.3% idle
Mem: 17M Active, 16M Inact, 271M Wired, 56K Buf, 7570M Free
ARC: 35M Total, 6297K MFU, 27M MRU, 359K Header, 1734K Other
     15M Compressed, 43M Uncompressed, 2.84:1 Ratio
Swap: 2048M Total, 2048M Free

  PID USERNAME    PRI NICE   SIZE    RES STATE    C   TIME    WCPU COMMAND
   11 root        187 ki31     0B   288K RUN      2  39:16 100.00% idle{idle: cpu2}
   11 root        187 ki31     0B   288K CPU13   13  39:16 100.00% idle{idle: cpu13}
   11 root        187 ki31     0B   288K CPU8     8  39:15 100.00% idle{idle: cpu8}
   11 root        187 ki31     0B   288K CPU7     7  39:15 100.00% idle{idle: cpu7}
   11 root        187 ki31     0B   288K CPU15   15  39:15 100.00% idle{idle: cpu15}
   11 root        187 ki31     0B   288K CPU10   10  39:15 100.00% idle{idle: cpu10}
   11 root        187 ki31     0B   288K CPU14   14  39:13 100.00% idle{idle: cpu14}
   11 root        187 ki31     0B   288K CPU9     9  39:12 100.00% idle{idle: cpu9}
    7 root        -16    -     0B    16K CPU4     4  39:10 100.00% rand_harvestq
   11 root        187 ki31     0B   288K CPU17   17  39:09 100.00% idle{idle: cpu17}
   11 root        187 ki31     0B   288K CPU5     5  39:17  98.97% idle{idle: cpu5}
   11 root        187 ki31     0B   288K CPU6     6  39:16  98.97% idle{idle: cpu6}
   11 root        187 ki31     0B   288K CPU3     3  39:16  98.97% idle{idle: cpu3}
   11 root        187 ki31     0B   288K CPU11   11  39:16  98.97% idle{idle: cpu11}
   11 root        187 ki31     0B   288K CPU16   16  39:15  98.97% idle{idle: cpu16}
   11 root        187 ki31     0B   288K CPU12   12  39:14  98.97% idle{idle: cpu12}
   11 root        187 ki31     0B   288K CPU1     1  39:14  98.97% idle{idle: cpu1}
   11 root        187 ki31     0B   288K CPU0     0  39:15  98.00% idle{idle: cpu0}
   12 root        -60    -     0B   400K WAIT    17   0:05   0.98% intr{swi0: uart}
    0 root        -16    -     0B  3664K swapin   0   6:02   0.00% kernel{swapper}
   11 root        187 ki31     0B   288K RUN      4   0:07   0.00% idle{idle: cpu4}
   15 root        -60    -     0B    80K -       10   0:02   0.00% usb{usbus0}
    6 root         -8    -     0B  2464K tx->tx  16   0:02   0.00% zfskern{txg_thread_enter}
   12 root        -64    -     0B   400K WAIT     9   0:01   0.00% intr{irq31: virtio_pci2}
    2 root        -60    -     0B   288K WAIT     0   0:00   0.00% clock{clock (0)}
 1974 root         21    0    17M  4896K STOP    14   0:00   0.00% top
    8 root        -16    -     0B    64K psleep  13   0:00   0.00% pagedaemon{dom0}
   12 root        -64    -     0B   400K WAIT     7   0:00   0.00% intr{irq29: virtio_pci1}
  813 ntpd         20    0    21M  5940K select   7   0:00   0.00% ntpd{ntpd}
 1597 root         20    0    13M  3476K wait    11   0:00   0.00% sh
    0 root        -12    -     0B  3664K -       15   0:00   0.00% kernel{z_wr_iss_13}
    0 root        -12    -     0B  3664K -       14   0:00   0.00% kernel{z_wr_iss_3}
    0 root        -12    -     0B  3664K -        8   0:00   0.00% kernel{z_wr_iss_9}
    0 root        -12    -     0B  3664K -       12   0:00   0.00% kernel{z_wr_iss_5}
    0 root        -12    -     0B  3664K -       16   0:00   0.00% kernel{z_wr_iss_2}
 1794 root         21    0    21M  9088K select   9   0:00   0.00% sshd
    0 root        -12    -     0B  3664K -       11   0:00   0.00% kernel{z_wr_iss_10}
  749 root         20    0    13M  3032K select  12   0:00   0.00% syslogd
    0 root        -12    -     0B  3664K -        0   0:00   0.00% kernel{z_wr_iss_11}
    0 root        -12    -     0B  3664K -       10   0:00   0.00% kernel{z_wr_iss_7}
    0 root        -12    -     0B  3664K -       12   0:00   0.00% kernel{z_wr_iss_6}
    0 root        -16    -     0B  3664K -       11   0:00   0.00% kernel{z_wr_int_2_1}
    0 root        -12    -     0B  3664K -       16   0:00   0.00% kernel{z_wr_iss_8}
    0 root        -12    -     0B  3664K -       17   0:00   0.00% kernel{z_wr_iss_12}

Comment 8 Mark Johnston freebsd_committer

2023-06-13 19:45:10 UTC

So this is a problem where the virtio random driver is constantly polling the host for entropy but not getting it for some reason.  I'm not sure why that would be - a bug in the virtio driver or some kind of restrictive configuration by the hypervisor.

Comment 9 R. Christian McDonald 2023-06-13 19:50:46 UTC

(In reply to Mark Johnston from comment #8)

Cool.

blacklisting virtio_random works around the hang.

Comment 10 R. Christian McDonald 2023-06-14 15:15:50 UTC

I discovered today that it is possible to reproduce this bug on non-Sapphire Rapids systems. The difference being the use of the legacy vs modern virtio-rng. 

The legacy virtio-rng doesn't exhibit this problem, it's the modern virtio-rng that does.

I tested on a non-Sapphire Rapids PVE host this morning and forced the virtio-rng to explicitly use the modern device (virtio-rng-pci-non-transitional) instead of the legacy (transitional) device (which is what PVE sets up by default) and the problem manifests there as well.

Conversely, if I explicitly use the legacy device on the Sapphire Rapids box, the problem goes away there too.