The fact that it reboots just fine, suggests that https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=253175 is indeed fixed, but some other underlying issue is not. Here's `hotkernel`, until I can find something more useful: Sampling... Hit Ctrl-C to end. ^C FUNCTION COUNT PCNT dtrace.ko`dtrace_dynvar_clean 3 0.0% kernel`0xffffffff81 4 0.0% kernel`pagezero_erms 8 0.0% kernel`hpet_get_timecount 18 0.0% kernel`lock_delay 22 0.0% kernel`spinlock_exit 575 0.9% kernel`kobj_error_method 817 1.2% kernel`virtqueue_poll 32208 48.8% kernel`acpi_cpu_c1 32371 49.0%
one more data-point: trying to `kldunload virtio_random.ko` hangs indefinitely. Which is inline with the nature of the fix of https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=253175 it moved de-initialization of the module as far towards the end as possible.
not much different or unexpected from procstat kstack either: root@freebsd:~ # procstat -w 1 kstack -v 6 PID TID COMM TDNAME KSTACK 6 100074 rand_harvestq - vtrnd_read+0xa4 random_kthread+0x174 fork_exit+0x80 fork_trampoline+0xe 6 100074 rand_harvestq - vtrnd_read+0xa4 random_kthread+0x174 fork_exit+0x80 fork_trampoline+0xe 6 100074 rand_harvestq - vtrnd_read+0xa4 random_kthread+0x174 fork_exit+0x80 fork_trampoline+0xe 6 100074 rand_harvestq - vtrnd_read+0xa4 random_kthread+0x174 fork_exit+0x80 fork_trampoline+0xe 6 100074 rand_harvestq - vtrnd_read+0xa4 random_kthread+0x174 fork_exit+0x80 fork_trampoline+0xe 6 100074 rand_harvestq - vtrnd_read+0xa4 random_kthread+0x174 fork_exit+0x80 fork_trampoline+0xe 6 100074 rand_harvestq - vtrnd_read+0xa4 random_kthread+0x174 fork_exit+0x80 fork_trampoline+0xe 6 100074 rand_harvestq - vtrnd_read+0xa4 random_kthread+0x174 fork_exit+0x80 fork_trampoline+0xe 6 100074 rand_harvestq - vtrnd_read+0xa4 random_kthread+0x174 fork_exit+0x80 fork_trampoline+0xe 6 100074 rand_harvestq - vtrnd_read+0xa4 random_kthread+0x174 fork_exit+0x80 fork_trampoline+0xe 6 100074 rand_harvestq - vtrnd_read+0xa4 random_kthread+0x174 fork_exit+0x80 fork_trampoline+0xe 6 100074 rand_harvestq - vtrnd_read+0xa4 random_kthread+0x174 fork_exit+0x80 fork_trampoline+0xe 6 100074 rand_harvestq - vtrnd_read+0xa4 random_kthread+0x174 fork_exit+0x80 fork_trampoline+0xe 6 100074 rand_harvestq - vtrnd_read+0xa4 random_kthread+0x174 fork_exit+0x80 fork_trampoline+0xe 6 100074 rand_harvestq - vtrnd_read+0xa4 random_kthread+0x174 fork_exit+0x80 fork_trampoline+0xe 6 100074 rand_harvestq - vtrnd_read+0xa4 random_kthread+0x174 fork_exit+0x80 fork_trampoline+0xe
more info on the hardware, before i try to dive into what's actually going wrong with the code: works: # devinfo -p vtrnd0 -v vtrnd0 pnpinfo vendor=0x00001af4 device=0x1005 subvendor=0x1af4 device_type=0x00000004 virtio_pci0 pnpinfo vendor=0x1af4 device=0x1005 subvendor=0x1af4 subdevice=0x0004 class=0x00ff00 at slot=2 function=0 dbsf=pci0:0:2:0 handle=\_SB_.PCI0.S10_ pci0 pcib0 pnpinfo _HID=PNP0A03 _UID=1 _CID=none at handle=\_SB_.PCI0 acpi0 nexus0 # pciconf -lv | grep -B2 -A1 RNG virtio_pci0@pci0:0:2:0: class=0x00ff00 rev=0x00 hdr=0x00 vendor=0x1af4 device=0x1005 subvendor=0x1af4 subdevice=0x0004 vendor = 'Red Hat, Inc.' device = 'Virtio RNG' class = old # sysctl kern.random kern.random.fortuna.concurrent_read: 1 kern.random.fortuna.minpoolsize: 64 kern.random.rdrand.rdrand_independent_seed: 0 kern.random.use_chacha20_cipher: 1 kern.random.block_seeded_status: 0 kern.random.random_sources: 'VirtIO Entropy Adapter','Intel Secure Key RNG' kern.random.harvest.mask_symbolic: PURE_VIRTIO,PURE_RDRAND,[CALLOUT],[UMA],[FS_ATIME],SWI,INTERRUPT,NET_NG,[NET_ETHER],NET_TUN,MOUSE,KEYBOARD,ATTACH,CACHED kern.random.harvest.mask_bin: 00000010010000000111011111 kern.random.harvest.mask: 590303 kern.random.initial_seeding.disable_bypass_warnings: 0 kern.random.initial_seeding.arc4random_bypassed_before_seeding: 0 kern.random.initial_seeding.read_random_bypassed_before_seeding: 0 kern.random.initial_seeding.bypass_before_seeding: 1 Misbehaves: # devinfo -p vtrnd0 -v vtrnd0 pnpinfo vendor=0x00001af4 device=0x1044 subvendor=0x1af4 device_type=0x00000004 virtio_pci1 pnpinfo vendor=0x1af4 device=0x1044 subvendor=0x1af4 subdevice=0x1100 class=0x00ff00 at slot=0 function=1 dbsf=pci0:1:0:1 pci1 pcib1 pnpinfo vendor=0x1b36 device=0x000c subvendor=0x1b36 subdevice=0x0000 class=0x060400 at slot=1 function=0 dbsf=pci0:0:1:0 handle=\_SB_.PCI0.S08_ pci0 pcib0 pnpinfo _HID=PNP0A08 _UID=0 _CID=PNP0A03 at handle=\_SB_.PCI0 acpi0 nexus0 # pciconf -lv | grep -B2 -A1 RNG virtio_pci1@pci0:1:0:1: class=0x00ff00 rev=0x01 hdr=0x00 vendor=0x1af4 device=0x1044 subvendor=0x1af4 subdevice=0x1100 vendor = 'Red Hat, Inc.' device = 'Virtio RNG' class = old # sysctl kern.random kern.random.fortuna.concurrent_read: 1 kern.random.fortuna.minpoolsize: 64 kern.random.rdrand.rdrand_independent_seed: 0 kern.random.use_chacha20_cipher: 1 kern.random.block_seeded_status: 0 kern.random.random_sources: 'VirtIO Entropy Adapter','Intel Secure Key RNG' kern.random.harvest.mask_symbolic: PURE_VIRTIO,PURE_RDRAND,[CALLOUT],[UMA],[FS_ATIME],SWI,INTERRUPT,NET_NG,[NET_ETHER],NET_TUN,MOUSE,KEYBOARD,ATTACH,CACHED kern.random.harvest.mask_bin: 00000010010000000111011111 kern.random.harvest.mask: 590303 kern.random.initial_seeding.disable_bypass_warnings: 0 kern.random.initial_seeding.arc4random_bypassed_before_seeding: 0 kern.random.initial_seeding.read_random_bypassed_before_seeding: 0 kern.random.initial_seeding.bypass_before_seeding: 1 The only difference i can see is in the devinfo output.
red herring from pciconf -lv, thanks to this bug: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=269874
been debugging this on and off for the past 2 days. at first I thought that dtrace -n 'kinst::virtqueue_poll:' output was not not very useful, until I realized that it constantly repeats: 1 73355 virtqueue_poll:112 1 73356 virtqueue_poll:115 1 73357 virtqueue_poll:118 1 73358 virtqueue_poll:124 1 73359 virtqueue_poll:129 1 73360 virtqueue_poll:133 1 73361 virtqueue_poll:135 1 73362 virtqueue_poll:137 1 73363 virtqueue_poll:141 1 73364 virtqueue_poll:144 1 73365 virtqueue_poll:151 1 73366 virtqueue_poll:155 1 73367 virtqueue_poll:158 1 73355 virtqueue_poll:112 1 73356 virtqueue_poll:115 1 73357 virtqueue_poll:118 1 73358 virtqueue_poll:124 1 73359 virtqueue_poll:129 1 73360 virtqueue_poll:133 1 73361 virtqueue_poll:135 1 73362 virtqueue_poll:137 1 73363 virtqueue_poll:141 1 73364 virtqueue_poll:144 1 73365 virtqueue_poll:151 1 73366 virtqueue_poll:155 1 73367 virtqueue_poll:158 etc… This function never returns, we don't even get (back) into vtrnd_read: dtrace -n 'kinst::vtrnd_read:' dtrace: description 'kinst::vtrnd_read:' matched 88 probes … # this never prints anything on this machine So let's look at those instructions: 0xffffffff80a113e0 <+112>: 48 89 df mov %rbx,%rdi 0xffffffff80a113e3 <+115>: ff 50 08 call *0x8(%rax) 0xffffffff80a113e6 <+118>: 41 0f b7 4c 24 56 movzwl 0x56(%r12),%ecx 0xffffffff80a113ec <+124>: 49 8b 44 24 48 mov 0x48(%r12),%rax 0xffffffff80a113f1 <+129>: 66 3b 48 02 cmp 0x2(%rax),%cx 0xffffffff80a113f5 <+133>: 75 32 jne 0xffffffff80a11429 <virtqueue_poll+185> 0xffffffff80a113f7 <+135>: f3 90 pause 0xffffffff80a113f9 <+137>: 49 8b 1c 24 mov (%r12),%rbx 0xffffffff80a113fd <+141>: 48 8b 0b mov (%rbx),%rcx 0xffffffff80a11400 <+144>: 0f b6 15 59 c0 0c 01 movzbl # 0x10cc059(%rip),%edx # 0xffffffff81add460 <virtio_bus_poll_desc> 0xffffffff80a11407 <+151>: 48 8b 04 d1 mov (%rcx,%rdx,8),%rax 0xffffffff80a1140b <+155>: 4c 39 38 cmp %r15,(%rax) 0xffffffff80a1140e <+158>: 74 d0 je 0xffffffff80a113e0 <virtqueue_poll+112> in other, C words: void * virtqueue_poll(struct virtqueue *vq, uint32_t *len) { void *cookie; VIRTIO_BUS_POLL(vq->vq_dev); while ((cookie = virtqueue_dequeue(vq, len)) == NULL) { cpu_spinwait(); VIRTIO_BUS_POLL(vq->vq_dev); } return (cookie); } (if *0x8(%rax) is an obfuscated call to virtqueue_dequeue() then…) This is literally just the inner loop here. That means: virtqueue_dequeue() never returns anything other than NULL. Why, i don't know yet. There are several virtqueue_dequeue()s running on the system, according to dtrace, but none of them seem to be running on CPU1, like our virtqueue_poll(). my assumption here is rand_harvest running CPU1 (true) → virtqueue_poll() running on CPU1 (true) Therefore, our virtqueue_dequeue() should be running on CPU1 as well. But either i'm not seeing it on CPU1, in fact it's running on all other CPUs exccept CPU1. Either that, or I don't know how to filter for it, but this doesn't work: root@freebsd:~ # dtrace -n 'kinst::virtqueue_dequeue: /cpu == 1/' dtrace: invalid probe specifier kinst::virtqueue_dequeue: /cpu == 1/: syntax error near end of input
n.b.: virtio_random is the only call-site of virtqueue_poll() that passes something other than NULL to the *len parameter: sys/dev/virtio/console/virtio_console.c 988: virtqueue_poll(vq, NULL); 1382: virtqueue_poll(vq, NULL); sys/dev/virtio/random/virtio_random.c 288: virtqueue_poll(vq, &rdlen); sys/dev/virtio/block/virtio_blk.c 1442: virtqueue_poll(vq, NULL); sys/dev/virtio/network/if_vtnet.c 3526: virtqueue_poll(vq, NULL); tho i don't see how that would influence it.
re DTrace stumbler: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=269906 if done "right", it shows no virtqueue_dequeue() on CPU1.
see also this review: https://reviews.freebsd.org/D38898
a commit towards fixing this issue has been. see base f1c5a2e3a625053e2b70d5b1777d849a4d9328f2 review D41656 This just leaves the question of MFC. hence, adding the author and committer of the above to this bug.
Forgot to tag this bug in the merges to stable/13 (c31927eb866f18b87cc196f14cbbcf658e10d38c) and stable/14 (b7f5e993476f6e445895383e73aed01741f8b9cf). I might try to get this into 14.0 as well.
(In reply to John Baldwin from comment #10) I think a merge to 14.0 would be a good idea, given that this bug's been reported in several contexts.
Merged to 14.0 in commit c1e05a5e1a3e94633a050026e26211aaae59cabc.
perhaps it's a good idea to MFC this to 13.x as well then
Is this now in 13.x merged or will be merged?
yeah, it was merged in base c31927eb866f18b87cc196f14cbbcf658e10d38c into stable/13 but I don't think we've got that released yet
(In reply to Mina Galić from comment #15) Thanks for your reply. Can i see somewhere when its will be released on which version? We are currently using opnsense whats based on freebsd 13.2 and we are affected on this bug.
(In reply to sparkblast from comment #16) The fix will appear in 13.3, which has not been officially announced but will likely be released within a couple of months.
(In reply to Mark Johnston from comment #17) The schedule's just recently been announced: https://www.freebsd.org/releases/13.3R/schedule/ Expected around March officially (so no contradiction)
(In reply to Kyle Evans from comment #18) Oh, hmm. I was going off of https://www.freebsd.org/releng/, which looks like it needs an update.