Bug 215740

Summary: [bhyve] utilizing passthru breaks raw device usage with virtio-blk | ahci-hd
Product: Base System Reporter: Harald Schmalzbauer <bugzilla.freebsd>
Component: miscAssignee: freebsd-virtualization (Nobody) <virtualization>
Status: New ---    
Severity: Affects Some People CC: grehan
Priority: ---    
Version: 11.0-STABLE   
Hardware: amd64   
OS: Any   
See Also: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=260178
Attachments:
Description Flags
Verbose boot log part 1, listing ACPI+CPU messages
none
Verbose boot log part 2, listing device probe messages
none
Verbose boot log part 3, listing rest (msi assignment + consumer attaching messages)
none
Vebose boot of ppt corruvting /dev/ada via bhyve-ahci none

Description Harald Schmalzbauer 2017-01-03 16:47:31 UTC
Using a passthru device with bhyve(8) for hosting guests with a physical device as storage backend (regardless if accessed through virtio-blk or ahci-hd)
corrupts guest-disk access, while file-backed ahci-hd (or virtio-blk) doesn't show that problem with passthru.

Steps to reproduce:

Use any harddrive containing any installed OS.
On the host: 'hd /dev/ada6 | less'
See MBR/PMBR code.

Use the same device (ada6 in that example) and conncet it to a FreeBSD-Live-DVD guest with a passthru device involved
(e. g.
bhyveload -d ./releases/ISO-IMAGES/11.0/FreeBSD-11.0-RELEASE-amd64-disc1.iso -S -m 2G ppttest && bhyve -u -A -H -P -s 0,hostbridge -s 3,ahci,cd:./releases/ISO-IMAGES/11.0/FreeBSD-11.0-RELEASE-amd64-disc1.iso,hd:/dev/ada6 -s 5,passthru,0/25/0 -s 31,lpc -l com1,stdio -S -m 2G -c 4 ppttest
)

Inside the guest, 'hd /dev/ada0 | less' doesn't work anymore (endless I/O)
Using 'dd if=/dev/ada6 count=1 | hd' shows only 0x0 instead of the output you saw on the host!

Simply repeating this without the passthru device in place solves the problem, you see exactly the same bytes inside the guest as on the host.
Comment 1 Peter Grehan freebsd_committer freebsd_triage 2017-01-04 20:37:20 UTC
Would you be able to post a verbose dmesg (boot -v) ?
Comment 2 Harald Schmalzbauer 2017-01-05 09:25:49 UTC
Created attachment 178537 [details]
Verbose boot log part 1, listing ACPI+CPU messages
Comment 3 Harald Schmalzbauer 2017-01-05 09:26:38 UTC
Created attachment 178538 [details]
Verbose boot log part 2, listing device probe messages
Comment 4 Harald Schmalzbauer 2017-01-05 09:28:00 UTC
Created attachment 178539 [details]
Verbose boot log part 3, listing rest (msi assignment + consumer attaching messages)
Comment 5 Harald Schmalzbauer 2017-01-05 09:28:53 UTC
(In reply to Peter Grehan from comment #1)

Thanks for your attention!
Please find them attached, I hope my 3-part separation doesn't confuse anybody...

-harry
Comment 6 Harald Schmalzbauer 2017-05-24 18:46:25 UTC
Created attachment 182869 [details]
Vebose boot of ppt corruvting /dev/ada via bhyve-ahci

I tried to investigate further.
I can confirm that the same procedure also breaks UEFI booting:
X64 Exception Type - 000000000000000D     CPU Apic ID - 00000000 !!!!
RIP  - 000000007FB00FF5, CS  - 0000000000000028, RFLAGS - 0000000000010002
ExceptionData - 0000000000000000
RAX  - 0000000000000000, RCX - 0000000000000008, RDX - 0000000000000408
RBX  - 0000000000000001, RSP - 000000007FBEF468, RBP - 000000007FBEF7C8
RSI  - 000000007E549B2E, RDI - 000000007FBEF468
R8   - 000000007FBEF97C, R9  - 000000007FC16A9F, R10 - 00000000000003F8
R11  - 0000000000000040, R12 - 0000000000000000, R13 - 0000000000000000
R14  - 0000000000000000, R15 - 0000000000000000
DS   - 0000000000000008, ES  - 0000000000000008, FS  - 0000000000000008
GS   - 0000000000000008, SS  - 0000000000000008
CR0  - 0000000080000033, CR2 - 0000000000000000, CR3 - 000000007FB8E000
CR4  - 0000000000000668, CR8 - 0000000000000000
DR0  - 0000000000000000, DR1 - 0000000000000000, DR2 - 0000000000000000
DR3  - 0000000000000000, DR6 - 00000000FFFF0FF0, DR7 - 0000000000000400
GDTR - 000000007FB78E98 000000000000003F, LDTR - 0000000000000000
IDTR - 000000007F711018 0000000000000FFF,   TR - 0000000000000000
FXSAVE_STATE - 000000007FBEF0C0

This happens as soon as I add a passthru device.
Attached is a verbose boot of an install-iso, with bhyve-ahci (responsive, dd to /dev/null leads to _real_ disk activity, unfortunately NULLs only, not the disk's data).
One thin I noticed is that I always get the message "pcib0: no PRT entry for 0.5.INTA" for any passthru device, regardless which slot I use.

Any help highly appreciated! How do others use passthru?

-harry
Comment 7 Harald Schmalzbauer 2017-06-11 10:56:48 UTC
Is there anybody who has checked whether the steps to reproduce show the reported results? Meaning, is there anybody who can confirm correct behaviour in that case?

I observed many more, at first sight completely unrelated strange errors, but all show up as soon as one condition is true: shutting down a bhyve-guest which had ppt in use.

Latest example:
panic: Memory modified after free 0xfffff8002486a030(48) val=0 @ 0xfffff8002486a030

cpuid = 5
KDB: stack backtrace:
#0 0xffffffff805bf327 at kdb_backtrace+0x67
#1 0xffffffff8057f266 at vpanic+0x186
#2 0xffffffff8057f2e3 at panic+0x43
#3 0xffffffff8082eaeb at trash_ctor+0x4b
#4 0xffffffff8082aaec at uma_zalloc_arg+0x52c
#5 0xffffffff813b54a6 at zio_add_child+0x26
#6 0xffffffff813b5a05 at zio_create+0x385
#7 0xffffffff813b6de2 at zio_vdev_child_io+0x232
#8 0xffffffff81396be0 at vdev_mirror_io_start+0x370
#9 0xffffffff813bc629 at zio_vdev_io_start+0x4a9
#10 0xffffffff813b76bc at zio_execute+0x36c
#11 0xffffffff813b6868 at zio_nowait+0xb8
#12 0xffffffff81396bec at vdev_mirror_io_start+0x37c
#13 0xffffffff813bc383 at zio_vdev_io_start+0x203
#14 0xffffffff813b76bc at zio_execute+0x36c
#15 0xffffffff805d10dd at taskqueue_run_locked+0x13d
#16 0xffffffff805d1e78 at taskqueue_thread_loop+0x88
#17 0xffffffff80543844 at fork_exit+0x84

#0  doadump (textdump=<value optimized out>) at pcpu.h:222
#1  0xffffffff8057ece0 in kern_reboot (howto=260) at /usr/local/share/deploy-tools/RELENG_11/src/sys/kern/kern_shutdown.c:366
#2  0xffffffff8057f2a0 in vpanic (fmt=<value optimized out>, ap=<value optimized out>)
    at /usr/local/share/deploy-tools/RELENG_11/src/sys/kern/kern_shutdown.c:759
#3  0xffffffff8057f2e3 in panic (fmt=<value optimized out>) at /usr/local/share/deploy-tools/RELENG_11/src/sys/kern/kern_shutdown.c:690
#4  0xffffffff8082eaeb in trash_ctor (mem=<value optimized out>, size=<value optimized out>, arg=<value optimized out>, flags=<value optimized out>)
    at /usr/local/share/deploy-tools/RELENG_11/src/sys/vm/uma_dbg.c:80
#5  0xffffffff8082aaec in uma_zalloc_arg (zone=0xfffff8001febc680, udata=0xfffff8001ad5f340, flags=<value optimized out>)
    at /usr/local/share/deploy-tools/RELENG_11/src/sys/vm/uma_core.c:2152
#6  0xffffffff813b54a6 in zio_add_child (pio=0xfffff8026f350b88, cio=0xfffff8002478b7b0)
    at /usr/local/share/deploy-tools/RELENG_11/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c:460
#7  0xffffffff813b5a05 in zio_create (pio=0xfffff8026f350b88, spa=<value optimized out>, txg=433989, bp=<value optimized out>, data=0xfffffe0058afa000, 
    size=1024, type=<value optimized out>, priority=ZIO_PRIORITY_ASYNC_WRITE, flags=<value optimized out>, vd=<value optimized out>, 
    offset=<value optimized out>, zb=<value optimized out>, pipeline=<value optimized out>)
    at /usr/local/share/deploy-tools/RELENG_11/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c:690
#8  0xffffffff813b6de2 in zio_vdev_child_io (pio=0xfffff8026f350b88, bp=<value optimized out>, vd=<value optimized out>, offset=325398016, 
    data=<value optimized out>, size=1024, type=<value optimized out>, flags=1048704, done=<value optimized out>)
    at /usr/local/share/deploy-tools/RELENG_11/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c:1141
#9  0xffffffff81396be0 in vdev_mirror_io_start (zio=0xfffff8026f350b88)
    at /usr/local/share/deploy-tools/RELENG_11/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_mirror.c:488
#10 0xffffffff813bc629 in zio_vdev_io_start (zio=0xfffff8026f350b88)
    at /usr/local/share/deploy-tools/RELENG_11/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c:3143
#11 0xffffffff813b76bc in zio_execute (zio=<value optimized out>)
    at /usr/local/share/deploy-tools/RELENG_11/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c:1681
#12 0xffffffff813b6868 in zio_nowait (zio=0xfffff8026f350b88)
    at /usr/local/share/deploy-tools/RELENG_11/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c:1739
#13 0xffffffff81396bec in vdev_mirror_io_start (zio=0xfffff8026f7a7b88)
    at /usr/local/share/deploy-tools/RELENG_11/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_mirror.c:488
#14 0xffffffff813bc383 in zio_vdev_io_start (zio=0xfffff8026f7a7b88)
    at /usr/local/share/deploy-tools/RELENG_11/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c:3021
#15 0xffffffff813b76bc in zio_execute (zio=<value optimized out>)
    at /usr/local/share/deploy-tools/RELENG_11/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c:1681
#16 0xffffffff805d10dd in taskqueue_run_locked (queue=0xfffff8001ab5a700) at /usr/local/share/deploy-tools/RELENG_11/src/sys/kern/subr_taskqueue.c:454
#17 0xffffffff805d1e78 in taskqueue_thread_loop (arg=<value optimized out>) at /usr/local/share/deploy-tools/RELENG_11/src/sys/kern/subr_taskqueue.c:741
#18 0xffffffff80543844 in fork_exit (callout=0xffffffff805d1df0 <taskqueue_thread_loop>, arg=0xfffff8001aa90720, frame=0xfffffe043f609ac0)
    at /usr/local/share/deploy-tools/RELENG_11/src/sys/kern/kern_fork.c:1042
#19 0xffffffff808598ae in fork_trampoline () at /usr/local/share/deploy-tools/RELENG_11/src/sys/amd64/amd64/exception.S:611
#20 0x0000000000000000 in ?? ()

I consider this as a severe problem, which shouldn't exist in 11.1-RELEASE.
If nobody can prove my findings wrong, using passthru should be disabled in RELENG_11_1 until it can be ruled out as source of these strange problems (some form of memory corruption).

Thanks,

-harry
Comment 8 Harald Schmalzbauer 2022-03-25 11:23:51 UTC
Seems to be fixed in
https://cgit.freebsd.org/src/commit/?id=246c398145674e4a9337fd933a6e6da7f160118e

Will close as soon as I had the opportunity to do a real-world check - anybody else checking and closing welcome.
Comment 9 commit-hook freebsd_committer freebsd_triage 2022-03-27 20:15:14 UTC
A commit in branch stable/13 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=dd113f67dfb5bdaf5d8b3a87bb19924ad447494c

commit dd113f67dfb5bdaf5d8b3a87bb19924ad447494c
Author:     Bjoern A. Zeeb <bz@FreeBSD.org>
AuthorDate: 2022-03-18 20:39:06 +0000
Commit:     Bjoern A. Zeeb <bz@FreeBSD.org>
CommitDate: 2022-03-27 17:57:28 +0000

    bhyve: Do not remove guest physical addresses from IOMMU host domain

    This permits I/O devices on the host to directly access wired memory
    dedicated to guests using passthru devices.  Note that wired memory
    belonging to guests that do not use passthru devices has always been
    accessible by I/O devices on the host.

    bhyve maps guest physical addresses into the user address space of
    the bhyve process by mmap'ing /dev/vmm/<vmname>.  Device models pass
    pointers derived from this mapping directly to system calls such as
    preadv() to minimize copies when emulating DMA.  If the backing store
    for a device model is a raw host device (e.g. when exporting a raw disk
    device such as /dev/ada<n> as a drive in the guest), the host device
    driver (e.g. ahci for /dev/ada<n>) can itself use DMA on the host
    directly to the guest's memory.  However, if the guest's memory is
    not present in the host IOMMU domain, these DMA requests by the host
    device will fail without raising an error visible to the host device
    driver or to the guest resulting in non-working I/O in the guest.

    It is unclear why guest addresses were removed from the IOMMU host domain
    initially, especially only for VM's with a passthru device as the
    host IOMMU domain does not affect the permissions of passthru devices,
    only devices on the host.

    A considered alternative was using bounce buffers instead (D34535
    is a proof of concept), but that adds additional overhead for unclear
    benefit.

    This solves a long-standing problem when using passthru devices and
    physical disks in the same VM.

    Thanks to:      grehan (patience and help)
    Thanks to:      jhb (for improving the commit message)
    PR:             260178, 215740
    Reviewed by:    grehan, jhb
    Differential Revision: https://reviews.freebsd.org/D34607

    (cherry picked from commit 246c398145674e4a9337fd933a6e6da7f160118e)

 sys/amd64/vmm/vmm.c | 2 --
 1 file changed, 2 deletions(-)
Comment 10 commit-hook freebsd_committer freebsd_triage 2022-03-30 15:50:30 UTC
A commit in branch releng/13.1 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=1c6abf864ecd3bbf07ace2018f9aab45b6406ce2

commit 1c6abf864ecd3bbf07ace2018f9aab45b6406ce2
Author:     Bjoern A. Zeeb <bz@FreeBSD.org>
AuthorDate: 2022-03-18 20:39:06 +0000
Commit:     Bjoern A. Zeeb <bz@FreeBSD.org>
CommitDate: 2022-03-30 15:33:47 +0000

    bhyve: Do not remove guest physical addresses from IOMMU host domain

    This permits I/O devices on the host to directly access wired memory
    dedicated to guests using passthru devices.  Note that wired memory
    belonging to guests that do not use passthru devices has always been
    accessible by I/O devices on the host.

    bhyve maps guest physical addresses into the user address space of
    the bhyve process by mmap'ing /dev/vmm/<vmname>.  Device models pass
    pointers derived from this mapping directly to system calls such as
    preadv() to minimize copies when emulating DMA.  If the backing store
    for a device model is a raw host device (e.g. when exporting a raw disk
    device such as /dev/ada<n> as a drive in the guest), the host device
    driver (e.g. ahci for /dev/ada<n>) can itself use DMA on the host
    directly to the guest's memory.  However, if the guest's memory is
    not present in the host IOMMU domain, these DMA requests by the host
    device will fail without raising an error visible to the host device
    driver or to the guest resulting in non-working I/O in the guest.

    It is unclear why guest addresses were removed from the IOMMU host domain
    initially, especially only for VM's with a passthru device as the
    host IOMMU domain does not affect the permissions of passthru devices,
    only devices on the host.

    A considered alternative was using bounce buffers instead (D34535
    is a proof of concept), but that adds additional overhead for unclear
    benefit.

    This solves a long-standing problem when using passthru devices and
    physical disks in the same VM.

    Approved by:    re (gjb)
    Thanks to:      grehan (patience and help)
    Thanks to:      jhb (for improving the commit message)
    PR:             260178, 215740
    Reviewed by:    grehan, jhb
    Differential Revision: https://reviews.freebsd.org/D34607

    (cherry picked from commit 246c398145674e4a9337fd933a6e6da7f160118e)
    (cherry picked from commit dd113f67dfb5bdaf5d8b3a87bb19924ad447494c)

 sys/amd64/vmm/vmm.c | 2 --
 1 file changed, 2 deletions(-)