Bug 212681 - I/O is slow for FreeBSD DOMu on XenServer
Summary: I/O is slow for FreeBSD DOMu on XenServer
Status: New
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 10.3-RELEASE
Hardware: amd64 Any
: --- Affects Only Me
Assignee: freebsd-xen (Nobody)
URL:
Keywords: performance
Depends on:
Blocks:
 
Reported: 2016-09-14 10:28 UTC by rainer
Modified: 2017-03-25 21:32 UTC (History)
5 users (show)

See Also:


Attachments
flamegraph dtrace (62.46 KB, image/svg+xml)
2017-02-07 17:25 UTC, rainer
no flags Details
Initial debug patch (531 bytes, patch)
2017-02-07 18:20 UTC, Roger Pau Monné
no flags Details | Diff
vmstat -ai before running dc3dd (7.87 KB, text/plain)
2017-02-10 16:35 UTC, rainer
no flags Details
vmstat -ai while running dc3dd (7.87 KB, text/plain)
2017-02-10 16:35 UTC, rainer
no flags Details
Disable all the Xen enlightments (296 bytes, patch)
2017-02-10 17:56 UTC, Roger Pau Monné
no flags Details | Diff
Screenshot of panic (157.62 KB, image/tiff)
2017-02-10 23:11 UTC, rainer
no flags Details
Disable all the Xen enlightments (608 bytes, patch)
2017-02-11 09:50 UTC, Roger Pau Monné
no flags Details | Diff
Selectively disable PV optimizations (1.93 KB, patch)
2017-02-13 10:15 UTC, Roger Pau Monné
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description rainer 2016-09-14 10:28:39 UTC
Hi,

I run several FreeBSD VMs on Xen-Server 6.5 SP2

I/O is at least 5 times faster on Linux VMs than the FreeBSD VMs.
This also with FreeBSD 11-RC2.

Unfortunately, this is very noticeable in some workloads (MySQL esp.) and as a result, I will not be able to use FreeBSD on our cloud-setup (we run our own Apache CloudStack "cloud") except maybe for stuff like reverse-proxies etc.

An easy test to spot this is with dc3dd.

With dc3dd -wipe, I get over 100 MB/s in a Linux-VM, and maybe 20 in a FreeBSD VM.

Tools etc. are installed, NICs show up as "xn", drives show up as "ada", for what it's worth.

I've already posted in the freebsd-xen mailing list, but nobody used Xen-Server there, apparently.
I also posted on Citrix's forum, but nobody bothered there, either.

I currently run about a 100 FreeBSD servers in various forms (physical and virtual) and if nothing changes, this will probably be the end of it :-(

If anyone needs a VM to test, I can provide that, too (and access to the CloudStack GUI for that particular VM).
Comment 1 karl 2016-09-14 13:01:57 UTC
Hi,

We run XenServer 6.5 and 7 here - with a range of FreeBSD versions (mostly 10.x now). We've not noticed any really low I/O performance.

How are you running dc3dd? - So I can try and install / replicate this here.

We have seen that I/O behaves differently under XenServer than it does on bare metal (which is obvious - i.e. local SATA SSD vs. Multipath iSCSI [or similar]) you will see differences (even just with 'mapped through XenServer' "Local Storage").

It could be dc3dd / the stuff you're running is a particularly 'bad case' for the I/O. e.g. In an equally 'synthetic' test - a regular "dd if=/dev/zero of=test.dat bs=64k count=10240" on our our test pool here (which has two paths iSCSI over Gigabit) we get around 220Mbyte/sec to a FreeBSD DomU.

That may well be a best case - and you may well have hit some worst cases.

-Karl
Comment 2 Daniel Ylitalo 2016-09-14 13:13:12 UTC
We use XenServer 7 overhere and we can't run FreeBSD vm's on it unfortunately. (Tried both 11-RC1 and 10.3) (We use local storage)

Just a regular "portsnap fetch extract" takes like 10 times the time it does on a baremetal server.

So when someone requests a FreeBSD server they get a baremetal one, all variants of linux gets a vm.

I'm happy to provide a vm if someone is eager to troubleshoot this, I've tried all solutions out there in the xen I/O mail threads/forum posts but none did the trick.
Comment 3 rainer 2016-09-14 13:36:08 UTC
Our hardware is local disks (not SSDs) networked via ScaleIO.

I've run a dd from one disk of a VM to another one and it was very, very slow.

OS: "Other (64bit)" (which is actually a little bit faster than choosing "FreeBSD 10"

(freebsd11 </root>) 0 # dc3dd wipe=/dev/ada1

dc3dd 7.2.641 started at 2016-08-18 13:24:02 +0200
compiled options:
command line: dc3dd wipe=/dev/ada1
device size: 104857600 sectors (probed), 53,687,091,200 bytes
sector size: 512 bytes (probed)
53687091200 bytes ( 50 G ) copied ( 100% ), 3084 s, 17 M/s

input results for pattern `00':
104857600 sectors in

output results for device `/dev/ada1':
104857600 sectors out

dc3dd completed at 2016-08-18 14:15:26 +0200


The question remains: why is this and what can one do?

One of our customers has an application-workload (php+mysql) that takes 3s to process on an Ubuntu 16 VM with 2 vCPUs and 8GB RAM.
The FreeBSD VM is completely unusable for this because it's bogged down to a halt, regardless of how many vCPUs and RAM I give it.
Comment 4 karl 2016-09-14 14:28:59 UTC
(In reply to rainer from comment #3)

Please try a 'like for like' comparison - i.e. run:

  dd if=/dev/zero of=test.dat bs=64k count=20480

(Will consume ~1.2Gb of disk space) and see what that comes back with. It's still not a  'proper' test but will at least give us a comparison between your system - and here (both running the same command).

-Karl
Comment 5 rainer 2016-09-14 15:04:18 UTC
10.3-RELEASE-p5:

(server </srv>) 0 # time dd if=/dev/zero of=test.dat bs=64k count=20480
20480+0 records in
20480+0 records out
1342177280 bytes transferred in 1.769942 secs (758317078 bytes/sec)
dd if=/dev/zero of=test.dat bs=64k count=20480  0.02s user 1.75s system 99% cpu 1.775 total
(server </srv>) 0 # time dd if=/dev/zero of=test.dat bs=64k count=204800
204800+0 records in
204800+0 records out
13421772800 bytes transferred in 17.266468 secs (777331701 bytes/sec)
dd if=/dev/zero of=test.dat bs=64k count=204800  0.15s user 17.06s system 99% cpu 17.271 total

This is ZFS.
Probably due to compression on.


root@other-server:/srv# lsb_release -a
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 14.04.4 LTS
Release:	14.04
Codename:	trusty
root@other-server:/srv# time dd if=/dev/zero of=test.dat bs=64k count=20480
20480+0 records in
20480+0 records out
1342177280 bytes (1.3 GB) copied, 2.5559 s, 525 MB/s

real	0m2.571s
user	0m0.041s
sys	0m2.125s

root@other-server:/srv# time dd if=/dev/zero of=test.dat bs=64k count=204800
204800+0 records in
204800+0 records out
13421772800 bytes (13 GB) copied, 93.6892 s, 143 MB/s

real	1m33.940s
user	0m0.219s
sys	0m26.136s


root@yet-another-server:/srv# lsb_release -a
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 16.04.1 LTS
Release:	16.04
Codename:	xenial
root@yet-another-server:/srv# time dd if=/dev/zero of=test.dat bs=64k count=20480
20480+0 records in
20480+0 records out
1342177280 bytes (1.3 GB, 1.2 GiB) copied, 1.62124 s, 828 MB/s

real	0m1.652s
user	0m0.004s
sys	0m1.616s
root@yet-another-server:/srv# time dd if=/dev/zero of=test.dat bs=64k count=204800
204800+0 records in
204800+0 records out
13421772800 bytes (13 GB, 12 GiB) copied, 100.348 s, 134 MB/s

real	1m40.711s
user	0m0.172s
sys	0m25.004s


So, in this particular test, it's actually faster.
But I can assure you, in practical use, it's not.


10.3-RELEASE-p7, UFS:

(freebsd-srv2 </srv>) 0 # time dd if=/dev/zero of=test.dat bs=64k count=20480
20480+0 records in
20480+0 records out
1342177280 bytes transferred in 8.746548 secs (153452229 bytes/sec)
dd if=/dev/zero of=test.dat bs=64k count=20480  0.02s user 1.65s system 19% cpu 8.756 total
(freebsd-srv2 </srv>) 0 # time dd if=/dev/zero of=test.dat bs=64k count=204800
204800+0 records in
204800+0 records out
13421772800 bytes transferred in 99.078364 secs (135466233 bytes/sec)
dd if=/dev/zero of=test.dat bs=64k count=204800  0.22s user 18.20s system 18% cpu 1:39.30 total
Comment 6 karl 2016-09-14 16:42:48 UTC
(In reply to rainer from comment #5)

Ok, so 'like for like' test see's better performance.

I'm not overly familiar with dc3dd - maybe it's using a "really small block size" - and  on your setup, this is causing an issue (maybe look at the ssz / bufsz options)

Do you have any 'local storage' on XenServer (or can you create some) - i.e. something you can map to a FreeBSD VM - and repeat both the dc3dd and dd test on - that is backed by a local SATA / SAS disk on XenServer (i.e. not going through ScaleIO?)

-Karl
Comment 7 rainer 2016-09-15 07:53:19 UTC
The hardware is HP DL380 Gen 8 servers with 600 or 900 GB SAS disks, running off a HW RAID controller.

On local storage, the realworld-test is even slower.

I can't run the dc3dd test here and it's apparently a bit more complicated now to create VMs from a template on local storage now that we have completely eliminated it from our offerings.

(server </srv>) 0 # time dd if=/dev/zero of=test.dat bs=64k count=20480 
20480+0 records in
20480+0 records out
1342177280 bytes transferred in 22.239869 secs (60350053 bytes/sec)
dd if=/dev/zero of=test.dat bs=64k count=20480  0.04s user 6.20s system 28% cpu 22.255 total
(server </srv>) 0 # time dd if=/dev/zero of=test.dat bs=64k count=20480
20480+0 records in
20480+0 records out
1342177280 bytes transferred in 38.072567 secs (35253133 bytes/sec)
dd if=/dev/zero of=test.dat bs=64k count=20480  0.05s user 24.30s system 63% cpu 38.374 total
(server </srv>) 0 # time dd if=/dev/zero of=test.dat bs=64k count=20480
20480+0 records in
20480+0 records out
1342177280 bytes transferred in 5.782933 secs (232092820 bytes/sec)
dd if=/dev/zero of=test.dat bs=64k count=20480  0.05s user 2.27s system 38% cpu 6.021 total
(server </srv>) 0 # 
(server </srv>) 0 # 
(server </srv>) 0 # 
(server </srv>) 0 # time dd if=/dev/zero of=test.dat bs=64k count=20480
20480+0 records in
20480+0 records out
1342177280 bytes transferred in 7.891797 secs (170072452 bytes/sec)
dd if=/dev/zero of=test.dat bs=64k count=20480  0.00s user 2.26s system 27% cpu 8.141 total
(server </srv>) 0 # 
(server </srv>) 0 # time dd if=/dev/zero of=test.dat bs=64k count=20480
20480+0 records in
20480+0 records out
1342177280 bytes transferred in 12.598706 secs (106532947 bytes/sec)
dd if=/dev/zero of=test.dat bs=64k count=20480  0.02s user 2.37s system 18% cpu 12.845 total
(server </srv>) 0 # 
(server </srv>) 0 # time dd if=/dev/zero of=test.dat bs=64k count=20480
20480+0 records in
20480+0 records out
1342177280 bytes transferred in 7.917661 secs (169516892 bytes/sec)
dd if=/dev/zero of=test.dat bs=64k count=20480  0.03s user 2.23s system 27% cpu 8.144 total

(server </srv>) 0 # time dd if=/dev/zero of=test.dat bs=64k count=204800
204800+0 records in
204800+0 records out
13421772800 bytes transferred in 147.423047 secs (91042568 bytes/sec)
dd if=/dev/zero of=test.dat bs=64k count=204800  0.18s user 22.60s system 15% cpu 2:27.69 total


dc3dd is just a tool to securely wipe a disk.
I found, because it takes the filesystem out of the equation, that it's a nice benchmarking tool.
Also, the results of dc3dd correlate directly with the results of the testcase (in PHP) that the customer has built.
Comment 8 karl 2016-09-15 08:23:30 UTC
(In reply to rainer from comment #7)

Our config here is very similar - HP Proliant DL380 Gen -9- though, with local SAS disks - which we use for 'Local Storage' for XenServer, and then iSCSI off to a Synology NAS we use for primary storage of our VM's.

>On local storage, the realworld-test is even slower.

This troubles me, as we don't see that - but we are running the test on un-contended local storage (you don't say if you are).

The times you show for that last dd run - seem to vary quite a lot:

1342177280 bytes transferred in 22.239869 secs (60350053 bytes/sec)
1342177280 bytes transferred in 38.072567 secs (35253133 bytes/sec)
1342177280 bytes transferred in 5.782933 secs (232092820 bytes/sec)
1342177280 bytes transferred in 7.891797 secs (170072452 bytes/sec)
1342177280 bytes transferred in 12.598706 secs (106532947 bytes/sec)
1342177280 bytes transferred in 7.917661 secs (169516892 bytes/sec)
13421772800 bytes transferred in 147.423047 secs (91042568 bytes/sec)

Especially that last one. How busy is the node? / local disks?

At this stage I'd be tempted to re-run the tests while on FreeBSD looking at the output of something like:

  iostat -x 1

(which will show what FreeBSD thinks the disk service time, % busy etc. are) - and on Dom0 providing the local storage running:

  iostat -x 1 | egrep -e "(^Device)|(^sd*)"

Which will do the equivalent for Xen.

Things to look out for are high % busy, and service time / queue depths.

At this point I'm not sure filling the ticket with reams of debug / command output is helping much - it might be better to close this ticket - and revert to email to see if we can narrow things down to a possible cause, before re-opening the ticket again with more specific information.

-Karl
Comment 9 rainer 2016-09-15 21:50:56 UTC
I noticed that the local values are very unstable.
Also, I don't really have access to the Xen side (yet).

I will look into how I can debug this further. I'm merely a consumer of it at this point.

CloudStack does not support Xen-Server 7, unfortunately. And it looks like it's going to be a while before that happens.

At your request, I will contact you via eMail once I have more debugging information.
Comment 10 Sydney Meyer 2016-09-15 21:58:39 UTC
Your disk shows up as using the ada driver, aren't you using some type of emulated disk device instead of the paravirtualized xbd block device?
Comment 11 rainer 2016-09-15 22:16:20 UTC
Well, I chose "FreeBSD 10 64bit" as OS-type.

In dmesg, I see:

xbd0: attaching as ada0
xbd0: features: write_barrier
xbd0: synchronize cache commands enabled.

sysctl -a |grep xen
kern.vm_guest: xen
device	xenpci
vfs.pfs.vncache.maxentries: 0
dev.xctrl.0.%parent: xenstore0
dev.xenbusb_back.0.%parent: xenstore0
dev.xenbusb_back.0.%pnpinfo: 
dev.xenbusb_back.0.%location: 
dev.xenbusb_back.0.%driver: xenbusb_back
dev.xenbusb_back.0.%desc: Xen Backend Devices
dev.xenbusb_back.%parent: 
dev.xn.0.xenstore_peer_path: /local/domain/0/backend/vif/109/0
dev.xn.0.xenbus_peer_domid: 0
dev.xn.0.xenbus_connection_state: Connected
dev.xn.0.xenbus_dev_type: vif
dev.xn.0.xenstore_path: device/vif/0
dev.xn.0.%parent: xenbusb_front0
dev.xbd.0.xenstore_peer_path: /local/domain/0/backend/vbd3/109/768
dev.xbd.0.xenbus_peer_domid: 0
dev.xbd.0.xenbus_connection_state: Connected
dev.xbd.0.xenbus_dev_type: vbd
dev.xbd.0.xenstore_path: device/vbd/768
dev.xbd.0.%parent: xenbusb_front0
dev.xenbusb_front.0.%parent: xenstore0
dev.xenbusb_front.0.%pnpinfo: 
dev.xenbusb_front.0.%location: 
dev.xenbusb_front.0.%driver: xenbusb_front
dev.xenbusb_front.0.%desc: Xen Frontend Devices
dev.xenbusb_front.%parent: 
dev.xenstore.0.%parent: xenpci0
dev.xenstore.0.%pnpinfo: 
dev.xenstore.0.%location: 
dev.xenstore.0.%driver: xenstore
dev.xenstore.0.%desc: XenStore
dev.xenstore.%parent: 
dev.xenpci.0.%parent: pci0
dev.xenpci.0.%pnpinfo: vendor=0x5853 device=0x0001 subvendor=0x5853 subdevice=0x0001 class=0x010000
dev.xenpci.0.%location: pci0:0:3:0 handle=\_SB_.PCI0.S18_
dev.xenpci.0.%driver: xenpci
dev.xenpci.0.%desc: Xen Platform Device
dev.xenpci.%parent: 
dev.xen_et.0.%parent: nexus0
dev.xen_et.0.%pnpinfo: 
dev.xen_et.0.%location: 
dev.xen_et.0.%driver: xen_et
dev.xen_et.0.%desc: Xen PV Clock
dev.xen_et.%parent: 
dev.xen.xsd_kva: 18446735281894703104
dev.xen.xsd_port: 17
dev.xen.balloon.high_mem: 0
dev.xen.balloon.low_mem: 0
dev.xen.balloon.hard_limit: 18446744073709551615
dev.xen.balloon.driver_pages: 0
dev.xen.balloon.target: 1048576
dev.xen.balloon.current: 1047552


If I choose "Other 64bit", I get the same (at least in FreeBSD 11RC2, which is what I could quickly switch-over).

xenbusb_back0: <Xen Backend Devices> on xenstore0
xbd0: 51200MB <Virtual Block Device> at device/vbd/832 on xenbusb_front0
xbd0: attaching as ada1
xbd0: features: write_barrier
xbd0: synchronize cache commands enabled.
xn0: backend features: feature-sg feature-gso-tcp4


What would you expect?
Comment 12 Sydney Meyer 2016-09-15 22:28:46 UTC
I have some 10.3 vm's running on Xen 4.4 with a Debian Linux 4.6 Dom0 and they give me:

dmesg

xenbusb_back0: <Xen Backend Devices> on xenstore0
xbd0: 5120MB <Virtual Block Device> at device/vbd/51712 on xenbusb_front0
xbd0: features: flush, write_barrier
xbd0: synchronize cache commands enabled.
xn0: backend features: feature-sg feature-gso-tcp4

sysctl -a | grep xen

kern.vm_guest: xen
device	xenpci
vfs.pfs.vncache.maxentries: 0
dev.xenbusb_back.0.%parent: xenstore0
dev.xenbusb_back.0.%pnpinfo: 
dev.xenbusb_back.0.%location: 
dev.xenbusb_back.0.%driver: xenbusb_back
dev.xenbusb_back.0.%desc: Xen Backend Devices
dev.xenbusb_back.%parent: 
dev.xn.0.xenstore_peer_path: /local/domain/0/backend/vif/3/0
dev.xn.0.xenbus_peer_domid: 0
dev.xn.0.xenbus_connection_state: Connected
dev.xn.0.xenbus_dev_type: vif
dev.xn.0.xenstore_path: device/vif/0
dev.xn.0.%parent: xenbusb_front0
dev.xbd.0.xenstore_peer_path: /local/domain/0/backend/vbd/3/51712
dev.xbd.0.xenbus_peer_domid: 0
dev.xbd.0.xenbus_connection_state: Connected
dev.xbd.0.xenbus_dev_type: vbd
dev.xbd.0.xenstore_path: device/vbd/51712
dev.xbd.0.%parent: xenbusb_front0
dev.xenbusb_front.0.%parent: xenstore0
dev.xenbusb_front.0.%pnpinfo: 
dev.xenbusb_front.0.%location: 
dev.xenbusb_front.0.%driver: xenbusb_front
dev.xenbusb_front.0.%desc: Xen Frontend Devices
dev.xenbusb_front.%parent: 
dev.xctrl.0.%parent: xenstore0
dev.xenstore.0.%parent: xenpci0
dev.xenstore.0.%pnpinfo: 
dev.xenstore.0.%location: 
dev.xenstore.0.%driver: xenstore
dev.xenstore.0.%desc: XenStore
dev.xenstore.%parent: 
dev.xenpci.0.%parent: pci0
dev.xenpci.0.%pnpinfo: vendor=0x5853 device=0x0001 subvendor=0x5853 subdevice=0x0001 class=0xff8000
dev.xenpci.0.%location: pci0:0:2:0 handle=\_SB_.PCI0.S2__
dev.xenpci.0.%driver: xenpci
dev.xenpci.0.%desc: Xen Platform Device
dev.xenpci.%parent: 
dev.xen_et.0.%parent: nexus0
dev.xen_et.0.%pnpinfo: 
dev.xen_et.0.%location: 
dev.xen_et.0.%driver: xen_et
dev.xen_et.0.%desc: Xen PV Clock
dev.xen_et.%parent: 
dev.xen.xsd_kva: 18446735281894703104
dev.xen.xsd_port: 3
dev.xen.balloon.high_mem: 0
dev.xen.balloon.low_mem: 0
dev.xen.balloon.hard_limit: 18446744073709551615
dev.xen.balloon.driver_pages: 0
dev.xen.balloon.target: 129024
dev.xen.balloon.current: 129024
Comment 13 Sydney Meyer 2016-09-15 22:35:10 UTC
I'm no expert in Cloudstack but perhaps something with the vm template might be off. Did you tried / is there a possibilty to install FreeBSD with some Linux template?
Comment 14 rainer 2016-09-15 22:38:25 UTC
Well, I also get the "ada" thing if I use "Other 64 bit" - which is what we use for Linux installations and is supposed to be the "optimal" setting (HVM).
Comment 15 Roger Pau Monné freebsd_committer 2016-09-16 08:10:24 UTC
The fact that you get "ada" or "xbd" devices depend on what you put in the guest configuration file. If the disk is attached as a "xvd" it will show up as "xbd" in FreeBSD, and if it's attached as "hd" it will show up as "ada". As long as you see something like:

xbd0: 51200MB <Virtual Block Device> at device/vbd/832 on xenbusb_front0
xbd0: attaching as adaX

In dmesg it means it's using the PV disks.

I will try to look into this, but it's not going to be now (I hope I will be able to get to it by the end of the month).
Comment 16 rainer 2016-09-16 08:17:29 UTC
OK, thanks for the clarification.
Comment 17 Roger Pau Monné freebsd_committer 2016-09-21 09:09:51 UTC
Hello,

Just a wild guess, but could you try to disable indirect descriptors to see if that makes a difference? AFAIK XenServer block backends don't implement it, but it doesn't hurt to try. Just add:

hw.xbd.xbd_enable_indirect="0"

To your /boot/loader.conf.

Thanks, Roger.
Comment 18 rainer 2016-09-21 12:44:53 UTC
Hi,

thanks - but it does not make a notable difference.
Neither for the dc3dd -wipe test, nor for my real-world testcase.

I can create a tenant on CloudStack, so you can try it yourself - if you want.
But I have no problem running tests, commands etc. for you, either - I can devote (almost) as much time on this as it takes.
Comment 19 rainer 2016-11-16 14:11:01 UTC
Updating to ScaleIO 2.0.3 (and all the latest Hotfixes of XenServer 6.5) doesn't make a difference.

How would one debug this problem?
Comment 20 rainer 2017-02-03 16:16:32 UTC
Interestingly enough, even when the backend storage is an SSD-backed ScaleIO volume (PCIe NVMe), it's not faster.

Linux is faster on SSDs.
Comment 21 rainer 2017-02-06 09:15:13 UTC
Still a problem on FreeBSD 12:

root@f12test:~ # dc3dd wipe=/dev/ada1

dc3dd 7.2.641 started at 2017-02-06 10:12:31 +0100
compiled options:
command line: dc3dd wipe=/dev/ada1
device size: 104857600 sectors (probed),   53,687,091,200 bytes
sector size: 512 bytes (probed)
  1153433600 bytes ( 1.1 G ) copied (  2% ),  131 s, 8.4 M/s                    

input results for pattern `00':
   2252800 sectors in

output results for device `/dev/ada1':
   2252800 sectors out

dc3dd aborted at 2017-02-06 10:14:42 +0100

root@f12test:~ # uname -a
FreeBSD f12test 12.0-CURRENT FreeBSD 12.0-CURRENT #0 r313113: Fri Feb  3 01:47:24 UTC 2017     root@releng3.nyi.freebsd.org:/usr/obj/usr/src/sys/GENERIC  amd64
Comment 22 Roger Pau Monné freebsd_committer 2017-02-07 15:37:09 UTC
Hello,

I don't have much time to look into this right now, could you try to create a flamegraph [0] of this workload, this way we might be able to identify the bottleneck(s). If possible you should create the flamegraph with pmcstat instead of dtrace, the result is going to be much more accurate (specially when running inside of a VM).

[0] http://www.brendangregg.com/blog/2015-03-10/freebsd-flame-graphs.html
Comment 23 rainer 2017-02-07 17:01:13 UTC
Which event specifier should I use?

I can't even run the sample:

(freebsd11 </root>) 0 # pmcstat –S RESOURCE_STALLS.ANY -O out.pmcstat sleep 10
pmcstat: [options] [commandline]
         Measure process and/or system performance using hardware
         performance monitoring counters.
         Options include:
         -C              (toggle) show cumulative counts
         -D path         create profiles in directory "path"
         -E              (toggle) show counts at process exit
         -F file         write a system-wide callgraph (Kcachegrind format) to "file"
         -G file         write a system-wide callgraph to "file"
         -M file         print executable/gmon file map to "file"
         -N              (toggle) capture callchains
         -O file         send log output to "file"
         -P spec         allocate a process-private sampling PMC
         -R file         read events from "file"
         -S spec         allocate a system-wide sampling PMC
         -T              start in top mode
         -W              (toggle) show counts per context switch
         -a file         print sampled PCs and callgraph to "file"
         -c cpu-list     set cpus for subsequent system-wide PMCs
         -d              (toggle) track descendants
         -e              use wide history counter for gprof(1) output
         -f spec         pass "spec" to as plugin option
         -g              produce gprof(1) compatible profiles
         -k dir          set the path to the kernel
         -l secs         set duration time
         -m file         print sampled PCs to "file"
         -n rate         set sampling rate
         -o file         send print output to "file"
         -p spec         allocate a process-private counting PMC
         -q              suppress verbosity
         -r fsroot       specify FS root directory
         -s spec         allocate a system-wide counting PMC
         -t process-spec attach to running processes matching "process-spec"
         -v              increase verbosity
         -w secs         set printing time interval
         -z depth        limit callchain display depth
Comment 24 rainer 2017-02-07 17:15:02 UTC
ok,

(freebsd11 </root>) 64 # pmccontrol -L
SOFT
        CLOCK.PROF
        CLOCK.HARD
        CLOCK.STAT
        LOCK.FAILED
        PAGE_FAULT.ALL
        PAGE_FAULT.READ
        PAGE_FAULT.WRITE


Is there a way to get virtualized hardware performance counters in a DomU?
Comment 25 Roger Pau Monné freebsd_committer 2017-02-07 17:19:20 UTC
IIRC this was working fine last time I've tried. Have you loaded the pmc module (kldload pmc), and which CPU are you using? Note that you also need to enable the PMU support in Xen [0] by passing vpmu=1 on the Xen command line.

PMC.HASWELL(3) lists RESOURCE_STALLS.ANY as a valid event. If that doesn't work (or is too complicated to setup) I would try with dtrace, and let's see what we get.

[0] http://xenbits.xenproject.org/docs/unstable/misc/xen-command-line.html
Comment 26 rainer 2017-02-07 17:25:45 UTC
Created attachment 179716 [details]
flamegraph dtrace

This is running the example on Brendan's page with dtrace. 
While the VM was processing a dc3dd wipe command.
Comment 27 rainer 2017-02-07 17:30:12 UTC
(In reply to Roger Pau Monné from comment #25)
(freebsd11 <FlameGraph>) 0 # kldload pmc
kldload: can't load pmc: module already loaded or in kernel

(freebsd11 <FlameGraph>) 1 # kldstat 
Id Refs Address            Size     Name
 1   49 0xffffffff80200000 1fa7c38  kernel
 2    1 0xffffffff821a9000 30aec0   zfs.ko
 3   11 0xffffffff824b4000 adc0     opensolaris.ko
 4    1 0xffffffff824bf000 1620     accf_data.ko
 5    1 0xffffffff824c1000 2710     accf_http.ko
 6    1 0xffffffff824c4000 3a78     cc_htcp.ko
 7    1 0xffffffff82619000 587b     fdescfs.ko
 8    1 0xffffffff8261f000 3710     ums.ko
 9    1 0xffffffff82623000 665d     nullfs.ko
10    1 0xffffffff8262a000 adec     tmpfs.ko
11    1 0xffffffff82635000 1bb42    hwpmc.ko
12    1 0xffffffff82651000 848      dtraceall.ko
13    9 0xffffffff82652000 3d890    dtrace.ko
14    1 0xffffffff82690000 4860     dtmalloc.ko
15    1 0xffffffff82695000 5aef     dtnfscl.ko
16    1 0xffffffff8269b000 6832     fbt.ko
17    1 0xffffffff826a2000 585be    fasttrap.ko
18    1 0xffffffff826fb000 172e     sdt.ko
19    1 0xffffffff826fd000 cf3d     systrace.ko
20    1 0xffffffff8270a000 cd44     systrace_freebsd32.ko
21    1 0xffffffff82717000 535e     profile.ko

Do you have an idea what has to be done on XenServer to enable PMU?

In any case, I cannot do any configuration on Dom0. My coworker would have to do that tomorrow (CET)


Best Regards
Rainer
Comment 28 Roger Pau Monné freebsd_committer 2017-02-07 18:19:46 UTC
(In reply to rainer from comment #26)
Thanks!

This shows that the guest is mostly inactive (low CPU load), is this correct?

I'm attaching a patch to add some debug to blkfront, please be aware that things might get noisy. Can you post the dmesg after running your workload with this patch?

Thanks, Roger.
Comment 29 Roger Pau Monné freebsd_committer 2017-02-07 18:20:36 UTC
Created attachment 179717 [details]
Initial debug patch
Comment 30 rainer 2017-02-08 08:28:18 UTC
Hi,

I compiled a new kernel with this.
Where would the messages show up?

Anything special I need to add to GENERIC?
Or a flag at booting?

Sorry to sound so dumb. I stopped paying attention to building FreeBSD once binary patches became available...
Comment 31 Roger Pau Monné freebsd_committer 2017-02-08 09:41:48 UTC
You should see the messages in dmesg (if any), just execute:

# dmesg

As root from the console after having run your workload.
Comment 32 rainer 2017-02-08 12:05:19 UTC
No, can't see anything.

I believe I compiled the kernel correctly:

(freebsd11 </root>) 0 # strings /boot/kernel/kernel|grep Freez
Sequencer On QFreeze and Complete list: 
Chan %d Freeze simq (loopdown)
%s: Freezing devq for target ID %d
Freezing with flag: %d count: %d used ring slots: %u


from dmesg

xn0: <Virtual Network Interface> at device/vif/0 on xenbusb_front0
xn0: bpf attached
xn0: Ethernet address: 02:00:2b:42:00:12
random: harvesting attach, 8 bytes (4 bits) from xn0
random: harvesting attach, 8 bytes (4 bits) from xenbusb_front0
xenbusb_back0: <Xen Backend Devices> on xenstore0
random: harvesting attach, 8 bytes (4 bits) from xenbusb_back0
xbd1: 51200MB <Virtual Block Device> at device/vbd/832 on xenbusb_front0
xbd1: attaching as ada1
xbd1: features: write_barrier
xbd1: synchronize cache commands enabled.
xbd2: 25600MB <Virtual Block Device> at device/vbd/768 on xenbusb_front0
xbd2: attaching as ada0
xbd2: features: write_barrier
xbd2: synchronize cache commands enabled.
xn0: backend features: feature-sg feature-gso-tcp4
lapic2: CMCI unmasked
SMP: AP CPU #1 Launched!
cpu1 AP:
     ID: 0x02000000   VER: 0x00050014 LDR: 0x00000000 DFR: 0xffffffff x2APIC: 0
  lint0: 0x00010700 lint1: 0x00000400 TPR: 0x00000000 SVR: 0x000001ff
  timer: 0x000100ef therm: 0x00010000 err: 0x000000f0 pmc: 0x00010400
ioapic0: routing intpin 1 (ISA IRQ 1) to lapic 2 vector 48
ioapic0: routing intpin 7 (ISA IRQ 7) to lapic 2 vector 49
ioapic0: routing intpin 12 (ISA IRQ 12) to lapic 2 vector 50
ioapic0: routing intpin 15 (ISA IRQ 15) to lapic 2 vector 51
TSC timecounter discards lower 1 bit(s)
Timecounter "TSC-low" frequency 1296782624 Hz quality -100
Trying to mount root from zfs:zroot/ROOT/default []...
GEOM: new disk ada1
GEOM: new disk ada0
ugen0.2: <QEMU 0.10.2> at usbus0
start_init: trying /sbin/init
xn0: 2 link states coalesced
xn0: link state changed to UP
ums0: <Endpoint1 Interrupt Pipe> on usbus0
ums0: 3 buttons and [Z] coordinates ID=0
random: harvesting attach, 8 bytes (4 bits) from ums0
hwpc_core: unknown PMC architecture: 0
hwpmc: SOFT/16/64/0x67<INT,USR,SYS,REA,WRI>
Comment 33 Roger Pau Monné freebsd_committer 2017-02-10 14:18:00 UTC
(In reply to rainer from comment #32)
That was a possible outcome. I have a box half-setup for this, I will try to reproduce it tomorrow (Saturday), and see if I can get any useful data. As a last thing, could you post the output of `vmstat -ai` just after running your workload?
Comment 34 rainer 2017-02-10 16:35:07 UTC
Created attachment 179837 [details]
vmstat -ai before running dc3dd
Comment 35 rainer 2017-02-10 16:35:34 UTC
Created attachment 179838 [details]
vmstat -ai while running dc3dd
Comment 36 rainer 2017-02-10 16:36:36 UTC
Hi,

I added the output.

As I said, I could give ssh access to the box, if you want.
I would need your ssh key.
Comment 37 Roger Pau Monné freebsd_committer 2017-02-10 16:42:00 UTC
(In reply to Roger Pau Monné from comment #33)
So I've run the dc3dd test on a FreeBSD VM, with 4 vCPUs and 4GB of RAM, against a block device on a spinning disk, nothing fancy, and this is what I get:

(FreeBSD guest) # dc3dd wipe=/dev/ada1

dc3dd 7.2.641 started at 2017-02-10 16:19:54 +0000
compiled options:
command line: dc3dd wipe=/dev/ada1
device size: 20971520 sectors (probed),   10,737,418,240 bytes
sector size: 512 bytes (probed)
 10737418240 bytes ( 10 G ) copied ( 100% ),  125 s, 82 M/s

input results for pattern `00':
   20971520 sectors in

output results for device `/dev/ada1':
   20971520 sectors out

dc3dd completed at 2017-02-10 16:21:58 +0000

Then I shut down the guest, and ran the same test from Dom0 (Debian Linux 3.16) against the same exact block device, and this is what I get:

(Linux Dom0) # dc3dd wipe=/dev/dt51-vg/test

dc3dd 7.2.641 started at 2017-02-10 16:31:34 +0000
compiled options:
command line: dc3dd wipe=/dev/dt51-vg/test
device size: 20971520 sectors (probed),   10,737,418,240 bytes
sector size: 512 bytes (probed)
 10737418240 bytes ( 10 G ) copied ( 100% ),  114 s, 90 M/s

input results for pattern `00':
   20971520 sectors in

output results for device `/dev/dt51-vg/test':
   20971520 sectors out

dc3dd completed at 2017-02-10 16:33:28 +0000

So there's a < 10M/s difference, which I think it's fine (at the end there's always some overhead).

I cannot really explain the results that you get, but I cannot also reproduce your entire setup here. This is using the 12.0-CURRENT snapshot from 20170203. I'm afraid that unless we find a way for me to reproduce this, there's no way that I can try to fix it.
Comment 38 rainer 2017-02-10 17:01:19 UTC
Well, for some it works, for some it doesn't.

The 10% I also see when writing to a RAM-disk.

I'd just like to know how I can determine where all the performance is lost.
Comment 39 Roger Pau Monné freebsd_committer 2017-02-10 17:05:19 UTC
(In reply to rainer from comment #36)
There's clearly something wrong there, you are not receiving as many interrupts as you should be, this is what I usually see when running dc3dd:

irq808: xen_et0:c0                 17545         93
irq809: xen_et0:c1                 73460        391
irq810: xen_et0:c2                 65527        349
irq811: xen_et0:c3                 73980        394
irq814: xbd1                      314436       1674

(note that xbd1 is the disk against which the dc3dd is run)

In your case this is:

irq768: xen_et0:c0               4061306         20
irq769: xen_et0:c1               1951430         10
irq773: xbd0                     1652038          8
irq774: xbd1                          29          0
irq775: xbd2                      159503          1

Note the difference in the rate of interrupts (from 1674 in my case).

Can you also post the results of running `sysctl -a | grep xbd` inside the FreeBSD guest?

Thanks!
Comment 40 rainer 2017-02-10 17:12:39 UTC
Here:

(freebsd11 </root>) 1 # sysctl -a |grep xbd
hw.xbd.xbd_enable_indirect: 0
dev.xbd.2.xenstore_peer_path: /local/domain/0/backend/vbd3/13/768
dev.xbd.2.xenbus_peer_domid: 0
dev.xbd.2.xenbus_connection_state: Connected
dev.xbd.2.xenbus_dev_type: vbd
dev.xbd.2.xenstore_path: device/vbd/768
dev.xbd.2.features: write_barrier
dev.xbd.2.ring_pages: 1
dev.xbd.2.max_request_size: 40960
dev.xbd.2.max_request_segments: 11
dev.xbd.2.max_requests: 32
dev.xbd.2.%parent: xenbusb_front0
dev.xbd.2.%pnpinfo: 
dev.xbd.2.%location: 
dev.xbd.2.%driver: xbd
dev.xbd.2.%desc: Virtual Block Device
dev.xbd.1.xenstore_peer_path: /local/domain/0/backend/vbd3/13/832
dev.xbd.1.xenbus_peer_domid: 0
dev.xbd.1.xenbus_connection_state: Connected
dev.xbd.1.xenbus_dev_type: vbd
dev.xbd.1.xenstore_path: device/vbd/832
dev.xbd.1.features: write_barrier
dev.xbd.1.ring_pages: 1
dev.xbd.1.max_request_size: 40960
dev.xbd.1.max_request_segments: 11
dev.xbd.1.max_requests: 32
dev.xbd.1.%parent: xenbusb_front0
dev.xbd.1.%pnpinfo: 
dev.xbd.1.%location: 
dev.xbd.1.%driver: xbd
dev.xbd.1.%desc: Virtual Block Device
dev.xbd.0.xenstore_peer_path: /local/domain/0/backend/vbd3/13/5632
dev.xbd.0.xenbus_peer_domid: 0
dev.xbd.0.xenbus_connection_state: Connected
dev.xbd.0.xenbus_dev_type: vbd
dev.xbd.0.xenstore_path: device/vbd/5632
dev.xbd.0.features: write_barrier
dev.xbd.0.ring_pages: 1
dev.xbd.0.max_request_size: 40960
dev.xbd.0.max_request_segments: 11
dev.xbd.0.max_requests: 32
dev.xbd.0.%parent: xenbusb_front0
dev.xbd.0.%pnpinfo: 
dev.xbd.0.%location: 
dev.xbd.0.%driver: xbd
dev.xbd.0.%desc: Virtual Block Device
dev.xbd.%parent:
Comment 41 Roger Pau Monné freebsd_committer 2017-02-10 17:56:21 UTC
Can you try to change the event timer and the time counter to a different one than the Xen one:

# sysctl -w kern.timecounter.hardware=ACPI-fast
# sysctl -w kern.eventtimer.timer=LAPIC

And finally I'm also attaching a patch that actually disables all the fancy PV stuff completely, could you also patch your kernel with it (if the above things don't make a difference) and see if that makes a difference?

Thanks, Roger.
Comment 42 Roger Pau Monné freebsd_committer 2017-02-10 17:56:57 UTC
Created attachment 179844 [details]
Disable all the Xen enlightments
Comment 43 rainer 2017-02-10 23:10:20 UTC
# sysctl -w kern.timecounter.hardware=ACPI-fast

Already had that.
IIRC, it's mentioned in the bug about moving a VM freezing it...

# sysctl -w kern.eventtimer.timer=LAPIC


and that freezes the machine.

The patch - it leads to the machine freezing rather early in the boot-process.
Comment 44 rainer 2017-02-10 23:11:30 UTC
Created attachment 179858 [details]
Screenshot of panic
Comment 45 Roger Pau Monné freebsd_committer 2017-02-11 08:55:53 UTC
(In reply to rainer from comment #43)
Hm, that's certainly not good, switching to the LAPIC timer shouldn't cause the VM to freeze, I've tried it and it works just fine. Do you see anything in the console when the VM freezes?
Comment 46 Roger Pau Monné freebsd_committer 2017-02-11 09:11:49 UTC
(In reply to rainer from comment #44)
This panic trace is very disturbing, I'm a little bit confused. Which kind of guest are you running?

The trace shows xen_start -> hammer_time_xen and this path should _never_ be used by a HVM guest, this is only used by PVH guests. Can you install addr2line from ports (it's in binutils) and run the following inside of the VM:

# /usr/local/bin/addr2line -e /usr/lib/debug/boot/kernel/kernel.debug 0xffffffff81109cbb

Is also any change that you could get the full boot output from the VM by connecting to it's serial console (and setting up the console inside of the guest). See https://www.freebsd.org/doc/handbook/serialconsole-setup.html

Thanks.
Comment 47 Roger Pau Monné freebsd_committer 2017-02-11 09:50:28 UTC
Created attachment 179869 [details]
Disable all the Xen enlightments

Pathc to disable all Xen enlightenments, this time tested.
Comment 48 Roger Pau Monné freebsd_committer 2017-02-11 09:51:25 UTC
(In reply to Roger Pau Monné from comment #46)
OK, I've re-done the patch to disable the Xen enlightenments, could you please try it again? Although the LAPIC timer issue is also concerning.
Comment 49 rainer 2017-02-11 14:29:57 UTC
I switched back the OS-type to FreeBSD 10 64bit.
I also booted back into a stock kernel and then the XENTIMER-LAPIC change went through without a freeze.


I recompiled (a clean source-tree) with your patch and now I get about 26MB/s.

dmesg:
Copyright (c) 1992-2016 The FreeBSD Project.
Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
	The Regents of the University of California. All rights reserved.
FreeBSD is a registered trademark of The FreeBSD Foundation.
FreeBSD 11.0-RELEASE-p7 #2: Sat Feb 11 14:46:26 CET 2017
    root@freebsd11:/usr/obj/usr/src/sys/GENERIC amd64
FreeBSD clang version 3.8.0 (tags/RELEASE_380/final 262564) (based on LLVM 3.8.0)
VT(vga): text 80x25
CPU: Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz (2593.55-MHz K8-class CPU)
  Origin="GenuineIntel"  Id=0x306e4  Family=0x6  Model=0x3e  Stepping=4
  Features=0x783fbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,MMX,FXSR,SSE,SSE2>
  Features2=0xc3ba2203<SSE3,PCLMULQDQ,SSSE3,CX16,PCID,SSE4.1,SSE4.2,x2APIC,POPCNT,TSCDLT,AESNI,RDRAND,HV>
  AMD Features=0x28100800<SYSCALL,NX,RDTSCP,LM>
  AMD Features2=0x1<LAHF>
  Structured Extended Features=0x200<ERMS>
Hypervisor: Origin = "XenVMMXenVMM"
real memory  = 8585740288 (8188 MB)
avail memory = 8265371648 (7882 MB)
Event timer "LAPIC" quality 400
ACPI APIC Table: <Xen HVM>
FreeBSD/SMP: Multiprocessor System Detected: 2 CPUs
FreeBSD/SMP: 2 package(s)
random: unblocking device.
ioapic0: Changing APIC ID to 1
MADT: Forcing active-low polarity and level trigger for SCI
ioapic0 <Version 1.1> irqs 0-47 on motherboard
random: entropy device external interface
kbd1 at kbdmux0
netmap: loaded module
module_register_init: MOD_LOAD (vesa, 0xffffffff8101c970, 0) error 19
random: registering fast source Intel Secure Key RNG
random: fast provider: "Intel Secure Key RNG"
vtvga0: <VT VGA driver> on motherboard
cryptosoft0: <software crypto> on motherboard
acpi0: <Xen> on motherboard
acpi0: Power Button (fixed)
acpi0: Sleep Button (fixed)
cpu0: <ACPI CPU> on acpi0
cpu1: <ACPI CPU> on acpi0
hpet0: <High Precision Event Timer> iomem 0xfed00000-0xfed003ff on acpi0
Timecounter "HPET" frequency 62500000 Hz quality 950
attimer0: <AT timer> port 0x40-0x43 irq 0 on acpi0
Timecounter "i8254" frequency 1193182 Hz quality 0
Event timer "i8254" frequency 1193182 Hz quality 100
atrtc0: <AT realtime clock> port 0x70-0x71 irq 8 on acpi0
Event timer "RTC" frequency 32768 Hz quality 0
Timecounter "ACPI-fast" frequency 3579545 Hz quality 900
acpi_timer0: <32-bit timer at 3.579545MHz> port 0xb008-0xb00b on acpi0
pcib0: <ACPI Host-PCI bridge> port 0xcf8-0xcff on acpi0
pci0: <ACPI PCI bus> on pcib0
isab0: <PCI-ISA bridge> at device 1.0 on pci0
isa0: <ISA bus> on isab0
atapci0: <Intel PIIX3 WDMA2 controller> port 0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0xc220-0xc22f at device 1.1 on pci0
ata0: <ATA channel> at channel 0 on atapci0
ata1: <ATA channel> at channel 1 on atapci0
uhci0: <Intel 82371SB (PIIX3) USB controller> port 0xc200-0xc21f irq 23 at device 1.2 on pci0
usbus0: controller did not stop
usbus0 on uhci0
pci0: <bridge> at device 1.3 (no driver attached)
vgapci0: <VGA-compatible display> mem 0xf0000000-0xf1ffffff,0xf3000000-0xf3000fff irq 24 at device 2.0 on pci0
vgapci0: Boot video device
pci0: <mass storage, SCSI> at device 3.0 (no driver attached)
re0: <RealTek 8139C+ 10/100BaseTX> port 0xc100-0xc1ff mem 0xf3001000-0xf30010ff irq 32 at device 4.0 on pci0
re0: Chip rev. 0x74800000
re0: MAC rev. 0x00000000
miibus0: <MII bus> on re0
rlphy0: <RealTek internal media interface> PHY 0 on miibus0
rlphy0:  10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto, auto-flow
re0: Using defaults for TSO: 65518/35/2048
re0: Ethernet address: 02:00:2b:42:00:12
re0: netmap queues/slots: TX 1/64, RX 1/64
atkbdc0: <Keyboard controller (i8042)> port 0x60,0x64 irq 1 on acpi0
atkbd0: <AT Keyboard> irq 1 on atkbdc0
kbd0 at atkbd0
atkbd0: [GIANT-LOCKED]
psm0: <PS/2 Mouse> irq 12 on atkbdc0
psm0: [GIANT-LOCKED]
psm0: model IntelliMouse Explorer, device ID 4
fdc0: <floppy drive controller> port 0x3f0-0x3f5,0x3f7 irq 6 drq 2 on acpi0
fdc0: does not respond
device_attach: fdc0 attach returned 6
uart0: <16550 or compatible> port 0x3f8-0x3ff irq 4 flags 0x10 on acpi0
ppc0: <Parallel port> port 0x378-0x37f irq 7 on acpi0
ppc0: Generic chipset (NIBBLE-only) in COMPATIBLE mode
ppbus0: <Parallel port bus> on ppc0
lpt0: <Printer> on ppbus0
lpt0: Interrupt-driven port
ppi0: <Parallel I/O> on ppbus0
vga0: <Generic ISA VGA> at port 0x3c0-0x3df iomem 0xa0000-0xbffff on isa0
fdc0: No FDOUT register!
ZFS filesystem version: 5
ZFS storage pool version: features support (5000)
Timecounters tick every 1.000 msec
nvme cam probe device init
usbus0: 12Mbps Full Speed USB v1.0
ugen0.1: <Intel> at usbus0
uhub0: <Intel UHCI root HUB, class 9/0, rev 1.00/1.00, addr 1> on usbus0
ada0 at ata0 bus 0 scbus0 target 0 lun 0
ada0: <QEMU HARDDISK 0.10.2> ATA-7 device
ada0: Serial Number QM00001
ada0: 16.700MB/s transfers (WDMA2, PIO 8192bytes)
ada0: 25600MB (52428800 512 byte sectors)
uhub0: 2 ports with 2 removable, self powered
ada1 at ata0 bus 0 scbus0 target 1 lun 0
ada1: <QEMU HARDDISK 0.10.2> ATA-7 device
ada1: Serial Number QM00002
ada1: 16.700MB/s transfers (WDMA2, PIO 8192bytes)
ada1: 51200MB (104857600 512 byte sectors)
ada2 at ata1 bus 0 scbus1 target 0 lun 0
ada2: <QEMU HARDDISK 0.10.2> ATA-7 device
ada2: Serial Number QM00003
ada2: 16.700MB/s transfers (WDMA2, PIO 8192bytes)
ada2: 51200MB (104857600 512 byte sectors)
cd0 at ata1 bus 0 scbus1 target 1 lun 0
cd0: <QEMU QEMU DVD-ROM 0.10> Removable CD-ROM SCSI device
SMP: AP CPU #1 Launched!
cd0: Serial Number QM00004
cd0: 16.700MB/s transfers (WDMA2, ATAPI 12bytes, PIO 65534bytes)
cd0: Attempt to query device size failed: NOT READY, Medium not present
Trying to mount root from zfs:zroot/ROOT/default []...
Root mount waiting for: usbus0
ugen0.2: <QEMU 0.10.2> at usbus0
re0: link state changed to UP
ums0: <Endpoint1 Interrupt Pipe> on usbus0
ums0: 3 buttons and [Z] coordinates ID=0

From what I know, there is no configuration-file for the guests in XenServer - it's stored in a DB. Is there still a way to extract how the VM is configured on the Dom0-side?
Comment 50 rainer 2017-02-13 07:58:24 UTC
Output from the xen-server:

 xe vm-list name-label=i-129-1591-VM params=all
uuid ( RO)                          : 30f354c0-6a14-0dea-2be2-070a38ca2fc0
                    name-label ( RW): i-129-1591-VM
              name-description ( RW): Template which allows VM installation from install media
                  user-version ( RW): 1
                 is-a-template ( RW): false
                 is-a-snapshot ( RO): false
                   snapshot-of ( RO): <not in database>
                     snapshots ( RO): 
                 snapshot-time ( RO): 19700101T00:00:00Z
                 snapshot-info ( RO): 
                        parent ( RO): <not in database>
                      children ( RO): 
             is-control-domain ( RO): false
                   power-state ( RO): running
                 memory-actual ( RO): 8589938688
                 memory-target ( RO): <expensive field>
               memory-overhead ( RO): 71303168
             memory-static-max ( RW): 8589934592
            memory-dynamic-max ( RW): 8589934592
            memory-dynamic-min ( RW): 8589934592
             memory-static-min ( RW): 8589934592
              suspend-VDI-uuid ( RW): <not in database>
               suspend-SR-uuid ( RW): <not in database>
                  VCPUs-params (MRW): weight: 51; cap: 160
                     VCPUs-max ( RW): 2
              VCPUs-at-startup ( RW): 2
        actions-after-shutdown ( RW): Destroy
          actions-after-reboot ( RW): Restart
           actions-after-crash ( RW): Destroy
                 console-uuids (SRO): 502503d4-ace2-f4c5-d6f7-5b7750ab5b18
                      platform (MRW): timeoffset: -2; viridian: false; acpi: 1; apic: true; pae: true; nx: true
            allowed-operations (SRO): changing_dynamic_range; hard_reboot; hard_shutdown; pause; snapshot
            current-operations (SRO): 
            blocked-operations (MRW): 
           allowed-VBD-devices (SRO): <expensive field>
           allowed-VIF-devices (SRO): <expensive field>
                possible-hosts ( RO): <expensive field>
               HVM-boot-policy ( RW): BIOS order
               HVM-boot-params (MRW): order: dc
         HVM-shadow-multiplier ( RW): 1.000
                     PV-kernel ( RW): 
                    PV-ramdisk ( RW): 
                       PV-args ( RW): 
                PV-legacy-args ( RW): 
                 PV-bootloader ( RW): 
            PV-bootloader-args ( RW): 
           last-boot-CPU-flags ( RO): vendor: GenuineIntel; features: 77bee3ff-bfebfbff-00000001-2c100800
              last-boot-record ( RO): <expensive field>
                   resident-on ( RO): b2615dd7-9308-4e9c-912b-1f302920ea63
                      affinity ( RW): b2615dd7-9308-4e9c-912b-1f302920ea63
                  other-config (MRW): vgpu_pci: ; mac_seed: ed5148a2-d9cd-b180-697d-e9998e496ee1; install-methods: cdrom; vm_uuid: d61de64b-8c67-4c9c-82ab-71a197ff1530
                        dom-id ( RO): 52
               recommendations ( RO): <restrictions><restriction field="memory-static-max" max="137438953472" /><restriction field="vcpus-max" max="16" /><restriction property="number-of-vbds" max="16" /><restriction property="number-of-vifs" max="7" /></restrictions>
                 xenstore-data (MRW): vm-data: 
    ha-always-run ( RW) [DEPRECATED]: false
           ha-restart-priority ( RW): 
                         blobs ( RO): 
                    start-time ( RO): 20170211T14:23:10Z
                  install-time ( RO): 19700101T00:00:00Z
                  VCPUs-number ( RO): 2
             VCPUs-utilisation (MRO): <expensive field>
                    os-version (MRO): <not in database>
            PV-drivers-version (MRO): <not in database>
         PV-drivers-up-to-date ( RO): <not in database>
                        memory (MRO): <not in database>
                         disks (MRO): <not in database>
                      networks (MRO): <not in database>
                         other (MRO): <not in database>
                          live ( RO): <not in database>
    guest-metrics-last-updated ( RO): <not in database>
      cooperative ( RO) [DEPRECATED]: <expensive field>
                          tags (SRW): 
                     appliance ( RW): <not in database>
                   start-delay ( RW): 0
                shutdown-delay ( RW): 0
                         order ( RW): 0
                       version ( RO): 0
                 generation-id ( RO): 
     hardware-platform-version ( RO): 0
Comment 51 Roger Pau Monné freebsd_committer 2017-02-13 09:49:43 UTC
(In reply to rainer from comment #49)
So performance is slightly better with this patch? (IIRC you where getting 17M/s and with the patch you get 26M/s)
Comment 52 rainer 2017-02-13 09:59:24 UTC
Yes.
So, 60%-70% increase.
Comment 53 Roger Pau Monné freebsd_committer 2017-02-13 10:14:50 UTC
I'm attaching another patch that will allow to selectively disable some PV optimizations, you will have to play with the following tunables, and see if you can find which one(s) causes the VM to go faster:

hw.xen.disable_pv_ipi
hw.xen.disable_pv_et
hw.xen.disable_pv_tc
hw.xen.disable_pv_clk
hw.xen.disable_pv_disks
hw.xen.disable_pv_nics

Those tunables should be set in /boot/loader.conf, like:

hw.xen.disable_pv_ipi=1

Thanks, Roger.
Comment 54 Roger Pau Monné freebsd_committer 2017-02-13 10:15:23 UTC
Created attachment 179940 [details]
Selectively disable PV optimizations
Comment 55 rainer 2017-02-13 13:19:51 UTC
I can't see these tunables in sysctl.

But:

hw.xen.disable_pv_disks=1

is responsible for the slight increases in disk-performance.
Comment 56 Roger Pau Monné freebsd_committer 2017-02-13 13:54:57 UTC
(In reply to rainer from comment #55)
Yes, you won't see those tunables in sysctl.

Then again I'm quite lost, because you did test a plain dd, and that was actually working fine (and yielding results in line with Linux). Can you try a more complete benchmark, like unixbench (available in ports) without any tunables and report it's results?
Comment 57 rainer 2017-02-13 15:58:55 UTC
Well, I did do dd test, but they only write on a filesystem.

It was (back then) most likely on ZFS, with compression etc. that changed the results.
Esp. if I just write zeros from /dev/null.

That's why I switched to dc3dd because it does the same thing on Linux and on FreeBSD and completely eliminates the filesystem (and caching) layer as well as any other kinds of write-optimization.


I've now created an empty UFS filesystem on my 50G volume and run unixbench on it.
Comment 58 Roger Pau Monné freebsd_committer 2017-02-13 16:15:46 UTC
(In reply to rainer from comment #57)
You can also use plain dd to write to a block device, just like you do with dc3dd. Can you actually also try if plain dd shows the same slowness with writing to a block device directly?
Comment 59 rainer 2017-02-13 17:46:35 UTC
BYTE UNIX Benchmarks (Version 4.1.0)
  System -- freebsd11
  Start Benchmark Run: Mon Feb 13 17:28:56 CET 2017
   3 interactive users.
   5:28PM  up  3:06, 3 users, load averages: 0.62, 0.69, 0.71
  -r-xr-xr-x  1 root  wheel  153744 Sep 27 18:03 /bin/sh
  /bin/sh: ELF 64-bit LSB executable, x86-64, version 1 (FreeBSD), dynamically linked, interpreter /libexec/ld-elf.so.1, for FreeBSD 11.0 (1100122), FreeBSD-style, stripped
  zroot/ROOT/default    11262956 6353528 4909428    56%    /
File Read 4096 bufsize 8000 maxblocks    2313874.0 KBps  (30.0 secs, 3 samples)
File Write 4096 bufsize 8000 maxblocks   120869.0 KBps  (30.0 secs, 3 samples)
File Copy 4096 bufsize 8000 maxblocks    126921.0 KBps  (30.0 secs, 3 samples)
Shell Scripts (1 concurrent)               2137.7 lpm   (60.0 secs, 3 samples)
Shell Scripts (8 concurrent)                314.1 lpm   (60.0 secs, 3 samples)
Shell Scripts (16 concurrent)               154.5 lpm   (60.0 secs, 3 samples)
Arithmetic Test (type = short)                1.0 lps   (0.0 secs, 3 samples)
Arithmetic Test (type = int)                  1.0 lps   (0.0 secs, 3 samples)
Arithmetic Test (type = long)                 1.0 lps   (0.0 secs, 3 samples)
Arithmetic Test (type = float)                1.0 lps   (0.0 secs, 3 samples)
Arithmetic Test (type = double)               1.0 lps   (0.0 secs, 3 samples)
Arithoh                                       1.0 lps   (0.0 secs, 3 samples)
C Compiler Throughput                       499.1 lpm   (60.0 secs, 3 samples)
Dc: sqrt(2) to 99 decimal places          78631.1 lpm   (30.0 secs, 3 samples)
Recursion Test--Tower of Hanoi           192648.9 lps   (20.0 secs, 3 samples)


                     INDEX VALUES            
TEST                                        BASELINE     RESULT      INDEX

File Copy 4096 bufsize 8000 maxblocks         5800.0   126921.0      218.8
Shell Scripts (8 concurrent)                     6.0      314.1      523.5
                                                                 =========
     FINAL SCORE                                                     338.5



on a 50G UFS partition.