When Live Migrating a FreeBSD 10.3 running VM under XenServer 6.5 - the VM panics when it "lands" on the other node.
This works under FreeBSD 10.1-R and 10.2-R (with 'freebsd-update' patches) - but fails for both 10.3-R and 10-STABLE.
The VM starts to migrate in XenCenter - 'lands' on the destination node (and the XenServer console tab is connected) - but the host immediately panics then with the output below.
I can't get a dump from this (fails, as shown below).
Fatal trap 30: reserved (unknown) fault while in kernel mode
cpuid = 0; apic id = 00
instruction pointer = 0x20:0xffffffff80e970b1
stack pointer = 0x28:0xfffffe002b9b0ac0
frame pointer = 0x28:0xfffffe002b9b0af0
code segment = base rx0, limit 0xfffff, type 0x1b
= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags = interrupt enabled, IOPL = 0
current process = 16 (xenwatch)
[ thread pid 16 tid 100033 ]
stopped at lapic_ipi_vectored+0xf1: addq $0x8,%rsp
Tracing pid 16 tid 100033 td 0xfffff8000242e000
lapic_ipi_vectored() at lapic_ipi_vectored+0xf1/frame 0xfffffe002b9b0af0
xctrl_suspend() at xctrl_suspend+0x317/frame 0xfffffe002b9b0b40
xctrl_on_watch_event() at xtrl_on_watch_event+0x5e/frame 0xfffffe002b9b0b70
xenwatch_thread() at xenwatch_thread+0x1cf/frame 0xfffffe002b9b0bb0
fork_exit() at fork_exit+0x9a/frame 0xfffffe002b9b0bf0
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe002b9b0bf0
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---
Dumping 71 out of 483 MB: xbd0: dump: no more commands?
** DUMP FAILED (ERROR 16) **
Automatic reboot in 15 seconds - press a key on the console to abort
Sorry - should have added - to reproduce:
- Install XenServer 6.5 + Hotfixes
- Install FreeBSD 10.3-RELEASE (or 10-STABLE)
- Do 'pkg install xe-guest-utilities'
- Set 'xenguest_enable="YES"' in /etc/rc.conf
- Attempt live migration from one XenServer in a pool, to another.
'xe-guest-utilities' needs to be running in order for XenServer to 'see' that the guest is agile / able to live migrate.
Sadly I don't seem to be able to reproduce this using Open Source Xen on my hardware.
I've sent a patch to update the FreeBSD version that's used for testing with Open Source Xen to 10.3, the test cluster contains quite a lot of different hardware, so it might be able to trigger it.
The crash itself looks quite weird, it seems to be caused by the IPIs that are sent to CPUs in order to setup the PV timers (xentimer_resume). Also vector 30 is used to deliver security exceptions , which I don't even know how they are triggered...
Anyway, will get back when I have more information about this.
In your test - did you use shared storage?
Having looked at this yesterday, and this morning - I tried setting up a completely separate pool - and found:
- Live migration with local storage in XenCenter Works.
- Live migration with shared (iSCSI) storage panics.
- Live migration from shared (iSCSI) to local storage panics.
- Live migration from local storage to shared (iSCSI) panics.
All of the above complete fine with FreeBSD 10.2. I've also tested this on both our production XenServer 6.5 / HP Proliant Gen8 pool - and test 6.5 / Proliant Gen9 pool - with the same results.
I don't know if your test used shared, or local storage.
(In reply to kpielorz from comment #3)
That's even more weird. Could it be that the VM with local storage only has 1 vCPU, while the VMs with shared storage have more than 1 vCPU?
Could you try to migrate a VM with only 1 vCPU and see if the same happens?
Created attachment 170177 [details]
Could you also try the following patch on a VM with vCPUs > 1?
It should apply cleanly against either 10.3-RELEASE source or the stable/10 branch, and you should only need to recompile the kernel. FWIW, I would just download 10.3 sources, apply the patch and recompile the kernel.
(In reply to Roger Pau MonnÃ© from comment #4)
The original machines have 2 vCPU's - I've just re-tested with a VM with only 1 vCPU - and it has the same issues :(
Apologies - I double checked on the vCPU counts on the VM's - it appears the issue *only* affects 1 vCPU VM's.
This is regardless of storage type. Some of the early test VM's I was using had 2 vCPU's - and that's on a pool with only local storage.
So - even without your patch, >1 vCPU VM's will migrate OK with stock 10.3-RELEASE.
Uniprocessor (1 vCPU) systems panic 'after' the live migrate. Sorry for the red herring.
I've found the cause of the issue, it's caused by the MFC of r291024, which needed some adjustments due to code differences when backported to stable/10 that I didn't take into account.
I will hopefully have a fix today, and I'm planning to request an EN for 10.3 in order to have binary fixes for this issue available using freebsd-update.
(In reply to Roger Pau MonnÃ© from comment #8)
Thanks - that's great! - And sorry again for the confusion (Too many 'test' VM's - and too many 'test' pools floating around here!).
Created attachment 170204 [details]
This patch fixes the issue on my side, can you please confirm it also fixes yours?
(In reply to Roger Pau MonnÃ© from comment #10)
Yes - the patch applied to a 10.3-R system stops it panic'ing when a 1 vCPU VM is live migrated. Also tested with >1 vCPU - and works fine.