Summary: | Under XenServer 6.5 - Live Migration results in a Kernel Panic when the VM 'Lands' on the other Node | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Base System | Reporter: | karl | ||||||
Component: | kern | Assignee: | Roger Pau Monné <royger> | ||||||
Status: | Closed FIXED | ||||||||
Severity: | Affects Only Me | CC: | rainer, royger | ||||||
Priority: | --- | Keywords: | patch | ||||||
Version: | 10.3-RELEASE | ||||||||
Hardware: | amd64 | ||||||||
OS: | Any | ||||||||
Attachments: |
|
Description
karl
2016-05-06 11:06:47 UTC
Sorry - should have added - to reproduce: - Install XenServer 6.5 + Hotfixes - Install FreeBSD 10.3-RELEASE (or 10-STABLE) - Do 'pkg install xe-guest-utilities' - Set 'xenguest_enable="YES"' in /etc/rc.conf - Attempt live migration from one XenServer in a pool, to another. 'xe-guest-utilities' needs to be running in order for XenServer to 'see' that the guest is agile / able to live migrate. -Karl Sadly I don't seem to be able to reproduce this using Open Source Xen on my hardware. I've sent a patch to update the FreeBSD version that's used for testing with Open Source Xen to 10.3, the test cluster contains quite a lot of different hardware, so it might be able to trigger it. The crash itself looks quite weird, it seems to be caused by the IPIs that are sent to CPUs in order to setup the PV timers (xentimer_resume). Also vector 30 is used to deliver security exceptions [0], which I don't even know how they are triggered... Anyway, will get back when I have more information about this. [0] http://wiki.osdev.org/Exceptions Hi, In your test - did you use shared storage? Having looked at this yesterday, and this morning - I tried setting up a completely separate pool - and found: - Live migration with local storage in XenCenter Works. - Live migration with shared (iSCSI) storage panics. - Live migration from shared (iSCSI) to local storage panics. - Live migration from local storage to shared (iSCSI) panics. All of the above complete fine with FreeBSD 10.2. I've also tested this on both our production XenServer 6.5 / HP Proliant Gen8 pool - and test 6.5 / Proliant Gen9 pool - with the same results. I don't know if your test used shared, or local storage. -Karl (In reply to kpielorz from comment #3) That's even more weird. Could it be that the VM with local storage only has 1 vCPU, while the VMs with shared storage have more than 1 vCPU? Could you try to migrate a VM with only 1 vCPU and see if the same happens? Created attachment 170177 [details]
Proposed fix
Could you also try the following patch on a VM with vCPUs > 1?
It should apply cleanly against either 10.3-RELEASE source or the stable/10 branch, and you should only need to recompile the kernel. FWIW, I would just download 10.3 sources, apply the patch and recompile the kernel.
Thanks.
(In reply to Roger Pau Monné from comment #4) Hi, The original machines have 2 vCPU's - I've just re-tested with a VM with only 1 vCPU - and it has the same issues :( Hi, Apologies - I double checked on the vCPU counts on the VM's - it appears the issue *only* affects 1 vCPU VM's. This is regardless of storage type. Some of the early test VM's I was using had 2 vCPU's - and that's on a pool with only local storage. So - even without your patch, >1 vCPU VM's will migrate OK with stock 10.3-RELEASE. Uniprocessor (1 vCPU) systems panic 'after' the live migrate. Sorry for the red herring. -Karl Hello, I've found the cause of the issue, it's caused by the MFC of r291024, which needed some adjustments due to code differences when backported to stable/10 that I didn't take into account. I will hopefully have a fix today, and I'm planning to request an EN for 10.3 in order to have binary fixes for this issue available using freebsd-update. (In reply to Roger Pau Monné from comment #8) Hi, Thanks - that's great! - And sorry again for the confusion (Too many 'test' VM's - and too many 'test' pools floating around here!). -Karl Created attachment 170204 [details]
Fix
This patch fixes the issue on my side, can you please confirm it also fixes yours?
(In reply to Roger Pau Monné from comment #10) Hi, Yes - the patch applied to a 10.3-R system stops it panic'ing when a 1 vCPU VM is live migrated. Also tested with >1 vCPU - and works fine. Many thanks, -Karl |