209327 – Under XenServer 6.5 - Live Migration results in a Kernel Panic when the VM 'Lands' on the other Node

Bug 209327 - Under XenServer 6.5 - Live Migration results in a Kernel Panic when the VM 'Lands' on the other Node

Summary: Under XenServer 6.5 - Live Migration results in a Kernel Panic when the VM 'L...

Status:	Closed FIXED

Alias:	None

Product:	Base System
Classification:	Unclassified
Component:	kern (show other bugs)
Version:	10.3-RELEASE
Hardware:	amd64 Any

Importance:	--- Affects Only Me
Assignee:	Roger Pau Monné

URL:
Keywords:	patch

Depends on:
Blocks:

Reported:	2016-05-06 11:06 UTC by karl
Modified:	2017-12-01 09:52 UTC (History)
CC List:	2 users (show)

See Also:

Attachments
Proposed fix (705 bytes, patch) 2016-05-10 10:42 UTC, Roger Pau Monné	no flags	Details \| Diff
Fix (904 bytes, patch) 2016-05-11 09:49 UTC, Roger Pau Monné	no flags	Details \| Diff
Show Obsolete (1) View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description karl 2016-05-06 11:06:47 UTC

When Live Migrating a FreeBSD 10.3 running VM under XenServer 6.5 - the VM panics when it "lands" on the other node.

This works under FreeBSD 10.1-R and 10.2-R (with 'freebsd-update' patches) - but fails for both 10.3-R and 10-STABLE.

The VM starts to migrate in XenCenter - 'lands' on the destination node (and the XenServer console tab is connected) - but the host immediately panics then with the output below.

I can't get a dump from this (fails, as shown below).

-Karl


Fatal trap 30: reserved (unknown) fault while in kernel mode
cpuid = 0; apic id = 00
instruction pointer      = 0x20:0xffffffff80e970b1
stack pointer            = 0x28:0xfffffe002b9b0ac0
frame pointer            = 0x28:0xfffffe002b9b0af0
code segment             = base rx0, limit 0xfffff, type 0x1b
                         = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags         = interrupt enabled, IOPL = 0
current process          = 16 (xenwatch)
[ thread pid 16 tid 100033 ]
stopped at        lapic_ipi_vectored+0xf1:    addq  $0x8,%rsp
db> trace
Tracing pid 16 tid 100033 td 0xfffff8000242e000
lapic_ipi_vectored() at lapic_ipi_vectored+0xf1/frame 0xfffffe002b9b0af0
xctrl_suspend() at xctrl_suspend+0x317/frame 0xfffffe002b9b0b40
xctrl_on_watch_event() at xtrl_on_watch_event+0x5e/frame 0xfffffe002b9b0b70
xenwatch_thread() at xenwatch_thread+0x1cf/frame 0xfffffe002b9b0bb0
fork_exit() at fork_exit+0x9a/frame 0xfffffe002b9b0bf0
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe002b9b0bf0
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---
db> panic
Uptime 2m22s
Dumping 71 out of 483 MB: xbd0: dump: no more commands?

** DUMP FAILED (ERROR 16) **
Automatic reboot in 15 seconds - press a key on the console to abort

Comment 1 karl 2016-05-06 11:09:13 UTC

Sorry - should have added - to reproduce:

 - Install XenServer 6.5 + Hotfixes
 - Install FreeBSD 10.3-RELEASE (or 10-STABLE)
 - Do 'pkg install xe-guest-utilities'
 - Set 'xenguest_enable="YES"' in /etc/rc.conf
 - Attempt live migration from one XenServer in a pool, to another.

'xe-guest-utilities' needs to be running in order for XenServer to 'see' that the guest is agile / able to live migrate.

-Karl

Comment 2 Roger Pau Monné freebsd_committer

2016-05-09 11:12:31 UTC

Sadly I don't seem to be able to reproduce this using Open Source Xen on my hardware.

I've sent a patch to update the FreeBSD version that's used for testing with Open Source Xen to 10.3, the test cluster contains quite a lot of different hardware, so it might be able to trigger it.

The crash itself looks quite weird, it seems to be caused by the IPIs that are sent to CPUs in order to setup the PV timers (xentimer_resume). Also vector 30 is used to deliver security exceptions [0], which I don't even know how they are triggered...

Anyway, will get back when I have more information about this.

[0] http://wiki.osdev.org/Exceptions

Comment 3 karl 2016-05-10 09:01:42 UTC

Hi,

In your test - did you use shared storage?

Having looked at this yesterday, and this morning - I tried setting up a completely separate pool - and found:

  - Live migration with local storage in XenCenter Works.

  - Live migration with shared (iSCSI) storage panics.

  - Live migration from shared (iSCSI) to local storage panics.

  - Live migration from local storage to shared (iSCSI) panics.

All of the above complete fine with FreeBSD 10.2. I've also tested this on both our production XenServer 6.5 / HP Proliant Gen8 pool - and test 6.5 / Proliant Gen9 pool - with the same results.

I don't know if your test used shared, or local storage.

-Karl

Comment 4 Roger Pau Monné freebsd_committer

2016-05-10 10:38:33 UTC

(In reply to kpielorz from comment #3)
That's even more weird. Could it be that the VM with local storage only has 1 vCPU, while the VMs with shared storage have more than 1 vCPU?

Could you try to migrate a VM with only 1 vCPU and see if the same happens?

Comment 5 Roger Pau Monné freebsd_committer

2016-05-10 10:42:31 UTC

Created attachment 170177 [details]
Proposed fix

Could you also try the following patch on a VM with vCPUs > 1?

It should apply cleanly against either 10.3-RELEASE source or the stable/10 branch, and you should only need to recompile the kernel. FWIW, I would just download 10.3 sources, apply the patch and recompile the kernel.

Thanks.

Comment 6 karl 2016-05-10 14:13:25 UTC

(In reply to Roger Pau MonnÃ© from comment #4)

Hi,

The original machines have 2 vCPU's - I've just re-tested with a VM with only 1 vCPU - and it has the same issues :(

Comment 7 karl 2016-05-10 15:24:16 UTC

Hi,

Apologies - I double checked on the vCPU counts on the VM's - it appears the issue *only* affects 1 vCPU VM's.

This is regardless of storage type. Some of the early test VM's I was using had 2 vCPU's - and that's on a pool with only local storage.

So - even without your patch, >1 vCPU VM's will migrate OK with stock 10.3-RELEASE.

Uniprocessor (1 vCPU) systems panic 'after' the live migrate. Sorry for the red herring.

-Karl

Comment 8 Roger Pau Monné freebsd_committer

2016-05-11 08:35:22 UTC

Hello,

I've found the cause of the issue, it's caused by the MFC of r291024, which needed some adjustments due to code differences when backported to stable/10 that I didn't take into account.

I will hopefully have a fix today, and I'm planning to request an EN for 10.3 in order to have binary fixes for this issue available using freebsd-update.

Comment 9 karl 2016-05-11 09:11:08 UTC

(In reply to Roger Pau MonnÃ© from comment #8)

Hi,

Thanks - that's great! - And sorry again for the confusion (Too many 'test' VM's - and too many 'test' pools floating around here!).

-Karl

Comment 10 Roger Pau Monné freebsd_committer

2016-05-11 09:49:33 UTC

Created attachment 170204 [details]
Fix

This patch fixes the issue on my side, can you please confirm it also fixes yours?

Comment 11 karl 2016-05-11 10:56:20 UTC

(In reply to Roger Pau MonnÃ© from comment #10)

Hi,

Yes - the patch applied to a 10.3-R system stops it panic'ing when a 1 vCPU VM is live migrated. Also tested with >1 vCPU - and works fine.

Many thanks,

-Karl