Bug 228982 - [panic] page fault in mld_v2_cancel_link_timers() on boot
Summary: [panic] page fault in mld_v2_cancel_link_timers() on boot
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: CURRENT
Hardware: Any Any
: --- Affects Only Me
Assignee: freebsd-net mailing list
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-06-13 12:43 UTC by Andrey V. Elsukov
Modified: 2018-06-14 09:39 UTC (History)
4 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Andrey V. Elsukov freebsd_committer 2018-06-13 12:43:59 UTC
It seems there are some cases that were not properly covered when IF_ADDR_LOCK was converted to epoch+mutex.

I seen such panic several times. It is not 100% reproducible, but it seems it is related to lagg(4) and assigning of link-local addresses. 

When lagg is created, it removes IPv6 LLAs from parent interfaces. And sometimes this panic happens during this.

<118>Created clone interfaces: lagg0.
<6>lo0: link state changed to UP
<6>re0: link state changed to DOWN
<6>lagg0: IPv6 addresses on em0 have been removed before adding it as a member to prevent IPv6 address scope violation.
<6>lagg0: link state changed to DOWN
<6>lagg0: IPv6 addresses on re0 have been removed before adding it as a member to prevent IPv6 address scope violation.
<6>re0: link state changed to UP
<6>lagg0: link state changed to UP
Kernel page fault with the following non-sleepable locks held:
exclusive sleep mutex if_addr_lock (if_addr_lock) r = 0 (0xfffff800122f2188) locked @ /home/devel/freebsd/base/head/sys/netinet6/mld6.c:1679
exclusive sleep mutex mld_mtx (mld_mtx) r = 0 (0xffffffff81fa9938) locked @ /home/devel/freebsd/base/head/sys/netinet6/mld6.c:684
exclusive sleep mutex in6_multi_list_mtx (in6_multi_list_mtx) r = 0 (0xffffffff8201f390) locked @ /home/devel/freebsd/base/head/sys/netinet6/mld6.c:683
stack backtrace:
#0 0xffffffff80bef103 at witness_debugger+0x73
#1 0xffffffff80bf04e1 at witness_warn+0x461
#2 0xffffffff8105e763 at trap_pfault+0x53
#3 0xffffffff8105dd7a at trap+0x2ba
#4 0xffffffff81038c6c at calltrap+0x8
#5 0xffffffff80de6b9f at mld_input+0x2ff
#6 0xffffffff80dc516d at icmp6_input+0x43d
#7 0xffffffff80ddfac8 at ip6_input+0xdd8
#8 0xffffffff80cae552 at netisr_dispatch_src+0xa2
#9 0xffffffff80c9181e at ether_demux+0x16e
#10 0xffffffff80c92cb2 at ether_nh_input+0x402
#11 0xffffffff80cae552 at netisr_dispatch_src+0xa2
#12 0xffffffff80c91cdf at ether_input+0x8f
#13 0xffffffff808b282b at re_rxeof+0x60b
#14 0xffffffff808afb60 at re_int_task+0x80
#15 0xffffffff80be192c at taskqueue_run_locked+0x14c
#16 0xffffffff80be179a at taskqueue_run+0x4a
#17 0xffffffff80b46699 at intr_event_execute_handlers+0x99


Fatal trap 12: page fault while in kernel mode
cpuid = 2; apic id = 04
fault virtual address	= 0x24
fault code		= supervisor read data, page not present
instruction pointer	= 0x20:0xffffffff80de90d6
stack pointer	        = 0x28:0xfffffe0077b873a0
frame pointer	        = 0x28:0xfffffe0077b873e0
code segment		= base rx0, limit 0xfffff, type 0x1b
			= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags	= interrupt enabled, resume, IOPL = 0
current process		= 12 (swi5: fast taskq)

__curthread () at ./machine/pcpu.h:231
231		__asm("movq %%gs:%1,%0" : "=r" (td)
(kgdb) l *0xffffffff80de90d6
0xffffffff80de90d6 is in mld_set_version (/home/devel/freebsd/base/head/sys/netinet6/mld6.c:1685).
1680	 restart:
1681		CK_STAILQ_FOREACH_SAFE(ifma, &ifp->if_multiaddrs, ifma_link, next) {
1682			if (ifma->ifma_addr->sa_family != AF_INET6)
1683				continue;
1684			inm = (struct in6_multi *)ifma->ifma_protospec;
1685			switch (inm->in6m_state) {
1686			case MLD_NOT_MEMBER:
1687			case MLD_SILENT_MEMBER:
1688			case MLD_IDLE_MEMBER:
1689			case MLD_LAZY_MEMBER:
Comment 1 Matthew Macy 2018-06-13 18:16:26 UTC
This looks a lot more like it's tied to my deferred deletion of multicast addresses. Could you test with a kernel prior to my epoch changes? Also, could you give me a specific configuration that is most likely to produce the issue?
Thankse
Comment 2 Andrey V. Elsukov freebsd_committer 2018-06-13 18:54:24 UTC
(In reply to Matthew Macy from comment #1)
> This looks a lot more like it's tied to my deferred deletion of multicast
> addresses. Could you test with a kernel prior to my epoch changes? Also,
> could you give me a specific configuration that is most likely to produce
> the issue?

I update this host periodically, and never seen such panics before epoch changes.
I can test some prior revision, but as I said it happens rarely.

I think relevant settings are:

cloned_interfaces="lagg0"
ifconfig_lagg0="laggproto failover laggport em0 laggport re0 DHCP"
ifconfig_em0="up" # this port is unplugged
ifconfig_re0="up"
ipv6_activate_all_interfaces="YES"

And my local network usually have some IPv6 activity, at least IPv6 is enabled.
Comment 3 commit-hook freebsd_committer 2018-06-14 09:36:53 UTC
A commit references this bug:

Author: ae
Date: Thu Jun 14 09:36:25 UTC 2018
New revision: 335129
URL: https://svnweb.freebsd.org/changeset/base/335129

Log:
  Add NULL check like the rest of code has.

  It is possible that ifma_protospec becomes NULL in this function for
  some entry, but it is still referenced and thus it will not unlinked
  from the list. Then "restart" condition triggers and this entry with
  NULL ifma_protospec will lead to page fault.

  PR:		228982

Changes:
  head/sys/netinet6/mld6.c