Summary: | mld_v2 listener report does not report all active groups to the router | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Base System | Reporter: | scheffler | ||||||||
Component: | kern | Assignee: | Andrey V. Elsukov <ae> | ||||||||
Status: | Closed FIXED | ||||||||||
Severity: | Affects Some People | CC: | ae, bms, scheffler | ||||||||
Priority: | --- | ||||||||||
Version: | 9.3-STABLE | ||||||||||
Hardware: | amd64 | ||||||||||
OS: | Any | ||||||||||
Attachments: |
|
It looks like the router, that sent general query uses very small "Maximum Response Delay" for such number of groups - 125 milliseconds. Is it possible to increase it? What is the system and software used on this router? FreeBSD limits the burst of replies to 4 packets, thereafter not yet sent packets will be send only after ~500 ms delay. Since response delay is very small, probably router just doesn't wait all packets? Andrey, are you sure you are reading the traces correctly? 1.) Maximum Response Code (Delay) is set to 10000 (10s) by the router. Which is the default value given by RFC3810. 2.) The Query Intervall (QQIC) is set to the (default) value of 125, but the unit of this value is seconds. The router is a Cisco 2811 running IOS 15.1-4.M10, the latest IOS for this platform supported by Cisco. The router has a basic MC configuration, no timer values have been changed from the default. The behaviour starts as soon as I enable 'IPv6 multicast-routing'. I also reproduced the behaviour on a 2901. Your description of 4-packet burts makes sense - I was wondering about the 500ms delay between packet groups. The 510 groups need 8 packets to report. So, the kernel should stop after the second packet group (having reported all 510 groups to the router for this reporting period). However, it does not! The trace clearly shows that it keeps on reporting the same groups over and over again until it suddenly starts losing groups from the report. So to me it looks like 2 bugs: 1.) Reporting should stop after having reported all 510 groups. 2.) We should not lose groups from the report which are still active. In the meantime I found a Linux-Box to run my MC code on and connected it to the very same router. Here the behaviour is very different. The router sends a General Query very 125 seconds. Linux reports the 510 groups (using 8 packets) and stays silent until it receives the next General Query. It also never reports less than the full 510 groups. If you think it helps, I can also attach the Linux-trace. Thomas (In reply to scheffler from comment #2) > are you sure you are reading the traces correctly? > > 1.) Maximum Response Code (Delay) is set to 10000 (10s) by the router. Which > is the default value given by RFC3810. Ah, yes, you are right. I read it incorrectly. > 2.) The Query Intervall (QQIC) is set to the (default) value of 125, but the > unit of this value is seconds. > > The router is a Cisco 2811 running IOS 15.1-4.M10, the latest IOS for this > platform supported by Cisco. The router has a basic MC configuration, no > timer values have been changed from the default. The behaviour starts as > soon as I enable 'IPv6 multicast-routing'. I also reproduced the behaviour > on a 2901. > > Your description of 4-packet burts makes sense - I was wondering about the > 500ms delay between packet groups. The 510 groups need 8 packets to report. > So, the kernel should stop after the second packet group (having reported > all 510 groups to the router for this reporting period). However, it does > not! Yes it looks strange to me too. But I think it should repeat these packets once again due to QRV equal to 2. > The trace clearly shows that it keeps on reporting the same groups over and > over again until it suddenly starts losing groups from the report. > So to me it looks like 2 bugs: > 1.) Reporting should stop after having reported all 510 groups. > 2.) We should not lose groups from the report which are still active. > > In the meantime I found a Linux-Box to run my MC code on and connected it to > the very same router. Here the behaviour is very different. The router sends > a General Query very 125 seconds. Linux reports the 510 groups (using 8 > packets) and stays silent until it receives the next General Query. It also > never reports less than the full 510 groups. > If you think it helps, I can also attach the Linux-trace. Yes, it is interesting. Thanks. It looks I was wrong again: "Therefore, to enhance protocol robustness, in spite of the possible unreliability of message exchanges, messages are retransmitted several times. Furthermore, timers are set so as to take into account the possible message losses, and to wait for retransmissions. Periodical General Queries and Current State Reports do not apply this rule, in order not to overload the link; it is assumed that in general these messages do not generate state changes, their main purpose being to refresh existing state." So, it shouldn't repeat these packets. (In reply to Andrey V. Elsukov from comment #4) Correct. Repeated transmission is only expected for 'state change reports'. Otherwise section 9.4 applies and takes care of lost reports: http://tools.ietf.org/html/rfc3810#section-9.4 I will attach the Linux trace in a minute... Created attachment 163679 [details]
Wireshark trace from Linux system
This trace shows the mldv2 traffic between a Linux system (running the same MC code) and a similarily configured Cisco router.
(It is not exactly the same router as in the FreeBSD case, because I am not in the lab right now. But this should not matter...).
Created attachment 163681 [details]
Proposed patch for stable/9 (untested)
I found the problem.
Since mld_fasttimo_vnet() always calls mld_v2_dispatch_general_query() to handle general query, for each call it queues new replies. So, in your case after 4 packets was sent, mld_v2_dispatch_general_query() called again to send not yet sent packets. And this queues new 8 packets. After each call the number of queued packets grows, then probably queue is overflowed and some of groups start to get lost.
This patch add the check that queue is still have some packets to send and do not queue new packets. Instead it finishes the sending.
A commit references this bug: Author: ae Date: Tue Dec 1 11:17:42 UTC 2015 New revision: 291578 URL: https://svnweb.freebsd.org/changeset/base/291578 Log: mld_v2_dispatch_general_query() is used by mld_fasttimo_vnet() to send a reply to the MLDv2 General Query. In case when router has a lot of multicast groups, the reply can take several packets due to MTU limitation. Also we have a limit MLD_MAX_RESPONSE_BURST == 4, that limits the number of packets we send in one shot. Then we recalculate the timer value and schedule the remaining packets for sending. The problem is that when we call mld_v2_dispatch_general_query() to send remaining packets, we queue new reply in the same mbuf queue. And when number of packets is bigger than MLD_MAX_RESPONSE_BURST, we get endless reply of MLDv2 reports. To fix this, add the check for remaining packets in the queue. PR: 204831 MFC after: 1 week Sponsored by: Yandex LLC Changes: head/sys/netinet6/mld6.c A commit references this bug: Author: ae Date: Tue Dec 8 07:26:16 UTC 2015 New revision: 291986 URL: https://svnweb.freebsd.org/changeset/base/291986 Log: MFC r291578: mld_v2_dispatch_general_query() is used by mld_fasttimo_vnet() to send a reply to the MLDv2 General Query. In case when router has a lot of multicast groups, the reply can take several packets due to MTU limitation. Also we have a limit MLD_MAX_RESPONSE_BURST == 4, that limits the number of packets we send in one shot. Then we recalculate the timer value and schedule the remaining packets for sending. The problem is that when we call mld_v2_dispatch_general_query() to send remaining packets, we queue new reply in the same mbuf queue. And when number of packets is bigger than MLD_MAX_RESPONSE_BURST, we get endless reply of MLDv2 reports. To fix this, add the check for remaining packets in the queue. PR: 204831 Changes: _U stable/10/ stable/10/sys/netinet6/mld6.c A commit references this bug: Author: ae Date: Tue Dec 8 07:36:26 UTC 2015 New revision: 291988 URL: https://svnweb.freebsd.org/changeset/base/291988 Log: MFC r291578: mld_v2_dispatch_general_query() is used by mld_fasttimo_vnet() to send a reply to the MLDv2 General Query. In case when router has a lot of multicast groups, the reply can take several packets due to MTU limitation. Also we have a limit MLD_MAX_RESPONSE_BURST == 4, that limits the number of packets we send in one shot. Then we recalculate the timer value and schedule the remaining packets for sending. The problem is that when we call mld_v2_dispatch_general_query() to send remaining packets, we queue new reply in the same mbuf queue. And when number of packets is bigger than MLD_MAX_RESPONSE_BURST, we get endless reply of MLDv2 reports. To fix this, add the check for remaining packets in the queue. PR: 204831 Changes: _U stable/9/sys/ stable/9/sys/netinet6/mld6.c Committed in head/, stable/10 and stable/9. Thanks! |
Created attachment 163558 [details] Wireshark trace When using applications that receive traffic from a rather large number (510) multicast groups, traffic from some groups is lost after time. Reception for all groups starts fine, but after a Multicast General Querry has been received from the on-link router, FreeBSD starts to loose lower numbered groups in the Multicast Listener Report Message. 'ifmcstat' shows all Groups are still active, but they are no longer reported to the router and therefore traffic is no longer forwarded to the receiver. This behaviour has first been observed on a FreeNAS using vnet, but could be replicated on a FreeBSD 9.3 with the latest patches (FreeBSD 9.3-RELEASE-p24). I include a Wireshark trace that shows the behaviour. Starting with packet 35 all 510 groups are reported. However, in packet 42 the reporting does not stop at the last group and all groups are reported again. This continues until we see in packet 62 that the first multicast groups are lost: ff15::1001:39 - ff15::1001:1 The loss contiues to grow until we have lost the first 221 groups, where it stays stable. In order to replicate the behaviour, an active Multicast Router is required on the link (it seems to be triggered by the reception of the General Querry Message from the router). The behaviour can not be seen, when no router is present. I used mgen and a simple c-programm to verify the behaviour. If needed, I can supply the testprogam.