Bug 197059 - network locks up with IPv6 udp traffic
Summary: network locks up with IPv6 udp traffic
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 10.0-STABLE
Hardware: Any Any
: --- Affects Only Me
Assignee: Andrey V. Elsukov
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-01-24 23:16 UTC by Dmitry Sivachenko
Modified: 2015-03-12 09:21 UTC (History)
5 users (show)

See Also:


Attachments
On output path send IPV6_PATHMTU ancillary data only to the socket, that had initiated an error (1.47 KB, patch)
2015-02-19 16:16 UTC, Andrey V. Elsukov
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Dmitry Sivachenko freebsd_committer freebsd_triage 2015-01-24 23:16:23 UTC
Hello!

I am using FreeBSD-10/stable.  We have a program at work that transmits data via UDP.
When I run several instances of this program simultaneously, after a few seconds network stops working.
If I login from console, I see some network daemons like ntpd, snmpd are in "*udp" state.

If I try to deal with network interface (ifconfig igb0 for instance), ifconfig utility stuck in "L" state (Marks a process that is waiting to acquire a lock.).
I found the only way to fix that: reboot.

What can be the cause for such a behaviour?

lock order reversal:
1st 0xffffffff80ea5588 pcbinfohash (pcbinfohash) @ /opt/WRK/src/sys/netinet6/ud
p6_usrreq.c:1202
2nd 0xffffffff80ea5530 udp (udp) @ /opt/WRK/src/sys/netinet6/in6_pcb.c:614
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe0c581c1270
kdb_backtrace() at kdb_backtrace+0x39/frame 0xfffffe0c581c1320
witness_checkorder() at witness_checkorder+0xc04/frame 0xfffffe0c581c13a0
_rw_wlock_cookie() at _rw_wlock_cookie+0x45/frame 0xfffffe0c581c13e0
in6_pcbnotify() at in6_pcbnotify+0x12e/frame 0xfffffe0c581c1480
udp6_common_ctlinput() at udp6_common_ctlinput+0x111/frame 0xfffffe0c581c14e0
pfctlinput2() at pfctlinput2+0x7d/frame 0xfffffe0c581c1520
ip6_output() at ip6_output+0x15b8/frame 0xfffffe0c581c1780
udp6_send() at udp6_send+0x75c/frame 0xfffffe0c581c1910
sosend_dgram() at sosend_dgram+0x30b/frame 0xfffffe0c581c1980
kern_sendit() at kern_sendit+0x191/frame 0xfffffe0c581c1a30
sendit() at sendit+0x225/frame 0xfffffe0c581c1a80
sys_sendmsg() at sys_sendmsg+0x68/frame 0xfffffe0c581c1ae0
amd64_syscall() at amd64_syscall+0x244/frame 0xfffffe0c581c1bf0
Xfast_syscall() at Xfast_syscall+0xfb/frame 0xfffffe0c581c1bf0
--- syscall (28, FreeBSD ELF64, sys_sendmsg), rip = 0x803ec5eaa, rsp = 0x7fffdf5f6848, rbp = 0x7fffdf5f6880 ---


lock order reversal:
1st 0xffffffff80ea5588 pcbinfohash (pcbinfohash) @ /opt/WRK/src/sys/netinet6/udp6_usrreq.c:1202
2nd 0xffffffff80ea52d8 tcp (tcp) @ /opt/WRK/src/sys/netinet6/in6_pcb.c:614
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe0c581c1240
kdb_backtrace() at kdb_backtrace+0x39/frame 0xfffffe0c581c12f0
witness_checkorder() at witness_checkorder+0xc04/frame 0xfffffe0c581c1370
_rw_wlock_cookie() at _rw_wlock_cookie+0x45/frame 0xfffffe0c581c13b0
in6_pcbnotify() at in6_pcbnotify+0x12e/frame 0xfffffe0c581c1450
tcp6_ctlinput() at tcp6_ctlinput+0x1a5/frame 0xfffffe0c581c14e0
pfctlinput2() at pfctlinput2+0x7d/frame 0xfffffe0c581c1520
ip6_output() at ip6_output+0x15b8/frame 0xfffffe0c581c1780
udp6_send() at udp6_send+0x75c/frame 0xfffffe0c581c1910
sosend_dgram() at sosend_dgram+0x30b/frame 0xfffffe0c581c1980
kern_sendit() at kern_sendit+0x191/frame 0xfffffe0c581c1a30
sendit() at sendit+0x225/frame 0xfffffe0c581c1a80
sys_sendmsg() at sys_sendmsg+0x68/frame 0xfffffe0c581c1ae0
amd64_syscall() at amd64_syscall+0x244/frame 0xfffffe0c581c1bf0
Xfast_syscall() at Xfast_syscall+0xfb/frame 0xfffffe0c581c1bf0
--- syscall (28, FreeBSD ELF64, sys_sendmsg), rip = 0x803ec5eaa, rsp = 0x7fffdf5f6848, rbp = 0x7fffdf5f6880 ---



lock order reversal:
1st 0xffffffff80ea5588 pcbinfohash (pcbinfohash) @ /opt/WRK/src/sys/netinet6/udp6_usrreq.c:1202
2nd 0xffffffff80ea4740 rip (rip) @ /opt/WRK/src/sys/netinet6/in6_pcb.c:614
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe0c581c12b0
kdb_backtrace() at kdb_backtrace+0x39/frame 0xfffffe0c581c1360
witness_checkorder() at witness_checkorder+0xc04/frame 0xfffffe0c581c13e0
_rw_wlock_cookie() at _rw_wlock_cookie+0x45/frame 0xfffffe0c581c1420
in6_pcbnotify() at in6_pcbnotify+0x12e/frame 0xfffffe0c581c14c0
rip6_ctlinput() at rip6_ctlinput+0x70/frame 0xfffffe0c581c14e0
pfctlinput2() at pfctlinput2+0x7d/frame 0xfffffe0c581c1520
ip6_output() at ip6_output+0x15b8/frame 0xfffffe0c581c1780
udp6_send() at udp6_send+0x75c/frame 0xfffffe0c581c1910
sosend_dgram() at sosend_dgram+0x30b/frame 0xfffffe0c581c1980
kern_sendit() at kern_sendit+0x191/frame 0xfffffe0c581c1a30
sendit() at sendit+0x225/frame 0xfffffe0c581c1a80
sys_sendmsg() at sys_sendmsg+0x68/frame 0xfffffe0c581c1ae0
amd64_syscall() at amd64_syscall+0x244/frame 0xfffffe0c581c1bf0
Xfast_syscall() at Xfast_syscall+0xfb/frame 0xfffffe0c581c1bf0
--- syscall (28, FreeBSD ELF64, sys_sendmsg), rip = 0x803ec5eaa, rsp = 0x7fffdf5f6848, rbp = 0x7fffdf5f6880 ---
Comment 1 Dmitry Sivachenko freebsd_committer freebsd_triage 2015-01-30 09:56:43 UTC
For us this is rather severe problem (it take about 10 seconds to leave machine without  working network).

If these LORs are not enough to debug this issue, I am more than willing to provide any necessary info, please ask.
Comment 2 Robert Watson freebsd_committer freebsd_triage 2015-02-01 09:55:10 UTC
Some notes form an e-mail from myself to Andrey about this problem:

Basically, the general rule, with respect to lock order, is that the network-stack input path can call the output path (e.g., inbound TCP segment triggers an immediate send of a TCP ACK), but you can't directly call in the other direction or you would violate the lock order. In this sort of situation, the output path needs to reinject packets via the netisr, rather than directly invoking the input path. This is how we handle, for example, routing-socket packets triggered by send events -- they are enqueued to the netisr for processing asynchronously, providing a context where transmit-pathj locks can be safely acquired.

(Basically, pfctlinput() is never safe to call from the transmit path.)
Comment 3 Robert Watson freebsd_committer freebsd_triage 2015-02-01 09:56:45 UTC
A further note on the problem:

A good question is whether the current behaviour actually makes sense: do we really need to notify all sockets of a change in MTU discovered by one socket on transmit? Or can we just let the others sockets discover the change on demand as they next try to transmit?

(I don't take a strong view on the answer, except to point out that it would be simpler if, as in IPv4, we didn't try to notify all sockets of the event.)
Comment 4 Andrey V. Elsukov freebsd_committer freebsd_triage 2015-02-19 16:16:23 UTC
Created attachment 153179 [details]
On output path send IPV6_PATHMTU ancillary data only to the socket, that had initiated an error

(In reply to Robert Watson from comment #3)
> A further note on the problem:
> 
> A good question is whether the current behaviour actually makes sense: do we
> really need to notify all sockets of a change in MTU discovered by one
> socket on transmit? Or can we just let the others sockets discover the
> change on demand as they next try to transmit?
> 
> (I don't take a strong view on the answer, except to point out that it would
> be simpler if, as in IPv4, we didn't try to notify all sockets of the event.)

I think this was implemented according to what RFC3542 says (p. 11.3):"
   Note that this also means an application that sets the option may
   receive an IPV6_MTU ancillary data item for each ICMP too big error
   the node receives, including such ICMP errors caused by other
   applications on the node."

But this doesn't mean we should send these ancillary data, when message size exceeds link MTU. So, I propose the following patch for testing
Comment 5 Dmitry Sivachenko freebsd_committer freebsd_triage 2015-02-19 16:33:19 UTC
I can confirm this patch fixes my problem.
Comment 6 commit-hook freebsd_committer freebsd_triage 2015-03-04 11:20:26 UTC
A commit references this bug:

Author: ae
Date: Wed Mar  4 11:20:03 UTC 2015
New revision: 279588
URL: https://svnweb.freebsd.org/changeset/base/279588

Log:
  Fix deadlock in IPv6 PCB code.

  When several threads are trying to send datagram to the same destination,
  but fragmentation is disabled and datagram size exceeds link MTU,
  ip6_output() calls pfctlinput2(PRC_MSGSIZE). It does notify all
  sockets wanted to know MTU to this destination. And since all threads
  hold PCB lock while sending, taking the lock for each PCB in the
  in6_pcbnotify() leads to deadlock.

  RFC 3542 p.11.3 suggests notify all application wanted to receive
  IPV6_PATHMTU ancillary data for each ICMPv6 packet too big message.
  But it doesn't require this, when we don't receive ICMPv6 message.

  Change ip6_notify_pmtu() function to be able use it directly from
  ip6_output() to notify only one socket, and to notify all sockets
  when ICMPv6 packet too big message received.

  PR:		197059
  Differential Revision:	https://reviews.freebsd.org/D1949
  Reviewed by:	no objection from #network
  Obtained from:	Yandex LLC
  MFC after:	1 week
  Sponsored by:	Yandex LLC

Changes:
  head/sys/netinet6/in6_pcb.c
  head/sys/netinet6/ip6_input.c
  head/sys/netinet6/ip6_output.c
  head/sys/netinet6/ip6_var.h
Comment 7 commit-hook freebsd_committer freebsd_triage 2015-03-12 09:04:30 UTC
A commit references this bug:

Author: ae
Date: Thu Mar 12 09:04:21 UTC 2015
New revision: 279911
URL: https://svnweb.freebsd.org/changeset/base/279911

Log:
  MFC r279588:
    Fix deadlock in IPv6 PCB code.

    When several threads are trying to send datagram to the same destination,
    but fragmentation is disabled and datagram size exceeds link MTU,
    ip6_output() calls pfctlinput2(PRC_MSGSIZE). It does notify all
    sockets wanted to know MTU to this destination. And since all threads
    hold PCB lock while sending, taking the lock for each PCB in the
    in6_pcbnotify() leads to deadlock.

    RFC 3542 p.11.3 suggests notify all application wanted to receive
    IPV6_PATHMTU ancillary data for each ICMPv6 packet too big message.
    But it doesn't require this, when we don't receive ICMPv6 message.

    Change ip6_notify_pmtu() function to be able use it directly from
    ip6_output() to notify only one socket, and to notify all sockets
    when ICMPv6 packet too big message received.

  MFC r279684:
    tcp6_ctlinput() doesn't pass MTU value to in6_pcbnotify().
    Check cmdarg isn't NULL before dereference, this check was in the
    ip6_notify_pmtu() before r279588.

  PR:		197059
  Sponsored by:	Yandex LLC

Changes:
_U  stable/10/
  stable/10/sys/netinet6/in6_pcb.c
  stable/10/sys/netinet6/ip6_input.c
  stable/10/sys/netinet6/ip6_output.c
  stable/10/sys/netinet6/ip6_var.h
Comment 8 commit-hook freebsd_committer freebsd_triage 2015-03-12 09:17:33 UTC
A commit references this bug:

Author: ae
Date: Thu Mar 12 09:16:52 UTC 2015
New revision: 279912
URL: https://svnweb.freebsd.org/changeset/base/279912

Log:
  MFC r279588:
    Fix deadlock in IPv6 PCB code.

    When several threads are trying to send datagram to the same destination,
    but fragmentation is disabled and datagram size exceeds link MTU,
    ip6_output() calls pfctlinput2(PRC_MSGSIZE). It does notify all
    sockets wanted to know MTU to this destination. And since all threads
    hold PCB lock while sending, taking the lock for each PCB in the
    in6_pcbnotify() leads to deadlock.

    RFC 3542 p.11.3 suggests notify all application wanted to receive
    IPV6_PATHMTU ancillary data for each ICMPv6 packet too big message.
    But it doesn't require this, when we don't receive ICMPv6 message.

    Change ip6_notify_pmtu() function to be able use it directly from
    ip6_output() to notify only one socket, and to notify all sockets
    when ICMPv6 packet too big message received.

  MFC r279684:
    tcp6_ctlinput() doesn't pass MTU value to in6_pcbnotify().
    Check cmdarg isn't NULL before dereference, this check was in the
    ip6_notify_pmtu() before r279588.

  PR:		197059
  Sponsored by:	Yandex LLC

Changes:
_U  stable/9/sys/
  stable/9/sys/netinet6/in6_pcb.c
  stable/9/sys/netinet6/ip6_input.c
  stable/9/sys/netinet6/ip6_output.c
  stable/9/sys/netinet6/ip6_var.h