Hello! I am using FreeBSD-10/stable. We have a program at work that transmits data via UDP. When I run several instances of this program simultaneously, after a few seconds network stops working. If I login from console, I see some network daemons like ntpd, snmpd are in "*udp" state. If I try to deal with network interface (ifconfig igb0 for instance), ifconfig utility stuck in "L" state (Marks a process that is waiting to acquire a lock.). I found the only way to fix that: reboot. What can be the cause for such a behaviour? lock order reversal: 1st 0xffffffff80ea5588 pcbinfohash (pcbinfohash) @ /opt/WRK/src/sys/netinet6/ud p6_usrreq.c:1202 2nd 0xffffffff80ea5530 udp (udp) @ /opt/WRK/src/sys/netinet6/in6_pcb.c:614 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe0c581c1270 kdb_backtrace() at kdb_backtrace+0x39/frame 0xfffffe0c581c1320 witness_checkorder() at witness_checkorder+0xc04/frame 0xfffffe0c581c13a0 _rw_wlock_cookie() at _rw_wlock_cookie+0x45/frame 0xfffffe0c581c13e0 in6_pcbnotify() at in6_pcbnotify+0x12e/frame 0xfffffe0c581c1480 udp6_common_ctlinput() at udp6_common_ctlinput+0x111/frame 0xfffffe0c581c14e0 pfctlinput2() at pfctlinput2+0x7d/frame 0xfffffe0c581c1520 ip6_output() at ip6_output+0x15b8/frame 0xfffffe0c581c1780 udp6_send() at udp6_send+0x75c/frame 0xfffffe0c581c1910 sosend_dgram() at sosend_dgram+0x30b/frame 0xfffffe0c581c1980 kern_sendit() at kern_sendit+0x191/frame 0xfffffe0c581c1a30 sendit() at sendit+0x225/frame 0xfffffe0c581c1a80 sys_sendmsg() at sys_sendmsg+0x68/frame 0xfffffe0c581c1ae0 amd64_syscall() at amd64_syscall+0x244/frame 0xfffffe0c581c1bf0 Xfast_syscall() at Xfast_syscall+0xfb/frame 0xfffffe0c581c1bf0 --- syscall (28, FreeBSD ELF64, sys_sendmsg), rip = 0x803ec5eaa, rsp = 0x7fffdf5f6848, rbp = 0x7fffdf5f6880 --- lock order reversal: 1st 0xffffffff80ea5588 pcbinfohash (pcbinfohash) @ /opt/WRK/src/sys/netinet6/udp6_usrreq.c:1202 2nd 0xffffffff80ea52d8 tcp (tcp) @ /opt/WRK/src/sys/netinet6/in6_pcb.c:614 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe0c581c1240 kdb_backtrace() at kdb_backtrace+0x39/frame 0xfffffe0c581c12f0 witness_checkorder() at witness_checkorder+0xc04/frame 0xfffffe0c581c1370 _rw_wlock_cookie() at _rw_wlock_cookie+0x45/frame 0xfffffe0c581c13b0 in6_pcbnotify() at in6_pcbnotify+0x12e/frame 0xfffffe0c581c1450 tcp6_ctlinput() at tcp6_ctlinput+0x1a5/frame 0xfffffe0c581c14e0 pfctlinput2() at pfctlinput2+0x7d/frame 0xfffffe0c581c1520 ip6_output() at ip6_output+0x15b8/frame 0xfffffe0c581c1780 udp6_send() at udp6_send+0x75c/frame 0xfffffe0c581c1910 sosend_dgram() at sosend_dgram+0x30b/frame 0xfffffe0c581c1980 kern_sendit() at kern_sendit+0x191/frame 0xfffffe0c581c1a30 sendit() at sendit+0x225/frame 0xfffffe0c581c1a80 sys_sendmsg() at sys_sendmsg+0x68/frame 0xfffffe0c581c1ae0 amd64_syscall() at amd64_syscall+0x244/frame 0xfffffe0c581c1bf0 Xfast_syscall() at Xfast_syscall+0xfb/frame 0xfffffe0c581c1bf0 --- syscall (28, FreeBSD ELF64, sys_sendmsg), rip = 0x803ec5eaa, rsp = 0x7fffdf5f6848, rbp = 0x7fffdf5f6880 --- lock order reversal: 1st 0xffffffff80ea5588 pcbinfohash (pcbinfohash) @ /opt/WRK/src/sys/netinet6/udp6_usrreq.c:1202 2nd 0xffffffff80ea4740 rip (rip) @ /opt/WRK/src/sys/netinet6/in6_pcb.c:614 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe0c581c12b0 kdb_backtrace() at kdb_backtrace+0x39/frame 0xfffffe0c581c1360 witness_checkorder() at witness_checkorder+0xc04/frame 0xfffffe0c581c13e0 _rw_wlock_cookie() at _rw_wlock_cookie+0x45/frame 0xfffffe0c581c1420 in6_pcbnotify() at in6_pcbnotify+0x12e/frame 0xfffffe0c581c14c0 rip6_ctlinput() at rip6_ctlinput+0x70/frame 0xfffffe0c581c14e0 pfctlinput2() at pfctlinput2+0x7d/frame 0xfffffe0c581c1520 ip6_output() at ip6_output+0x15b8/frame 0xfffffe0c581c1780 udp6_send() at udp6_send+0x75c/frame 0xfffffe0c581c1910 sosend_dgram() at sosend_dgram+0x30b/frame 0xfffffe0c581c1980 kern_sendit() at kern_sendit+0x191/frame 0xfffffe0c581c1a30 sendit() at sendit+0x225/frame 0xfffffe0c581c1a80 sys_sendmsg() at sys_sendmsg+0x68/frame 0xfffffe0c581c1ae0 amd64_syscall() at amd64_syscall+0x244/frame 0xfffffe0c581c1bf0 Xfast_syscall() at Xfast_syscall+0xfb/frame 0xfffffe0c581c1bf0 --- syscall (28, FreeBSD ELF64, sys_sendmsg), rip = 0x803ec5eaa, rsp = 0x7fffdf5f6848, rbp = 0x7fffdf5f6880 ---
For us this is rather severe problem (it take about 10 seconds to leave machine without working network). If these LORs are not enough to debug this issue, I am more than willing to provide any necessary info, please ask.
Some notes form an e-mail from myself to Andrey about this problem: Basically, the general rule, with respect to lock order, is that the network-stack input path can call the output path (e.g., inbound TCP segment triggers an immediate send of a TCP ACK), but you can't directly call in the other direction or you would violate the lock order. In this sort of situation, the output path needs to reinject packets via the netisr, rather than directly invoking the input path. This is how we handle, for example, routing-socket packets triggered by send events -- they are enqueued to the netisr for processing asynchronously, providing a context where transmit-pathj locks can be safely acquired. (Basically, pfctlinput() is never safe to call from the transmit path.)
A further note on the problem: A good question is whether the current behaviour actually makes sense: do we really need to notify all sockets of a change in MTU discovered by one socket on transmit? Or can we just let the others sockets discover the change on demand as they next try to transmit? (I don't take a strong view on the answer, except to point out that it would be simpler if, as in IPv4, we didn't try to notify all sockets of the event.)
Created attachment 153179 [details] On output path send IPV6_PATHMTU ancillary data only to the socket, that had initiated an error (In reply to Robert Watson from comment #3) > A further note on the problem: > > A good question is whether the current behaviour actually makes sense: do we > really need to notify all sockets of a change in MTU discovered by one > socket on transmit? Or can we just let the others sockets discover the > change on demand as they next try to transmit? > > (I don't take a strong view on the answer, except to point out that it would > be simpler if, as in IPv4, we didn't try to notify all sockets of the event.) I think this was implemented according to what RFC3542 says (p. 11.3):" Note that this also means an application that sets the option may receive an IPV6_MTU ancillary data item for each ICMP too big error the node receives, including such ICMP errors caused by other applications on the node." But this doesn't mean we should send these ancillary data, when message size exceeds link MTU. So, I propose the following patch for testing
I can confirm this patch fixes my problem.
A commit references this bug: Author: ae Date: Wed Mar 4 11:20:03 UTC 2015 New revision: 279588 URL: https://svnweb.freebsd.org/changeset/base/279588 Log: Fix deadlock in IPv6 PCB code. When several threads are trying to send datagram to the same destination, but fragmentation is disabled and datagram size exceeds link MTU, ip6_output() calls pfctlinput2(PRC_MSGSIZE). It does notify all sockets wanted to know MTU to this destination. And since all threads hold PCB lock while sending, taking the lock for each PCB in the in6_pcbnotify() leads to deadlock. RFC 3542 p.11.3 suggests notify all application wanted to receive IPV6_PATHMTU ancillary data for each ICMPv6 packet too big message. But it doesn't require this, when we don't receive ICMPv6 message. Change ip6_notify_pmtu() function to be able use it directly from ip6_output() to notify only one socket, and to notify all sockets when ICMPv6 packet too big message received. PR: 197059 Differential Revision: https://reviews.freebsd.org/D1949 Reviewed by: no objection from #network Obtained from: Yandex LLC MFC after: 1 week Sponsored by: Yandex LLC Changes: head/sys/netinet6/in6_pcb.c head/sys/netinet6/ip6_input.c head/sys/netinet6/ip6_output.c head/sys/netinet6/ip6_var.h
A commit references this bug: Author: ae Date: Thu Mar 12 09:04:21 UTC 2015 New revision: 279911 URL: https://svnweb.freebsd.org/changeset/base/279911 Log: MFC r279588: Fix deadlock in IPv6 PCB code. When several threads are trying to send datagram to the same destination, but fragmentation is disabled and datagram size exceeds link MTU, ip6_output() calls pfctlinput2(PRC_MSGSIZE). It does notify all sockets wanted to know MTU to this destination. And since all threads hold PCB lock while sending, taking the lock for each PCB in the in6_pcbnotify() leads to deadlock. RFC 3542 p.11.3 suggests notify all application wanted to receive IPV6_PATHMTU ancillary data for each ICMPv6 packet too big message. But it doesn't require this, when we don't receive ICMPv6 message. Change ip6_notify_pmtu() function to be able use it directly from ip6_output() to notify only one socket, and to notify all sockets when ICMPv6 packet too big message received. MFC r279684: tcp6_ctlinput() doesn't pass MTU value to in6_pcbnotify(). Check cmdarg isn't NULL before dereference, this check was in the ip6_notify_pmtu() before r279588. PR: 197059 Sponsored by: Yandex LLC Changes: _U stable/10/ stable/10/sys/netinet6/in6_pcb.c stable/10/sys/netinet6/ip6_input.c stable/10/sys/netinet6/ip6_output.c stable/10/sys/netinet6/ip6_var.h
A commit references this bug: Author: ae Date: Thu Mar 12 09:16:52 UTC 2015 New revision: 279912 URL: https://svnweb.freebsd.org/changeset/base/279912 Log: MFC r279588: Fix deadlock in IPv6 PCB code. When several threads are trying to send datagram to the same destination, but fragmentation is disabled and datagram size exceeds link MTU, ip6_output() calls pfctlinput2(PRC_MSGSIZE). It does notify all sockets wanted to know MTU to this destination. And since all threads hold PCB lock while sending, taking the lock for each PCB in the in6_pcbnotify() leads to deadlock. RFC 3542 p.11.3 suggests notify all application wanted to receive IPV6_PATHMTU ancillary data for each ICMPv6 packet too big message. But it doesn't require this, when we don't receive ICMPv6 message. Change ip6_notify_pmtu() function to be able use it directly from ip6_output() to notify only one socket, and to notify all sockets when ICMPv6 packet too big message received. MFC r279684: tcp6_ctlinput() doesn't pass MTU value to in6_pcbnotify(). Check cmdarg isn't NULL before dereference, this check was in the ip6_notify_pmtu() before r279588. PR: 197059 Sponsored by: Yandex LLC Changes: _U stable/9/sys/ stable/9/sys/netinet6/in6_pcb.c stable/9/sys/netinet6/ip6_input.c stable/9/sys/netinet6/ip6_output.c stable/9/sys/netinet6/ip6_var.h