Hi, Experienced a SIGSEGV in unbound today after upgrading from 1.12.0 to 1.13.0 yesterday. > Dec 13 16:25:43 meterkast5 kernel: pid 22051 (unbound), jid 0, uid 59: exited on signal 11 This was also reported by Tatsuki Makino as a comment in #251563
As reported by Tatsuki: It is a little unstable in my environment. When I ran the following command, the following failure occurred. command: lldb -o r -- /usr/local/sbin/unbound -d -c /usr/local/etc/unbound/unbound.conf error: * thread #2, name = 'unbound', stop reason = signal SIGSEGV: invalid address (fault address: 0x8) frame #0: 0x000000080042397f libevent-2.1.so.7`evmap_io_del_ + 63 libevent-2.1.so.7`evmap_io_del_: -> 0x80042397f <+63>: movzwl 0x8(%r14), %r15d 0x800423984 <+68>: xorl %eax, %eax 0x800423986 <+70>: testl %r15d, %r15d 0x800423989 <+73>: setne %al If there are others who experience this problem, continue elsewhere :)
I upgraded to unbound 1.13.0 a few days ago: Dec 4 17:35:53 localhost pkg-static[54682]: unbound-1.13.0 installed This morning I had the same issue: Dec 13 09:46:17 localhost unbound[92443]: [92443:0] warning: did not exit gracefully last time (2919) Uptime is about five days (after upgrading stable/12 from r368274 to r368463.
Probably still a guess, but the solution is to run pkg delete -f unbound before building unbound-1.13.0. The reason for this is that it cannot be reproduced in a package made by poudriere. And poudriere leaves the following log. ./libtool --tag=CC --mode=compile cc -I. -I/usr/local/include -I/usr/local/include -I/usr/local/include -I/usr/local/include -DSRCDIR=. -O2 -pipe -fstack-protector-strong -fno-strict-aliasing -D_THREAD_SAFE -pthread -o context.lo -c ./libunbound/context.c libtool: compile: cc -I. -I/usr/local/include -I/usr/local/include -I/usr/local/include -I/usr/local/include -DSRCDIR=. -O2 -pipe -fstack-protector-strong -fno-strict-aliasing -D_THREAD_SAFE -pthread -c ./libunbound/context.c -fPIC -DPIC -o .libs/context.o libtool: compile: cc -I. -I/usr/local/include -I/usr/local/include -I/usr/local/include -I/usr/local/include -DSRCDIR=. -O2 -pipe -fstack-protector-strong -fno-strict-aliasing -D_THREAD_SAFE -pthread -c ./libunbound/context.c -o context.o >/dev/null 2>&1 According to the structure of the contents of the distfile, the working directory is ${WRKSRC}. The search order for the include file is . , then /usr/local/include. At this point, if unbound.h is needed, it is likely that the one in /usr/local/include will be used. But I don't know, I haven't even looked at the source :) If you already have 1.13.0 installed, you may be able to fix it by building it again. But I haven't tried it yet :)
(In reply to Tatsuki Makino from comment #3) I was able to test this prediction now. unbound-1.13.0, built without deinstalling unbound-1.12.0, will cause this problem. unbound-1.13.0 built in the environment where unbound-1.13.0 is installed does not cause this problem. This is probably because the /usr/local/include/unbound.h installed by unbound-1.13.0 is used.
(In reply to Tatsuki Makino from comment #4) Comment #4 is not correct :) 1.13.0 still fails something. Here is the backtrace in the debug version. (lldb) bt * thread #2, name = 'unbound', stop reason = signal SIGABRT * frame #0: 0x000000080098dfda libc.so.7`__sys_thr_kill + 10 frame #1: 0x000000080098c424 libc.so.7`__raise + 52 frame #2: 0x0000000800900389 libc.so.7`abort + 73 frame #3: 0x000000080048a511 libevent-2.1.so.7`___lldb_unnamed_symbol83$$libevent-2.1.so.7 + 33 frame #4: 0x000000080048a821 libevent-2.1.so.7`event_errx + 161 frame #5: 0x0000000800484adf libevent-2.1.so.7`evmap_io_del_ + 415 frame #6: 0x000000080047e050 libevent-2.1.so.7`event_del_nolock_ + 640 frame #7: 0x0000000800480dd8 libevent-2.1.so.7`event_del + 56 frame #8: 0x0000000000340998 unbound`ub_event_del(ev=0x00000008015b2600) at ub_event.c:395:9 frame #9: 0x0000000000329d14 unbound`comm_point_close(c=0x00000008015ab800) at netevent.c:3819:6 frame #10: 0x00000000003399e0 unbound`decommission_pending_tcp(outnet=0x00000008015f8000, pend=0x00000008018c9140) at outside_network.c:916:2 frame #11: 0x00000000003377d1 unbound`reuse_cb_and_decommission(outnet=0x00000008015f8000, pend=0x00000008018c9140, error=-2) at outside_network.c:956:2 frame #12: 0x0000000000339fae unbound`outnet_tcptimer(arg=0x00000008015f8f00) at outside_network.c:1965:3 frame #13: 0x000000000032de90 unbound`comm_timer_callback(fd=-1, event=1, arg=0x0000000801488b00) at netevent.c:4129:2 frame #14: 0x0000000800483fa1 libevent-2.1.so.7`___lldb_unnamed_symbol65$$libevent-2.1.so.7 + 785 frame #15: 0x000000080048011c libevent-2.1.so.7`event_base_loop + 1404 frame #16: 0x0000000000340718 unbound`ub_event_base_dispatch(base=0x000000080148b000) at ub_event.c:280:9 frame #17: 0x0000000000327bdb unbound`comm_base_dispatch(b=0x0000000801486000) at netevent.c:246:11 frame #18: 0x000000000026d279 unbound`worker_work(worker=0x0000000800ded800) at worker.c:1941:2 frame #19: 0x0000000000257eef unbound`thread_start(arg=0x0000000800ded800) at daemon.c:540:2 frame #20: 0x00000008007abfda libthr.so.3`___lldb_unnamed_symbol1$$libthr.so.3 + 346
Backtrace stopped by SIGSEGV. * thread #2, name = 'unbound', stop reason = signal SIGSEGV: invalid address (fault address: 0x8) frame #0: 0x000000080048497f libevent-2.1.so.7`evmap_io_del_ + 63 libevent-2.1.so.7`evmap_io_del_: -> 0x80048497f <+63>: movzwl 0x8(%r14), %r15d 0x800484984 <+68>: xorl %eax, %eax 0x800484986 <+70>: testl %r15d, %r15d 0x800484989 <+73>: setne %al (lldb) bt * thread #2, name = 'unbound', stop reason = signal SIGSEGV: invalid address (fault address: 0x8) * frame #0: 0x000000080048497f libevent-2.1.so.7`evmap_io_del_ + 63 frame #1: 0x000000080047e050 libevent-2.1.so.7`event_del_nolock_ + 640 frame #2: 0x0000000800480dd8 libevent-2.1.so.7`event_del + 56 frame #3: 0x0000000000340998 unbound`ub_event_del(ev=0x00000008015b1600) at ub_event.c:395:9 frame #4: 0x0000000000329d14 unbound`comm_point_close(c=0x00000008015ab800) at netevent.c:3819:6 frame #5: 0x00000000003399e0 unbound`decommission_pending_tcp(outnet=0x00000008015f7000, pend=0x00000008018c9140) at outside_network.c:916:2 frame #6: 0x00000000003377d1 unbound`reuse_cb_and_decommission(outnet=0x00000008015f7000, pend=0x00000008018c9140, error=-2) at outside_network.c:956:2 frame #7: 0x0000000000339fae unbound`outnet_tcptimer(arg=0x0000000800d49d00) at outside_network.c:1965:3 frame #8: 0x000000000032de90 unbound`comm_timer_callback(fd=-1, event=1, arg=0x0000000801489100) at netevent.c:4129:2 frame #9: 0x0000000800483fa1 libevent-2.1.so.7`___lldb_unnamed_symbol65$$libevent-2.1.so.7 + 785 frame #10: 0x000000080048011c libevent-2.1.so.7`event_base_loop + 1404 frame #11: 0x0000000000340718 unbound`ub_event_base_dispatch(base=0x000000080148b000) at ub_event.c:280:9 frame #12: 0x0000000000327bdb unbound`comm_base_dispatch(b=0x0000000801486000) at netevent.c:246:11 frame #13: 0x000000000026d279 unbound`worker_work(worker=0x0000000800ded800) at worker.c:1941:2 frame #14: 0x0000000000257eef unbound`thread_start(arg=0x0000000800ded800) at daemon.c:540:2 frame #15: 0x00000008007abfda libthr.so.3`___lldb_unnamed_symbol1$$libthr.so.3 + 346
I'm beginning to understand the conditions that cause this :) Restricted to TCP with do-udp: no, etc. unbound creates a large number of TCP connections in the TIME_WAIT state. At that time, run "service unbound restart". It is more likely to occur during that procedure. If unbound is started after the TCP connection in TIME_WAIT state is gone, it is unlikely to occur.
I'm seeing multiple SIGSEGVs per day but a different backtrace: Program received signal SIGSEGV, Segmentation fault. log_addr (v=VERB_OPS, str=0x541f2b "remote address is", addr=0x0, addrlen=0) at net_help.c:198 198 net_help.c: No such file or directory. in net_help.c Current language: auto; currently minimal (gdb) bt #0 log_addr (v=VERB_OPS, str=0x541f2b "remote address is", addr=0x0, addrlen=0) at net_help.c:198 #1 0x00000000004e6dc7 in comm_point_send_udp_msg (c=0x802c64500, packet=0x80227e810, addr=0x0, addrlen=0) at netevent.c:382 #2 0x00000000004fbe86 in randomize_and_send_udp (pend=0x8030b3940, packet=0x80227e810, timeout=275) at outside_network.c:1874 #3 0x00000000004fbd16 in pending_udp_query (sq=0x8030da780, packet=0x80227e810, timeout=275, cb=0x4fe170 <serviced_udp_callback>, cb_arg=0x8030da780) at outside_network.c:1944 #4 0x00000000004feae6 in serviced_udp_send (sq=0x8030da780, buff=0x80227e810) at outside_network.c:2519 #5 0x00000000004ff046 in outnet_serviced_query (outnet=0x80221f900, qinfo=0x802ef8980, flags=0, dnssec=32784, want_dnssec=0, nocaps=0, tcp_upstream=0, ssl_upstream=0, tls_auth_name=0x0, addr=0x802ef9188, addrlen=28, zone=0x802ef89f8 "\006google\003com", zonelen=12, qstate=0x802ef8510, callback=0x421600 <worker_handle_service_reply>, callback_arg=0x802ef98d0, buff=0x80227e810, env=0x8023ce530) at outside_network.c:3014 #6 0x000000000042712a in worker_send_query (qinfo=0x802ef8980, flags=0, dnssec=32784, want_dnssec=0, nocaps=0, addr=0x802ef9188, addrlen=28, zone=0x802ef89f8 "\006google\003com", zonelen=12, ssl_upstream=0, tls_auth_name=0x0, q=0x802ef8510) at worker.c:2001 #7 0x000000000044358a in processQueryTargets (qstate=0x802ef8510, iq=0x802ef8880, ie=0x80221b100, id=1) at iterator.c:2600 #8 0x000000000043fb38 in iter_handle (qstate=0x802ef8510, iq=0x802ef8880, ie=0x80221b100, id=1) at iterator.c:3634 #9 0x000000000043fa53 in process_request (qstate=0x802ef8510, iq=0x802ef8880, ie=0x80221b100, id=1) at iterator.c:3677 #10 0x000000000043f625 in iter_operate (qstate=0x802ef8510, event=module_event_pass, id=1, outbound=0x0) at iterator.c:3889 #11 0x0000000000461902 in mesh_run (mesh=0x802229a00, mstate=0x802ef84c0, ev=module_event_pass, e=0x0) at mesh.c:1699 #12 0x0000000000460c32 in mesh_new_client (mesh=0x802229a00, qinfo=0x7fffffffe340, cinfo=0x0, qflags=256, edns=0x7fffffffe328, rep=0x7fffffffe5f0, qid=29987) at mesh.c:585 #13 0x0000000000423c93 in worker_handle_request (c=0x802274800, arg=0x8023cd000, error=0, repinfo=0x7fffffffe5f0) at worker.c:1565 #14 0x00000000004e7d26 in comm_point_udp_callback (fd=7, event=2, arg=0x802274800) at netevent.c:749 #15 0x000000000048ca42 in handle_select (base=0x802272500, wait=0x7fffffffe858) at mini_event.c:220 #16 0x000000000048c3de in minievent_base_dispatch (base=0x802272500) at mini_event.c:242 #17 0x0000000000502788 in ub_event_base_dispatch (base=0x802272500) at ub_event.c:280 #18 0x00000000004e698b in comm_base_dispatch (b=0x8023c1e60) at netevent.c:246 #19 0x0000000000427189 in worker_work (worker=0x8023cd000) at worker.c:1941 #20 0x00000000004105a5 in daemon_fork (daemon=0x80221f000) at daemon.c:700 #21 0x000000000041fbfc in run_daemon (cfgfile=0x7fffffffed21 "/unbound.conf", cmdline_verbose=0, debug_mode=0, need_pidfile=1) at unbound.c:707 #22 0x000000000041f6f8 in main (argc=0, argv=0x7fffffffea20) at unbound.c:808
Crash for me too, build from sources: TFOCL, TFOSE, THREADS. Very custom config.
https://github.com/NLnetLabs/unbound/issues/376
In https://github.com/NLnetLabs/unbound/issues/376 is a fix proposed by delphij. We are a bit thin in the unbound department currently but they will take a look later today. But if anybody wants to test this do let me know so I'll do an emergency patch to the port ASAP.
I rebooted FreeBSD and unbound was started cleanly by /etc/rc. In this case, the process that was started at that time is still working. If we use sockstat -4 -6 -c -P tcp,udp -s to check that there are many sockets created by unbound (including the one that unbound gave away) and then restart unbound, we are likely to have this problem. By the way, services/outside_network.c has added reuse_foobar :) between 1.12.0 and 1.13.0. How does a new process know that a socket created by an old process is still unusable?
(In reply to Jaap Akkerhuis from comment #11) I also experienced an unbound crash this morning: Dec 17 06:19:08 phlox kernel: pid 1711 (unbound), jid 0, uid 59: exited on signal 11 I am keenly interested in testing a patch for this issue and I would encourage Jaap's suggestion to do an emergency patch to the port.
Created attachment 220659 [details] svn diff from upstream Issue 376
Comment on attachment 220659 [details] svn diff from upstream Issue 376 I was working on the same but you seem to have beaten me on it. Yes, go ahead
A commit references this bug: Author: brnrd Date: Thu Dec 17 09:38:41 UTC 2020 New revision: 558269 URL: https://svnweb.freebsd.org/changeset/ports/558269 Log: dns/unbound: SIGSEGV fix PR: 251821 Submitted by: delphij Approved by: Jaap Akkerhuis (maintainer) Obtained from: https://github.com/NLnetLabs/unbound/issues/376 MFH: 2020Q4 Changes: head/dns/unbound/Makefile head/dns/unbound/files/patch-Issue376
A commit references this bug: Author: brnrd Date: Thu Dec 17 09:41:33 UTC 2020 New revision: 558271 URL: https://svnweb.freebsd.org/changeset/ports/558271 Log: MFH: r558269 dns/unbound: SIGSEGV fix PR: 251821 Submitted by: delphij Approved by: Jaap Akkerhuis (maintainer) Obtained from: https://github.com/NLnetLabs/unbound/issues/376 Changes: _U branches/2020Q4/ branches/2020Q4/dns/unbound/Makefile branches/2020Q4/dns/unbound/files/patch-Issue376
patch-Issue376 seems to has no effect on tcp-upstream:yes or do-udp:no server.
(In reply to Tatsuki Makino from comment #18) Hmm... Your issue was different from what I have observed on my server. Let me try to reproduce it and find out why.
I can't seem to be able to reproduce your issue, basically what I have done was that I have set up unbound.conf with: # POTENTIALLY DANGEROUS SETTINGS - THIS IS BEHIND FIREWALL interface: 0.0.0.0/0 interface-automatic: yes do-udp: no do-tcp: yes tcp-upstream: yes access-control: 0.0.0.0/0 allow log-queries: yes log-replies: yes Along with a forward-zone of '.', configured to query my own DNS server on the local network. I've issued multiple queries to local host: dig foo.example.com @127.0.0.1 +tcp etc. and I did see TIME_WAIT connections to 127.0.0.1:53, but were unable to crash unbound. Could you please file a new bug upstream with steps to reproduce so that they can take a look?
(In reply to Xin LI from comment #20) Yes, upstreams also thinks that this is a different problem but couldn't reproduce this (yet) so more information is needed and a new GitHub issue with more information would be appreciated.
(In reply to Xin LI from comment #20) I have a unbound.conf that I can reproduce relatively well. server: verbosity: 5 num-threads: 2 port: 65353 interface: 127.0.0.1@65353 outgoing-range: 64 outgoing-num-tcp: 2 #so-reuseport: yes cache-max-ttl: 1 cache-max-negative-ttl: 1 do-udp: yes do-tcp: yes tcp-upstream: yes #do-daemonize: no chroot: "" username: "" directory: . use-syslog: no pidfile: unbound.pid Then run this as cd /tmp && unbound -d -c /tmp/unbound.conf. Then it stops as soon as dig @127.0.0.1 a -p 65353 +tcp example.com. is executed. In addition, there are ways to prevent them from occurring. First, check pkg info -r libevent. I have the following. libevent-2.1.12: thunderbird-78.6.0_1 unbound-1.13.0_1 avahi-app-0.8 nsd-4.3.4 seamonkey-2.49.4_27 tmux-3.1c Next, stop all programs in the package displayed above. Then, use sockstat -4 -6 -c -P tcp,udp -s to confirm that all connections are gone. Finally, if all connections do not exist, start unbound.
(In reply to Tatsuki Makino from comment #22) Hi, I have followed the steps: 1. create unbound.conf in #21 2. confirm that no unbound instance is running, then: 3. unbound -d -c /tmp/unbound.conf 4. In a different console, run: "dig @127.0.0.1 a -p 65353 +tcp example.com." The unbound instance didn't crash. This is latest FreeBSD main running unbound-1.13.0_1; unbound and libevent were built with WITH_DEBUG. Any idea?
(In reply to Xin LI from comment #23) When the unbound on port 65353 stops, there is another unbound running on the other port, and I am using the seamonkey web browser with libevent to view other sites. Also, I'm using 12.2-STABLE r368787.