251821 – dns/unbound: SIGSEGV in 1.13.0

Bug 251821 - dns/unbound: SIGSEGV in 1.13.0

Summary: dns/unbound: SIGSEGV in 1.13.0

Status:	Closed FIXED

Alias:	None

Product:	Ports & Packages
Classification:	Unclassified
Component:	Individual Port(s) (show other bugs)
Version:	Latest
Hardware:	Any Any

Importance:	--- Affects Only Me
Assignee:	freebsd-ports-bugs (Nobody)

URL:	https://github.com/NLnetLabs/unbound/...
Keywords:

Depends on:
Blocks:

Reported:	2020-12-13 18:06 UTC by Bernard Spil
Modified:	2020-12-27 22:48 UTC (History)
CC List:	10 users (show)

See Also:

Flags:	bugzilla: maintainer-feedback? (jaap)

Attachments
svn diff from upstream Issue 376 (6.97 KB, patch) 2020-12-17 08:43 UTC, Bernard Spil	brnrd: maintainer-approval?	Details \| Diff
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Bernard Spil freebsd_committer

2020-12-13 18:06:47 UTC

Hi,

Experienced a SIGSEGV in unbound today after upgrading from 1.12.0 to 1.13.0 yesterday.

> Dec 13 16:25:43 meterkast5 kernel: pid 22051 (unbound), jid 0, uid 59: exited on signal 11

This was also reported by Tatsuki Makino as a comment in #251563

Comment 1 Bernard Spil freebsd_committer

2020-12-13 18:07:28 UTC

As reported by Tatsuki:

It is a little unstable in my environment.
When I ran the following command, the following failure occurred.

command:
lldb -o r -- /usr/local/sbin/unbound -d -c /usr/local/etc/unbound/unbound.conf

error:
* thread #2, name = 'unbound', stop reason = signal SIGSEGV: invalid address (fault address: 0x8)
    frame #0: 0x000000080042397f libevent-2.1.so.7`evmap_io_del_ + 63
libevent-2.1.so.7`evmap_io_del_:
->  0x80042397f <+63>: movzwl 0x8(%r14), %r15d
    0x800423984 <+68>: xorl   %eax, %eax
    0x800423986 <+70>: testl  %r15d, %r15d
    0x800423989 <+73>: setne  %al

If there are others who experience this problem, continue elsewhere :)

Comment 2 Herbert J. Skuhra 2020-12-13 18:30:12 UTC

I upgraded to unbound 1.13.0 a few days ago:

Dec  4 17:35:53 localhost pkg-static[54682]: unbound-1.13.0 installed

This morning I had the same issue:

Dec 13 09:46:17 localhost unbound[92443]: [92443:0] warning: did not exit gracefully last time (2919)

Uptime is about five days (after upgrading stable/12 from r368274 to r368463.

Comment 3 Tatsuki Makino 2020-12-14 00:15:51 UTC

Probably still a guess, but the solution is to run pkg delete -f unbound before building unbound-1.13.0.

The reason for this is that it cannot be reproduced in a package made by poudriere.
And poudriere leaves the following log.

./libtool --tag=CC --mode=compile cc -I. -I/usr/local/include  -I/usr/local/include -I/usr/local/include -I/usr/local/include -DSRCDIR=. -O2 -pipe  -fstack-protector-strong -fno-strict-aliasing  -D_THREAD_SAFE -pthread -o context.lo -c ./libunbound/context.c
libtool: compile:  cc -I. -I/usr/local/include -I/usr/local/include -I/usr/local/include -I/usr/local/include -DSRCDIR=. -O2 -pipe -fstack-protector-strong -fno-strict-aliasing -D_THREAD_SAFE -pthread -c ./libunbound/context.c  -fPIC -DPIC -o .libs/context.o
libtool: compile:  cc -I. -I/usr/local/include -I/usr/local/include -I/usr/local/include -I/usr/local/include -DSRCDIR=. -O2 -pipe -fstack-protector-strong -fno-strict-aliasing -D_THREAD_SAFE -pthread -c ./libunbound/context.c -o context.o >/dev/null 2>&1

According to the structure of the contents of the distfile, the working directory is ${WRKSRC}.
The search order for the include file is . , then /usr/local/include.
At this point, if unbound.h is needed, it is likely that the one in /usr/local/include will be used.
But I don't know, I haven't even looked at the source :)

If you already have 1.13.0 installed, you may be able to fix it by building it again. But I haven't tried it yet :)

Comment 4 Tatsuki Makino 2020-12-14 00:38:31 UTC

(In reply to Tatsuki Makino from comment #3)

I was able to test this prediction now.

unbound-1.13.0, built without deinstalling unbound-1.12.0, will cause this problem.

unbound-1.13.0 built in the environment where unbound-1.13.0 is installed does not cause this problem. This is probably because the /usr/local/include/unbound.h installed by unbound-1.13.0 is used.

Comment 5 Tatsuki Makino 2020-12-14 02:09:48 UTC

(In reply to Tatsuki Makino from comment #4)
Comment #4 is not correct :)
1.13.0 still fails something.

Here is the backtrace in the debug version.

(lldb) bt
* thread #2, name = 'unbound', stop reason = signal SIGABRT
  * frame #0: 0x000000080098dfda libc.so.7`__sys_thr_kill + 10
    frame #1: 0x000000080098c424 libc.so.7`__raise + 52
    frame #2: 0x0000000800900389 libc.so.7`abort + 73
    frame #3: 0x000000080048a511 libevent-2.1.so.7`___lldb_unnamed_symbol83$$libevent-2.1.so.7 + 33
    frame #4: 0x000000080048a821 libevent-2.1.so.7`event_errx + 161
    frame #5: 0x0000000800484adf libevent-2.1.so.7`evmap_io_del_ + 415
    frame #6: 0x000000080047e050 libevent-2.1.so.7`event_del_nolock_ + 640
    frame #7: 0x0000000800480dd8 libevent-2.1.so.7`event_del + 56
    frame #8: 0x0000000000340998 unbound`ub_event_del(ev=0x00000008015b2600) at ub_event.c:395:9
    frame #9: 0x0000000000329d14 unbound`comm_point_close(c=0x00000008015ab800) at netevent.c:3819:6
    frame #10: 0x00000000003399e0 unbound`decommission_pending_tcp(outnet=0x00000008015f8000, pend=0x00000008018c9140) at outside_network.c:916:2
    frame #11: 0x00000000003377d1 unbound`reuse_cb_and_decommission(outnet=0x00000008015f8000, pend=0x00000008018c9140, error=-2) at outside_network.c:956:2
    frame #12: 0x0000000000339fae unbound`outnet_tcptimer(arg=0x00000008015f8f00) at outside_network.c:1965:3
    frame #13: 0x000000000032de90 unbound`comm_timer_callback(fd=-1, event=1, arg=0x0000000801488b00) at netevent.c:4129:2
    frame #14: 0x0000000800483fa1 libevent-2.1.so.7`___lldb_unnamed_symbol65$$libevent-2.1.so.7 + 785
    frame #15: 0x000000080048011c libevent-2.1.so.7`event_base_loop + 1404
    frame #16: 0x0000000000340718 unbound`ub_event_base_dispatch(base=0x000000080148b000) at ub_event.c:280:9
    frame #17: 0x0000000000327bdb unbound`comm_base_dispatch(b=0x0000000801486000) at netevent.c:246:11
    frame #18: 0x000000000026d279 unbound`worker_work(worker=0x0000000800ded800) at worker.c:1941:2
    frame #19: 0x0000000000257eef unbound`thread_start(arg=0x0000000800ded800) at daemon.c:540:2
    frame #20: 0x00000008007abfda libthr.so.3`___lldb_unnamed_symbol1$$libthr.so.3 + 346

Comment 6 Tatsuki Makino 2020-12-14 05:39:37 UTC

Backtrace stopped by SIGSEGV.

* thread #2, name = 'unbound', stop reason = signal SIGSEGV: invalid address (fault address: 0x8)                                                               
    frame #0: 0x000000080048497f libevent-2.1.so.7`evmap_io_del_ + 63
libevent-2.1.so.7`evmap_io_del_:
->  0x80048497f <+63>: movzwl 0x8(%r14), %r15d
    0x800484984 <+68>: xorl   %eax, %eax
    0x800484986 <+70>: testl  %r15d, %r15d
    0x800484989 <+73>: setne  %al

(lldb) bt
* thread #2, name = 'unbound', stop reason = signal SIGSEGV: invalid address (fault address: 0x8)
  * frame #0: 0x000000080048497f libevent-2.1.so.7`evmap_io_del_ + 63
    frame #1: 0x000000080047e050 libevent-2.1.so.7`event_del_nolock_ + 640
    frame #2: 0x0000000800480dd8 libevent-2.1.so.7`event_del + 56
    frame #3: 0x0000000000340998 unbound`ub_event_del(ev=0x00000008015b1600) at ub_event.c:395:9
    frame #4: 0x0000000000329d14 unbound`comm_point_close(c=0x00000008015ab800) at netevent.c:3819:6
    frame #5: 0x00000000003399e0 unbound`decommission_pending_tcp(outnet=0x00000008015f7000, pend=0x00000008018c9140) at outside_network.c:916:2
    frame #6: 0x00000000003377d1 unbound`reuse_cb_and_decommission(outnet=0x00000008015f7000, pend=0x00000008018c9140, error=-2) at outside_network.c:956:2
    frame #7: 0x0000000000339fae unbound`outnet_tcptimer(arg=0x0000000800d49d00) at outside_network.c:1965:3
    frame #8: 0x000000000032de90 unbound`comm_timer_callback(fd=-1, event=1, arg=0x0000000801489100) at netevent.c:4129:2
    frame #9: 0x0000000800483fa1 libevent-2.1.so.7`___lldb_unnamed_symbol65$$libevent-2.1.so.7 + 785
    frame #10: 0x000000080048011c libevent-2.1.so.7`event_base_loop + 1404
    frame #11: 0x0000000000340718 unbound`ub_event_base_dispatch(base=0x000000080148b000) at ub_event.c:280:9
    frame #12: 0x0000000000327bdb unbound`comm_base_dispatch(b=0x0000000801486000) at netevent.c:246:11
    frame #13: 0x000000000026d279 unbound`worker_work(worker=0x0000000800ded800) at worker.c:1941:2
    frame #14: 0x0000000000257eef unbound`thread_start(arg=0x0000000800ded800) at daemon.c:540:2
    frame #15: 0x00000008007abfda libthr.so.3`___lldb_unnamed_symbol1$$libthr.so.3 + 346

Comment 7 Tatsuki Makino 2020-12-14 09:14:55 UTC

I'm beginning to understand the conditions that cause this :)

Restricted to TCP with do-udp: no, etc.
unbound creates a large number of TCP connections in the TIME_WAIT state. At that time, run "service unbound restart".

It is more likely to occur during that procedure.
If unbound is started after the TCP connection in TIME_WAIT state is gone, it is unlikely to occur.

Comment 8 emikulic 2020-12-14 13:01:42 UTC

I'm seeing multiple SIGSEGVs per day but a different backtrace:

Program received signal SIGSEGV, Segmentation fault.
log_addr (v=VERB_OPS, str=0x541f2b "remote address is", addr=0x0, addrlen=0) at net_help.c:198
198     net_help.c: No such file or directory.
        in net_help.c
Current language:  auto; currently minimal
(gdb) bt
#0  log_addr (v=VERB_OPS, str=0x541f2b "remote address is", addr=0x0, addrlen=0) at net_help.c:198
#1  0x00000000004e6dc7 in comm_point_send_udp_msg (c=0x802c64500, packet=0x80227e810, addr=0x0, addrlen=0) at netevent.c:382
#2  0x00000000004fbe86 in randomize_and_send_udp (pend=0x8030b3940, packet=0x80227e810, timeout=275) at outside_network.c:1874
#3  0x00000000004fbd16 in pending_udp_query (sq=0x8030da780, packet=0x80227e810, timeout=275, cb=0x4fe170 <serviced_udp_callback>, cb_arg=0x8030da780)
    at outside_network.c:1944
#4  0x00000000004feae6 in serviced_udp_send (sq=0x8030da780, buff=0x80227e810) at outside_network.c:2519
#5  0x00000000004ff046 in outnet_serviced_query (outnet=0x80221f900, qinfo=0x802ef8980, flags=0, dnssec=32784, want_dnssec=0, nocaps=0, tcp_upstream=0, 
    ssl_upstream=0, tls_auth_name=0x0, addr=0x802ef9188, addrlen=28, zone=0x802ef89f8 "\006google\003com", zonelen=12, qstate=0x802ef8510, 
    callback=0x421600 <worker_handle_service_reply>, callback_arg=0x802ef98d0, buff=0x80227e810, env=0x8023ce530) at outside_network.c:3014
#6  0x000000000042712a in worker_send_query (qinfo=0x802ef8980, flags=0, dnssec=32784, want_dnssec=0, nocaps=0, addr=0x802ef9188, addrlen=28, 
    zone=0x802ef89f8 "\006google\003com", zonelen=12, ssl_upstream=0, tls_auth_name=0x0, q=0x802ef8510) at worker.c:2001
#7  0x000000000044358a in processQueryTargets (qstate=0x802ef8510, iq=0x802ef8880, ie=0x80221b100, id=1) at iterator.c:2600
#8  0x000000000043fb38 in iter_handle (qstate=0x802ef8510, iq=0x802ef8880, ie=0x80221b100, id=1) at iterator.c:3634
#9  0x000000000043fa53 in process_request (qstate=0x802ef8510, iq=0x802ef8880, ie=0x80221b100, id=1) at iterator.c:3677
#10 0x000000000043f625 in iter_operate (qstate=0x802ef8510, event=module_event_pass, id=1, outbound=0x0) at iterator.c:3889
#11 0x0000000000461902 in mesh_run (mesh=0x802229a00, mstate=0x802ef84c0, ev=module_event_pass, e=0x0) at mesh.c:1699
#12 0x0000000000460c32 in mesh_new_client (mesh=0x802229a00, qinfo=0x7fffffffe340, cinfo=0x0, qflags=256, edns=0x7fffffffe328, rep=0x7fffffffe5f0, qid=29987)
    at mesh.c:585
#13 0x0000000000423c93 in worker_handle_request (c=0x802274800, arg=0x8023cd000, error=0, repinfo=0x7fffffffe5f0) at worker.c:1565
#14 0x00000000004e7d26 in comm_point_udp_callback (fd=7, event=2, arg=0x802274800) at netevent.c:749
#15 0x000000000048ca42 in handle_select (base=0x802272500, wait=0x7fffffffe858) at mini_event.c:220
#16 0x000000000048c3de in minievent_base_dispatch (base=0x802272500) at mini_event.c:242
#17 0x0000000000502788 in ub_event_base_dispatch (base=0x802272500) at ub_event.c:280
#18 0x00000000004e698b in comm_base_dispatch (b=0x8023c1e60) at netevent.c:246
#19 0x0000000000427189 in worker_work (worker=0x8023cd000) at worker.c:1941
#20 0x00000000004105a5 in daemon_fork (daemon=0x80221f000) at daemon.c:700
#21 0x000000000041fbfc in run_daemon (cfgfile=0x7fffffffed21 "/unbound.conf", cmdline_verbose=0, debug_mode=0, need_pidfile=1) at unbound.c:707
#22 0x000000000041f6f8 in main (argc=0, argv=0x7fffffffea20) at unbound.c:808

Comment 9 Ivan Rozhuk 2020-12-14 19:46:14 UTC

Crash for me too, build from sources: TFOCL, TFOSE, THREADS.
Very custom config.

Comment 10 Ivan Rozhuk 2020-12-14 20:04:46 UTC

https://github.com/NLnetLabs/unbound/issues/376

Comment 11 Jaap Akkerhuis 2020-12-15 11:21:17 UTC

In https://github.com/NLnetLabs/unbound/issues/376 is a fix proposed by delphij. We are a bit thin in the unbound department currently but they will take a look later today.

But if anybody wants to test this do let me know so I'll do an emergency patch to the port ASAP.

Comment 12 Tatsuki Makino 2020-12-15 22:25:25 UTC

I rebooted FreeBSD and unbound was started cleanly by /etc/rc.
In this case, the process that was started at that time is still working.

If we use sockstat -4 -6 -c -P tcp,udp -s to check that there are many sockets created by unbound (including the one that unbound gave away) and then restart unbound, we are likely to have this problem.

By the way, services/outside_network.c has added reuse_foobar :) between 1.12.0 and 1.13.0.
How does a new process know that a socket created by an old process is still unusable?

Comment 13 Tod McQuillin 2020-12-17 00:39:58 UTC

(In reply to Jaap Akkerhuis from comment #11)
I also experienced an unbound crash this morning:

Dec 17 06:19:08 phlox kernel: pid 1711 (unbound), jid 0, uid 59: exited on signal 11

I am keenly interested in testing a patch for this issue and I would encourage Jaap's suggestion to do an emergency patch to the port.

Comment 14 Bernard Spil freebsd_committer

2020-12-17 08:43:37 UTC

Created attachment 220659 [details]
svn diff from upstream Issue 376

Comment 15 Jaap Akkerhuis 2020-12-17 09:18:28 UTC

Comment on attachment 220659 [details]
svn diff from upstream Issue 376

I was working on the same but you seem to have beaten me on it. Yes, go ahead

Comment 16 commit-hook freebsd_committer

2020-12-17 09:39:37 UTC

A commit references this bug:

Author: brnrd
Date: Thu Dec 17 09:38:41 UTC 2020
New revision: 558269
URL: https://svnweb.freebsd.org/changeset/ports/558269

Log:
  dns/unbound: SIGSEGV fix

  PR:		251821
  Submitted by:	delphij
  Approved by:	Jaap Akkerhuis (maintainer)
  Obtained from:	https://github.com/NLnetLabs/unbound/issues/376
  MFH:		2020Q4

Changes:
  head/dns/unbound/Makefile
  head/dns/unbound/files/patch-Issue376

Comment 17 commit-hook freebsd_committer

2020-12-17 09:41:40 UTC

A commit references this bug:

Author: brnrd
Date: Thu Dec 17 09:41:33 UTC 2020
New revision: 558271
URL: https://svnweb.freebsd.org/changeset/ports/558271

Log:
  MFH: r558269

  dns/unbound: SIGSEGV fix

  PR:		251821
  Submitted by:	delphij
  Approved by:	Jaap Akkerhuis (maintainer)
  Obtained from:	https://github.com/NLnetLabs/unbound/issues/376

Changes:
_U  branches/2020Q4/
  branches/2020Q4/dns/unbound/Makefile
  branches/2020Q4/dns/unbound/files/patch-Issue376

Comment 18 Tatsuki Makino 2020-12-18 05:00:24 UTC

patch-Issue376 seems to has no effect on tcp-upstream:yes or do-udp:no server.

Comment 19 Xin LI freebsd_committer

2020-12-27 05:49:23 UTC

(In reply to Tatsuki Makino from comment #18)
Hmm...  Your issue was different from what I have observed on my server.  Let me try to reproduce it and find out why.

Comment 20 Xin LI freebsd_committer

2020-12-27 10:20:12 UTC

I can't seem to be able to reproduce your issue, basically what I have done was that I have set up unbound.conf with:

# POTENTIALLY DANGEROUS SETTINGS - THIS IS BEHIND FIREWALL
interface: 0.0.0.0/0
interface-automatic: yes
do-udp: no
do-tcp: yes
tcp-upstream: yes
access-control: 0.0.0.0/0 allow
log-queries: yes
log-replies: yes

Along with a forward-zone of '.', configured to query my own DNS server on the local network.

I've issued multiple queries to local host:

dig foo.example.com @127.0.0.1 +tcp

etc. and I did see TIME_WAIT connections to 127.0.0.1:53, but were unable to crash unbound.

Could you please file a new bug upstream with steps to reproduce so that they can take a look?

Comment 21 Jaap Akkerhuis 2020-12-27 12:24:01 UTC

(In reply to Xin LI from comment #20)

Yes, upstreams also thinks that this is a different problem but couldn't reproduce this (yet) so more information is needed and a new GitHub issue with more information would be appreciated.

Comment 22 Tatsuki Makino 2020-12-27 22:12:22 UTC

(In reply to Xin LI from comment #20)

I have a unbound.conf that I can reproduce relatively well.

server:
	verbosity: 5
	num-threads: 2
	port: 65353
	interface: 127.0.0.1@65353
	outgoing-range: 64
	outgoing-num-tcp: 2
	#so-reuseport: yes
	cache-max-ttl: 1
	cache-max-negative-ttl: 1
	do-udp: yes
	do-tcp: yes
	tcp-upstream: yes
	#do-daemonize: no
	chroot: ""
	username: ""
	directory: .
	use-syslog: no
	pidfile: unbound.pid

Then run this as cd /tmp && unbound -d -c /tmp/unbound.conf.
Then it stops as soon as dig @127.0.0.1 a -p 65353 +tcp example.com. is executed.

In addition, there are ways to prevent them from occurring.
First, check pkg info -r libevent.
I have the following.

libevent-2.1.12:
        thunderbird-78.6.0_1
        unbound-1.13.0_1
        avahi-app-0.8
        nsd-4.3.4
        seamonkey-2.49.4_27
        tmux-3.1c

Next, stop all programs in the package displayed above.
Then, use sockstat -4 -6 -c -P tcp,udp -s to confirm that all connections are gone.
Finally, if all connections do not exist, start unbound.

Comment 23 Xin LI freebsd_committer

2020-12-27 22:32:05 UTC

(In reply to Tatsuki Makino from comment #22)
Hi,

I have followed the steps:

1. create unbound.conf in #21

2. confirm that no unbound instance is running, then:

3. unbound -d -c /tmp/unbound.conf

4. In a different console, run: "dig @127.0.0.1 a -p 65353 +tcp example.com."

The unbound instance didn't crash.  This is latest FreeBSD main running unbound-1.13.0_1; unbound and libevent were built with WITH_DEBUG.

Any idea?

Comment 24 Tatsuki Makino 2020-12-27 22:48:07 UTC

(In reply to Xin LI from comment #23)

When the unbound on port 65353 stops, there is another unbound running on the other port, and I am using the seamonkey web browser with libevent to view other sites.

Also, I'm using 12.2-STABLE r368787.