Bug 270912 - dns/unbound: issues with ASLR
Summary: dns/unbound: issues with ASLR
Status: Closed Overcome By Events
Alias: None
Product: Ports & Packages
Classification: Unclassified
Component: Individual Port(s) (show other bugs)
Version: Latest
Hardware: amd64 Any
: --- Affects Some People
Assignee: Fernando Apesteguía
URL:
Keywords: regression
Depends on:
Blocks: 259968
  Show dependency treegraph
 
Reported: 2023-04-18 13:33 UTC by Wout Decré
Modified: 2023-09-04 11:30 UTC (History)
13 users (show)

See Also:
bugzilla: maintainer-feedback? (jaap)
grahamperrin: maintainer-feedback? (net)


Attachments
dns/unbound: introduce NOASLR port option (1.95 KB, patch)
2023-05-15 18:20 UTC, R. Christian McDonald
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Wout Decré 2023-04-18 13:33:03 UTC
After upgrading to FreeBSD 13.2, I notice issues with the Unbound DNS resolver. There seem to be a lot of timeouts to the forward servers:

Apr 18 14:32:59 canobox0 unbound[98053]: [98053:0] error: SERVFAIL <www.google.com. AAAA IN>: all the configured stub or forward servers failed, at zone . from 9.9.9.9 upstream server timeout

Disabling ASLR seems to resolve the issue.

The Unbound server is using DNS over TLS with the following configuration:

forward-zone:
	name: .
	forward-tls-upstream: yes
	forward-addr: 9.9.9.9@853		# Quad9
	forward-addr: 149.112.112.112@853	# Quad9
	forward-addr: 2620:fe::fe@853		# Quad9
	forward-addr: 2620:fe::9@853		# Quad9
Comment 1 Wout Decré 2023-04-19 08:09:11 UTC
Except for the timeouts on the upstream servers, it does not crash the Unbound server. It is able to recover but it keeps happening every so often.

Workarounds:

# elfctl -e +noaslr /usr/local/sbin/unbound

or

# sysctl kern.elf64.aslr.enable=0

or

Disable TLS support for the upstream servers.
Comment 2 Joshua Kinard 2023-04-19 22:39:53 UTC
It appears to be failing in the SSL/TLS handshake to the upstream forwarders after a period of time.  Almost seems like a queue is getting backed up somewhere and things start overflowing.  I've spent the last hour trying to debug it, thinking it was a problem w/ Quad9 itself.

The key error in log files from Unbound is this:
> unbound[19]: [19:2] error: SSL_handshake syscall: Connection reset by peer

I've turned up the verbosity, but the additional detail does not shed any additional light on why the connection was reset (no SSL errors/reasons, etc).  It seems like everything just works with each connection until it doesn't.

I started noticing these issues when my Squid proxy would just start returning read errors after launching a browser on one of my desktops.  Waiting about ~30s-1m after launching the browser seemed to let the dust settle and then things would seemingly work fine for hours before you'd see things start to hiccup again.  Websites that do a *lot* of background chatter, like Twitter, Facebook, etc, seemed to trip the issue up the most because they pile the queries up and seem to overload Unbound, causing the SSL errors to appear.

A good way to reproduce is first, silence outbound network traffic on the network, or at least traffic that will hit a particular Unbound DNS server.   Edit the config, set verbosity to '2' and restart the daemon.  tail -f the log file and once it's loaded up, find a Windows box, and launch MS Edge.  Edge makes a *ton* of queries at once when it loads and on my end, the first few got resolved fine against Quad9, then the SSL errors would start to appear, causing Unbound to try other forwarders and eventually giving up and returning SERVFAIL.

Can confirm, though, that turning ASLR off for the Unbound binary appears to make things smooth again.
Comment 3 Graham Perrin freebsd_committer freebsd_triage 2023-04-20 18:59:21 UTC
Cross-reference: <https://old.reddit.com/r/freebsd/comments/12prz7k/-/>.
Comment 4 Jaap Akkerhuis 2023-05-08 13:42:31 UTC
This doesn't seem like a port issue to me. Reported to the developers as an issue https://github.com/NLnetLabs/unbound/issues/887
Comment 5 lumiwa 2023-05-13 22:10:14 UTC
I have aslr's enabled on FreeBSD 13.`-RELEASE and I didn't have problems and after upgrade to 13.2 RELEASE it works without problem to the Tuesday this week and it stopped. I disable aslr and the problem exist..
Comment 6 lumiwa 2023-05-14 21:22:05 UTC
Is it possible in my case that is a problem related to gnutls/libretls?
I am using openntpd which installs libretls and I thing that problem start after update of openntpd.
Thank you.
Comment 7 R. Christian McDonald 2023-05-15 16:42:04 UTC
We should probably use the `USES= elfctl` facility to disable ASLR on unbound (and maybe all the unbound binaries) until this can be fixed upstream.

Something like this should do

+- USES=		autoreconf cpe libtool pkgconfig ssl
++ USES=		autoreconf cpe elfctl libtool pkgconfig ssl
+CPE_VENDOR=	nlnetlabs
++ ELF_FEATURES=	+noaslr:${PORTNAME}
Comment 8 Marek Zarychta 2023-05-15 18:00:27 UTC
Perhaps you can add a "NOASLR" build option to the port instead of totally disabling ALSR? I am running a couple of dns/unbound servers with ASLR and PIE enabled as standalone recursive DNS servers for a few hundred clients and both run fine for weeks/months, there is no need to restart or poke them between the upgrades.
Comment 9 R. Christian McDonald 2023-05-15 18:20:11 UTC
Created attachment 242199 [details]
dns/unbound: introduce NOASLR port option

Good idea. I have attached a patch that adds "NOASLR" as make option
Comment 10 Jaap Akkerhuis 2023-06-01 12:41:18 UTC
A fix is developed by upstairs. There will be a new release within weeks with this fix. For the inpatients among us, a prerelease is made available <https://github.com/NLnetLabs/unbound/issues/887#issuecomment-1570136710>.
Comment 11 Jaap Akkerhuis 2023-09-04 11:22:47 UTC
(In reply to Jaap Akkerhuis from comment #10)
There is a new release (1.18.0, See bug #73456) which promised to fix the problem.

We can close this, as it has been taken over by events.