Bug 253062 - dns/bind916: Hang on service stop in uwait state starting in v9.16.11
Summary: dns/bind916: Hang on service stop in uwait state starting in v9.16.11
Status: Closed Overcome By Events
Alias: None
Product: Ports & Packages
Classification: Unclassified
Component: Individual Port(s) (show other bugs)
Version: Latest
Hardware: Any Any
: --- Affects Only Me
Assignee: Mathieu Arnold
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-01-28 13:16 UTC by John W. O'Brien
Modified: 2021-11-07 22:11 UTC (History)
2 users (show)

See Also:
bugzilla: maintainer-feedback? (mat)
john: maintainer-feedback?


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description John W. O'Brien 2021-01-28 13:16:43 UTC
After upgrading from 9.16.10 to 9.16.11, the named service does not stop properly.

$ sudo service named stop
Stopping named.
Waiting for PIDS: %PID% <-- hangs here indefinitely

$ procstat %PID%
  PID  PPID  PGID   SID  TSID THR LOGIN    WCHAN     EMUL          COMM
%PID%     1 %PID% %PID%     0  26 obrienjw uwait     FreeBSD ELF64 named


(lldb) bt
* thread #1, name = 'named'
  * frame #0: 0x0000000800cc768c libthr.so.3`_umtx_op_err at _umtx_op_err.S:37
    frame #1: 0x0000000800cc3591 libthr.so.3`join_common(pthread=0x0000000803585200, thread_return=0x0000000000000000, abstime=0x0000000000000000, peek=<unavailable>) at thr_join.c:147:9
    frame #2: 0x000000000050dbef named`isc_thread_join + 31
    frame #3: 0x00000000004fb49f named`isc_taskmgr_destroy + 607
    frame #4: 0x00000000002dae31 named`main + 5841
    frame #5: 0x00000000002d16b0 named`_start + 256

This is reproducible on two different 12.2R machines on which named is configured to act as a resolver. The hang does not occur on two other hosts on which named is configured to as as authoritative only. On the hosts where the hang occurs, the named logs contain lines of the form:

Jan 28 08:03:12 hostname named[%PID%]: creating IPv6 interface igb0 failed; interface ignored

These errors are absent on the hosts where the hang does not occur.

The 9.16.11 changelog includes a number of items related to threads and netmgr.

5557.   [bug]           Prevent RBTDB instances from being destroyed by multiple
                        threads at the same time. [GL #2317]

5545.   [func]          OS support for load-balanced sockets is no longer
                        required to receive incoming queries in multiple netmgr
                        threads. [GL #2137]

5543.   [bug]           Fix UDP performance issues caused by making netmgr
                        callbacks asynchronous-only. [GL #2320]

5542.   [bug]           Refactor netmgr. [GL #1920] [GL #2034] [GL #2061]
                        [GL #2194] [GL #2221] [GL #2266] [GL #2283] [GL #2318]
                        [GL #2321]
Comment 1 Mathieu Arnold freebsd_committer freebsd_triage 2021-02-02 14:55:46 UTC
Reported upstream https://gitlab.isc.org/isc-projects/bind9/-/issues/2465
Comment 2 Morgan Davis 2021-02-08 06:38:19 UTC
I've had this same thing happen and could only kill the hung named with -9.

I was able to work around this by UNcommenting this line in named.conf to enable listening on IPv6 localhost:

    listen-on-v6    { ::1; };

This got rid of the "creating IPv6 interface" log errors and also allowed 'service named restart' to work again without hanging.

Unfortunately, 9.16.11 is creating a more critical issue for me which is why I came here to see if anyone else was experiencing this. Ever since I upgraded from 9.16.10 on January 28, the named process will just bail out randomly. No log entries. No core files.

This happens on two FreeBSD 12.2-RELEASE-p3 systems (VMs on DigitalOcean). They're handling primary and secondary DNS publicly. Until this version of bind, they've been rock-solid for years. Now, both will stop running named after a day or so and I've had to manually restart them.

I also have a third system, lightly loaded, on bare metal hardware running 9.16.11 privately that hasn't had any of these issues (yet). Maybe its its due to some condition after a certain number of requests (guessing here).

Hope this helps.
Comment 3 John W. O'Brien 2021-09-04 18:44:13 UTC
I am no longer able to reproduce this issue in 9.16.20. I believe the hang on service stop has been resolved as of 9.16.15.