Auditdistd does not handle transient errors gracefully when tls is involved. Without tls, it retries the connection. With a tls:// server, it just stops. The last log messages are: Jul 28 01:03:33 stream auditdistd[36170]: [audit.xxx] (sender) Unable to receive reply: Operation timed out. Jul 28 01:03:33 stream auditdistd[36170]: [audit.xxx] (sender) Disconnected from tls://audit.xxx:7878. And game over, /var/audit/dist starts filling up.
For what its worth, we still see this every now and then.
Just to cross reference the two sets of bug reports: https://github.com/openbsm/openbsm/issues/3 https://github.com/openbsm/openbsm/issues/2 Filed by brueffer.
A commit references this bug: Author: pjd Date: Thu Oct 4 05:54:58 UTC 2018 New revision: 339177 URL: https://svnweb.freebsd.org/changeset/base/339177 Log: When the adist_free list is empty and we lose connection to the receiver we move all elements from the adist_send and adist_recv lists back onto the adist_free list, but we don't wake consumers waitings for the adist_free list to become non-empty. This can lead to the sender process stopping audit trail files distribution and waiting forever. Fix the problem by adding the missing wakeup. While here slow down spinning on CPU in case of a short race in sender_disconnect() and add an explaination when it can occur. PR: 201953 Reported by: peter Approved by: re (kib) Changes: head/contrib/openbsm/bin/auditdistd/auditdistd.h head/contrib/openbsm/bin/auditdistd/sender.c
There is a commit referencing this PR, but it's still not closed and has been inactive for some time. Closing the PR as fixed but feel free to re-open it if the issue hasn't been completely resolved. Thanks