Bug 31029

Summary: syslogd remote logging back down
Product: Base System Reporter: AnarCat <AnarCat>
Component: binAssignee: Crist J. Clark <cjc>
Status: Closed FIXED    
Severity: Affects Only Me    
Priority: Normal    
Version: 4.4-STABLE   
Hardware: Any   
OS: Any   
Attachments:
Description Flags
file.diff none

Description AnarCat 2001-10-04 09:30:01 UTC
From -questions:

On Tue, Oct 02, 2001 at 11:57:08AM -0400, The Anarcat wrote:
> Hi.
>  
> I think I noticed what seems to me undesirable (and undocumented?)
> behavior in syslogd. When a remote logging host (@host) is
> unreachable:
> 
> syslogd: sendto: Host is down
> 
> syslogd *never* tries to reach it again, unless it receives a HUP.
> Shouldn't it try to reach it again, from time to time?
> 
> The @host was indeed down, but when it was brought back up, remote
> logging wasn't resumed.

Fix: This is a draft of what I would call "approximate exponential backoff
algorithm". :)

There's a lot of debugging code that can be removed, but they help
seeing what's going on.

There's probably a better way to do this too. :)
How-To-Repeat: 
*.*			@host

where host is down or unreachable.
Comment 1 AnarCat 2001-10-05 20:41:27 UTC
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

[cc'ed to -hackers since it is what I think is the relevant list, please
correct me if I'm wrong]

[[note that I welcome style and function comments]]

Note that this fix "works" in the sense that syslogd will eventually be
able to reach to remote server again, but the delays of the checks are
dependent on the frequency of the outgoing log lines. 

For example, say the initial delay is 30 seconds. If the algorithm was
"clean", there would be some kind of check made on the remote server at
30, 30+60=90, 90+120=210, etc, seconds after the failure. But the check
is only made when a log has to be sent out. So if no log is sent out for
(say) 5 minutes, the delay (f_delay) won't be changed to take into
account the delay taken to have a new log to send. 

This might not be an undesirable behavior, BTW. :) That way, if a remote
host doesn't usually receive much logs, it has more chances of being
checked each time a new log is sent. On the other hand, if a host
receives much logs, most will be discarded, and the host will be checked
at regular, exponentially expanding intervals, until it becomes
available again.

Also, note that the patch inserts 2 more time_t fields in the filed
struct.

I don't know exactly how the struct filed* pointers are initialized in
the fprintlog function. Actually, one must assume that f_unreach is
initialized to 0. If the field is filled with arbitrary value, the
algorithm leads to undefined behavior. The worst that can happen is that
syslogd will not *try* to send the log to this host, even if it is
reachable, as the f_unreach field is also a "flag".

Also, all this won't work very well if something like this isn't applied
after:

(warning: cut'n'paste)

- --- /usr/src/usr.sbin/syslogd/syslogd.c Thu Oct  4 00:06:49 2001
+++ syslogd.c   Fri Oct  5 15:31:28 2001
@@ -1064,6 +1064,8 @@
                                        f->f_type = F_UNUSED;
                                        break;
                                }
+                       } else { /* no error from sendto */
+                               f->f_unreach = 0; /* clear unreach error
flag */
                        }
                }
                break;

I could swear I saw my router's patched syslogd recover from "host down"
even without that extra patch (using only the orinigal pr's patch), but
anyways..

For convenience, I also put here a full patch that has the dprintf's
removed:

- --- /usr/src/usr.sbin/syslogd/syslogd.c.orig	Wed Oct  3 15:56:32 2001
+++ syslogd.c	Fri Oct  5 15:37:57 2001
@@ -142,6 +142,9 @@
 #define MARK		0x008	/* this message is a mark */
 #define ISKERNEL	0x010	/* kernel generated message */
 
+#define DELAY_MUL	2       /* delay multiplier */
+#define DELAY_INIT	30	/* initial delay in seconds */
+
 /*
  * This structure represents the files that will have log
  * copies printed.
@@ -159,6 +162,9 @@
 #define PRI_EQ	0x2
 #define PRI_GT	0x4
 	char	*f_program;		/* program this applies to */
+	/* should this be part of the union? */
+	time_t  f_unreach;	      /* time since last unreach */
+	time_t  f_delay;		/* backoff time */
 	union {
 		char	f_uname[MAXUNAMES][UT_NAMESIZE+1];
 		struct {
@@ -999,6 +1005,11 @@
 			l = MAXLINE;
 
 		if (finet) {
+			/* XXX: must make sure this is initialized to 0 */
+			if ((f->f_unreach) && 
+			   ((now - f->f_unreach) < f->f_delay)) {
+				break; /* do not send */
+			}
 			for (r = f->f_un.f_forw.f_addr; r; r = r->ai_next) {
 				for (i = 0; i < *finet; i++) {
 #if 0 
@@ -1019,10 +1030,34 @@
 			}
 			if (lsent != l) {
 				int e = errno;
- -				(void)close(f->f_file);
- -				errno = e;
- -				f->f_type = F_UNUSED;
 				logerror("sendto");
+				errno = e;
+				switch (errno) {
+				case EHOSTUNREACH:
+				case EHOSTDOWN:
+					if (f->f_unreach)
+						f->f_delay *= DELAY_MUL;
+					else {
+						f->f_unreach = now;
+						f->f_delay = DELAY_INIT;
+					}
+					break;
+				/* case EBADF: */
+				/* case EACCES: */
+				/* case ENOTSOCK: */
+				/* case EFAULT: */
+				/* case EMSGSIZE: */
+				/* case EAGAIN: */
+				/* case ENOBUFS: */
+				/* case ECONNREFUSED: */
+				default:
+					(void)close(f->f_file);
+					errno = e;
+					f->f_type = F_UNUSED;
+					break;
+				}
+			} else { /* no error from sendto */
+				f->f_unreach = 0; /* clear unreach error flag */
 			}
 		}
 		break;

A.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.6 (FreeBSD)
Comment: Pour information voir http://www.gnupg.org

iEYEARECAAYFAju+DOUACgkQttcWHAnWiGcsVgCeI7L2H3xD5GRN65mDdW4ZLvwe
sfkAnAjESC9zhuC7wlobnXpm14MZ00Ik
=h9I5
-----END PGP SIGNATURE-----

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-bugs" in the body of the message
Comment 2 Crist J. Clark freebsd_committer freebsd_triage 2001-11-29 10:04:07 UTC
Responsible Changed
From-To: freebsd-bugs->cjc

I have been testing these patches.
Comment 3 Crist J. Clark freebsd_committer freebsd_triage 2002-08-25 07:05:38 UTC
State Changed
From-To: open->patched

I hesitated doing this because the back-off method seemed too 
complicated. So I simplifed. Don't back-off at all. Just keep trying 
and ignore (but log) EHOSTUNREACH and EHOSTDOWN. The failures really 
don't consume more resources than a success would, so why not just 
keep trying? The fact that we can keep getting "sendto" errors in the 
logs helps to point out the failure and are throttled by the usual 
methods.
Comment 4 Crist J. Clark freebsd_committer freebsd_triage 2002-09-05 18:54:45 UTC
State Changed
From-To: patched->closed

MFC to -STABLE.