At boot time, dhclient is supposed to fall back to /var/db/dhclient.leases when it reaches 'timeout' (60s) without getting a DHCPOFFER response. For 3 years I've been observing that my freebsd router no longer started its services properly after a power outage, when previously it all worked fine and there were no relevant configuration changes, aside from gradual OS upgrades. The core reason was that dhclient's behavior changed in 12.1-RELEASE from "No DHCPOFFERS received. Trying recorded lease X. bound: renewal in Y seconds." to "No working leases in persistent database - sleeping." and leaving the WAN interface with no ip address. The supporting reason is that this only happens when the ISP's network is slow to recover, and when the server has no battery backup. If there is no reboot, the already running dhclient most likely just retains the active lease and ip binding. I have reproduced this issue in a VM by letting it boot once to get the lease, then firewalling off the VM on the hypervisor level and rebooting (cannot simply disconnect link; cannot use on-host pf because it starts too late). I have tested 15.0-CURRENT-20240620, 14.0-RELEASE, 12.4-RELEASE, 12.1-RELEASE and all trivially reproduce it. 12.0-RELEASE doesn't. Via tedious buildworld bisection, I have identified the cause as b6c2f6eb (base r344237) from 2019-02-17. In main branch these were 95f237c2 (base r343896) followed by 3b08e0fc (base r343922) from 2019-02-08. https://cgit.freebsd.org/src/commit/sbin/dhclient/dhclient.c?id=b6c2f6eb29f4ca183a8877ccedd97ee98d583df3 https://cgit.freebsd.org/src/commit/sbin/dhclient/dhclient.c?id=95f237c2f65130b6567e69df06c393586e3969a3 https://cgit.freebsd.org/src/commit/sbin/dhclient/dhclient.c?id=3b08e0fcf357c1a905c5e59731930528fb94a0b1 MFC r343896,r343922: dhclient: Pass through exit status from script @@ -2353,2 +2353,3 @@ priv_script_go(void) - return (wstatus & 0xff); + return (WIFEXITED(wstatus) ? + WEXITSTATUS(wstatus) : 128 + WTERMSIG(wstatus)); } /usr/include/sys/wait.h:#define>_WSTATUS(x) (_W_INT(x) & 0177) /usr/include/sys/wait.h:#define WTERMSIG(x) (_WSTATUS(x)) /usr/include/sys/wait.h:#define WIFEXITED(x) (_WSTATUS(x) == 0) /usr/include/sys/wait.h:#define WEXITSTATUS(x) (_W_INT(x) >> 8) A bunch of internal dhclient logic isn't handled directly by C code; instead, the code sets up parameters for a 'script' and then runs it via script_go(). I found 12 such places in dhclient.c. At least 5 places treat the return value as a boolean 0=success, !0=error. Crucially, the lease reuse logic in dhclient.c::state_panic() does this, invoking dhclient-script 'TIMEOUT' (add new address) operation. note("Trying recorded lease %s", piaddr(ip->client->active->address)); /* Run the client script with the existing parameters. */ script_init("TIMEOUT", ip->client->active->medium); script_write_params("new_", ip->client->active); if (ip->client->alias) script_write_params("alias_", ip->client->alias); /* If the old lease is still good and doesn't yet need renewal, go into BOUND state and timeout at the renewal time. */ if (!script_go()) { if (cur_time < ip->client->active->renewal) { ip->client->state = S_BOUND; note("bound: renewal in %d seconds.", Adding debug logging code reveals that wstatus is 256, meaning that originally the return value was 0, but after the change it became 1. This is because instead of just returning the wait() outcome of the forked subprocess (exited / signaled / coredumped), it now instead provides the script's programmatic exit code. And the way dhclient-script's TIMEOUT is structured, it exits with 0 in some specific cases, and 1 in the default path. The abovementioned change thus shows a fundamental lack of understanding how dhclient's code operates, resulting in a 'return type confusion' programming mistake. The immediate fix would be to review each call to script_go() that checks the result, and ensure that the check aligns with the exit codes in dhclient-script. Curiously, one of the commits notes that it's identical to openbsd's 1.117 from 2008 ( https://cvsweb.openbsd.org/src/sbin/dhclient/dhclient.c ). The code structure is nearly identical, so I presume that change also introduced this same issue to openbsd. From the revision log I see that it seems to also have went unnoticed, for 4 years, until 1.159 'nuked' the entire script integration framework and thus unintentionally fixed it.
I have been made aware of a detail that changes the conclusion of the previous findings. The source code change itself is reasonable; what it in fact did is uncover an issue in dhclient-script. You see, currently, all other invokable commands in dhclient-script have no individual exit points (so no error checking and/or no failure states) and just fall through to a catch-all 'exit_with_hooks 0'. TIMEOUT does do checking, and as per its documentation, should exit with 0 if it thinks the saved lease is valid. Considering this, the 2019 change of the meaning of script_go()'s return value still makes 'zero' the expected value for success, and the nonzero wait() process failure code continues to be returned as a nonzero bitmasked value. So there's no issue there. In addition, I have misunderstood the layout of dhclient-script's TIMEOUT section. The last part that exits 1 is not the default expected path, it is a cleanup step, and the ass-backwards if-statement structure is to avoid copy-pasting, 'goto fail;' or dummy scope constructs. It makes it difficult to see the positive execution path, especially since it includes a negated AND condition, split into two if statements. Ultimately, what the script is doing is: 0. TIMEOUT) - so the DHCP server is not available, but we have saved leases 1. add_new_address() 2. lease must have 'routers' 3. first router on the list must be pingable 4. add_new_alias() 5. add_new_routes() 6. add_new_resolv_conf() Up until 2019-02, step 3 would fail and the script would exit, but dhclient proceeded as if it were successful because it wasn't checking the script's exit code. This also meant that steps 4 5 and 6 weren't performed, but in my case there were no visible consequences. Now, the requirement for the gateway to be pingable is enforced, and my host is no longer able to reuse the saved lease during an outage scenario. Unfortunately at this point I conclude that everything is now 'working as programmed', and any further discussion moves from "it's not supposed to do X!" to "should it be doing X?" or "can it be improved", which is more about design than bugfixing. I haven't thought much about how to improve this. My guess is that the ping test acts as some sort of primitive NLA / sanity check for when the host moves around between networks or when the network suddenly changes ip ranges, all while the DHCP server is absent. This makes it a very strange scenario. I think the power going out and the server booting up faster than the router that's also the DHCP server, is more likely. Anyway, dhclient-script and that ping test was implemented by openbsd's krw@ who also maintains dhclient and who eventually deleted the whole thing. Their current code only checks addressinuse and nothing else, and they're supposedly transitioning to their dhcpleased.
^Triage: notify committer of 95f237c2 .