Bug 260628 - FreeBSD 12.3-Release got stuck during the boot process after the update (maybe nsswitch issue?)
Summary: FreeBSD 12.3-Release got stuck during the boot process after the update (mayb...
Status: New
Alias: None
Product: Base System
Classification: Unclassified
Component: misc (show other bugs)
Version: 12.3-RELEASE
Hardware: amd64 Any
: --- Affects Some People
Assignee: freebsd-bugs (Nobody)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-12-23 01:20 UTC by Yoshisato Yanagisawa
Modified: 2023-03-28 19:33 UTC (History)
4 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Yoshisato Yanagisawa 2021-12-23 01:20:37 UTC
What is the issues?
1. FreeBSD 12.3R got stuck during the boot.
2. password database seems not be referred after I made password database to skip reading compat.

What is the expected behavior?
1. FreeBSD 12.3R should boot without getting stuck even if ldap is configured.
2. when I type a user's password, FreeBSD should recognize it and allow the user to login.  When I made the user's password blank, FreeBSD should allow them to login without typing password. That should also be true for root.

What happened?
1. I used freebsd-update to update from 12.2-Release to 12.3-Release. I typed following commands to upgrade.
# freebsd-update -r 12.3-RELEASE upgrade
(typed 'y' twice because the shown message looked reasonable)
# freebsd-update install
As mentioned, I have rebooted the computer.
# reboot

FreeBSD booted with 12.3R kernel.
To make all packages use the latest library, I force upgraded packages.
# portmaster -a -f
Then, I typed freebsd-update again to finalize the install.
# freebsd-update install
# reboot

During the reboot, it got stuck on booting after starting devd. Since it did not respond about 10 minutes, I hit power button.  The computer seems to understand power button and started the shutdown procedure.  I felt ACPI working well.

I booted the computer and in the first count-down menu, I typed 's' to make it go into the single user mode.  I could go into the single user mode shell eventually.
I saw the message saying:
2021-12-22T17:13:00.321721+09:00 init 28 - - NSSWITCH(_nsdispatch): ldap, passwd_compat, endpwent, not found, and no fallback provided.

I supposed something wrong with nsswitch. I typed following command to mount all with writable.
# mount -u -a
However, as far as I checked with mount / seems to be read-only. I explicitly asked it to mount '/' with writable.
# mount -u /

Since I supposed boot process got stuck for fails to access LDAP, I removed +::: in password and groups.
# vipw
# vi /etc/group

Then, I exited and made it goes into multi-user mode.  During RC execution, I saw several suspicious messages like:
/etc/mail/submit.cf: 441: readcf: option RunAsUser: unknown user smmsp
(snip)
Mail submission program must have RunAsUser set to non root user
/etc/rc: WARNING: failed to start sendmail_msp_queue
(snip)
Performing sanity check on sshd configuration.
Privilege separation user sshd does not exist
/etc/rc: WARNING: failed precmd routine for sshd
(snip)
Starting jails: jail: <jname>: unknown uid 0
(snip)

I felt password database cannot be read on 12.3R.  Moreover, nsswitch might have some issues with 12.3R.

I gave up on updating and chose to rollback. i.e. reboot with single user mode, mount / with writable and executed the following command.
# freebsd-update rollback
Here, rollback got stuck (no progress for 30 minutes) but I decided to reboot anyway. The system booted and I typed the command again to rollback and reboot. It seems to work well.

Anything special for your config?
The machine use ldap for extra password and group.  It has a ldap server inside a jail running on the same machine, and to avoid getting stuck on reading ldaps during the boot, it does not refer ldap for groups.

However, the issue continued after I removed +:::... lines in password, it might not be related.
Comment 1 Graham Perrin freebsd_committer freebsd_triage 2021-12-27 14:11:12 UTC
(In reply to Yoshisato Yanagisawa from comment #0)

> from 12.2-Release

Which patch level? 

From your use of rollback, I assume UFS i.e. no ZFS boot environments. True?
Comment 2 Yoshisato Yanagisawa 2021-12-27 22:02:13 UTC
> Which patch level? 

I hope you are asking patch level of 12.2R, then:
# uname -v
FreeBSD 12.2-RELEASE-p7 GENERIC
# freebsd-update fetch
Looking up update.FreeBSD.org mirrors... 2 mirrors found.
Fetching metadata signature for 12.2-RELEASE from update2.freebsd.org... done.
Fetching metadata index... done.
Inspecting system... done.
Preparing to download files... done.

No updates needed to update system to 12.2-RELEASE-p11.


I am not sure why uname and freebsd-update show different values.

Also, I use ZFS during the boot, and the root file system is also ZFS.
# cat /boot/loader.conf
kern.geom.label.disk_ident.enable="0"
kern.geom.label.gptid.enable="0"
opensolaris_load="YES"
zfs_load="YES"
# grep zfs /etc/rc.conf
zfs_enable="YES"

Is there any special procedure I should do for updating such environment?
As far as I understand from https://docs.freebsd.org/en/books/handbook/cutting-edge/#updating-upgrading-freebsdupdate, I should call nextboot only if I use a custom kernel, which I do not.
I usually postpone `zfs upgrade` until I confirm the new version is stable enough because I do not know how to "downgrade" ZFS safely.  Note that I assume there is no ZFS update between 12.2R and 12.3R.
Comment 3 Graham Perrin freebsd_committer freebsd_triage 2021-12-28 04:09:53 UTC
(In reply to Yoshisato Yanagisawa from comment #2)

For myself, I often (quietly) run: 

freebsd-version -kru

I find it particularly useful to run at various stages of an _upgrade_ routine. To help me not lose sight of the point at which a restart is appropriate, and so on. 

freebsd-version(1)
<https://www.freebsd.org/cgi/man.cgi?query=freebsd-version&sektion=1&manpath=FreeBSD>

It's not unusual to find the patch level of the installed userland different from the patch level of the running kernel. Some patches apply to userland but not the kernel. Non-official <https://bokut.in/freebsd-patch-level-table/#releng/12.2> might help to understand things such as this.

> … I assume there is no ZFS update between 12.2R and 12.3R. …

<https://cgit.freebsd.org/src/log/?h=releng%2F12.3&qt=grep&q=zfs> some ZFS-related changes. 

As orientation: less specifically, we currently have the amber 'release/12.3.0' at the head of <https://cgit.freebsd.org/src/log/?h=releng%2F12.3>.
Comment 4 Remco 2021-12-28 13:22:24 UTC
I might have related, similar problems...

I upgraded my FreeBSD VM from 12.2 to 12.3 by using freebsd-update.

I think to remember it initially worked fine and survived the reboot after the upgrade, but I'm not 100% sure. But a few weeks later after updating my packages it was stuck on boot with the last text visible on the console: "devd" ...

Single user mode works, multi-user mode not.

I gave up, as I couldn't fix it, and tried to install a complete new machine using FreeBSD 13.0-RELEASE ISO.
But... it had the same issue !! Also stuck at "devd" 

I figured out that using a traditional E1000 network adapter in VMWare (ESXi 5.5.0, yes rather old) I could install FreeBSD 13. But when using the VMXNET3 driver, it crashed.

I went back to the other FreeBSD 12.3 server, changed the interface in VMware in an E1000 (em0) interface, adjusted my rc.conf and ... voila... the whole machine worked again.

... if above is related, it might helping fixing the bug.

Regards, Remco
Comment 5 Yoshisato Yanagisawa 2021-12-28 21:49:23 UTC
(In reply to Graham Perrin from comment #3)

Thank you for the information.
I did not know freebsd-version command.  It should help me to see detailed step of freebsd-update.  Currently, it is like this:
$ freebsd-version -kru
12.2-RELEASE-p7
12.2-RELEASE-p7
12.2-RELEASE-p11

I suppose linux_futex, bhybe, and vmci are kernel modules, and that might be why my kernel kept on running with p7.
https://bokut.in/freebsd-patch-level-table/#releng/12.2

From my experience updating FreeBSD without freebsd-update, I suppose freebsd-update updates with this order; kernel first, then userland. I might see it boot fine with kernel and get stuck with userland, but let me see.

There seems to be updates in ZFS, but how it relates to this issue?  If this is caused by ZFS, I guess I would get stuck in earlier stage like mounting '/', which looks not.

(In reply to Remco from comment #4)

Thank you for the information, but unfortunately, it might have nothing to do with this issue because I do not use a virtual machine.  It is PRIMERGY TX1310M3.  Note that is has Intel(R) PRO/1000 network adapters.

Here, I am thinking about the possibility either devd is the cause or not.  For your case, if enabling netif is after devd, I suppose devd is fine but the step just after devd might be the cause.  Then, I should also confirm what comes after devd.
Comment 6 gerrit.kuehn 2022-04-22 08:05:15 UTC
I see exactly the same behaviour on my system after updating to 12.3. It worked fine when booting just with the new kernel (coming from latest 12.2), but broke after rebooting with the new userland installed.

A few more observations from my side:
This is definitely related to nsswitch/ldap relying on network resources. The issue on my system clearly is that network is not fully up and working at the time devd starts. I can set devd_enable="NO" in rc.conf, and the system will just boot fine. devd can be started manually after the system is up with no issue.

Another workaround I found is setting the network interface to "SYNCDHCP" instead of static setup. This will cause the boot process to wait until networking is online and the IP address has been obtained. devd will be started after that step and runs up fine then. However, I don't know of a similar switch for static ip setup, is there any?

The issue appears to be that static IP setup already happened, but the network interface has not acquired its link yet. I see kernel messages on this around devd startup, but apparently it is too late for whatever devd is trying to do here. Strange enough, I cannot quit it with CTRL-C, it just sits there waiting for something that obviously never happens.
Comment 7 Yoshisato Yanagisawa 2022-06-03 22:39:35 UTC
Thanks Gerrit for the comment.  I actually could not test your suggestion because with my operation mistake, I made my FreeBSD machine impossible to boot.

I clean-installed 13.0-Release on the same machine (with different HDD), and upgraded to 13.1-Release.  But, I did not see this issue again.  I suppose it temporary happened on 12.2-Release to 12.3-Release upgrade.
Comment 8 Dylan 2023-03-28 19:17:33 UTC
Looks the same or similar to:
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=262170

I recently upgraded a FreeBSD 10.4 LDAP auth client system to 12.4 using "freebsd-update upgrade -r 12.4-RELEASE" and the usual install (Kernel stuff), reboot, install (Userland stuff), steps after that.  I also ran a "pkg update", which didn't work, got the 'version 2' error, so I ran "pkg bootstrap -f", then "pkg update", then "pkg upgrade -y", which bumped bash, nss_ldap, etc.

On the next reboot I discovered 2 things:

#1. The system hung after the "ifconfig" output from netif and the "Starting devd" message.  Hitting ctrl+t listed:

load: 0.13  cmd: env 340 [nanslp] 15.57r 0.00u 0.02s 0% 6192k

Same thing if I waited a bit and hit it again, incrementing the #r runtime value of course.  NOTE: Manually typed since this was on an ESXi video console.


----
I resolved #1 using the info in the bug link above, reboot, hit 2 for single-user mode, then /bin/sh, then 'mount -at ufs', then 'vi /usr/sbin/service' and change the exec line back to not having -L 0/daemon for now.  I left a comment and the original line for my future self to patch it:

$ less -N /usr/sbin/service
...
    168                 # Open bug PR#262170, hang on boot in [env]
    169                 #exec env -i -L 0/daemon HOME=/ PATH=/sbin:/bin:/usr/sbin:/usr/bin "$dir/$script" "$@"
    170                 exec env -i HOME=/ PATH=/sbin:/bin:/usr/sbin:/usr/bin "$dir/$script" "$@"
    171         fi
    172 done
...
Comment 9 Dylan 2023-03-28 19:33:42 UTC
Apologies, the '2nd' thing was that 12.4-RELEASE-p1 and bash-5.2.15 don't seem to agree.  Bash dies with this, which is, at a glance, something that showed up in bash 4-ish, but FreeBSD 10.4-p3 had bash 4.4.23 and worked:

    $ uname -a
    FreeBSD <hostname here> 12.4-RELEASE-p1 FreeBSD 12.4-RELEASE-p1 GENERIC  amd64

    $ pkg info | grep bash
    bash-5.2.15                    GNU Project's Bourne Again SHell

    $ /usr/local/bin/bash
    ld-elf.so.1: Undefined symbol "rl_sort_completion_matches" referenced from COPY relocation in /usr/local/bin/bash

Was added here:
https://github.com/akamai/bash/blob/no-function-forward/CHANGES#L2249


Why is this error relevant here?  Switching to 'bash-static' works, but doesn't play nice with nss_ldap, so prompts for remote (Pulled from LDAP) users result in:

    [I have no name!@<your server name> <Current path>]$

A 'whoami' shows them correctly mapped, so I just switched the default shell to /bin/sh for now until I can find a fixed version of bash.

Example I found of someone else noting bash-static didn't work with LDAP mapping:
https://lists.freebsd.org/pipermail/freebsd-stable/2014-March/077852.html

So, not relevant if your LDAP client doesn't use bash for root or users, but heads up if they do that it may not play nice with 12.4-p1.