Environment: amd64 running multiple i386 and amd64 jails on xeon mb.
# uname -a
FreeBSD hathor 12.0-STABLE FreeBSD 12.0-STABLE #0 r351834M: Sat Sep 7 17:20:37 AEST 2019 amd64
- Upgraded a long running amd64 host from 11.3S to 12Stable from source.
- All i386 jails have a common environment; as do the amd64's.
- "jexec -U root b1 tcsh" works for one jail but not another; prior to upgrade to 12Stable everything worked reliably and consistently.
- One jail performs "ls -l" successfully; the other responds with "Bad system call"
- "ls" works on both (??)
Prior to the 12Stable upgrade from 11.3S, everything worked as expected. We'll focus on two i386 jails (b1 & b3) on this amd64 host. The host has been running in this way for a couple of years and is reliable. There are no hardware issues.
Jail b1 and jail b3 share common files for / and /usr as shown below:
~# ls -l /usr/jails/b1/|grep ^l
lrwxr-xr-x 1 root wheel 7 Jun 21 2014 bin -> /bj/bin
lrwxr-xr-x 1 root wheel 8 Apr 7 2015 boot -> /bj/boot
lrwxr-xr-x 1 root wheel 7 Jun 21 2014 lib -> /bj/lib
lrwxr-xr-x 1 root wheel 11 Jun 21 2014 libexec -> /bj/libexec
lrwxr-xr-x 1 root wheel 10 Jun 21 2014 rescue -> /bj/rescue
lrwxr-xr-x 1 root wheel 8 Jun 21 2014 sbin -> /bj/sbin
lrwxr-xr-x 1 root wheel 15 Jun 21 2014 sys -> /bj/usr/src/sys
~# ls -l /usr/jails/b1/usr | grep ^l
lrwxr-xr-x 1 root wheel 11 Sep 24 2014 bin -> /bj/usr/bin
lrwxr-xr-x 1 root wheel 15 Jun 21 2014 include -> /bj/usr/include
lrwxr-xr-x 1 root wheel 11 Sep 24 2014 lib -> /bj/usr/lib
lrwxr-xr-x 1 root wheel 15 Jun 21 2014 libdata -> /bj/usr/libdata
lrwxr-xr-x 1 root wheel 15 Sep 24 2014 libexec -> /bj/usr/libexec
lrwxr-xr-x 1 root wheel 12 Sep 24 2014 sbin -> /bj/usr/sbin
lrwxr-xr-x 1 root wheel 13 Jun 21 2014 share -> /bj/usr/share
bj references a directory that contains and shares / and /usr
~# df -h | grep bj1
/usr/jails/bj1 204G 113G 75G 60% /usr/jails/b1/bj
/usr/jails/bj1 204G 113G 75G 60% /usr/jails/b3/bj
Its reasonable to expect that any commands would have the same outcome, which is normally the case.
After upgrading the host and jail "roots" (bj1 in this case). Things were no longer consistent.
A lot of things that worked in b1 no longer worked in b3. For example in b1
ls and ls -l / ; produced expected output
ls worked correctly but "ls -l" produced a "Bad system call". This jail is no longer functional, pkg fails, tar fails. Fortunately /rescue commands all function as expected.
Trying to reduce the problem to simplest terms. After starting the respective jails:
From the host environment perform some basic connections to each jail:
From host to b1 - the working i386 jails
~# jexec b1 tcsh
~# jexec b1 /bin/tcsh
~# jexec b1 sh
From host to b3 - the sick puppy
~# jexec b3 tcsh
Bad system call
~# jexec b3 /bin/tcsh
Bad system call
~# /usr/sbin/jexec -U root b3 tcsh ;# An interesting case as the default user in all tests is also root
~# jexec b3 /bin/sh
I hope that this provides sufficient information as to the problem, and possibly someone can reproduce?
I'm unable to test if the orer of jails starting is significant. (perhaps for tomorrow)?
Without yet trying to duplicate anything, I see a possibility: PAM. The difference between "ls" and "ls -l" is the latter looks up users. The difference between "jexec" and "jexec -U root" is that the former (the default) looks up users inside the jail. So it seems at first glance that looking up a user gives you a bad system call.
While your libraries are necessarily all the same, your configurations may not be. Is it possible that the PAM config on b3 is calling some library that b1 isn't?
(In reply to Jamie Gritton from comment #1)
James, you have no idea how helpful that suggestion was. I've had a stable environment for so long, I'd forgotten about /etc/nsswitch.conf. Because the system calls have changed, winbind failed, it was the first on the password line. Now I can update my packages and hopefully operational :)
Big hug of appreciation for sharing insight.