Bug 240443 - Upgrade from 11Stable to 12Stable jail behaviour inconsistency
Summary: Upgrade from 11Stable to 12Stable jail behaviour inconsistency
Status: Closed Works As Intended
Alias: None
Product: Base System
Classification: Unclassified
Component: bin (show other bugs)
Version: 12.0-STABLE
Hardware: Any Any
: --- Affects Only Me
Assignee: freebsd-jail (Nobody)
URL:
Keywords: regression
Depends on:
Blocks:
 
Reported: 2019-09-09 12:29 UTC by Dewayne
Modified: 2019-09-11 00:54 UTC (History)
0 users

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Dewayne 2019-09-09 12:29:08 UTC
Environment: amd64 running multiple i386 and amd64 jails on xeon mb.

# uname -a
FreeBSD hathor 12.0-STABLE FreeBSD 12.0-STABLE #0 r351834M: Sat Sep  7 17:20:37 AEST 2019  amd64

Summary:
- Upgraded a long running amd64 host from 11.3S to 12Stable from source.
- All i386 jails have a common environment; as do the amd64's.
- "jexec -U root b1 tcsh" works for one jail but not another; prior to upgrade to 12Stable everything worked reliably and consistently.
- One jail performs "ls -l" successfully; the other responds with "Bad system call"
- "ls" works on both (??)

Details
Prior to the 12Stable upgrade from 11.3S, everything worked as expected. We'll focus on two i386 jails (b1 & b3) on this amd64 host.  The host has been running in this way for a couple of years and is reliable.  There are no hardware issues.

Jail b1 and jail b3 share common files for / and /usr as shown below:

~# ls -l /usr/jails/b1/|grep ^l
lrwxr-xr-x   1 root  wheel       7 Jun 21  2014 bin -> /bj/bin
lrwxr-xr-x   1 root  wheel       8 Apr  7  2015 boot -> /bj/boot
lrwxr-xr-x   1 root  wheel       7 Jun 21  2014 lib -> /bj/lib
lrwxr-xr-x   1 root  wheel      11 Jun 21  2014 libexec -> /bj/libexec
lrwxr-xr-x   1 root  wheel      10 Jun 21  2014 rescue -> /bj/rescue
lrwxr-xr-x   1 root  wheel       8 Jun 21  2014 sbin -> /bj/sbin
lrwxr-xr-x   1 root  wheel      15 Jun 21  2014 sys -> /bj/usr/src/sys
~#
~# ls -l /usr/jails/b1/usr | grep ^l
lrwxr-xr-x   1 root  wheel    11 Sep 24  2014 bin -> /bj/usr/bin
lrwxr-xr-x   1 root  wheel    15 Jun 21  2014 include -> /bj/usr/include
lrwxr-xr-x   1 root  wheel    11 Sep 24  2014 lib -> /bj/usr/lib
lrwxr-xr-x   1 root  wheel    15 Jun 21  2014 libdata -> /bj/usr/libdata
lrwxr-xr-x   1 root  wheel    15 Sep 24  2014 libexec -> /bj/usr/libexec
lrwxr-xr-x   1 root  wheel    12 Sep 24  2014 sbin -> /bj/usr/sbin
lrwxr-xr-x   1 root  wheel    13 Jun 21  2014 share -> /bj/usr/share

bj references a directory that contains and shares / and /usr
~# df -h | grep bj1
/usr/jails/bj1        204G    113G     75G    60%    /usr/jails/b1/bj
/usr/jails/bj1        204G    113G     75G    60%    /usr/jails/b3/bj

Its reasonable to expect that any commands would have the same outcome, which is normally the case.

After upgrading the host and jail "roots" (bj1 in this case).  Things were no longer consistent.

A lot of things that worked in b1 no longer worked in b3.  For example in b1
ls and ls -l / ; produced expected output
in b3
ls worked correctly but "ls -l" produced a "Bad system call".  This jail is no longer functional, pkg fails, tar fails.  Fortunately /rescue commands all function as expected.

Trying to reduce the problem to simplest terms.  After starting the respective jails:
From the host environment perform some basic connections to each jail:

From host to b1 - the working i386 jails
~#  jexec b1 tcsh
b1# exit

~# jexec b1 /bin/tcsh
# exit

~# jexec b1 sh
# exit
---
From host to b3 - the sick puppy
~# jexec b3 tcsh
Bad system call

~# jexec b3 /bin/tcsh
Bad system call

~# /usr/sbin/jexec -U root b3 tcsh ;# An interesting case as the default user in all tests is also root
# exit

~# jexec b3 /bin/sh
# exit

I hope that this provides sufficient information as to the problem, and possibly someone can reproduce?

I'm unable to test if the orer of jails starting is significant. (perhaps for tomorrow)?
Comment 1 Jamie Gritton freebsd_committer freebsd_triage 2019-09-10 20:51:51 UTC
Without yet trying to duplicate anything, I see a possibility: PAM.  The difference between "ls" and "ls -l" is the latter looks up users.  The difference between "jexec" and "jexec -U root" is that the former (the default) looks up users inside the jail.  So it seems at first glance that looking up a user gives you a bad system call.

While your libraries are necessarily all the same, your configurations may not be.  Is it possible that the PAM config on b3 is calling some library that b1 isn't?
Comment 2 Dewayne 2019-09-11 00:54:41 UTC
(In reply to Jamie Gritton from comment #1)
James, you have no idea how helpful that suggestion was.  I've had a stable environment for so long, I'd forgotten about /etc/nsswitch.conf.  Because the system calls have changed, winbind failed, it was the first on the password line.  Now I can update my packages and hopefully operational :)

Big hug of appreciation for sharing insight.