Bug 238785 - sysutils/py-google-compute-engine: Some GCE packages fill the logs with an error message
Summary: sysutils/py-google-compute-engine: Some GCE packages fill the logs with an er...
Status: Open
Alias: None
Product: Ports & Packages
Classification: Unclassified
Component: Individual Port(s) (show other bugs)
Version: Latest
Hardware: Any Any
: --- Affects Some People
Assignee: freebsd-ports-bugs mailing list
URL:
Keywords: needs-qa, regression
Depends on:
Blocks:
 
Reported: 2019-06-24 09:58 UTC by Mage
Modified: 2019-08-14 09:47 UTC (History)
4 users (show)

See Also:
koobs: maintainer-feedback? (lucas.kanashiro)


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Mage 2019-06-24 09:58:06 UTC
Hello,

I’ve been running FreeBSD servers in Compute Engine for years.

In the last few months, my servers’ logs are full of a specific error message:

Jun 12 09:34:21 xxx google-accounts: ERROR Exception calling the response handler. [Errno 2] No such file or directory. Traceback (most recent call last):   File "/usr/local/lib/python2.7/site-packages/google_compute_engine/metadata_watcher.py", line 200, in WatchMetadata     handler(response)   File "/usr/local/lib/python2.7/site-packages/google_compute_engine/accounts/accounts_daemon.py", line 283, in HandleAccounts     self._UpdateUsers(desired_users)   File "/usr/local/lib/python2.7/site-packages/google_compute_engine/accounts/accounts_daemon.py", line 216, in _UpdateUsers     if not self.utils.UpdateUser(user, ssh_keys):   File "/usr/local/lib/python2.7/site-packages/google_compute_engine/accounts/accounts_utils.py", line 349, in UpdateUser     if not self._UpdateSudoer(user, sudoer=True):   File "/usr/local/lib/python2.7/site-packages/google_compute_engine/accounts/accounts_utils.py", line 271, in _UpdateSudoer     subprocess.check_call(command.split(' '))   File "/usr/local/lib/python2.7/subprocess.py", line 185, in c
Jun 12 09:34:21 xxx kernel: ess.py", line 185, in c

It’s there in every second minute or so.

FreeBSD xxx 11.2-RELEASE-p10 FreeBSD 11.2-RELEASE-p10 #0: Mon May 13 21:20:50 UTC 2019     root@amd64-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC  amd64

All packages are up to date.

I think it comes from this package/port

py27-google-compute-engine-2.8.13

I hope it helps.

Mage
Comment 1 Lucas Kanashiro 2019-06-26 12:27:51 UTC
(In reply to Mage from comment #0)

Hi Mage, First of all thanks for this bug report.

I am not sure if you updated your ports tree because the latest version we have is 2.8.16. Are you able to update the sysutils/py-google-compute-engine to the latest version and check if the error persists?
Comment 2 Mage 2019-06-26 13:17:15 UTC
(In reply to Lucas Kanashiro from comment #1)
Hello Lucas,

on my production servers, I use pkg only, and I have no ports tree checked out. I will do it now for testing.
Comment 3 Mage 2019-06-26 13:54:03 UTC
I didn’t use the ports in the last few years (since PostgreSQL got proper ICU support in the binary package). Maybe I was doing it wrong.

To me, it seems a bit confusing:

# portmaster --list-origins
[...]
net/google-cloud-sdk
[...]
sysutils/panicmail
ports-mgmt/pkg
ports-mgmt/portmaster
[...]
devel/py-boto
sysutils/py-google-compute-engine
[...]

# portmaster -n sysutils/py-google-compute-engine

[...]

===>>> sysutils/py-google-compute-engine >> (17)

===>>> The following actions will be taken if you choose to proceed:
	Install sysutils/py-google-compute-engine
	Install devel/py-boto@py36
	Install devel/py-setuptools@py36
	Install lang/python36
	Upgrade gettext-runtime-0.19.8.1_2 to gettext-runtime-0.20.1
	Upgrade pkg-1.10.5_5 to pkg-1.11.1
	Install devel/gettext-tools
	Install devel/libtextstyle
	Install devel/pkgconf
	Upgrade readline-7.0.5 to readline-8.0.0
	Install net/py-netaddr@py36
	Install net/py-netifaces@py36
	Upgrade google-compute-engine-oslogin-1.4.3 to google-compute-engine-oslogin-1.5.3_1
	Upgrade curl-7.65.0_2 to curl-7.65.1
	Upgrade perl5-5.28.1_1 to perl5-5.28.2
	Upgrade ca_root_nss-3.44 to ca_root_nss-3.44.1
	Upgrade libnghttp2-1.37.0 to libnghttp2-1.39.1
	Install sysutils/py-distro@py36

===>>> Proceed? y/n [y] n


# pkg install py-google-compute-engine
Updating FreeBSD repository catalogue...
FreeBSD repository is up to date.
All repositories are up to date.
pkg: No packages available to install matching 'py-google-compute-engine' have been found in the repositories

# pkg install py27-google-compute-engine
Updating FreeBSD repository catalogue...
FreeBSD repository is up to date.
All repositories are up to date.
Checking integrity... done (0 conflicting)
The most recent version of packages are already installed

# portmaster -n py27-google-compute-engine

[...]

===>>> py27-google-compute-engine-2.8.13_5 >> (15)

===>>> The following actions will be taken if you choose to proceed:
	Upgrade py27-google-compute-engine-2.8.13_5 to py27-google-compute-engine-2.8.16
	Upgrade py27-boto-2.49.0 to py27-boto-2.49.0_1
	Upgrade py27-setuptools-40.8.0 to py27-setuptools-41.0.0
	Upgrade python27-2.7.16 to python27-2.7.16_1
	Upgrade gettext-runtime-0.19.8.1_2 to gettext-runtime-0.20.1
	Upgrade pkg-1.10.5_5 to pkg-1.11.1
	Install devel/gettext-tools
	Install devel/libtextstyle
	Install devel/pkgconf
	Upgrade readline-7.0.5 to readline-8.0.0
	Upgrade py27-netaddr-0.7.19 to py27-netaddr-0.7.19_1
	Upgrade google-compute-engine-oslogin-1.4.3 to google-compute-engine-oslogin-1.5.3_1
	Upgrade curl-7.65.0_2 to curl-7.65.1
	Upgrade perl5-5.28.1_1 to perl5-5.28.2
	Upgrade ca_root_nss-3.44 to ca_root_nss-3.44.1
	Upgrade libnghttp2-1.37.0 to libnghttp2-1.39.1

===>>> Proceed? y/n [y] n


To me, it looks like that:

portmaster is telling me I have sysutils/py-google-compute-engine port installed.

When I try to upgrade it with portmaster, it turns out it’s not installed.

If I would install it with portmaster, it would not upgrade the sysutils/py27-google-compute-engine port.

pkg tells me sysutils/py27-google-compute-engine is the latest version as a package.  If I upgrade it via portmaster, I end up with a mixed system (pkg and port). In theory, I could do it as I did it in the past. In this case, it wouldn’t be fun because the readline upgrade would break my userland binaries that are running as the primary purpose of the servers.

Can we upgrade it via pkg?
Comment 4 Lucas Kanashiro 2019-06-28 18:01:24 UTC
(In reply to Mage from comment #3)
In fact, I do not know how those packages are released, someone from the FreeBSD team might know how these things work better than me. In my regular tests I've been managing my ports tree using 'portsnap', it has been working well so far.

With the latest version (2.8.16) I was not able to reproduce what you reported. Let's see if someone more FreeBSD knowledgeable can help you on getting the latest version.
Comment 5 Mage 2019-06-29 16:16:24 UTC
Well, most of the time, the package and the port names are similar. (I used to believe they always are.)

I know how to install it from ports. But it would upgrade readline-7.0.5 to readline-8.0.0 hence break my most important compiled binaries. 

I will test it a VM soon.
Comment 6 Mage 2019-07-08 11:32:21 UTC
It got interesting. 

First, I could not reproduce the issue in a VM on my home computer. I thought the reason was something like the daemon couldn’t connect and refused to run.

Then I created a test machine on Google cloud, using an image I created on 28.1.2019.

The issue also didn’t appear on this machine. That was normal as I started having it only this year, likely after January.

However, when I upgraded the test machine’s packages to the current version with pkg, the issue also did not appear.

Yet every machine that has been running in the last few months and that I kept upgrading on a regular basis has the issue. In addition that I know I didn’t edit any related config file (I don’t even know if py27-google-compute-engine has one in the first place), the machines are on two separated accounts. On one of the accounts, I hardly ever edit any config.

I thought that somewhere between 28 January and today, a specific version of py27-google-compute-engine must have created some file that causes the issue. This would explain why I don’t see it on either a fresh install (home VM) or on the server I upgraded from 28 January to today’s state in a single step.

By the way, the upgrade was:

py27-google-compute-engine: 2.8.6 -> 2.8.13_5

I uninstalled every package that had "google" in their name. I executed pkg autoremove as well.

There aren’t many files left in the system that have "google" in their path.

There is a folder called "/var/lib/google". That reminds me of the bad days of Linux. Shouldn’t it be /var/db/google?

/var/lib/google # ls -l
total 14
drwxr-xr-x  7 root  wheel   7 Nov 15  2016 per-instance
-rw-r--r--  1 root  wheel  20 Nov 15  2016 vm-instance-id
-rw-rw-rw-  1 root  wheel   0 Sep  9  2018 vm-instance-id.lock

The 'per-instance' directory contains empty subdirectories that are named as a 19 digits decimal number. There are five of them from November 2016. 

That’s probably when I created my first instance. All my servers in Google Compute Engine are a descendant or let’s say, a clone of my first instance that I created from the FreeBSD image on GCE. I have been upgrading them for years. 

The vm-instance-id is neither the same nor different on each of my servers. On some of the servers, it’s the same. On a few other servers, it’s unique. 

The file is from 2016 Movember from each of my servers. The lock is from 2018 September 9 everywhere. I’m not sure, but the latter might be when I migrated from the deprecated google-cloud-something-cant-remember to the py27-google-compute-engine.

I feel like deleting the /var/lib/google everywhere, or more like rm -rf /var/lib. Can I do that?

In addition to making things right, I’m trying to figure out the cause of the issue that’s the original subject of this ticket.
Comment 7 Mage 2019-07-08 14:37:07 UTC
There are multiple problems.

I think it might be dependencies. For example, I managed to uninstall py27-boto-2.49.0 using pkg autoremove.

I’m not sure, how. I think it happened after I removed google-cloud-sdk.

I removed py27-google-compute-engine, then reinstalled it with pkg. It did not install the py27-boto, but provided me an error message and quit at startup.

I tried to migrate to py36-google-compute-engine. It also does not install py36-boto, and quits after displaying this:


  File "/usr/local/bin/google_accounts_daemon-3.6", line 6, in <module>
    from pkg_resources import load_entry_point
  File "/usr/local/lib/python3.6/site-packages/pkg_resources/__init__.py", line 3191, in <module>
    @_call_aside
  File "/usr/local/lib/python3.6/site-packages/pkg_resources/__init__.py", line 3175, in _call_aside
    f(*args, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/pkg_resources/__init__.py", line 3204, in _initialize_master_working_set
    working_set = WorkingSet._build_master()
  File "/usr/local/lib/python3.6/site-packages/pkg_resources/__init__.py", line 583, in _build_master
    ws.require(__requires__)
  File "/usr/local/lib/python3.6/site-packages/pkg_resources/__init__.py", line 900, in require
    needed = self.resolve(parse_requirements(requirements))
  File "/usr/local/lib/python3.6/site-packages/pkg_resources/__init__.py", line 786, in resolve
    raise DistributionNotFound(req, requirers)
pkg_resources.DistributionNotFound: The 'boto' distribution was not found and is required by google-compute-engine


I believe the dependencies of both py27-google-compute-engine and py36-google-compute-engine are messed up. Not requiring boto is another issue, not the one that made me open this ticket. 

Most likely, I removed another pkg with autoremove between January 2019 and now, wich breaks py27-google-compute-engine.

I think someone should review its dependencies.

The above was all about the pkg versions. Now I’m going to try fixing it with ports.

By the way, I upgraded the testing servers that I created for this issue to FreeBSD 12. It didn’t help.
Comment 8 Mage 2019-07-08 16:39:36 UTC
I did my best. The rest is up to you. Here is what I found:


I created two servers, test1 and test2. test1 is a clone or a running server. test2 was a saved image of that server in January 2019.

I booted up both servers. test1 put the error message into the log every minute. test2 did not put there anything.

I upgraded both servers using pkg upgrade. I compared the package list with diff. They were identical except for the PostgreSQL version (10 vs. 11). It was the only difference.

test1 kept adding the error to the logs. test2 did not do it.

(All my running servers behave the same as test1).

I ran diff -ur on /usr/local/etc and /etc to compare the both servers. I found almost nothing belpful. They were almost the same, except fot the apache config in /usr/local/etc.

There was in interesting difference though. On test2, my user was in the google-sudoers group. On test1, it was not.

google-sudoers:*:1003:
vs. 
google-sudoers:*:1003:mage

I don’t like sudo. On my non-google FreeBSD servers, I’ve never had it installed. On GCE, I use it solely for 'sudo su -'

It is sure I did not remove myself from that file. I didn’t even know I was in that group. I don’t even know what that group does because I can 'sudo su -' on any server on which I’m not a member of that group.

As the log was not helpful, I went throgh the three files, and I added a line to /usr/local/lib/python2.7/subprocess.py

Line 185:
print popenargs

Then, I killed the daemons (the service google-xxx-daemon stop doesn’t work by the way), and I ran it from console.

root@test1 ~ # /usr/local/bin/google_accounts_daemon-2.7
(['gpasswd', '-a', 'mage', 'google-sudoers'],)
(['gpasswd', '-a', 'mage', 'google-sudoers'],)
[...]

It tries to add me to that group in every minute using gpasswd. There is no gpasswd on the server, and I think it never was.

For this reason, test2 also couldn’t have gpasswd. I was surprised because it did the following:

root@test2 ~ # /usr/local/bin/google_accounts_daemon-2.7
(['pw', 'groupmod', 'google-sudoers', '-m', 'mage'],)

For some reason, test2 didn’t try to execute gpasswd (Linux) but the pw command, as it should on FreeBSD.

Another difference is that test2 executed the command only once, right after I started the daemon. It never tried it again. test1 tried to do it in every minute but with a different command.

I added myself to that group on test1.

pw usermod mage -G google-sudoers

It did not help. The daemon was trying to add me, with the wrong command.

Now more or less the only difference between the two servers is the FreeBSD version.

FreeBSD test1 11.2-RELEASE-p9 FreeBSD 11.2-RELEASE-p9 #0: Tue Feb  5 15:30:36 UTC 2019     root@amd64-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC  amd64

FreeBSD test2 11.2-RELEASE-p8 FreeBSD 11.2-RELEASE-p8 #0: Tue Jan  8 21:35:12 UTC 2019     root@amd64-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC  amd64

I’m not sure this causes them to run different commands for adding me to the group. I will run freebsd-update on both, and see what happens.
Comment 9 Mage 2019-07-08 17:16:21 UTC
Updating to the current 11.2 July 2 release didn’t change anything, but I found this:


diff -ur /usr/local/etc/instance_configs.cfg usr/instance_configs.cfg
--- /usr/local/etc/instance_configs.cfg 2019-07-08 19:06:47.267559000 +0200
+++ usr/instance_configs.cfg    2019-07-08 17:35:15.202994000 +0200
@@ -7,13 +7,13 @@

 [Accounts]
 deprovision_remove = false
-gpasswd_add_cmd = gpasswd -a {user} {group}
-gpasswd_remove_cmd = gpasswd -d {user} {group}
 groupadd_cmd = pw groupadd {group}
 groups = adm,dip,docker,lxd,plugdev,video
 useradd_cmd = pw useradd {user} -m
 userdel_cmd = pw userdel {user} -r
 usermod_cmd = pw usermod {user} -G {groups}
+gpasswd_add_cmd = pw groupmod {group} -m {user}
+gpasswd_remove_cmd = pw groupmod {group} -d {user}

 [Daemons]
 accounts_daemon = true

I guess there was a version that changed this file, and none of the updates corrected it.
Comment 10 Mage 2019-07-08 17:41:43 UTC
If I delete /usr/local/etc/instance_configs.cfg on test1, and reboot the machine, the problem gets solved. It’s weird for a few reasons:

1. 
The file says: 
"Do not edit this file directly. If you need to add
# items to this file, create or edit
# /usr/local/etc/instance_configs.cfg.template"

2. 
There is no /usr/local/etc/instance_configs.cfg.template file.

3. Before I deleted the file, I checked its modification date, and it seemed it was generated at the last boot.

-rw-r--r--  1 root   wheel   1299 Jul  8 19:30 instance_configs.cfg


4. 
Although there is no /usr/local/etc/instance_configs.cfg.template file, there are two others:

-rw-r--r--  1 root   wheel    522 Jul  5 08:48 instance_configs.cfg.distro
-rw-r--r--  1 root   wheel    522 Jul  5 08:48 instance_configs.cfg.distro.sample

And the content of these files is correct. 

test1 before delete and reboot:

root@test1 /usr/local/etc # md5 instance_configs.cfg*
MD5 (instance_configs.cfg) = 7ac0b2a92b082fde5ec55d5183472db1
MD5 (instance_configs.cfg.distro) = 1fe870481fd1c6568e1564408f620569
MD5 (instance_configs.cfg.distro.sample) = 1fe870481fd1c6568e1564408f620569


test2:
root@test2 /usr/local/etc # md5 instance_configs.cfg*
MD5 (instance_configs.cfg) = 921a4751023351ee709eeb7df53992f6
MD5 (instance_configs.cfg.distro) = 1fe870481fd1c6568e1564408f620569
MD5 (instance_configs.cfg.distro.sample) = 1fe870481fd1c6568e1564408f620569

test1: after delete and reboot:
MD5 (instance_configs.cfg) = 857d6d4a96ac7d577c9b66bc15156776
MD5 (instance_configs.cfg.distro) = 1fe870481fd1c6568e1564408f620569
MD5 (instance_configs.cfg.distro.sample) = 1fe870481fd1c6568e1564408f620569

I’m not sure where it comes from, and why it wasn’t properly overwritten without deleting it first.
Comment 11 MIHIRA Sanpei Yoshiro freebsd_committer 2019-08-14 09:47:17 UTC
I also had this issue on Google Could Platform/Compute Engine.

I solved below steps
1. remove /usr/local/etc/instance_configs.cfg
2. execute /usr/local/bin/google_instance_setup-2.7 command (and created instance_configs.cfg
3. /usr/local/etc/rc.d/google_accounts_daemon restart

Thank you for your information.