Bug 232350 - periodic: pkg-checksum and pkg-backup interfere with 'overnight' port builds
Summary: periodic: pkg-checksum and pkg-backup interfere with 'overnight' port builds
Status: Open
Alias: None
Product: Base System
Classification: Unclassified
Component: conf (show other bugs)
Version: CURRENT
Hardware: Any Any
: --- Affects Some People
Assignee: freebsd-bugs (Nobody)
URL:
Keywords:
: 239488 (view as bug list)
Depends on:
Blocks:
 
Reported: 2018-10-17 11:34 UTC by Bob Frazier
Modified: 2020-05-30 11:18 UTC (History)
9 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Bob Frazier 2018-10-17 11:34:04 UTC
while attempting to build ports at a time that the periodic nightly jobs are scheduled, the jobs 'pkg-checksum' and 'pkg-backup' (and possibly others) lock the package database for an extended period of time, and thereby prevent ports from being able to install in parallel, such as "leaving the ports build running overnight", something that is typically done for time-consuming builds.

These jobs appear to run themselves as part of the daily periodic scheduled operations, at around 4AM.  This causes the entire build process for a port [one with a lot of dependencies, in particular] to halt in its tracks on error at around the same time.


A typical error output might look like this:

===>  Installing for p5-XML-Parser-2.44
===>  Checking if p5-XML-Parser already installed
===>   Registering installation for p5-XML-Parser-2.44 as automatic
pkg-static: Cannot get an exclusive lock on a database, it is locked by another process
*** Error code 75

Stop.
make[1]: stopped in /usr/ports/textproc/p5-XML-Parser
*** Error code 1

Stop.


The only real solution to this problem is to somehow detect that a port build is in progress and thereby prevent these jobs (and others like them) from running automatically.

Typically it can take several DAYS to build and install a large number of ports, such as a mate desktop with a web browser, mail client, and an office suite, running unattended.  It has been possible to do this in the past, and I had done so many times. However, this new SNAFU prevents me from being able to leave a ports build/install unattended, as some minor dependency is likely to fail to install as a result of one or more periodic jobs running at 4AM.  And nobody wants to have to wake up at 4AM to re-start a build because of things like this.


uname output:
FreeBSD fbsd11.hack 11.2-STABLE FreeBSD 11.2-STABLE #1 r339273: Tue Oct  9 21:10:39 PDT 2018     root@hack.SFT.local:/usr/obj/usr/src/sys/GENERIC  amd64

pkg -v :  1.10.5
Comment 1 Brad Davis freebsd_committer freebsd_triage 2018-10-29 17:07:32 UTC
This is a race condition for sure, but in this situation I am not sure how we could reliably tell if people where building ports easily.

I would just recommend two options:

1) The preferred option, build in poudriere so that the builds are isolated from the rest of the system.
2) As part of the build command you are running, use sysrc to disable these jobs and at the end of the command use sysrc to reenable them.
Comment 2 Walter Schwarzenfeld freebsd_triage 2018-10-31 00:14:00 UTC
It is the question if it is a problem of pkg. Or it is possible to handle it in ther periodic scripts.
Comment 3 Ian Lepore freebsd_committer freebsd_triage 2018-11-08 20:31:53 UTC
IMO, the fix for this should strive to reduce failures due to concurrent access by the nightly periodic jobs to nearly zero, then ensure that if there is a failure, it is in the periodic jobs, not in the ports build process.

A way to accomplish that would be to change the behavior of the periodic jobs to be something like:

  set a retry counter and retry limit
  do
    copy all files needed by the periodic job to a dir in /tmp
    run the statistics/validation/whatever on the temp files
  while validation is not-successful and retry count < limit  
  clean up temp files

With that logic it's possible for the validation to fail if the copy happened to grab a file that was being updated at the instant of the copy, but a retry loop will reduce the chances of that happening again to almost nothing. Even if it does fail, what fails is the lower-priorty periodic work, not the expensive and more-important ports building work.
Comment 4 Walter Schwarzenfeld freebsd_triage 2018-11-14 19:54:33 UTC
See also bug #227952.
Comment 5 Alex Kozlov freebsd_committer freebsd_triage 2018-11-14 20:21:50 UTC
IMO pkg locking should be fixed. There is no need to keep lock for all duration of pkg-check run. The poudriere suggestion is good, but it is more of a workaround.
Comment 6 Ian Lepore freebsd_committer freebsd_triage 2018-11-14 20:29:26 UTC
(In reply to Alex Kozlov from comment #5)

"Should be fixed" is easy to say, but it's meaningless.  Provide details, what's your proposal?  It doesn't matter if the lock can be held for a shorter duration, the locking conflicts will still occur, it will just be less likely. But failing to handle a less-likely error is still a failure.

How specifically do you propose to handle the case where nightly periodic jobs are running at the same time as package building and/or installation and there is lock contention?  Will there be retries?  A limit to how many times or how long to keep retrying?  A way to configure those limits?
Comment 7 Walter Schwarzenfeld freebsd_triage 2018-11-14 20:41:18 UTC
If have set:
/usr/local/etc/pkg.conf

LOCK_WAIT = 100;
LOCK_RETRIES = 100;


makes  it better, but does not solve it.
Comment 8 Alex Kozlov freebsd_committer freebsd_triage 2018-11-14 20:43:28 UTC
The race window will shrink to negligible values. Also there is no need to build and *install* packages at the same time. Even if original poster doesn't want to use poudriere, which I strongly recommend, he can build packages overnight and install them later, manually.
Comment 9 Bob Frazier 2018-11-15 09:36:34 UTC
the thing is, if a port has a zillion run/build dependencies, and you simply type 'make' for the port, then the dependent packages will all be installed as part of the process.  In the case of build dependencies, it would not be possible to build a package until the build dependencies have been installed.  And the process does not appear to differentiate between build dependencies and run dependencies when you use 'make' (or 'make package' for that matter).

So it's back to 'Catch 22' on the build-dependency packages, if building packages "without installing" is the only option.

Why not just eliminate the problem by making a copy of the relevant files, rather than locking them at all, during the periodic jobs, and working with the copies?  

Ian's suggestion of copying them to /tmp first makes the most sense.

The alternative would be to shut off the periodic jobs somehow while ports are building, even if that means creating a /var/run/ports-are-building file (or similar) and locking it for the duration of the process.
Comment 10 Alex Kozlov freebsd_committer freebsd_triage 2018-11-15 15:05:33 UTC
> the thing is, if a port has a zillion run/build dependencies, and you
> simply type 'make' for the port, then the dependent packages will all
> be installed as part of the process. 
Ah yes, if you use make then while you can pre-build build dependency,
there're still problems with other types of dependencies.
I guess there are only two choices: poudriere or disabling pkg-* periodic
scripts. You can run them from cron at another time. 

> Why not just eliminate the problem by making a copy of the relevant
> files, rather than locking them at all, during the periodic jobs,
> and working with the copies?
Well, if you work on copy of pkgdb, then you check against and backup potentially stalled version of db.
Comment 11 Ian Lepore freebsd_committer freebsd_triage 2018-11-15 15:26:20 UTC
(In reply to Alex Kozlov from comment #10)

It is not an error or a problem to be working on a snapshot of the database taken at the beginning of the periodic run.  This is a standard axiom of fixing races with lockless (or lock minimization) techniques... it doesn't matter whether you capture the snapshot before or after some arbitrary action, it only matters that you capture and operate on a consistant snapshot.  If updates are happening to the live database while you run the validation on the snapshot, those updates will be validated in the run the next day, and that situation is exactly identical to the situation in which running with exclusionary locks prevented the update from happening until after the validation run completes.

These are not new problems, and the techniques for solving them are not new either.  I first ran into these problems and the snapshot-based solutions to them in the 1970s.

The main downside to such techniques is that it requires copying the data to be validated, and sometimes the size of that data makes that too expensive of an operation.  That could be the case here, I have no idea how much data would have to be copied in this instance.
Comment 12 Alex Kozlov freebsd_committer freebsd_triage 2018-11-15 16:04:21 UTC
The pkg-checksum reads almost entire $PREFIX, so if you don't want for files disappear/change checksum during validation, you need to copy $(pkg info -la) files to /tmp.
It's easier with pkgdb, just a few megabytes, but if e.g. pkg-audit works on stalled pkgdb, there is possibility that you install vulnerable port and will know about it only after next periodic run. Granted, chances for this is not very high, but IMO it's much easier and safer to move pkg-* periodic scripts to another time or build ports in poudriere.
Comment 13 Ian Lepore freebsd_committer freebsd_triage 2018-11-15 16:10:09 UTC
(In reply to Alex Kozlov from comment #12)

> but if e.g. pkg-audit works on stalled pkgdb, there is possibility that you
> install vulnerable port and will know about it only after next periodic run

You seem to have missed the most important point of my comment:  You CANNOT prevent that from happening with ANY technique or algorithm. If the port build/install completes before validation begins, it gets validated tonight.  If validation begins first and locks out the installation of the port while validation is running, it gets validated tomorrow night.  That's the exact same situation as validating against a snapshot.

"Use poudriere" may (or often may not) be good advice for any given user and situation, but IT IS NOT A SOLUTION.
Comment 14 Alex Kozlov freebsd_committer freebsd_triage 2018-11-15 16:33:31 UTC
> You CANNOT prevent that from happening with ANY technique or algorithm.
You cannot, but it's not mean that you have to increase duration of vulnerability window.

> "Use poudriere" may (or often may not) be good advice for any given user and
> situation, but IT IS NOT A SOLUTION.
The solution is to improve db locking in at least pkg-check, all other suggestions are workarounds. And IMO poudriere is better one that your idea of copying $PREFIX to /tmp.
Comment 15 Walter Schwarzenfeld freebsd_triage 2018-11-15 16:50:13 UTC
> have to increase duration of vulnerability window

The duration of the vuln window is always 24 hours.
Comment 16 Walter Schwarzenfeld freebsd_triage 2018-11-15 16:55:50 UTC
The simpelst solution is
put in /etc/periodic.conf security_status_pkgchecksum_enable=YES|security_status_pkgchecksum_enable=NO

and at the end of your update-script
pkg check -s
Comment 17 Walter Schwarzenfeld freebsd_triage 2018-11-15 16:56:44 UTC
paste error:
shoudl be
security_status_pkgchecksum_enable=NO
Comment 18 Alex Kozlov freebsd_committer freebsd_triage 2018-11-15 17:15:59 UTC
Sigh, I guess easiest workaround would be:
security_status_pkgaudit_enable=NO
security_status_pkgchecksum_enable=NO in periodic.conf
then
${PREFIX}/etc/periodic/security/410.pkg-audit in the beginning of port building
and
${PREFIX}/etc/periodic/security/460.pkg-checksum at the end
Oh, and LOCK_WAIT=100 in pkg.conf for pkg-backup and all.
Comment 19 Walter Schwarzenfeld freebsd_triage 2018-11-15 18:15:24 UTC
Additional it needs a check if pkg check is already running:
Comment 20 Walter Schwarzenfeld freebsd_triage 2018-11-15 18:20:57 UTC
No, not needed if it is set to NO.
Comment 21 Bob Frazier 2018-11-15 21:20:12 UTC
having entries in pkg.conf is useful, but can they be automatically added/removed by the ports build system?  Having to manually fiddle with this over-complicates the ports build process.

And, it seems to me that building/installing poudriere on an embedded device (like Raspberry Pi) is probably not a good solution for simply building from ports on the same device.  In my opinion, that "breaks ports".


If I had unlimited bandwidth, unlimited CPU, unlimited RAM, and unlimited disk access speed/space, and unlimited time to read all of the documentation and get familiar with how to set it up correctly, then certainly 'install poudriere' would be a simple and adequate solution.  But for older and embedded hardware, and generally busy people, not so much.  I'd rather just go to the ports directory and type 'make install', thanks, and have the system work as intended.


Another possibility I suggested earlier is for pkg to have "a file someplace", in /var/run, in /tmp, doesn't really matter where all that much, but one that's locked in a way that allows for simultaneous locking by other processes, and prevents the periodic process from locking the entire pkg database.

non-exclusive locks seem to be one of the best ways to make this work, with an exclusive lock on the same file from periodic processes.  It's a method I've used before for concurrency of this kind.

In a makefile context it may be possible to start a daemon process that tracks the PID of the parent process as part of the make environment.  When the parent process stops (complete or error), the daemon would exit.  The daemon would then keep the non-exclusive lock on the file.  If the 'pkg' utility were modified to do the locking, it could be something like 'pkg --lock' to do a non-exclusive lock (and run the daemon), and 'pkg --xlock' to do an exclusive one, with this command being called for the appropriate context [the difference would be for building 'pkg' itself, which would have to just work without doing this step].  The 'pkg' utility would return an error code if the lock fails, and whatever process would check for this and act appropriately.
Comment 22 Tatsuki Makino 2018-11-18 06:13:05 UTC
Below is a little hentai way for portmaster user.

1. Add the following to /etc/periodic.conf SHELL SCRIPT.

if ! lockf -s -t 0 /var/run/portmaster true ; then
  security_status_pkgchecksum_enable="NO"
fi

2. Run portmaster like "lockf /var/run/portmaster portmaster ...".
Comment 23 Kubilay Kocak freebsd_committer freebsd_triage 2019-07-28 11:55:23 UTC
*** Bug 239488 has been marked as a duplicate of this bug. ***
Comment 24 Kubilay Kocak freebsd_committer freebsd_triage 2019-07-28 12:03:48 UTC
This issue, as reported, is a base (periodic script) issue. Re-classify accordingly, keeping pkg mailing list on CC.

If changes are required, or determined to be desirable, in pkg to improve locking behaviour (if that's possible) to complement any changes or improvements to periodic scripts, then a separate, new "depends on" bug can be created to track those requested changes in pkg.
Comment 25 Tatsuki Makino 2019-11-26 05:18:01 UTC
pkg-1.12.0 sets busy_timeout to 5 seconds in libpkg/pkgdb.c.
Is it too short or not applied to all database handles?
Comment 26 Tatsuki Makino 2019-11-27 23:48:50 UTC
Should we adjust LOCK_RETRIES and LOCK_WAIT in /usr/local/etc/pkg.conf?

LOCK_WAIT = 10
LOCK_RETRIES = 500000
Comment 27 Tatsuki Makino 2020-01-10 05:16:41 UTC
By the way, /var/db/pkg/local.sqlite has a table named pkg_lock.
If another process is using "BEGIN", will that table be unreachable?
sqlite3 locks all tables in the file, doesn't it?
Comment 28 Matthew Seaman freebsd_committer freebsd_triage 2020-01-10 06:46:13 UTC
(In reply to Tatsuki Makino from comment #27)

The pkg_lock table is used by pkg-lock(8), not for the database locking used as a part of the routine operations of pkg(8) as a whole.

The sqlite3 locking used by pkg(8) is a global lock over the whole database -- sqlite3 is not a multiuser database and has essentially no support for concurrent access or MVCC.  If one instance of pkg(8) is active it will hold either a write lock -- giving it exclusive access -- or a read lock -- which could be shared with other consumers, but all are blocked from writing or obtaining a write lock themselves.  That's always going to be a potential cause of problems whenever multiple invocations of pkg(8) are run simultaneously.

This is unfortunately intrinsic to the use of sqlite3.  Possible alternatives might include much heavier-weight RDBMSes like mysql or postgresql, which would solve the concurrency problem but doesn't really seem practical for general use.
Comment 29 Tatsuki Makino 2020-01-10 07:15:02 UTC
(In reply to Matthew Seaman from comment #28)

Thank you for answering my sqlite3 questions.

However, I don't think the pkg_lock table is for pkg-lock.
I have locked seamonkey-2.49.4_27, but when I run sqlite3 /var/db/pkg/local.sqlite .dump , I only get the following:

CREATE TABLE pkg_lock (exclusive INTEGER(1),advisory INTEGER(1),read INTEGER(8));
INSERT INTO pkg_lock VALUES(0,0,0);

And there is a column for the locked flag in the packages table.

CREATE TABLE packages (id INTEGER PRIMARY KEY,origin TEXT NOT NULL,name TEXT NOT NULL,version TEXT NOT NULL,comment TEXT NOT NULL,desc TEXT NOT NULL,mtree_id INTEGER REFERENCES mtree(id) ON DELETE RESTRICT ON UPDATE CASCADE,message TEXT,arch TEXT NOT NULL,maintainer TEXT NOT NULL, www TEXT,prefix TEXT NOT NULL,flatsize INTEGER NOT NULL,automatic INTEGER NOT NULL,locked INTEGER NOT NULL DEFAULT 0, ...Omitted

The column names in the pkg_lock table seem to allow simultaneous invocation...