Bug 235657 - /usr/libexec/atrun race causes missed jobs
Summary: /usr/libexec/atrun race causes missed jobs
Status: New
Alias: None
Product: Base System
Classification: Unclassified
Component: bin (show other bugs)
Version: 12.0-STABLE
Hardware: Any Any
: --- Affects Some People
Assignee: freebsd-bugs (Nobody)
Keywords: patch
Depends on:
Reported: 2019-02-11 05:46 UTC by karl
Modified: 2019-11-23 16:14 UTC (History)
3 users (show)

See Also:

Diff against /usr/src/libexec/atrun directory (1.84 KB, patch)
2019-02-11 05:46 UTC, karl
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description karl 2019-02-11 05:46:11 UTC
Created attachment 201915 [details]
Diff against /usr/src/libexec/atrun directory

I have no idea why this hasn't bit people before, or isn't biting people now.... but it is biting me.

/usr/libexec/atrun is the "batch" job executor out of the cron and by default runs every 5 minutes.

The code has an unlink call in it that attempts to remove old jobs from the queue but unfortunately the queue code can select a job to run, call fork() to start it, post-fork() the child can give up the CPU before it opens the file containing the job and thus the queue code (which is in the parent) can execute the unlink before the child process gets the file open.  If this happens you get a "file not found" error in the cron log and the job doesn't run.

The attached patch fixes the potential race by moving the unlink into the child; it may not be the most-elegant, but it works.  Unfortunately due to the code's structure (it performs multiple tests on the file to be run for security reasons) there are multiple error exits and, in the event of any of those, you must unlink the file as well or it will try to run repeatedly -- yet you can't unlink it immediately after it is opened because some of the tests require it still be on the filesystem.