Created attachment 201915 [details] Diff against /usr/src/libexec/atrun directory I have no idea why this hasn't bit people before, or isn't biting people now.... but it is biting me. /usr/libexec/atrun is the "batch" job executor out of the cron and by default runs every 5 minutes. The code has an unlink call in it that attempts to remove old jobs from the queue but unfortunately the queue code can select a job to run, call fork() to start it, post-fork() the child can give up the CPU before it opens the file containing the job and thus the queue code (which is in the parent) can execute the unlink before the child process gets the file open. If this happens you get a "file not found" error in the cron log and the job doesn't run. The attached patch fixes the potential race by moving the unlink into the child; it may not be the most-elegant, but it works. Unfortunately due to the code's structure (it performs multiple tests on the file to be run for security reasons) there are multiple error exits and, in the event of any of those, you must unlink the file as well or it will try to run repeatedly -- yet you can't unlink it immediately after it is opened because some of the tests require it still be on the filesystem.
This is still present in 12.2-STABLE as of today... updated a multi-core system and had it repeatedly drop submitted "batch" jobs due to it.
This is still "active" as of 12.2 as a problem; updated the version impacted to reflect that fact.
In my usage this is not an intermittent irritation on Stable-13, the "race" is hit every time. It would be nice to see this fix pulled into the current releases. It has been broken since Stable-12.
(In reply to Dave Baukus from comment #3) It has been "nearly every time" for me since I went to 12.2 on one specific box, but is intermittent on others. I suspect it has to do with how the scheduler interacts with the various cores in specific configurations. Nonetheless I agree with you -- this needs fixed since when it happens the job disappears without a trace.
In case anyone is looking at this now... I had a problem with at jobs disappearing and traced it down to a conflict with my old /etc/crontab and the new /etc/cron.d/at way of doing things. Both files had the schedule for atrun and I suspect cron was running it twice which was causing my problem. When I removed the old atrun line from /etc/crontab the at service began returning expected results. This problem began for me when I upgraded my system to 12.x-RELEASE from 10.x-RELEASE
At some point in more-recent builds this has disappeared; with 12.3 I cannot duplicate it without the patch, thus closed. My system did not have the "extra" (old) entry as bitbucket noted.