Bug 192490 - [build] race condition with multiple instances of cleandir in subdirectories; results in failure like "rm: fts_read: No such file or directory"
Summary: [build] race condition with multiple instances of cleandir in subdirectories;...
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: conf (show other bugs)
Version: CURRENT
Hardware: Any Any
: --- Affects Some People
Assignee: Ian Lepore
URL:
Keywords:
: 193558 (view as bug list)
Depends on:
Blocks:
 
Reported: 2014-08-07 20:50 UTC by Enji Cooper
Modified: 2015-10-24 23:46 UTC (History)
6 users (show)

See Also:


Attachments
Build log (205.82 KB, application/x-gzip)
2014-08-07 20:53 UTC, Enji Cooper
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Enji Cooper freebsd_committer freebsd_triage 2014-08-07 20:50:06 UTC
I've seen issues with builds (mostly on my VMware Fusion VM running FreeBSD CURRENT) where it fails to build with a false positive like the following (from http://kyua3.nyi.freebsd.org/head/data/0-LATEST/output.log):

===> usr.sbin/lpr (cleandir)
--- cleandir_subdir_lastlogin ---
--- cleanobj ---
--- usr.bin.cleandir__D ---
--- cleandir_subdir_limits ---
===> usr.bin/limits (cleandir)
--- cleandir_subdir_lex ---
--- cleanobj ---
--- usr.sbin.cleandir__D ---
--- cleandir_subdir_lmcconfig ---
--- cleanobj ---
--- usr.bin.cleandir__D ---
--- _sub.cleandir ---
===> usr.bin/lex/lib (cleandir)
--- usr.sbin.cleandir__D ---
--- cleandir_subdir_lpr ---
--- _sub.cleandir ---
===> usr.sbin/lpr/common_source (cleandir)
--- usr.bin.cleandir__D ---
--- cleandir_subdir_limits ---
--- cleanobj ---
--- usr.sbin.cleandir__D ---
--- cleandir_subdir_lptcontrol ---
===> usr.sbin/lptcontrol (cleandir)
--- usr.bin.cleandir__D ---
--- cleandir_subdir_lex ---
--- cleanobj ---
--- usr.sbin.cleandir__D ---
--- cleandir_subdir_mailstats ---
===> usr.sbin/mailstats (cleandir)
--- usr.bin.cleandir__D ---
--- cleanobj ---
rm: fts_read: No such file or directory
*** [cleanobj] Error code 1

The error shown makes sense as there are 3 instances of usr.bin.cleandir__D being run in parallel instead of one instance.

I've skated around this issue before in the past by serializing the removal of ${MAKEOBJDIRPREFIX} from Makefile.inc1 when NO_CLEAN is not set, which isn't necessarily optimal as rm -Rf /usr/obj is a O(n) process in a single process, but it works 100% of the time.
Comment 1 Enji Cooper freebsd_committer freebsd_triage 2014-08-07 20:53:56 UTC
Created attachment 145486 [details]
Build log

From http://kyua3.nyi.freebsd.org/head/data/0-LATEST/output.log
Comment 2 Enji Cooper freebsd_committer freebsd_triage 2014-09-02 23:11:08 UTC
This may have already been addressed by imp@ in https://svnweb.freebsd.org/base?view=revision&revision=268376 . I can't verify what version of FreeBSD the jenkins server or kyua servers are running though, so I don't know if this bug happened before or after the enhancement was made to rm to ignore fts_read errors with rm -f.
Comment 3 Warner Losh freebsd_committer freebsd_triage 2014-09-02 23:16:19 UTC
Already fixed, as Garrett states.
Comment 4 Enji Cooper freebsd_committer freebsd_triage 2014-09-02 23:21:27 UTC
The fix hasn't been MFCed to stable/10 or stable/9 though... should it?
Comment 5 Warner Losh freebsd_committer freebsd_triage 2014-09-02 23:30:13 UTC
Yes. It does.
Comment 6 Enji Cooper freebsd_committer freebsd_triage 2014-10-02 06:20:52 UTC
*** Bug 193558 has been marked as a duplicate of this bug. ***
Comment 7 Glen Barber freebsd_committer freebsd_triage 2014-10-02 06:24:19 UTC
r272372
Comment 8 Enji Cooper freebsd_committer freebsd_triage 2014-10-04 04:44:22 UTC
Reopening the bug because in retrospect the underlying issue has not been resolved -- it has been worked around with r268376.

There are actually two bugs:
1. (Cause) rm -Rf is being run on a path and a subdirectory concurrently.
2. (Effect) rm -Rf is failing because fts_read isn't properly filtering out certain errors like EACCES, ENOENT and EPERM.

imp@ resolved 2. (but there are other issues that were introduced with the commit as it ignores all errors with fts_* according to the rm(1) manpage).
ian@ is looking into 1.
Comment 9 Ian Lepore freebsd_committer freebsd_triage 2014-10-04 13:40:39 UTC
I've been digging into the actual build system problem, and I'm starting to think that all the reported failures that contain enough of the log to be useful show that the build failed in a directory that has subdirectories.  That is, one of the failures appeared to be caused by rm -rf running concurrently in usr.bin/lex and usr.bin/lex/lib.  Another failure involves modules/aic7xxx and modules/aic7xxx/ahc.  In another log it appeared that ata/atapci/chipsets was being deleted simulataneously with ata/atapci/chipsets/ataacard and several other subdirs under chipsets/.  

I didn't see any evidence that the exact same path was being multiply deleted at the same time.  That is, no duplicated entries in SUBDIR lists or accidentally processing the entire sys/modules hiearchy twice in parallel somehow through two different parent paths or anything like that.
Comment 10 Warner Losh freebsd_committer freebsd_triage 2014-10-04 20:56:59 UTC
Ian: the problem is that directory foo has a subdir bar. So running rm -rf in both foo and foo/bar (especially when there are several bars) is what causes the issue. rm was too stupid to not generate an error when it was trying to read an entry that went away by some other agent. bde thinks this is a standards violation, but I'm not convinced by his reasoning. Adding actual debugging to rm shows that running in both is the issue, and it fails to properly ignore the ENOENT it gets when reading the subdir..

This is caused, I think, by not waiting for all subdirs to finish in clean targets before doing the current directory. Or maybe the -Rf in the first place is the bug that's masking this not waiting.

I believe the -rf in question is in bsd.obj.mk where it deletes the CANONICALOBJDIR since CLEANDIRS isn't used in the sys/modules tree. It would appear that the for loop in bsd.subdir.mk that has the ${__target}: ${__subdir_targets} dependency isn't strong enough to force the recursion to finish before the current target is run. That might be a fruitful line to try to investigate. If you fix that, then the -r likely can go away if you think about it (though we'd need to add a seperate rmdir to get the same effect, which would have the added benefit of catching when the directory isn't empty due to missing CLEANFILES, since our build system has opted for the "you must list everything" approach elsewhere the -rf is a bit of a outliner in the current system).

So by all means, look at the build system, but it is my belief that the fix in rm is actually standards-ly correct and fixes this bug. A build-system fix would help those systems with rm that isn't quite standards compliant.
Comment 11 Warner Losh freebsd_committer freebsd_triage 2014-10-04 21:23:52 UTC
P.S. If you want to paper over it in a different way, you could add a '-' after the @ in the rm -rf ${CANONICALOBJDIR} line. That would cause errors to just be ignored.
Comment 12 Enji Cooper freebsd_committer freebsd_triage 2015-10-24 23:46:28 UTC
Closing bug.