Bug 165392 - [ufs] [patch] Multiple mkdir/rmdir fails with errno 31
Summary: [ufs] [patch] Multiple mkdir/rmdir fails with errno 31
Status: Open
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 9.0-RELEASE
Hardware: Any Any
: Normal Affects Only Me
Assignee: freebsd-bugs mailing list
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-02-22 11:00 UTC by Vsevolod Volkov
Modified: 2017-12-31 22:27 UTC (History)
0 users

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Vsevolod Volkov 2012-02-22 11:00:22 UTC
Multiple sequence of mkdir and rmdir causes mkdir failure with errno 31. Usualy it happens on 32765 iteration.

How-To-Repeat: Compile and execute the following program:

#include <sys/stat.h>
#include <unistd.h>
#include <stdio.h>
#include <errno.h>

int main (void)
{
  int i;
  char dir[100];
  for (i = 0; i < 50000; i++)
  {
    snprintf (dir, sizeof(dir), "empty_dir/%d", i);
    printf ("%s\n", dir);
    if (mkdir (dir, 0700) == -1)
    {
      printf ("mkdir %s: (errno %d)\n", dir, errno);
      break;
    }
    if (rmdir (dir) == -1)
    {
      printf ("rmdir %s: (errno %d)\n", dir, errno);
      break;
    }
  }
  return 0;
}

gcc -o test1 test1.c
mkdir empty_dir
./test1
Comment 1 Andriy Gapon freebsd_committer 2012-02-22 21:23:15 UTC
Could you please provide full details of the tested filesystem?
-- 
Andriy Gapon
Comment 2 Eitan Adler freebsd_committer freebsd_triage 2012-02-23 05:05:19 UTC
Responsible Changed
From-To: freebsd-bugs->eadler

I'll take it.
Comment 3 Vsevolod Volkov 2012-02-23 07:59:42 UTC
I've tested 2 computers with 9.0-RELEASE (amd64 and i386). Filesystems
are UFS2 with soft updates:

tunefs: POSIX.1e ACLs: (-a)                                disabled
tunefs: NFSv4 ACLs: (-N)                                   disabled
tunefs: MAC multilabel: (-l)                               disabled
tunefs: soft updates: (-n)                                 enabled
tunefs: soft update journaling: (-j)                       disabled
tunefs: gjournal: (-J)                                     disabled
tunefs: trim: (-t)                                         disabled
tunefs: maximum blocks per file in a cylinder group: (-e)  2048
tunefs: average file size: (-f)                            16384
tunefs: average number of files in a directory: (-s)       64
tunefs: minimum percentage of free space: (-m)             8%
tunefs: optimization preference: (-o)                      time
tunefs: volume label: (-L)
Comment 4 Vsevolod Volkov 2012-02-23 08:06:03 UTC
There is no problem with zfs. Test passed with 200000 iterations.
Comment 5 Eitan Adler freebsd_committer freebsd_triage 2012-02-25 15:24:40 UTC
Responsible Changed
From-To: eadler->freebsd-fs

I'm not going to have time to look into this soon enough
Comment 6 Jilles Tjoelker freebsd_committer 2012-02-25 18:27:02 UTC
> [mkdir fails with [EMLINK], but link count < LINK_MAX]

I can reproduce this problem with UFS with soft updates (with or without
journaling).

A reproduction without C programs is:

cd empty_dir
mkdir `jot 32766 1`     # the last one will fail (correctly)
rmdir 1
mkdir a                 # will erroneously fail

The problem appears to be because the previous rmdir has not yet been
fully completed. It is still holding onto the link count until the
directory is written, which may take up to two minutes.

The same problem can occur with other calls that increase the link count
such as link() and rename().

A workaround is to call fsync() on the directory that contained the
deleted entries. It will then release its hold on the link count and
allow mkdir or other calls. If fsync() is only called when [EMLINK] is
returned, the performance impact should not be very bad, although it
still causes more I/O than necessary.

The book "The Design and Implementation of the FreeBSD Operating System"
contains a detailed description of soft updates in section 8.6 Soft
Updates. The subsection "File Removal Requirements for Soft Updates"
appears particularly relevant to this problem.

A possible solution is to check for the problematic situation
(i_effnlink < LINK_MAX && i_nlink >= LINK_MAX) and if so synchronously
write one or more deleted directory entries that pointed to the inode
with the link count problem. After that, i_nlink should be less than
LINK_MAX and the link count can be checked again (depending on whether
locks need to be dropped to do the write, it may or may not be possible
for another thread to use up the last link first).

For mkdir() and rename(), the directory that contains the deleted
entries is obvious (the directory that will contain the new directory)
while for link() it can (in the general case) only be found in soft
updates data structures. Soft updates must track this because (if the
link count became 0) it will not clear the inode before all directory
entries that pointed to it have been written.

Simply replacing the i_nlink < LINK_MAX check with i_effnlink < LINK_MAX
is unsafe because it will lead to overflow of the 16-bit signed i_nlink
field. If the field is made larger, I don't see how it is prevented that
the code commits such a set of changes that an inode on disk has more
than LINK_MAX links for some time (for example if a file in the new
directory is fsynced while the old directory entries are still on the
disk).

-- 
Jilles Tjoelker
Comment 7 Jaakko Heinonen freebsd_committer 2013-05-20 20:21:34 UTC
Hi!

>  A workaround is to call fsync() on the directory that contained the
>  deleted entries. It will then release its hold on the link count and
>  allow mkdir or other calls. If fsync() is only called when [EMLINK] is
>  returned, the performance impact should not be very bad, although it
>  still causes more I/O than necessary.

I tried to implement this with the following patch:

	http://people.freebsd.org/~jh/patches/ufs-check_linkcnt.diff

However, VOP_FSYNC(9) with the MNT_WAIT flag seems not to update the
i_nlink count for a reason unknown to me. I can verify that also by
taking your reproduction recipe above and adding "fsync ." between
"rmdir 1" and "mkdir a".

Does this mean that fsync(2) is broken for directories on softdep
enabled UFS?

I have cc'd Kirk in hope he could shed some light on this.

-- 
Jaakko
Comment 8 Jilles Tjoelker freebsd_committer 2013-05-27 17:53:28 UTC
fsync certainly helps but not as effectively as you'd want. Some
combination of sleeps, fsyncs and mkdir attempts appears to be needed. A
shell loop like
  rmdir 8; fsync .; \
  until mkdir h 2>/dev/null; do printf .; fsync .; sleep 1; done
takes two seconds.

However, in
  rmdir 13; mkdir m; fsync .; \
  until mkdir m 2>/dev/null; do printf .; sleep 1; done
the fsync is of no benefit. It is just as slow as omitting it (about
half a minute).

I must have taken long enough to type/recall the commands when I tried
this earlier. In my earlier experiments I gave the commands separately.

> Does this mean that fsync(2) is broken for directories on softdep
> enabled UFS?

I don't think fsync(2) has to sync the exact link count to disk, since
fsck will take care of that. However, it has to sync the timestamps,
permissions and directory entries.

> I have cc'd Kirk in hope he could shed some light on this.

I'm also interested in whether it is safe to call VOP_FSYNC at that
point, especially in the case of a rename where a lock on the source
directory vnode may be held at the same time.

-- 
Jilles Tjoelker
Comment 9 Jaakko Heinonen freebsd_committer 2013-05-29 17:53:11 UTC
On 2013-05-27, Jilles Tjoelker wrote:
> > However, VOP_FSYNC(9) with the MNT_WAIT flag seems not to update the
> > i_nlink count for a reason unknown to me. I can verify that also by
> > taking your reproduction recipe above and adding "fsync ." between
> > "rmdir 1" and "mkdir a".
> 
> fsync certainly helps but not as effectively as you'd want. Some
> combination of sleeps, fsyncs and mkdir attempts appears to be needed.

I have revised the patch and the following version _appears_ to work.

	http://people.freebsd.org/~jh/patches/ufs-check_linkcnt.2.diff

It's still experimental and doesn't handle link(2) or rename(2) at all.

In my testing debug.softdep.linkcnt_retries is increased by one with
your original reproduction recipe.

> I'm also interested in whether it is safe to call VOP_FSYNC at that
> point, especially in the case of a rename where a lock on the source
> directory vnode may be held at the same time.

I think your concern is valid because softdep_fsync() needs to lock
parent directories. Possibly you can work around the problem by
unlocking the vnodes, doing fsync and then restarting rename.
Unfortunately this makes rename even more complex.

-- 
Jaakko
Comment 10 Eitan Adler freebsd_committer freebsd_triage 2017-12-31 07:58:24 UTC
For bugs matching the following criteria:

Status: In Progress Changed: (is less than) 2014-06-01

Reset to default assignee and clear in-progress tags.

Mail being skipped