Bug 223348 - Inode number for directories change to inode number of directory two levels up causing hierarchy loop
Summary: Inode number for directories change to inode number of directory two levels u...
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: gnu (show other bugs)
Version: CURRENT
Hardware: Any Any
: --- Affects Only Me
Assignee: freebsd-fs (Nobody)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-10-31 18:58 UTC by rz-rpi03
Modified: 2017-11-01 12:34 UTC (History)
2 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description rz-rpi03 2017-10-31 18:58:04 UTC
Hi,

I ran into this bug on two different systems.
- One uses a r324861 amd64 GENERIC-NODEBUG kernel and userland with ZFS,
- the other uses a r324918 arm RPI-B kernel an userland with UFS2.

The directory hierarchy loop occures on both systems, so I think it is not related to the underlying filesystem.

After installing freebsd-uucp from ports and configuring it, there exists a new directory tree /var/spool/uucp/lyxys/[DX].
At that moment the content of lyxys/D. and lyxys/X. looked as expected.
But some time after both looked for ls(1) like a copy of /var/spool/uucp.
And indeed ls(1) showed the same inode number for all three directories:

63747 drwxrwxr-x  10 uucp  uucp  512 Feb  7  2017 /var/spool/uucp
95493 drwxr-xr-x   4 uucp  uucp  512 Oct 24 23:53 /var/spool/uucp/lyxys
63747 drwxrwxr-x  10 uucp  uucp  512 Feb  7  2017 /var/spool/uucp/lyxys/D.
63747 drwxrwxr-x  10 uucp  uucp  512 Feb  7  2017 /var/spool/uucp/lyxys/X.

fsdb(8) showed me different inode numbers than ls(1) for the D. and X. directories.
But this can be a mistake, as it was my first use of fsdb.

fsck_ufs does not report any error or tries to fix something.
As ZFS does not have a fsck, I tried "zpool scrub". Which also did not report any error.
So both internal filesystem structures seem to be OK.


Steps tried to reproduce the problem:

With ZFS all I can do is a rooback to a earlier snapshot without the /var/spool/uucp directory and start again.
As it is easier for me to experiment with the arm system I used the Raspberry Pi for the following tests.

Trying to get closer to the moment this happens I set up a date(1), ls(1) loop on a fresh set up[1] UFS.

It showed the inode change in the directory information happend sometime between, 02:57 and 03:57 CEST:

Wed Oct 25 02:57:07 CEST 2017
 95237 drwxr-xr-x  25 root  wheel  512 Oct 24 23:38 /var
102372 drwxr-xr-x  11 root  wheel  512 Mar  7  2017 /var/spool/
102701 drwxrwxr-x  10 uucp  uucp   512 Feb  7  2017 /var/spool/uucp
102711 drwxr-xr-x   4 uucp  uucp   512 Oct 24 23:53 /var/spool/uucp/lyxys
102684 drwxr-xr-x   2 uucp  uucp   512 Oct 24 23:53 /var/spool/uucp/lyxys/D.
102685 drwxr-xr-x   2 uucp  uucp   512 Oct 24 23:53 /var/spool/uucp/lyxys/X.
Wed Oct 25 03:57:08 CEST 2017
 95237 drwxr-xr-x  25 root  wheel  512 Oct 24 23:38 /var
102372 drwxr-xr-x  11 root  wheel  512 Mar  7  2017 /var/spool/
102701 drwxrwxr-x  10 uucp  uucp   512 Feb  7  2017 /var/spool/uucp
102711 drwxr-xr-x   4 uucp  uucp   512 Oct 24 23:53 /var/spool/uucp/lyxys
102701 drwxrwxr-x  10 uucp  uucp   512 Feb  7  2017 /var/spool/uucp/lyxys/D.
102701 drwxrwxr-x  10 uucp  uucp   512 Feb  7  2017 /var/spool/uucp/lyxys/X.

As the system was not used actively, the only difference between the hours before and after the change was cron running periodic daily at 03:01:00.

In the next step I searched for a way to faster set up the UFS for testing.
First I used "dump -0a -f /usr/local/root_dump /" on the arm then moved
the SD-card to the amd64 and with
  newfs -UEt /dev/da0s2a
  mount /dev/da0s2a /mnt
  mount -r /dev/da0s3a /tmp/da0s3a/
  cd /mnt
  restore rf /tmp/da0s3a/root_dump
  rm restoresymtable
restored the filesystem. The output of ls(1) showed everything is OK:

# ls -lid var/spool/uucp var/spool/uucp/lyxys var/spool/uucp/lyxys/*
95491 drwxrwxr-x  10 uucp  uucp  512 Feb  7  2017 var/spool/uucp
  261 drwxr-xr-x   4 uucp  uucp  512 Oct 24 23:53 var/spool/uucp/lyxys
32005 drwxr-xr-x   2 uucp  uucp  512 Oct 24 23:53 var/spool/uucp/lyxys/D.
63746 drwxr-xr-x   2 uucp  uucp  512 Oct 24 23:53 var/spool/uucp/lyxys/X.
# 

But after booting this filesystem on the arm:
# cd /; ls -lid var/spool/uucp var/spool/uucp/lyxys var/spool/uucp/lyxys/*
95491 drwxrwxr-x  10 uucp  uucp  512 Feb  7  2017 var/spool/uucp
  261 drwxr-xr-x   4 uucp  uucp  512 Oct 24 23:53 var/spool/uucp/lyxys
95491 drwxrwxr-x  10 uucp  uucp  512 Feb  7  2017 var/spool/uucp/lyxys/D.
95491 drwxrwxr-x  10 uucp  uucp  512 Feb  7  2017 var/spool/uucp/lyxys/X.
# 

Again mounted on the amd64 showed the same output.
Now I tried just restoring, unmounting and mounting the UFS on the amd64 and it looks like it happens immediately with the unmount:

# rm restoresymtable 
# ls -lid var/spool/uucp var/spool/uucp/lyxys var/spool/uucp/lyxys/*
63747 drwxrwxr-x  10 uucp  uucp  512 Feb  7  2017 var/spool/uucp
95493 drwxr-xr-x   4 uucp  uucp  512 Oct 24 23:53 var/spool/uucp/lyxys
  261 drwxr-xr-x   2 uucp  uucp  512 Oct 24 23:53 var/spool/uucp/lyxys/D.
32002 drwxr-xr-x   2 uucp  uucp  512 Oct 24 23:53 var/spool/uucp/lyxys/X.
# cd; umount /mnt; mount /dev/da0s2a /mnt; cd /mnt
# ls -lid var/spool/uucp var/spool/uucp/lyxys var/spool/uucp/lyxys/*
63747 drwxrwxr-x  10 uucp  uucp  512 Feb  7  2017 var/spool/uucp
95493 drwxr-xr-x   4 uucp  uucp  512 Oct 24 23:53 var/spool/uucp/lyxys
63747 drwxrwxr-x  10 uucp  uucp  512 Feb  7  2017 var/spool/uucp/lyxys/D.
63747 drwxrwxr-x  10 uucp  uucp  512 Feb  7  2017 var/spool/uucp/lyxys/X.
# 

So I have two ways of repeating this behaviour:
- Waiting till periodiy daily triggers ist somehow. Or calling it immediately.
- Restoring the filesystem from dump(8) and unmounting it.


Just to complete the information and to demonstrate the loop, this is what the ls(1) optput looks like without -d:

# ls -li var/spool/uucp var/spool/uucp/lyxys var/spool/uucp/lyxys/*
var/spool/uucp:
total 44
95491 drwxr-xr-x  3 uucp  uucp   512 Sep  4  2013 .Failed
  259 drwxrwxr-x  2 uucp  uucp   512 Feb  7  2017 .Preserve
63748 drwxr-xr-x  2 uucp  uucp   512 Sep  4  2013 .Received
95492 drwxrwxr-x  2 uucp  uucp   512 Feb  7  2017 .Sequence
  260 drwxrwxr-x  2 uucp  uucp   512 Sep  4  2013 .Status
32001 drwxrwxr-x  2 uucp  uucp   512 Feb  7  2017 .Temp
63749 drwxrwxr-x  2 uucp  uucp   512 Feb  7  2017 .Xqtdir
67117 -rw-r--r--  1 uucp  uucp  5959 Nov 24  2014 Log
67118 -rw-r--r--  1 uucp  uucp  3728 Nov 24  2014 Stats
95493 drwxr-xr-x  4 uucp  uucp   512 Oct 24 23:53 lyxys

var/spool/uucp/lyxys:
total 8
63747 drwxrwxr-x  10 uucp  uucp  512 Feb  7  2017 D.
63747 drwxrwxr-x  10 uucp  uucp  512 Feb  7  2017 X.

var/spool/uucp/lyxys/D.:
total 44
95491 drwxr-xr-x  3 uucp  uucp   512 Sep  4  2013 .Failed
  259 drwxrwxr-x  2 uucp  uucp   512 Feb  7  2017 .Preserve
63748 drwxr-xr-x  2 uucp  uucp   512 Sep  4  2013 .Received
95492 drwxrwxr-x  2 uucp  uucp   512 Feb  7  2017 .Sequence
  260 drwxrwxr-x  2 uucp  uucp   512 Sep  4  2013 .Status
32001 drwxrwxr-x  2 uucp  uucp   512 Feb  7  2017 .Temp
63749 drwxrwxr-x  2 uucp  uucp   512 Feb  7  2017 .Xqtdir
67117 -rw-r--r--  1 uucp  uucp  5959 Nov 24  2014 Log
67118 -rw-r--r--  1 uucp  uucp  3728 Nov 24  2014 Stats
95493 drwxr-xr-x  4 uucp  uucp   512 Oct 24 23:53 lyxys

var/spool/uucp/lyxys/X.:
total 44
95491 drwxr-xr-x  3 uucp  uucp   512 Sep  4  2013 .Failed
  259 drwxrwxr-x  2 uucp  uucp   512 Feb  7  2017 .Preserve
63748 drwxr-xr-x  2 uucp  uucp   512 Sep  4  2013 .Received
95492 drwxrwxr-x  2 uucp  uucp   512 Feb  7  2017 .Sequence
  260 drwxrwxr-x  2 uucp  uucp   512 Sep  4  2013 .Status
32001 drwxrwxr-x  2 uucp  uucp   512 Feb  7  2017 .Temp
63749 drwxrwxr-x  2 uucp  uucp   512 Feb  7  2017 .Xqtdir
67117 -rw-r--r--  1 uucp  uucp  5959 Nov 24  2014 Log
67118 -rw-r--r--  1 uucp  uucp  3728 Nov 24  2014 Stats
95493 drwxr-xr-x  4 uucp  uucp   512 Oct 24 23:53 lyxys
# 

A friend hinted me to the 64bit inode change as a possible cause, but as I do not have a backup old enough I can not verify his idea easily.


Ralf

[1] newfs, rsync from backup without /var/spool/uucp, port reinstalled, configured.
Comment 1 Andriy Gapon freebsd_committer 2017-10-31 19:41:04 UTC
Another hypothesis is that the name cache may have something to do with it.
Comment 2 Andriy Gapon freebsd_committer 2017-11-01 07:46:17 UTC
(In reply to Andriy Gapon from comment #1)
To expand on my previous comment, I mean that your report is the first of its kind and the affected directories have rather peculiar names of form "?.".  So, my suspicion is that some generic lookup or cache code may have a flaw of misinterpreting those names as "..".
Comment 3 Andriy Gapon freebsd_committer 2017-11-01 07:53:50 UTC
And when you know what you are looking for:

static __noinline int
cache_lookup_nomakeentry(struct vnode *dvp, struct vnode **vpp,
    struct componentname *cnp, struct timespec *tsp, int *ticksp)
{
        struct namecache *ncp;
        struct rwlock *blp;
        struct mtx *dvlp, *dvlp2;
        uint32_t hash;
        int error;

        if (cnp->cn_namelen == 2 && cnp->cn_nameptr[1] == '.') {

There is a similar condition in cache_lookup() as well.

I do not see any code to pre-check that cnp->cn_nameptr[0] == '.'

Seems like bad refactoring in r324378.
Comment 4 Mateusz Guzik freebsd_committer 2017-11-01 07:59:16 UTC
wow, that's a fucking retarded bug if i ever seen one. sorry for the inconvenience.

So to be clear, does this solve the problem for you?

diff --git a/sys/kern/vfs_cache.c b/sys/kern/vfs_cache.c
index 3da1c97..23c80a9 100644
--- a/sys/kern/vfs_cache.c
+++ b/sys/kern/vfs_cache.c
@@ -1126,7 +1126,8 @@ cache_lookup_nomakeentry(struct vnode *dvp, struct vnode **vpp,
        uint32_t hash;
        int error;
 
-       if (cnp->cn_namelen == 2 && cnp->cn_nameptr[1] == '.') {
+       if (cnp->cn_namelen == 2 &&
+           cnp->cn_nameptr[0] == '.' && cnp->cn_nameptr[1] == '.') {
                counter_u64_add(dotdothits, 1);
                dvlp = VP2VNODELOCK(dvp);
                dvlp2 = NULL;
Comment 5 Mateusz Guzik freebsd_committer 2017-11-01 08:41:16 UTC
The obvious bug is fixed in https://svnweb.freebsd.org/changeset/base/325268
Comment 6 rz-rpi03 2017-11-01 12:34:27 UTC
Sorry for the wait.

I can confirm, that base r325268 fixes the bug.
Tested with the faster restore-umount-mount case.
Compiling the arm will take some more hours, but I do not expect a surprise there.

So I like to close the bug as fixed. 
Thank you all.

Ralf