Bug 204337

Summary: ZFS can have a non-empty directory, but the files don't exist on arm64.
Product: Base System Reporter: Andrew Turner <Andrew>
Component: kernAssignee: Andrew Turner <Andrew>
Status: Closed FIXED    
Severity: Affects Some People CC: avg, emaste, peter
Priority: ---    
Version: CURRENT   
Hardware: arm64   
OS: Any   
See Also: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=204037
Bug Depends on:    
Bug Blocks: 203349    

Description Andrew Turner freebsd_committer 2015-11-06 16:55:26 UTC
I can easily get ZFS into a state where a directory is empty, but the files within it don't exist. I can reproduce this by extracting base.txz from a weekly snapshot. When I extract the tarball I got:

root@cavium:/tank/andrew # tar -xpf base.txz -C /tank/andrew/test/
./usr/share/man/man3/remainderl.3.gz: Can't create 'usr/share/man/man3/remainderl.3.gz'
./usr/share/man/man3/jdate.3.gz: Can't create 'usr/share/man/man3/jdate.3.gz'
./usr/share/man/man3/archive_write_add_filter_xz.3.gz: Can't create 'usr/share/man/man3/archive_write_add_filter_xz.3.gz'
./usr/share/man/man3/winnstr.3.gz: Can't create 'usr/share/man/man3/winnstr.3.gz'
./usr/share/man/man3/dwarf_func_cu_offset.3.gz: Can't create 'usr/share/man/man3/dwarf_func_cu_offset.3.gz'
./usr/share/man/man3/remainder.3.gz: Can't create 'usr/share/man/man3/remainder.3.gz'
tar: Error exit delayed from previous errors.

I've trimmed most of the errors as there were over 1000 lines. When I tried to remove the test directory I got:

root@cavium:/tank/andrew # rm -fr test
rm: test/usr/share/man/man3: Directory not empty
rm: test/usr/share/man: Directory not empty
rm: test/usr/share: Directory not empty
rm: test/usr: Directory not empty
rm: test: Directory not empty
root@cavium:/tank/andrew # ls -lh test/usr/share/man/man3
ls: MD4FileChunk.3.gz: No such file or directory
ls: catanf.3.gz: No such file or directory
ls: gelf_getmove.3.gz: No such file or directory
ls: krb5_config_vget_strings.3.gz: No such file or directory
ls: quota_read.3.gz: No such file or directory
ls: rpc_clnt_create.3.gz: No such file or directory
ls: ufs_disk_close.3.gz: No such file or directory
ls: wcslcpy.3.gz: No such file or directory
total 0

Nether remounting, or rebooting fixed the error, it seems to be an issue in the disk.
Comment 1 Andrew Turner freebsd_committer 2016-06-13 17:28:57 UTC
This seems to be fixed by adding memory barriers to the opensolaris atomic_cas_* functions. These are used as a locking primitive (along with other uses).

It may be overkill to add to both atomic_cas_* functions as some code may not depend on the ordering & this will slow it down, but fixing this can be a later task after testing to check if these barriers are sufficient.
Comment 2 Peter Wemm freebsd_committer freebsd_triage 2016-08-27 19:32:04 UTC
Did the extra barriers get added? I've been experimentally running root-on-zfs on a machine in the cluster in a typical cluster configuration to see if I can break things, but it seems quite solid.  I don't know if it matters but it is running top-of-tree with the avg@'s ZPL namespace layer fixes.
Comment 3 Andrew Turner freebsd_committer 2016-08-28 19:44:43 UTC
I think it has been fixed, I haven't been able to reproduce this on a more recent head. I'm running a ZFS root system in the netperf cluster, and another with poudriere building packages on ZFS and haven't seen any ZFS issues.
Comment 4 Andrew Turner freebsd_committer 2016-10-28 17:51:12 UTC
This seems to be fixed, I've been using zfs on ThunderX without seeing these issues.