|Summary:||ZFS can have a non-empty directory, but the files don't exist on arm64.|
|Product:||Base System||Reporter:||Andrew Turner <Andrew>|
|Component:||kern||Assignee:||Andrew Turner <Andrew>|
|Severity:||Affects Some People||CC:||avg, emaste, peter|
|Bug Depends on:|
Description Andrew Turner 2015-11-06 16:55:26 UTC
I can easily get ZFS into a state where a directory is empty, but the files within it don't exist. I can reproduce this by extracting base.txz from a weekly snapshot. When I extract the tarball I got: root@cavium:/tank/andrew # tar -xpf base.txz -C /tank/andrew/test/ ./usr/share/man/man3/remainderl.3.gz: Can't create 'usr/share/man/man3/remainderl.3.gz' ./usr/share/man/man3/jdate.3.gz: Can't create 'usr/share/man/man3/jdate.3.gz' ./usr/share/man/man3/archive_write_add_filter_xz.3.gz: Can't create 'usr/share/man/man3/archive_write_add_filter_xz.3.gz' ... ./usr/share/man/man3/winnstr.3.gz: Can't create 'usr/share/man/man3/winnstr.3.gz' ./usr/share/man/man3/dwarf_func_cu_offset.3.gz: Can't create 'usr/share/man/man3/dwarf_func_cu_offset.3.gz' ./usr/share/man/man3/remainder.3.gz: Can't create 'usr/share/man/man3/remainder.3.gz' tar: Error exit delayed from previous errors. I've trimmed most of the errors as there were over 1000 lines. When I tried to remove the test directory I got: root@cavium:/tank/andrew # rm -fr test rm: test/usr/share/man/man3: Directory not empty rm: test/usr/share/man: Directory not empty rm: test/usr/share: Directory not empty rm: test/usr: Directory not empty rm: test: Directory not empty root@cavium:/tank/andrew # ls -lh test/usr/share/man/man3 ls: MD4FileChunk.3.gz: No such file or directory ls: catanf.3.gz: No such file or directory ls: gelf_getmove.3.gz: No such file or directory ls: krb5_config_vget_strings.3.gz: No such file or directory ls: quota_read.3.gz: No such file or directory ls: rpc_clnt_create.3.gz: No such file or directory ls: ufs_disk_close.3.gz: No such file or directory ls: wcslcpy.3.gz: No such file or directory total 0 Nether remounting, or rebooting fixed the error, it seems to be an issue in the disk.
Comment 1 Andrew Turner 2016-06-13 17:28:57 UTC
This seems to be fixed by adding memory barriers to the opensolaris atomic_cas_* functions. These are used as a locking primitive (along with other uses). It may be overkill to add to both atomic_cas_* functions as some code may not depend on the ordering & this will slow it down, but fixing this can be a later task after testing to check if these barriers are sufficient.
Comment 2 Peter Wemm 2016-08-27 19:32:04 UTC
Did the extra barriers get added? I've been experimentally running root-on-zfs on a machine in the cluster in a typical cluster configuration to see if I can break things, but it seems quite solid. I don't know if it matters but it is running top-of-tree with the avg@'s ZPL namespace layer fixes.
Comment 3 Andrew Turner 2016-08-28 19:44:43 UTC
I think it has been fixed, I haven't been able to reproduce this on a more recent head. I'm running a ZFS root system in the netperf cluster, and another with poudriere building packages on ZFS and haven't seen any ZFS issues.
Comment 4 Andrew Turner 2016-10-28 17:51:12 UTC
This seems to be fixed, I've been using zfs on ThunderX without seeing these issues.