Created attachment 243566 [details] C test program to trigger the bug A read(2) system call on a file can read incorrect data when run concurrently with a write(2) that appends to the file. The attached test program demonstrates this: sudo mount -t tmpfs tmpfs /tmp cd /tmp ./test_concurrent_read_write2 ERROR! invalid value read 0x 0 at 343 of 387, pos 387 It's a race, seems to need a couple hundred iterations to provoke the bug (it defaults to 1000 iterations). When the bug happens, the read() sees the length of the file as including the data from the concurrent write(), but the read data is incorrect (all zeros in my tests). This was reproduced with FreeBSD 13.0, 13.2, and 14-current. It is reproducable on tmpfs with default settings. It can reportedly be reproduced on ufs filesystems by disabling vn_io_fault and enablint vn_io_pgcache_read_enable. A work-around is: sysctl debug.vn_io_pgcache_read_enable=0
I analyzed this one based on discussions on IRC with the original poster (and did the testing on 13.2-stable and 14-current). The short version is that vn_read thinks it can do the VOP_READ_PGCACHE path without doing any locking at all, which violates the consistency expected here (and the OP's test case is taken from an actual application, mariadb). The problem does not normally manifest with non-tmpfs filesystems because those are using range locks and the vn_io_fault code path (which tmpfs does not use). In more detail: vn_write calls tmpfs_write with the vnode exclusively locked, and on an appending write, tmpfs_write updates the file length before copying in the new data. vn_read calls tmpfs_read_pgcache without any locking, and that reads the file length (getting the new length). There is a comment here "size cannot become shorter due to rangelock.", which I think is bogus because there is no rangelock in effect at this point; further tests would be needed to verify whether a concurrent truncate could crash it. After getting the length (and in the race case, that'll be the new length), it copies the data (which is not yet written in the race case).
(In reply to Andrew "RhodiumToad" Gierth from comment #1) Concurrent truncate doesn't, so far in my testing, result in any crashes, but it can cause the reader to read data with 0x00 in place of the original bytes. i.e. file initially contains some non-zero bytes; read and ftruncate(fd,0) race; read might return non-zero bytes read based on the old length, but with the data zeroed. (I noticed another comment in uiomove_object that said that the tmpfs vnode lock was held, even though it might not be.)
Original MariaDB bug report for reference: https://jira.mariadb.org/browse/MDEV-28986
Thanks for the report. Analysis looks mostly right. Until the issue is resolved, it can be worked around with: sysctl debug.vn_io_pgcache_read_enable=0 (which you can add to sysctl.conf) To the bug, while it is correct that the size is updated first and causing trouble, another issue is that pages are only sbusied (as opposed to xbusied), thus even if file size remains unchanged, the read may get a transient state which violates the idea of reads and writes maintaining atomicity. truncate should also xbusy, which will sort it out.
https://reviews.freebsd.org/D41158
A commit in branch main references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=5b353925ff61b9ddb97bb453ba75278b578ed7d9 commit 5b353925ff61b9ddb97bb453ba75278b578ed7d9 Author: Konstantin Belousov <kib@FreeBSD.org> AuthorDate: 2023-07-23 15:55:50 +0000 Commit: Konstantin Belousov <kib@FreeBSD.org> CommitDate: 2023-07-24 22:02:59 +0000 vnode read(2)/write(2): acquire rangelock regardless of do_vn_io_fault() To ensure atomicity of reads against parallel writes and truncates, vnode lock was not enough at least since introduction of vn_io_fault(). That code only take rangelock when it was possible that vn_read() and vn_write() could drop the vnode lock. At least since the introduction of VOP_READ_PGCACHE() which generally does not lock the vnode at all, rangelocks become required even for filesystems that do not need vn_io_fault() workaround. For instance, tmpfs. PR: 272678 Analyzed and reviewed by: Andrew Gierth <andrew@tao11.riddles.org.uk> Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D41158 sys/kern/vfs_vnops.c | 21 ++++++++++++++------- 1 file changed, 14 insertions(+), 7 deletions(-)
A commit in branch main references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=821dec4d56a876db56a82cf1dc54a84905b74f5b commit 821dec4d56a876db56a82cf1dc54a84905b74f5b Author: Konstantin Belousov <kib@FreeBSD.org> AuthorDate: 2023-08-06 01:23:35 +0000 Commit: Konstantin Belousov <kib@FreeBSD.org> CommitDate: 2023-08-09 03:54:15 +0000 vnode io: request range-locking when pgcache reads are enabled PR: 272678 Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D41334 sys/kern/vfs_vnops.c | 20 +++++++++++++------- 1 file changed, 13 insertions(+), 7 deletions(-)
A commit in branch stable/13 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=b9ccef49d02c9ad9e3b986347562408d80271b6c commit b9ccef49d02c9ad9e3b986347562408d80271b6c Author: Konstantin Belousov <kib@FreeBSD.org> AuthorDate: 2023-07-23 15:55:50 +0000 Commit: Konstantin Belousov <kib@FreeBSD.org> CommitDate: 2023-08-16 09:16:09 +0000 vnode read(2)/write(2): acquire rangelock regardless of do_vn_io_fault() PR: 272678 (cherry picked from commit 5b353925ff61b9ddb97bb453ba75278b578ed7d9) sys/kern/vfs_vnops.c | 21 ++++++++++++++------- 1 file changed, 14 insertions(+), 7 deletions(-)
A commit in branch stable/13 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=ee3b3146293c0dd2e8a779ebb0a3ee65318e8c3b commit ee3b3146293c0dd2e8a779ebb0a3ee65318e8c3b Author: Konstantin Belousov <kib@FreeBSD.org> AuthorDate: 2023-08-06 01:23:35 +0000 Commit: Konstantin Belousov <kib@FreeBSD.org> CommitDate: 2023-08-16 09:16:09 +0000 vnode io: request range-locking when pgcache reads are enabled PR: 272678 (cherry picked from commit 821dec4d56a876db56a82cf1dc54a84905b74f5b) sys/kern/vfs_vnops.c | 20 +++++++++++++------- 1 file changed, 13 insertions(+), 7 deletions(-)
^Triage: assign to committer that resolved.