Bug 272678 - VFS: Incorrect data in read from concurrent write
Summary: VFS: Incorrect data in read from concurrent write
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 13.2-RELEASE
Hardware: amd64 Any
: --- Affects Some People
Assignee: Konstantin Belousov
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2023-07-23 13:03 UTC by Kristian Nielsen
Modified: 2024-01-10 07:12 UTC (History)
7 users (show)

See Also:


Attachments
C test program to trigger the bug (3.60 KB, text/plain)
2023-07-23 13:03 UTC, Kristian Nielsen
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Kristian Nielsen 2023-07-23 13:03:32 UTC
Created attachment 243566 [details]
C test program to trigger the bug

A read(2) system call on a file can read incorrect data when run
concurrently with a write(2) that appends to the file.

The attached test program demonstrates this:

  sudo mount -t tmpfs tmpfs /tmp
  cd /tmp
  ./test_concurrent_read_write2
ERROR! invalid value read 0x 0 at 343 of 387, pos 387

It's a race, seems to need a couple hundred iterations to provoke the bug
(it defaults to 1000 iterations). When the bug happens, the read() sees the
length of the file as including the data from the concurrent write(), but
the read data is incorrect (all zeros in my tests).

This was reproduced with FreeBSD 13.0, 13.2, and 14-current.
It is reproducable on tmpfs with default settings.
It can reportedly be reproduced on ufs filesystems by disabling vn_io_fault
and enablint vn_io_pgcache_read_enable.
A work-around is: sysctl debug.vn_io_pgcache_read_enable=0
Comment 1 Andrew "RhodiumToad" Gierth 2023-07-23 13:32:58 UTC
I analyzed this one based on discussions on IRC with the original poster (and did the testing on 13.2-stable and 14-current).

The short version is that vn_read thinks it can do the VOP_READ_PGCACHE path without doing any locking at all, which violates the consistency expected here (and the OP's test case is taken from an actual application, mariadb). The problem does not normally manifest with non-tmpfs filesystems because those are using range locks and the vn_io_fault code path (which tmpfs does not use).

In more detail:

vn_write calls tmpfs_write with the vnode exclusively locked, and on an appending write, tmpfs_write updates the file length before copying in the new data.

vn_read calls tmpfs_read_pgcache without any locking, and that reads the file length (getting the new length). There is a comment here "size cannot become shorter due to rangelock.", which I think is bogus because there is no rangelock in effect at this point; further tests would be needed to verify whether a concurrent truncate could crash it. After getting the length (and in the race case, that'll be the new length), it copies the data (which is not yet written in the race case).
Comment 2 Andrew "RhodiumToad" Gierth 2023-07-23 14:18:25 UTC
(In reply to Andrew "RhodiumToad" Gierth from comment #1)

Concurrent truncate doesn't, so far in my testing, result in any crashes, but it can cause the reader to read data with 0x00 in place of the original bytes.

i.e. file initially contains some non-zero bytes; read and ftruncate(fd,0) race; read might return non-zero bytes read based on the old length, but with the data zeroed.

(I noticed another comment in uiomove_object that said that the tmpfs vnode lock was held, even though it might not be.)
Comment 3 Kristian Nielsen 2023-07-23 14:51:01 UTC
Original MariaDB bug report for reference:

    https://jira.mariadb.org/browse/MDEV-28986
Comment 4 Mateusz Guzik freebsd_committer freebsd_triage 2023-07-23 15:43:48 UTC
Thanks for the report.

Analysis looks mostly right.

Until the issue is resolved, it can be worked around with:
sysctl debug.vn_io_pgcache_read_enable=0 (which you can add to sysctl.conf)

To the bug, while it is correct that the size is updated first and causing trouble, another issue is that pages are only sbusied (as opposed to xbusied), thus even if file size remains unchanged, the read may get a transient state which violates the idea of reads and writes maintaining atomicity.

truncate should also xbusy, which will sort it out.
Comment 5 Konstantin Belousov freebsd_committer freebsd_triage 2023-07-23 16:02:52 UTC
https://reviews.freebsd.org/D41158
Comment 6 commit-hook freebsd_committer freebsd_triage 2023-07-24 22:04:12 UTC
A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=5b353925ff61b9ddb97bb453ba75278b578ed7d9

commit 5b353925ff61b9ddb97bb453ba75278b578ed7d9
Author:     Konstantin Belousov <kib@FreeBSD.org>
AuthorDate: 2023-07-23 15:55:50 +0000
Commit:     Konstantin Belousov <kib@FreeBSD.org>
CommitDate: 2023-07-24 22:02:59 +0000

    vnode read(2)/write(2): acquire rangelock regardless of do_vn_io_fault()

    To ensure atomicity of reads against parallel writes and truncates,
    vnode lock was not enough at least since introduction of vn_io_fault().
    That code only take rangelock when it was possible that vn_read() and
    vn_write() could drop the vnode lock.

    At least since the introduction of VOP_READ_PGCACHE() which generally
    does not lock the vnode at all, rangelocks become required even
    for filesystems that do not need vn_io_fault() workaround.  For
    instance, tmpfs.

    PR:     272678
    Analyzed and reviewed by:       Andrew Gierth  <andrew@tao11.riddles.org.uk>
    Sponsored by:   The FreeBSD Foundation
    MFC after:      1 week
    Differential revision:  https://reviews.freebsd.org/D41158

 sys/kern/vfs_vnops.c | 21 ++++++++++++++-------
 1 file changed, 14 insertions(+), 7 deletions(-)
Comment 7 commit-hook freebsd_committer freebsd_triage 2023-08-09 04:03:52 UTC
A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=821dec4d56a876db56a82cf1dc54a84905b74f5b

commit 821dec4d56a876db56a82cf1dc54a84905b74f5b
Author:     Konstantin Belousov <kib@FreeBSD.org>
AuthorDate: 2023-08-06 01:23:35 +0000
Commit:     Konstantin Belousov <kib@FreeBSD.org>
CommitDate: 2023-08-09 03:54:15 +0000

    vnode io: request range-locking when pgcache reads are enabled

    PR:     272678
    Sponsored by:   The FreeBSD Foundation
    MFC after:      1 week
    Differential revision:  https://reviews.freebsd.org/D41334

 sys/kern/vfs_vnops.c | 20 +++++++++++++-------
 1 file changed, 13 insertions(+), 7 deletions(-)
Comment 8 commit-hook freebsd_committer freebsd_triage 2023-08-16 09:18:18 UTC
A commit in branch stable/13 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=b9ccef49d02c9ad9e3b986347562408d80271b6c

commit b9ccef49d02c9ad9e3b986347562408d80271b6c
Author:     Konstantin Belousov <kib@FreeBSD.org>
AuthorDate: 2023-07-23 15:55:50 +0000
Commit:     Konstantin Belousov <kib@FreeBSD.org>
CommitDate: 2023-08-16 09:16:09 +0000

    vnode read(2)/write(2): acquire rangelock regardless of do_vn_io_fault()

    PR:     272678

    (cherry picked from commit 5b353925ff61b9ddb97bb453ba75278b578ed7d9)

 sys/kern/vfs_vnops.c | 21 ++++++++++++++-------
 1 file changed, 14 insertions(+), 7 deletions(-)
Comment 9 commit-hook freebsd_committer freebsd_triage 2023-08-16 09:18:21 UTC
A commit in branch stable/13 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=ee3b3146293c0dd2e8a779ebb0a3ee65318e8c3b

commit ee3b3146293c0dd2e8a779ebb0a3ee65318e8c3b
Author:     Konstantin Belousov <kib@FreeBSD.org>
AuthorDate: 2023-08-06 01:23:35 +0000
Commit:     Konstantin Belousov <kib@FreeBSD.org>
CommitDate: 2023-08-16 09:16:09 +0000

    vnode io: request range-locking when pgcache reads are enabled

    PR:     272678

    (cherry picked from commit 821dec4d56a876db56a82cf1dc54a84905b74f5b)

 sys/kern/vfs_vnops.c | 20 +++++++++++++-------
 1 file changed, 13 insertions(+), 7 deletions(-)
Comment 10 Mark Linimon freebsd_committer freebsd_triage 2024-01-10 07:12:57 UTC
^Triage: assign to committer that resolved.