Bug 68690 - [libc] write(2) returns wrong value when EFAULT
Summary: [libc] write(2) returns wrong value when EFAULT
Status: Open
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: unspecified
Hardware: Any Any
: Normal Affects Only Me
Assignee: freebsd-bugs mailing list
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2004-07-05 13:30 UTC by KOIE Hidetaka
Modified: 2017-11-07 16:45 UTC (History)
1 user (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description KOIE Hidetaka 2004-07-05 13:30:22 UTC
Invoking write(fd, buf, size), if buf has both validand invalid segment,
write return -1, but it's file pointer has been advanced.
The caller mistakes the pointer stays.

Fix: 

Sorry, I don't know.
How-To-Repeat: #include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <sys/types.h>
#include <sys/mman.h>
#include <unistd.h>

int
main()
{
    const int PAGESIZE = sysconf(_SC_PAGESIZE);
    const int N = 5;
    const int SIZE = N * PAGESIZE;
    int ifd = open("in.dat", O_RDWR|O_CREAT|O_TRUNC, 0600);
    if (ifd < 0) {
        perror("open");
        goto out;
    }
    int should = 0;
    for (int i = 0; i < N - 2; i++) {
        char buf[PAGESIZE];
        if (write(ifd, buf, sizeof buf) != sizeof buf) {
            perror("write");
            goto out;
        }
        should += sizeof buf;
    }

    int ofd = open("out.dat", O_WRONLY|O_CREAT|O_TRUNC, 0600);
    if (ofd < 0) {
        perror("open");
        goto out;
    }

    void *addr = mmap(0, SIZE, PROT_READ, MAP_PRIVATE, ifd, 0);
    if (addr == MAP_FAILED) {
        perror("mmap");
        goto out;
    }
    fprintf(stderr, "pos=%ld\n", (long)lseek(ofd, 0, SEEK_CUR));
    int n = write(ofd, addr, SIZE);
    fprintf(stderr, "write(%d)->%d (should be %d)\n", SIZE, n, should);
    perror("write");
    fprintf(stderr, "pos=%ld (should be %d)\n", (long)lseek(ofd, 0, SEEK_CUR), should);
 out:
    exit(0);
}
Comment 1 Bruce Evans 2004-07-07 09:23:57 UTC
On Mon, 5 Jul 2004, KOIE Hidetaka wrote:

> >Description:
> Invoking write(fd, buf, size), if buf has both validand invalid segment,
> write return -1, but it's file pointer has been advanced.
> The caller mistakes the pointer stays.

This is a longstanding bug.  It is more of a problem for non-seekable
devices since physically written can't be undone, so the best possible
error handling is to return a short write as specified in all versions
of POSIX.

Do you actually see the file pointer advanced?  This may be file system
dependent.  ffs is supposed to back out of the write, and it does so
for me.  Output:

%%%
pos=0
write(20480)->-1 (should be 12288)
write: Bad address
pos=0 (should be 12288)
%%%

This is correct.  We tried to write 20K but the buffer size is only 12K
and the was an i/o error after 12K.  ffs_write() handled the error by
truncating the file to its original size and rerstoring the original
file system, and since the write was at the end of the file this has
the same result as if the write had failed with EFAULT after 0K.  I
think there is also no way for other processes to see the write while
it is in progress, since even if they have mmapped the file they can't
see beyond its original length.

To show an actual bug for ffs, modifiy the example slightly so that
it overwrites existing data.  Then truncating the file has no effect,
and it is just wrong for write() to return -1 after clobbering some
data.

> >Fix:

(1) Here is ffs_write()'s buggy error handling:

% 	for (error = 0; uio->uio_resid > 0;) {
% 		...
% 		if (uio->uio_offset + xfersize > ip->i_size)
% 			vnode_pager_setsize(vp, uio->uio_offset + xfersize);

Not error handling, but general setup.  Hopefully extending the size at the
vm level doesn't allow other other processes to see up to the new EOF yet.

%
%                 /*
% 		 * We must perform a read-before-write if the transfer size
% 		 * does not cover the entire buffer.
%                  */
% 		if (fs->fs_bsize > xfersize)
% 			flags |= BA_CLRBUF;
% 		else
% 			flags &= ~BA_CLRBUF;

Security-related code.  Other processes can see buffer contents via
mmap(), at least after completion of extension of the file at the vfs
level, so we must clear the buffer contents now.  This also avoids
potential leakage of unwritten buffer contents to the same process
after buggy EFAULT handling.

% 		...
% 		/*
% 		 * If the buffer is not valid we have to clear out any
% 		 * garbage data from the pages instantiated for the buffer.
% 		 * If we do not, a failed uiomove() during a write can leave
% 		 * the prior contents of the pages exposed to a userland
% 		 * mmap().  XXX deal with uiomove() errors a better way.
% 		 */
% 		if ((bp->b_flags & B_CACHE) == 0 && fs->fs_bsize <= xfersize)
% 			vfs_bio_clrbuf(bp);

More buffer clearing, for another case.  It's not so clear that this is
done before other processes can see the buffer.

% 	...
% 	/*
% 	 * If we successfully wrote any data, and we are not the superuser
% 	 * we clear the setuid and setgid bits as a precaution against
% 	 * tampering.
% 	 */
% 	if (resid > uio->uio_resid && ap->a_cred &&
% 	    suser_cred(ap->a_cred, PRISON_ROOT)) {
% 		ip->i_mode &= ~(ISUID | ISGID);
% 		DIP(ip, i_mode) = ip->i_mode;
% 	}

Misplaced code.  We are going to back out of the write in some cases, so
we don't know if we successefully wrote any data here.

% 	...
% 	if (error) {
% 		if (ioflag & IO_UNIT) {
% 			(void)UFS_TRUNCATE(vp, osize,
% 			    IO_NORMAL | (ioflag & IO_SYNC),
% 			    ap->a_cred, uio->uio_td);
% 			uio->uio_offset -= resid - uio->uio_resid;
% 			uio->uio_resid = resid;
% 		}
% 	} ...

Here is where we back out of the write in some cases.
(a) We only back out in the IO_UNIT case, but that is the usual case and
    is always the case for write(2).  (There is another bug suite
    involving IO_UNIT.  Most references to IO_UNIT in the kernel except
    those related to userland i/o don't actually give atomic writes
    since writes are made non-atomic using vfs_rdwr_inchunks() anyway.
    OTOH, there seems to be no reason for ffs to skip the backout in
    the !IO_UNIT case, so there is no need for the IO_UNIT flag to exist.)
(b) We cannot back out if anything was written before the original end of
    the file.  Truncation just increases the mess in that case, since it
    only reduces the file to its original size, but the file position
    (in uio->uio_offset) is reduced by the full amount.

Fixes for ffs_write(): ignore IO_UNIT, and don't rewind the file pointer
below osize.  This gives short for writes that wrote something before
osize and then failed.

Similarly for other file systems.

Here is write(2)'s buggy error handling (in dofilewrite()):

% 	if ((error = fo_write(fp, &auio, td->td_ucred, flags, td))) {
% 		if (auio.uio_resid != cnt && (error == ERESTART ||
% 		    error == EINTR || error == EWOULDBLOCK))
% 			error = 0;

(auio.uio_resid != cnt) means that the write was short.  It should be
returned to userland as a short write by setting error to 0 unconditionally.
LOwer layers mostly get this write and return a short write if and only
if they couldn't back out of the write, but they still return a nonzero
error code to indicate that there was a problem.  The userland interface
(write(2) in this case) just doesn't support returning both an error code
and a byte count, so one of them must be discarded, and it must be the
error code since it is useful and POSIX standard to return the byte count.

Since ffs attempts to backs out of failing writes, and the backup is
successful in most cases, this bug mainly affects devices.

Similarly for read(2) and writev(2), etc.

Bruce
Comment 2 KOIE Hidetaka 2004-07-08 04:07:56 UTC
Thanks for your inspection.

  Message-Id: <20040707172014.C3185@gamplex.bde.org>
  Date:       Wed, 7 Jul 2004 18:23:57 +1000 (EST)
  From:       Bruce Evans <bde@zeta.org.au>
  Subject:    Re: kern/68690: write(2) returns wrong vlalue when EFAUL..

  | Do you actually see the file pointer advanced?  This may be file system
  | dependent.  ffs is supposed to back out of the write, and it does so
  | for me.  Output:
  | 
  | %%%
  | pos=0
  | write(20480)->-1 (should be 12288)
  | write: Bad address
  | pos=0 (should be 12288)
  | %%%

I was unaware of examining on NFS.
%%%
pos=0
write(20480)->-1 (should be 12288)
write: Bad address
pos=8192 (should be 12288)
%%%

The file pointer is not backed.


  | > >Fix:

I understand some problems exist.

--
KOIE Hidetaka / koie@suri.co.jp / SURIGIKEN Co.,LTD.
Comment 3 Bruce Evans 2004-07-08 08:19:51 UTC
On Thu, 8 Jul 2004, KOIE Hidetaka wrote:

>   From:       Bruce Evans <bde@zeta.org.au>
>   | Do you actually see the file pointer advanced?  This may be file system
>   | dependent.  ffs is supposed to back out of the write, and it does so
>   | for me.  Output:
>   |
>   | %%%
>   | pos=0
>   | write(20480)->-1 (should be 12288)
>   | write: Bad address
>   | pos=0 (should be 12288)
>   | %%%
>
> I was unaware of examining on NFS.

I think you mean that you saw this error for nfs.

> %%%
> pos=0
> write(20480)->-1 (should be 12288)
> write: Bad address
> pos=8192 (should be 12288)
> %%%
>
> The file pointer is not backed.

I see this behaviour for nfs to.  It is because nfs_write() just doesn't
back out of the write like ffs_write does.  From nfs_bio.c rev.1.132:

% int
% nfs_write(struct vop_write_args *ap)
% {
% ...
% 	do {
% ...
% 		error = uiomove((char *)bp->b_data + on, n, uio);
%
% 		/*
% 		 * Since this block is being modified, it must be written
% 		 * again and not just committed.  Since write clustering does
% 		 * not work for the stage 1 data write, only the stage 2
% 		 * commit rpc, we have to clear B_CLUSTEROK as well.
% 		 */
% 		bp->b_flags &= ~(B_NEEDCOMMIT | B_CLUSTEROK);
%
% 		if (error) {
% 			bp->b_ioflags |= BIO_ERROR;
% 			brelse(bp);
% 			break;
% 		}
% 	} while (uio->uio_resid > 0 && n > 0);
%
% 	if (haverslock)
% 		nfs_rsunlock(np, td);
%
% 	return (error);
% }

After an error in uiomove(), nfs_write() just unlocks and returns, so
it provides no protection from the bug in dofilewrite() even if the
i/o was at the end of the file.  Only the case where the error occurs
before any i/o is done is handled correctly.

Bruce
Comment 4 Enji Cooper freebsd_committer 2015-11-16 06:22:03 UTC
Taking for the RETEST.