Bug 60313

Summary: data destruction: lseek(2) misalignment silently ignored by some block devices
Product: Base System Reporter: Matthias Andree <matthias.andree>
Component: kernAssignee: freebsd-bugs (Nobody) <bugs>
Status: Open ---    
Severity: Affects Only Me    
Priority: Normal    
Version: 4.9-STABLE   
Hardware: Any   
OS: Any   

Description Matthias Andree 2003-12-17 00:30:20 UTC
da(4) will happily return success on lseek() to some random position that is
not aligned at a block boundary and start writing from the beginning, byte 0,
of the block, ignoring the lseek().

In other words, if I have a drive with 512 bytes per block, seek to offset 256
and write 512 bytes, da(4) will scribble over the first 256 bytes as well.
Either the seek or the write should be rejected.

vn(4) will accept the seek but a subsequent write(2) will return the bogus
EINVAL in errno.

ad(4) status is unknown.

FreeBSD 5 status is unknown.

Fix: 

Let lseek return (off_t)-1 and errno==EINVAL when the device doesn't support
unaligned access.
How-To-Repeat: Save this test program as tryraw.c.

Compile this DESTRUCTIVE test program and run it on an UNUSED partition:

gcc -O -o tryraw tryraw.c
./tryraw /dev/da0s1b

WARNING: THIS TEST DESTROYS DATA IN THE GIVEN PARTITION!
USE IT ON AN INACTIVE SWAP PARTITION ONLY (i. e. comment the swap partition
out, reboot).

/* BEGIN tryraw.c */
#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <unistd.h>
#include <errno.h>
#include <string.h>
#include <sys/disklabel.h>

void barf(const char *t) __attribute__((noreturn));
void barf(const char *t) { perror(t); exit(EXIT_FAILURE); }

int main(int argc, char **argv) {
	int bs, fd;
	int want = 0x66, t;
	struct disklabel dl;
	char *x;

	if (argc < 2) {
		fprintf(stderr, "need a partition that I can clobber.\n");
		exit(EXIT_FAILURE); 
	}
	fd = open(argv[1], O_RDWR);
	if (fd < 0) barf("open");

	if (0 == ioctl(fd, DIOCGDINFO, &dl)) {
		printf("sector size: %lu\n", (unsigned long)(bs = dl.d_secsize));
	} else barf("ioctl DIOCGDINFO");

	/* this writes bs = blocksize times the byte 0x66 at offset #0
	 * then seeks to offset #bs/2 then writes bs times the byte 0x33.  */
	if (!(x = malloc(bs))) barf("malloc");
	memset(x, want, bs);
	if (lseek(fd, 0, SEEK_SET) == (off_t)-1) barf("lseek 0");
	if (write(fd, x, bs) < bs) barf("write");

	memset(x, 0x33, bs);
	t = lseek(fd, bs >> 1, SEEK_SET);
	if (t == -1) barf("lseek bs/2");
	printf("seeked to byte #%d\n", t);
	if (write(fd, x, bs) < bs) barf("write");

	if (lseek(fd, 0, SEEK_SET) == (off_t)-1) barf("lseek 0");
	if (read(fd, x, bs) < bs) barf("read");
	if (x[0] == want) puts("OK");
	else printf("KERNEL BUG: byte #0 is %x, should be %x\n", x[0], want);
	free(x);
	return 0;
}
/* END tryraw.c */
Comment 1 Bruce Evans 2003-12-17 02:13:55 UTC
On Wed, 17 Dec 2003, Matthias Andree wrote:

> >Description:
> da(4) will happily return success on lseek() to some random position that is
> not aligned at a block boundary and start writing from the beginning, byte 0,
> of the block, ignoring the lseek().
>
> In other words, if I have a drive with 512 bytes per block, seek to offset 256
> and write 512 bytes, da(4) will scribble over the first 256 bytes as well.
> Either the seek or the write should be rejected.

The boundary is checked in dscheck() in RELENG_4, but only after
forgetting it mod DEV_BSIZE, so only offsets that are multiples of
DEV_BSIZE but not a multiple of the sector size are detected.

The boundary is forgotten here in physio() (RELENG_4 version):

% 			blockno = bp->b_offset >> DEV_BSHIFT;
% 			if (chk_blockno && (daddr_t)blockno != blockno) {
% 				error = EINVAL; /* blockno overflow */
% 				goto doerror;
% 			}
% 			bp->b_blkno = blockno;

except bp->b_offset is still available so lower layers can check it.  None
do.  (The acd driver doesn't use dscheck() and uses b_offset to determine
its own idea of the block number; it has a subset of dscheck()'s other
boundary checks and has the same nonexistent boundary checking as physio()
for the offset.)

> vn(4) will accept the seek but a subsequent write(2) will return the bogus
> EINVAL in errno.

I don't see how vn(4) can be missing the bug.  In the "labels" case it
uses dscheck() which doesn't check b_offset, and in the non-labels case
it uses its own check which also doesn't check b_offset.

> ad(4) status is unknown.

I don't see how it can be missing the bug.

> FreeBSD 5 status is unknown.

-current is quite different.  It deprecates b_blkno and uses b_offset
more (mostly in its bio_offset form), and uses GEOM (except in my
version).  physio() still sets b_blkno without checking the boundary,
but b_blkno is mostly not used.

The problem seems to be fixed in GEOM (function g_io_check()).

I fixed it my version of dscheck() but didn't get around
to committing the fix before dscheck() was axed.  Here are my current
diffs relative to RELENG_4.  They don't apply directly.  RELENG_4 needs
mainly the "if (offset % (uoff_t)DEV_BSIZE) {" clause.  The patch also
fixes some slightly incorrect choices of types that become fatally
incorrect with larger daddr_t's in -current, and fixes the longstanding
brokenness of EOF handling.

%%%
Index: subr_diskslice.c
===================================================================
RCS file: /home/ncvs/src/sys/kern/Attic/subr_diskslice.c,v
retrieving revision 1.82.2.6
diff -u -2 -r1.82.2.6 subr_diskslice.c
--- subr_diskslice.c	24 Jul 2001 09:49:41 -0000	1.82.2.6
+++ subr_diskslice.c	7 Dec 2003 05:05:22 -0000
@@ -134,47 +284,55 @@
 int
 dscheck(bp, ssp)
-	struct buf *bp;
+	struct bio *bp;
 	struct diskslices *ssp;
 {
-	daddr_t	blkno;
-	u_long	endsecno;
-	daddr_t	labelsect;
+	daddr_t blkno;
+	u_long endsecno;
+	daddr_t labelsect;
 	struct disklabel *lp;
 	char *msg;
-	long	nsec;
+	long nsec;
+	off_t offset;
 	struct partition *pp;
-	daddr_t	secno;
-	daddr_t	slicerel_secno;
+	daddr_t secno;
+	daddr_t slicerel_secno;
 	struct diskslice *sp;
-	int s;

-	blkno = bp->b_blkno;
-	if (blkno < 0) {
-		printf("dscheck(%s): negative b_blkno %ld\n",
-		    devtoname(bp->b_dev), (long)blkno);
-		bp->b_error = EINVAL;
+	offset = bp->bio_offset;
+	if (offset < 0) {
+		printf("dscheck(%s): negative bio_offset %jd\n",
+		    devtoname(bp->bio_dev), (intmax_t)offset);
+		bp->bio_error = EINVAL;
+		goto bad;
+	}
+	if (offset % (uoff_t)DEV_BSIZE) {
+		printf(
+		"dscheck(%s): bio_offset %jd is not on a DEV_BSIZE boundary\n",
+		    devtoname(bp->bio_dev), (intmax_t)offset);
+		bp->bio_error = EINVAL;
 		goto bad;
 	}
-	sp = &ssp->dss_slices[dkslice(bp->b_dev)];
+	blkno = btodb(bp->bio_offset);
+	sp = &ssp->dss_slices[dkslice(bp->bio_dev)];
 	lp = sp->ds_label;
 	if (ssp->dss_secmult == 1) {
-		if (bp->b_bcount % (u_long)DEV_BSIZE)
+		if (bp->bio_bcount % (u_long)DEV_BSIZE)
 			goto bad_bcount;
 		secno = blkno;
-		nsec = bp->b_bcount >> DEV_BSHIFT;
+		nsec = bp->bio_bcount >> DEV_BSHIFT;
 	} else if (ssp->dss_secshift != -1) {
-		if (bp->b_bcount & (ssp->dss_secsize - 1))
+		if (bp->bio_bcount & (ssp->dss_secsize - 1))
 			goto bad_bcount;
 		if (blkno & (ssp->dss_secmult - 1))
-			goto bad_blkno;
+			goto bad_offset;
 		secno = blkno >> ssp->dss_secshift;
-		nsec = bp->b_bcount >> (DEV_BSHIFT + ssp->dss_secshift);
+		nsec = bp->bio_bcount >> (DEV_BSHIFT + ssp->dss_secshift);
 	} else {
-		if (bp->b_bcount % ssp->dss_secsize)
+		if (bp->bio_bcount % ssp->dss_secsize)
 			goto bad_bcount;
 		if (blkno % ssp->dss_secmult)
-			goto bad_blkno;
+			goto bad_offset;
 		secno = blkno / ssp->dss_secmult;
-		nsec = bp->b_bcount / ssp->dss_secsize;
+		nsec = bp->bio_bcount / ssp->dss_secsize;
 	}
 	if (lp == NULL) {
@@ -184,6 +342,6 @@
 	} else {
 		labelsect = lp->d_partitions[LABEL_PART].p_offset;
-if (labelsect != 0) Debugger("labelsect != 0 in dscheck()");
-		pp = &lp->d_partitions[dkpart(bp->b_dev)];
+		KASSERT(labelsect == 0, ("labelsect != 0 in dscheck()"));
+		pp = &lp->d_partitions[dkpart(bp->bio_dev)];
 		endsecno = pp->p_size;
 		slicerel_secno = pp->p_offset + secno;
@@ -196,6 +354,6 @@
 	    slicerel_secno + nsec > LABELSECTOR + labelsect &&
 #endif
-	    (bp->b_flags & B_READ) == 0 && sp->ds_wlabel == 0) {
-		bp->b_error = EROFS;
+	    bp->bio_cmd == BIO_WRITE && sp->ds_wlabel == 0) {
+		bp->bio_error = EROFS;
 		goto bad;
 	}
@@ -203,7 +361,7 @@
 #if defined(DOSBBSECTOR) && defined(notyet)
 	/* overwriting master boot record? */
-	if (slicerel_secno <= DOSBBSECTOR && (bp->b_flags & B_READ) == 0 &&
+	if (slicerel_secno <= DOSBBSECTOR && bp->bio_cmd == BIO_WRITE &&
 	    sp->ds_wlabel == 0) {
-		bp->b_error = EROFS;
+		bp->bio_error = EROFS;
 		goto bad;
 	}
@@ -211,20 +369,19 @@

 	/* beyond partition? */
-	if (secno + nsec > endsecno) {
-		/* if exactly at end of disk, return an EOF */
-		if (secno == endsecno) {
-			bp->b_resid = bp->b_bcount;
-			return (0);
-		}
-		/* or truncate if part of it fits */
-		nsec = endsecno - secno;
-		if (nsec <= 0) {
-			bp->b_error = EINVAL;
+	if ((uintmax_t)secno + nsec > endsecno) {
+		/* If at or beyond end of partition, return EOF. */
+		if (secno >= endsecno) {
+			if (bp->bio_cmd == BIO_READ) {
+				bp->bio_resid = bp->bio_bcount;
+				return (0);
+			}
+			bp->bio_error = ENOSPC;
 			goto bad;
 		}
-		bp->b_bcount = nsec * ssp->dss_secsize;
+		/* Truncate to the part that fits. */
+		bp->bio_bcount = (endsecno - secno) * ssp->dss_secsize;
 	}

-	bp->b_pblkno = sp->ds_offset + slicerel_secno;
+	bp->bio_pblkno = sp->ds_offset + slicerel_secno;

 	/*
@@ -237,18 +394,18 @@
 	    && slicerel_secno + nsec > LABELSECTOR + labelsect
 #endif
+	    /* XXX following seems to be broken for dedicated case: */
 	    && sp->ds_offset != 0) {
 		struct iodone_chain *ic;

 		ic = malloc(sizeof *ic , M_DEVBUF, M_WAITOK);
-		ic->ic_prev_flags = bp->b_flags;
-		ic->ic_prev_iodone = bp->b_iodone;
-		ic->ic_prev_iodone_chain = bp->b_iodone_chain;
+		ic->ic_prev_flags = bp->bio_flags;
+		ic->ic_prev_iodone = bp->bio_done;
+		ic->ic_prev_iodone_chain = bp->bio_done_chain;
 		ic->ic_args[0].ia_long = (LABELSECTOR + labelsect -
 		    slicerel_secno) * ssp->dss_secsize;
 		ic->ic_args[1].ia_ptr = sp;
-		bp->b_flags |= B_CALL;
-		bp->b_iodone = dsiodone;
-		bp->b_iodone_chain = ic;
-		if (!(bp->b_flags & B_READ)) {
+		bp->bio_done = dsiodone;
+		bp->bio_done_chain = ic;
+		if (bp->bio_cmd != BIO_READ) {
 			/*
 			 * XXX even disklabel(8) writes directly so we need
@@ -260,18 +417,13 @@
 			 * temporarily corrupting the in-core copy.
 			 */
-			if (bp->b_vp != NULL) {
-				s = splbio();
-				bp->b_vp->v_numoutput++;
-				splx(s);
-			}
 			/* XXX need name here. */
 			msg = fixlabel((char *)NULL, sp,
 				       (struct disklabel *)
-				       (bp->b_data + ic->ic_args[0].ia_long),
+				       (bp->bio_data + ic->ic_args[0].ia_long),
 				       TRUE);
 			if (msg != NULL) {
 				printf("dscheck(%s): %s\n",
-				    devtoname(bp->b_dev), msg);
-				bp->b_error = EROFS;
+				    devtoname(bp->bio_dev), msg);
+				bp->bio_error = EROFS;
 				goto bad;
 			}
@@ -282,19 +434,19 @@
 bad_bcount:
 	printf(
-	"dscheck(%s): b_bcount %ld is not on a sector boundary (ssize %d)\n",
-	    devtoname(bp->b_dev), bp->b_bcount, ssp->dss_secsize);
-	bp->b_error = EINVAL;
+	"dscheck(%s): bio_bcount %ld is not on a sector boundary (ssize %d)\n",
+	    devtoname(bp->bio_dev), bp->bio_bcount, ssp->dss_secsize);
+	bp->bio_error = EINVAL;
 	goto bad;

-bad_blkno:
+bad_offset:
 	printf(
-	"dscheck(%s): b_blkno %ld is not on a sector boundary (ssize %d)\n",
-	    devtoname(bp->b_dev), (long)blkno, ssp->dss_secsize);
-	bp->b_error = EINVAL;
+	"dscheck(%s): bio_offset %jd is not on a sector boundary (ssize %d)\n",
+	    devtoname(bp->bio_dev), (intmax_t)offset, ssp->dss_secsize);
+	bp->bio_error = EINVAL;
 	goto bad;

 bad:
-	bp->b_resid = bp->b_bcount;
-	bp->b_flags |= B_ERROR;
+	bp->bio_resid = bp->bio_bcount;
+	bp->bio_flags |= BIO_ERROR;
 	return (-1);
 }
%%%

Bruce
Comment 2 Hiten Pandya freebsd_committer freebsd_triage 2005-06-16 03:24:27 UTC
Bruce,

Given the impact of this bug, shouldn't we work towards committing 
it to RELENG_4 and various other related branches?

Regards,

Hiten Pandya
hmp at freebsd.org
Comment 3 Alexander Leidinger freebsd_committer freebsd_triage 2006-01-08 16:39:04 UTC
State Changed
From-To: open->closed

5.x and 6.x come with GEOM, where -- according to phk -- this is 
handled correctly. 

From phk: 
---snip--- 
Notice however that lseek(2) does not in fact instigate any I/O 
request so it cannot possibly know if the address is properly aligned 
or not.  Therefore lseek(2) will always succeed, even if the subsequent 
read/write will fail at that offset. 
---snip---
Comment 4 Alexander Best freebsd_committer freebsd_triage 2011-11-17 20:50:53 UTC
State Changed
From-To: closed->open

Open this problem report again. From a recent discussion on freebsd-hackers@ it 
has become apparent that the issues described in this problem report still 
exist in HEAD today.
Comment 5 Eitan Adler freebsd_committer freebsd_triage 2017-12-31 08:01:37 UTC
For bugs matching the following criteria:

Status: In Progress Changed: (is less than) 2014-06-01

Reset to default assignee and clear in-progress tags.

Mail being skipped