Bug 217065 - Memmove / bcopy optimization bug
Summary: Memmove / bcopy optimization bug
Status: New
Alias: None
Product: Base System
Classification: Unclassified
Component: arm (show other bugs)
Version: CURRENT
Hardware: Any Any
: --- Affects Only Me
Assignee: freebsd-arm (Nobody)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-02-13 10:51 UTC by Alexandre martins
Modified: 2017-06-17 20:48 UTC (History)
4 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Alexandre martins 2017-02-13 10:51:06 UTC
Hello,

During the optimization phase of our project on an Armada 380, I have see that an optimization on bcopy/memmove is broken.

https://svnweb.freebsd.org/base/head/sys/arm/arm/support.S?view=annotate#l403

The condition on the following line are reserved:

> subcc r3, r0, r1 /* if (dst > src) r3 = dst - src */
> subcs r3, r1, r0 /* if (src > dsr) r3 = src - dst */
> cmp r3, r2 /* if (r3 < len) we have an overlap */
> bcc PIC_SYM(_C_LABEL(memcpy), PLT)

Should be:

> subcs r3, r0, r1 /* if (dst > src) r3 = dst - src */
> subcc r3, r1, r0 /* if (src > dsr) r3 = src - dst */
> cmp r3, r2 /* if (r3 < len) we have an overlap */
> bcs PIC_SYM(_C_LABEL(memcpy), PLT)

This code may produce bugs if an unlikely copy of more that 2Gb is done (I don't see how).

I have already reported that on the mailling list:

https://docs.freebsd.org/cgi/getmsg.cgi?fetch=462567+0+archive/2017/freebsd-arm/20170212.freebsd-arm

Best regards,

Alexandre Martins
Comment 1 Andrew Turner freebsd_committer 2017-02-13 11:15:31 UTC
I don't see the bug.

The cmp on line 401 performs r0 - r1 (dst - src) storing the condition codes, but not the result.
Line 402 will return if the result is zero, i.e. src == dst.
Line 403 will execute if the carry bit is clear, i.e. the result is greater than or equal to zero, so r0 >= r1 (dst >= src).
Line 404 will execute if the carry bit is set, i.e. the result is less than zero, so r0 < r1 (src > dst).

The cmp on line 405 checks r3 - r2 (delta - length)
Line 406 will use memcpy if delta >= length, i.e. they are not overlapping.

This code should never be used for copying more than 2GB. on 32-bit ARM the kernel is limited to 1GB, and this should never be used to read userspace memory.
Comment 2 Mark Millard 2017-02-13 20:16:30 UTC
(In reply to Andrew Turner from comment #1)

> Line 403 will execute if the carry bit is clear, i.e. the result is greater than or equal to zero, so r0 >= r1 (dst >= src).
> Line 404 will execute if the carry bit is set, i.e. the result is less than zero, so r0 < r1 (src > dst).

That is not what I get from the arm documentation. . .

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0473e/CJAJIHAD.html

CC	Carry clear	Less than
CS	Carry set	Greater than or equal, or unordered

line                            instruction
403	 	 	        subcc   r3, r0, r1      /* if (dst > src) r3 = dst - src */
404	 	 	        subcs   r3, r1, r0      /* if (src > dsr) r3 = src - dst */

Line 403 has a less than test to enable its subtraction, not a "greater than  or equal to" test.

Line 404 has a greater than or equal test to enable its subtraction, not a less than test.


However there is a difference in the original and proposed code for when
r3==r2: the proposed code would jump to memcpy but the original code
would not.

This might bias the implementation. If so then the comments should be
of a form that explains why the code is as it is (not sticking normal
non-negative arithmetic and depending on 2's complement properties with
test reversals based on those properties).
Comment 3 Mark Millard 2017-02-13 20:58:04 UTC
An example of the code's activity where the
proposed and original to not match, but first
various quotes about the arm condition codes:

(from http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0473m/dom1359731162080.html )

C is set in one of the following ways:
. . .
For a subtraction, including the comparison instruction CMP,
C is set to 0 if the subtraction produced a borrow (that is,
an unsigned underflow), and to 1 otherwise.

(from http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0473m/dom1359731162080.html )

EQ	Z set	Equal
CC or LO	C clear	Lower (unsigned < )
CS or HS	C set	Higher or same (unsigned >= )


For the original code(small values for simplicity):

r0=0x10
r1=0x20
r2=0x10

NOTE: abs(r0-r1) == r2 so the distance between is equal to n (i.e. r2).

But. . .

cmp r0,r1 gives:

eq= (r0==r1) (i.e., zero)
cc= (r0<r1)  (i.e., one)
cs= (r0>=r1) (i.e., zero)

subcc r3,r0,r1 gives r3= 0x10-0x20
(bit pattern for 2's complement signed result)
(no condition code change)

subcs r3,r1,r0 is a no-op.

r3 is now large as an unsigned value

cmp r3,r2 gives: 

eq= (r3==r2) (i.e., zero)
cc= (r3<r2)  (i.e., zero)
cs= (r3>=r2) (i.e., one)

bcc PIC_SYM(_C_LABEL(memcpy), PLT): does not branch.

By contrast the proposed code would:

same cmp r0,r1 result

subcs r3,r0,r1 is a no-op

subcc r3,r1,r0 gives r3= 0x20-0x10 (i.e., 0x10)
(no condition code change)

cmp r3,r2 gives: 

eq= (r3==r2) (i.e., one)
cc= (r3<r2)  (i.e., zero)
cs= (r3>=r2) (i.e., one)

bcs PIC_SYM(_C_LABEL(memcpy), PLT): branches
Comment 4 Mark Millard 2017-02-14 01:58:40 UTC
In the list I claimed that there were cases of
picking the wrong routine relative to the
correctness of the copy. I was wrong, possibly
for multiple reasons but the following is sufficient:

https://www.freebsd.org/cgi/man.cgi?query=memcpy&sektion=3

says:

     In this implementation memcpy() is	implemented using bcopy(3), and	there-
     fore the strings may overlap.  On other systems, copying overlapping
     strings may produce surprises.  Programs intended to be portable should
     use memmove(3) when src and dst may overlap.

so the branch taken case for:

bcc PIC_SYM(_C_LABEL(memcpy), PLT)

also deals with overlaps since FreeBSD criteria is
that memcpy does so. (I had been thinking that it
did not deal with such.)


Notably the arm implementation of FreeBSD memcpy does not call
bcopy (that would be recursive in the arm implementation). It
just needs to have some properties that bcopy also has.

This suggests that memcpy vs. bcopy may have a performance
Principle of Least Astonishment violation since memcpy may well
perform differently than bcopy but memcpy is supposed to use
bcopy.
Comment 5 Mark Millard 2017-02-14 03:17:44 UTC
This is notes about the comments involved in the
code involved.

Point 0:

The comments are not explicit about which case
memcpy is supposed to handle vs. which case
flowing through to more memmove case is supposed
to handle.

It is not obvious since FreeBSD defines that both
handle copies with overlapping ranges. (memcpy
is even suppose to use bcopy --but does not
in arm.)

Point 1:

Some comments in the original code below do not
reflect the actual behavior of the code based on
what the arm documentation says about condition
codes. I'm not here indicating which of the two
should be corrected: I'm just noting various
mismatches.

(Earlier 217065 comments quote ARM about what cc
and cs mean: less than vs. greater-than-or-equal
respectively.)

In:

> cmp r0,r1
> RETeq           /* Bail now if src/dst are the same */
> subcc r3, r0, r1 /* if (dst > src) r3 = dst - src */
> subcs r3, r1, r0 /* if (src > dsr) r3 = src - dst */
> cmp r3, r2 /* if (r3 < len) we have an overlap */
> bcc PIC_SYM(_C_LABEL(memcpy), PLT)

the following material illustrate mismatches.

Contraints for dst and src (in r0/r1 terms):

0<=r0<=(0xffffffff-r2)+1
0<=r1<=(0xffffffff-r2)+1

To make a specific example:

r0=0x7fffffe (the dst)
r1=0x7ffffff (the src)
r2=0x2
(which fits the above constraints):

Note: abs(r0-r1)==0x1 and 0x1 < r2
(normal math with abs(r0-r1) being
a non-negative distance).

Since r0<r1:

cmp r0,r1
subcc r3,r0,r1

results in r3==0xffffffff (which is not
the actual distance between the r0 value
and the r1 value).

Note: /* if (dst > src) . . . */
does not match the actual test for cc
from that cmp: less than (r0 being dst
and r1 being src).

subcs r3,r1,r0 is a no-op for r0<r1.

Note: /* if (src > dsr) . . . */
does not match the actual test for cs
from that cmp: greater than or equal
(r0 being dst and r1 being src).

So then:

cmp r3,r2

results in eq=0;cc=0;cs=1 since r3>r2
[unsigned interpretation involved].

bcc PIC_SYM(_C_LABEL(memcpy), PLT)
[branch not taken since r3>r2 but
cc is for r3<r2]

But the comment on that cmp says:

/* if (r3 < len) we have an overlap */

despite r3 not being the (positive) distance
between r0's value and r1's value. This misleads
for how the code actually works currently.

But the bcc does test for r3<r2 so the comment
is correct in that respect but is wrong about
the implication given the prior code. (My
statements assume r2==len, in other words,
n==len.)
Comment 6 Alexandre martins 2017-02-14 08:23:04 UTC
Please note that the code is the same in userland:

https://svnweb.freebsd.org/base/head/lib/libc/arm/string/memmove.S?view=annotate#l57
Comment 7 Mark Millard 2017-02-15 20:27:05 UTC
(In reply to Andrew Turner from comment #1)

Does my interpretation of the arm code
sound right to you given the
content of the arm-documentation quotes
that I supplied (repeated below)?

Gathering the quotes together in a probably
better order for reading them:


From http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0473m/dom1359731162080.html :

C is set in one of the following ways:
. . .
For a subtraction, including the comparison instruction CMP,
C is set to 0 if the subtraction produced a borrow (that is,
an unsigned underflow), and to 1 otherwise.


From http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0473m/dom1359731162080.html :

EQ	Z set	Equal
CC or LO	C clear	Lower (unsigned < )
CS or HS	C set	Higher or same (unsigned >= )


http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0473e/CJAJIHAD.html :

CC	Carry clear	Less than
CS	Carry set	Greater than or equal, or unordered
Comment 8 Alexandre martins 2017-02-23 14:26:39 UTC
Just for notice, Netbsd fixed that code a couple of year ago:

http://cvsweb.netbsd.org/bsdweb.cgi/src/common/lib/libc/arch/arm/string/memmove.S.diff?r1=1.1&r2=1.2

But it still assume that memcpy don't handle overlapping.
Comment 9 andrew 2017-02-28 10:41:05 UTC
The comment in the memcpy manpage about overlaps is dangerously misleading; these days, memcpy() can be a compiler intrinsic (enabled by default in both clang and gcc with -O) and the compiler's version does NOT cope with overlaps.
Comment 10 Mark Millard 2017-02-28 12:28:28 UTC
(In reply to andrew from comment #9)

Wow. If memcpy compiler intrinsics are not (and have never
been) been adjusted by FreeBSD to follow its published
memcpy definition's rules, that would be dangerous for
using system compilers and devel/* and lang/* compilers.
I had assumed FreeBSD enforced its definitions, even in
interfaces implemented by ports.

If complers are left alone for memcpy sematics and the
man page is to be the thing to be fixed relative to
overlapped memory claims and also if FreeBSD's own
directly-used memcpy no longer deals with overlaps
for any TARGET_ARCH, then it would be critical that
bcopy/memmove calling memcpy avoid doing so for memory
overlaps.

In other words: the distance (in absolute value) between
the dst and src would need to be >= the len in the call
to allow to allow memcpy's use by bcopy/memmove.

That would seem to make an update like netbsd's fix
essential for arm64 (more than just being an
optimization).
Comment 11 Mark Millard 2017-02-28 19:49:51 UTC
(In reply to Alexandre martins from comment #8)

It is unfortunate that the netbsd update has an
incorrect comment. Quoting from the linked
material (with . . . for some omitted text):

Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
. . .
  * Copyright (c) 1997 The NetBSD Foundation, Inc.
@@ -53,10 +53,10 @@ ENTRY(bcopy)
 	/* Do the buffers overlap? */
 	cmp	r0, r1
 	RETc(eq)		/* Bail now if src/dst are the same */
-	subcc	r3, r0, r1	/* if (dst > src) r3 = dst - src */
-	subcs	r3, r1, r0	/* if (src > dsr) r3 = src - dst */
-	cmp	r3, r2		/* if (r3 < len) we have an overlap */
-	bcc	PIC_SYM(_C_LABEL(memcpy), PLT)
+	subhs	r3, r0, r1	/* if (dst > src) r3 = dst - src */
+	sublo	r3, r1, r0	/* if (src > dst) r3 = src - dst */
+	cmp	r3, r2		/* if (r3 >= len) we have an overlap */
+	bhs	PIC_SYM(_C_LABEL(memcpy), PLT)
 
 	/* Determine copy direction */
 	cmp	r1, r0


In the netbsd "+" code: r3 at the "cmp r3,r2"
is the distance (in absolute value)
between dst and src (r0 and r1 values).

So in the comment:

/* if (r3 >= len) we have an overlap */

the comment is wrong as written. It should
be more like:

/* if (r3 >= len) we have no overlap */

or even:

/* (abs(dst-src) >= len) is equivalent to
   the status of "there is no overlap" */

So the bhs to memcpy is for the no-overlap case,
as it should be for a classical definition of
memcpy's requirements --despite what the original
comment suggests.
Comment 12 Mark Millard 2017-04-06 22:30:56 UTC
(In reply to Andrew Turner from comment #1)

To be more explicit about netbsd's fix to memmove.S for arm
from back in 2007:

http://mail-index.netbsd.org/netbsd-bugs/2007/06/20/0002.html

reports:

Subject: port-arm/36512: memmove doesn't fall back to memcpy
To: None <port-arm-maintainer@netbsd.org, gnats-admin@netbsd.org,>
From: Hiroki Doshita <doshita@iij.ad.jp>
List: netbsd-bugs
Date: 06/20/2007 01:25:00
>Number:         36512
>Category:       port-arm
>Synopsis:       memmove doesn't fall back to memcpy
>Confidential:   no
>Severity:       non-critical
>Priority:       medium
>Responsible:    port-arm-maintainer
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Wed Jun 20 01:25:00 +0000 2007
>Originator:     Hiroki Doshita
>Release:        NetBSD 3.1
>Organization:
	Internet Initiative Japan, Inc.
>Environment:
Architecture: arm
Machine: armeb
>Description:
	the judgement of buffer overlap in xscale's memmove is wrong.

>How-To-Repeat:
>Fix:
Index: src/lib/libc/arch/arm/string/memmove.S
diff -u src/lib/libc/arch/arm/string/memmove.S:1.2 src/lib/libc/arch/arm/string/memmove.S:1.3
--- src/lib/libc/arch/arm/string/memmove.S:1.2	Mon Nov 17 15:25:34 2003
+++ src/lib/libc/arch/arm/string/memmove.S	Mon Jun 18 18:23:39 2007
@@ -53,10 +53,10 @@
 	/* Do the buffers overlap? */
 	cmp	r0, r1
 	moveq	pc, lr		/* Bail now if src/dst are the same */
-	subcc	r3, r0, r1	/* if (dst > src) r3 = dst - src */
-	subcs	r3, r1, r0	/* if (src > dsr) r3 = src - dst */
-	cmp	r3, r2		/* if (r3 < len) we have an overlap */
-	bcc	PIC_SYM(_C_LABEL(memcpy), PLT)
+	subhs	r3, r0, r1	/* if (dst > src) r3 = dst - src */
+	sublo	r3, r1, r0	/* if (src > dsr) r3 = src - dst */
+	cmp	r3, r2		/* if (r3 >= len) we have no overlap */
+	bhs	PIC_SYM(_C_LABEL(memcpy), PLT)
 
 	/* Determine copy direction */
 	cmp	r1, r0

(My notes Comment #11 about the "if (r3 >= len)" comment in the
above netbsd update applies but it is the code that matters here.)
Comment 13 Mark Millard 2017-04-06 22:44:33 UTC
(In reply to Mark Millard from comment #12)

I was unclear about the code comments:

/* if (r3 >= len) we have no overlap */
vs.
/* if (r3 >= len) we have an overlap */


The first id from:

http://mail-index.netbsd.org/netbsd-bugs/2007/06/20/0002.html

(the original patch) and is correct but the comment:

/* if (r3 >= len) we have an overlap */

as the code was official is wrong.

Somehow "Apply the patch supplied in PR/36512 to fix the
buffer overlap check" did not update the "an" to a "no".

The original patch tends to confirm my analysis of the
comment and Alexandre's analysis of the code (and my
concurrence).
Comment 14 Mark Millard 2017-04-06 23:33:23 UTC
(In reply to Mark Millard from comment #13)

I found were to get to the original netbsd problem report:

https://gnats.netbsd.org/cgi-bin/query-pr-single.pl?number=36512

It shows that netbsd's:

src/sys/lib/libkern/arch/arm/memmove.S

was also involved in the update.