Hello, During the optimization phase of our project on an Armada 380, I have see that an optimization on bcopy/memmove is broken. https://svnweb.freebsd.org/base/head/sys/arm/arm/support.S?view=annotate#l403 The condition on the following line are reserved: > subcc r3, r0, r1 /* if (dst > src) r3 = dst - src */ > subcs r3, r1, r0 /* if (src > dsr) r3 = src - dst */ > cmp r3, r2 /* if (r3 < len) we have an overlap */ > bcc PIC_SYM(_C_LABEL(memcpy), PLT) Should be: > subcs r3, r0, r1 /* if (dst > src) r3 = dst - src */ > subcc r3, r1, r0 /* if (src > dsr) r3 = src - dst */ > cmp r3, r2 /* if (r3 < len) we have an overlap */ > bcs PIC_SYM(_C_LABEL(memcpy), PLT) This code may produce bugs if an unlikely copy of more that 2Gb is done (I don't see how). I have already reported that on the mailling list: https://docs.freebsd.org/cgi/getmsg.cgi?fetch=462567+0+archive/2017/freebsd-arm/20170212.freebsd-arm Best regards, Alexandre Martins
I don't see the bug. The cmp on line 401 performs r0 - r1 (dst - src) storing the condition codes, but not the result. Line 402 will return if the result is zero, i.e. src == dst. Line 403 will execute if the carry bit is clear, i.e. the result is greater than or equal to zero, so r0 >= r1 (dst >= src). Line 404 will execute if the carry bit is set, i.e. the result is less than zero, so r0 < r1 (src > dst). The cmp on line 405 checks r3 - r2 (delta - length) Line 406 will use memcpy if delta >= length, i.e. they are not overlapping. This code should never be used for copying more than 2GB. on 32-bit ARM the kernel is limited to 1GB, and this should never be used to read userspace memory.
(In reply to Andrew Turner from comment #1) > Line 403 will execute if the carry bit is clear, i.e. the result is greater than or equal to zero, so r0 >= r1 (dst >= src). > Line 404 will execute if the carry bit is set, i.e. the result is less than zero, so r0 < r1 (src > dst). That is not what I get from the arm documentation. . . http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0473e/CJAJIHAD.html CC Carry clear Less than CS Carry set Greater than or equal, or unordered line instruction 403 subcc r3, r0, r1 /* if (dst > src) r3 = dst - src */ 404 subcs r3, r1, r0 /* if (src > dsr) r3 = src - dst */ Line 403 has a less than test to enable its subtraction, not a "greater than or equal to" test. Line 404 has a greater than or equal test to enable its subtraction, not a less than test. However there is a difference in the original and proposed code for when r3==r2: the proposed code would jump to memcpy but the original code would not. This might bias the implementation. If so then the comments should be of a form that explains why the code is as it is (not sticking normal non-negative arithmetic and depending on 2's complement properties with test reversals based on those properties).
An example of the code's activity where the proposed and original to not match, but first various quotes about the arm condition codes: (from http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0473m/dom1359731162080.html ) C is set in one of the following ways: . . . For a subtraction, including the comparison instruction CMP, C is set to 0 if the subtraction produced a borrow (that is, an unsigned underflow), and to 1 otherwise. (from http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0473m/dom1359731162080.html ) EQ Z set Equal CC or LO C clear Lower (unsigned < ) CS or HS C set Higher or same (unsigned >= ) For the original code(small values for simplicity): r0=0x10 r1=0x20 r2=0x10 NOTE: abs(r0-r1) == r2 so the distance between is equal to n (i.e. r2). But. . . cmp r0,r1 gives: eq= (r0==r1) (i.e., zero) cc= (r0<r1) (i.e., one) cs= (r0>=r1) (i.e., zero) subcc r3,r0,r1 gives r3= 0x10-0x20 (bit pattern for 2's complement signed result) (no condition code change) subcs r3,r1,r0 is a no-op. r3 is now large as an unsigned value cmp r3,r2 gives: eq= (r3==r2) (i.e., zero) cc= (r3<r2) (i.e., zero) cs= (r3>=r2) (i.e., one) bcc PIC_SYM(_C_LABEL(memcpy), PLT): does not branch. By contrast the proposed code would: same cmp r0,r1 result subcs r3,r0,r1 is a no-op subcc r3,r1,r0 gives r3= 0x20-0x10 (i.e., 0x10) (no condition code change) cmp r3,r2 gives: eq= (r3==r2) (i.e., one) cc= (r3<r2) (i.e., zero) cs= (r3>=r2) (i.e., one) bcs PIC_SYM(_C_LABEL(memcpy), PLT): branches
In the list I claimed that there were cases of picking the wrong routine relative to the correctness of the copy. I was wrong, possibly for multiple reasons but the following is sufficient: https://www.freebsd.org/cgi/man.cgi?query=memcpy&sektion=3 says: In this implementation memcpy() is implemented using bcopy(3), and there- fore the strings may overlap. On other systems, copying overlapping strings may produce surprises. Programs intended to be portable should use memmove(3) when src and dst may overlap. so the branch taken case for: bcc PIC_SYM(_C_LABEL(memcpy), PLT) also deals with overlaps since FreeBSD criteria is that memcpy does so. (I had been thinking that it did not deal with such.) Notably the arm implementation of FreeBSD memcpy does not call bcopy (that would be recursive in the arm implementation). It just needs to have some properties that bcopy also has. This suggests that memcpy vs. bcopy may have a performance Principle of Least Astonishment violation since memcpy may well perform differently than bcopy but memcpy is supposed to use bcopy.
This is notes about the comments involved in the code involved. Point 0: The comments are not explicit about which case memcpy is supposed to handle vs. which case flowing through to more memmove case is supposed to handle. It is not obvious since FreeBSD defines that both handle copies with overlapping ranges. (memcpy is even suppose to use bcopy --but does not in arm.) Point 1: Some comments in the original code below do not reflect the actual behavior of the code based on what the arm documentation says about condition codes. I'm not here indicating which of the two should be corrected: I'm just noting various mismatches. (Earlier 217065 comments quote ARM about what cc and cs mean: less than vs. greater-than-or-equal respectively.) In: > cmp r0,r1 > RETeq /* Bail now if src/dst are the same */ > subcc r3, r0, r1 /* if (dst > src) r3 = dst - src */ > subcs r3, r1, r0 /* if (src > dsr) r3 = src - dst */ > cmp r3, r2 /* if (r3 < len) we have an overlap */ > bcc PIC_SYM(_C_LABEL(memcpy), PLT) the following material illustrate mismatches. Contraints for dst and src (in r0/r1 terms): 0<=r0<=(0xffffffff-r2)+1 0<=r1<=(0xffffffff-r2)+1 To make a specific example: r0=0x7fffffe (the dst) r1=0x7ffffff (the src) r2=0x2 (which fits the above constraints): Note: abs(r0-r1)==0x1 and 0x1 < r2 (normal math with abs(r0-r1) being a non-negative distance). Since r0<r1: cmp r0,r1 subcc r3,r0,r1 results in r3==0xffffffff (which is not the actual distance between the r0 value and the r1 value). Note: /* if (dst > src) . . . */ does not match the actual test for cc from that cmp: less than (r0 being dst and r1 being src). subcs r3,r1,r0 is a no-op for r0<r1. Note: /* if (src > dsr) . . . */ does not match the actual test for cs from that cmp: greater than or equal (r0 being dst and r1 being src). So then: cmp r3,r2 results in eq=0;cc=0;cs=1 since r3>r2 [unsigned interpretation involved]. bcc PIC_SYM(_C_LABEL(memcpy), PLT) [branch not taken since r3>r2 but cc is for r3<r2] But the comment on that cmp says: /* if (r3 < len) we have an overlap */ despite r3 not being the (positive) distance between r0's value and r1's value. This misleads for how the code actually works currently. But the bcc does test for r3<r2 so the comment is correct in that respect but is wrong about the implication given the prior code. (My statements assume r2==len, in other words, n==len.)
Please note that the code is the same in userland: https://svnweb.freebsd.org/base/head/lib/libc/arm/string/memmove.S?view=annotate#l57
(In reply to Andrew Turner from comment #1) Does my interpretation of the arm code sound right to you given the content of the arm-documentation quotes that I supplied (repeated below)? Gathering the quotes together in a probably better order for reading them: From http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0473m/dom1359731162080.html : C is set in one of the following ways: . . . For a subtraction, including the comparison instruction CMP, C is set to 0 if the subtraction produced a borrow (that is, an unsigned underflow), and to 1 otherwise. From http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0473m/dom1359731162080.html : EQ Z set Equal CC or LO C clear Lower (unsigned < ) CS or HS C set Higher or same (unsigned >= ) http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0473e/CJAJIHAD.html : CC Carry clear Less than CS Carry set Greater than or equal, or unordered
Just for notice, Netbsd fixed that code a couple of year ago: http://cvsweb.netbsd.org/bsdweb.cgi/src/common/lib/libc/arch/arm/string/memmove.S.diff?r1=1.1&r2=1.2 But it still assume that memcpy don't handle overlapping.
The comment in the memcpy manpage about overlaps is dangerously misleading; these days, memcpy() can be a compiler intrinsic (enabled by default in both clang and gcc with -O) and the compiler's version does NOT cope with overlaps.
(In reply to andrew from comment #9) Wow. If memcpy compiler intrinsics are not (and have never been) been adjusted by FreeBSD to follow its published memcpy definition's rules, that would be dangerous for using system compilers and devel/* and lang/* compilers. I had assumed FreeBSD enforced its definitions, even in interfaces implemented by ports. If complers are left alone for memcpy sematics and the man page is to be the thing to be fixed relative to overlapped memory claims and also if FreeBSD's own directly-used memcpy no longer deals with overlaps for any TARGET_ARCH, then it would be critical that bcopy/memmove calling memcpy avoid doing so for memory overlaps. In other words: the distance (in absolute value) between the dst and src would need to be >= the len in the call to allow to allow memcpy's use by bcopy/memmove. That would seem to make an update like netbsd's fix essential for arm64 (more than just being an optimization).
(In reply to Alexandre martins from comment #8) It is unfortunate that the netbsd update has an incorrect comment. Quoting from the linked material (with . . . for some omitted text): Please note that diffs are not public domain; they are subject to the copyright notices on the relevant files. . . . * Copyright (c) 1997 The NetBSD Foundation, Inc. @@ -53,10 +53,10 @@ ENTRY(bcopy) /* Do the buffers overlap? */ cmp r0, r1 RETc(eq) /* Bail now if src/dst are the same */ - subcc r3, r0, r1 /* if (dst > src) r3 = dst - src */ - subcs r3, r1, r0 /* if (src > dsr) r3 = src - dst */ - cmp r3, r2 /* if (r3 < len) we have an overlap */ - bcc PIC_SYM(_C_LABEL(memcpy), PLT) + subhs r3, r0, r1 /* if (dst > src) r3 = dst - src */ + sublo r3, r1, r0 /* if (src > dst) r3 = src - dst */ + cmp r3, r2 /* if (r3 >= len) we have an overlap */ + bhs PIC_SYM(_C_LABEL(memcpy), PLT) /* Determine copy direction */ cmp r1, r0 In the netbsd "+" code: r3 at the "cmp r3,r2" is the distance (in absolute value) between dst and src (r0 and r1 values). So in the comment: /* if (r3 >= len) we have an overlap */ the comment is wrong as written. It should be more like: /* if (r3 >= len) we have no overlap */ or even: /* (abs(dst-src) >= len) is equivalent to the status of "there is no overlap" */ So the bhs to memcpy is for the no-overlap case, as it should be for a classical definition of memcpy's requirements --despite what the original comment suggests.
(In reply to Andrew Turner from comment #1) To be more explicit about netbsd's fix to memmove.S for arm from back in 2007: http://mail-index.netbsd.org/netbsd-bugs/2007/06/20/0002.html reports: Subject: port-arm/36512: memmove doesn't fall back to memcpy To: None <port-arm-maintainer@netbsd.org, gnats-admin@netbsd.org,> From: Hiroki Doshita <doshita@iij.ad.jp> List: netbsd-bugs Date: 06/20/2007 01:25:00 >Number: 36512 >Category: port-arm >Synopsis: memmove doesn't fall back to memcpy >Confidential: no >Severity: non-critical >Priority: medium >Responsible: port-arm-maintainer >State: open >Class: sw-bug >Submitter-Id: net >Arrival-Date: Wed Jun 20 01:25:00 +0000 2007 >Originator: Hiroki Doshita >Release: NetBSD 3.1 >Organization: Internet Initiative Japan, Inc. >Environment: Architecture: arm Machine: armeb >Description: the judgement of buffer overlap in xscale's memmove is wrong. >How-To-Repeat: >Fix: Index: src/lib/libc/arch/arm/string/memmove.S diff -u src/lib/libc/arch/arm/string/memmove.S:1.2 src/lib/libc/arch/arm/string/memmove.S:1.3 --- src/lib/libc/arch/arm/string/memmove.S:1.2 Mon Nov 17 15:25:34 2003 +++ src/lib/libc/arch/arm/string/memmove.S Mon Jun 18 18:23:39 2007 @@ -53,10 +53,10 @@ /* Do the buffers overlap? */ cmp r0, r1 moveq pc, lr /* Bail now if src/dst are the same */ - subcc r3, r0, r1 /* if (dst > src) r3 = dst - src */ - subcs r3, r1, r0 /* if (src > dsr) r3 = src - dst */ - cmp r3, r2 /* if (r3 < len) we have an overlap */ - bcc PIC_SYM(_C_LABEL(memcpy), PLT) + subhs r3, r0, r1 /* if (dst > src) r3 = dst - src */ + sublo r3, r1, r0 /* if (src > dsr) r3 = src - dst */ + cmp r3, r2 /* if (r3 >= len) we have no overlap */ + bhs PIC_SYM(_C_LABEL(memcpy), PLT) /* Determine copy direction */ cmp r1, r0 (My notes Comment #11 about the "if (r3 >= len)" comment in the above netbsd update applies but it is the code that matters here.)
(In reply to Mark Millard from comment #12) I was unclear about the code comments: /* if (r3 >= len) we have no overlap */ vs. /* if (r3 >= len) we have an overlap */ The first id from: http://mail-index.netbsd.org/netbsd-bugs/2007/06/20/0002.html (the original patch) and is correct but the comment: /* if (r3 >= len) we have an overlap */ as the code was official is wrong. Somehow "Apply the patch supplied in PR/36512 to fix the buffer overlap check" did not update the "an" to a "no". The original patch tends to confirm my analysis of the comment and Alexandre's analysis of the code (and my concurrence).
(In reply to Mark Millard from comment #13) I found were to get to the original netbsd problem report: https://gnats.netbsd.org/cgi-bin/query-pr-single.pl?number=36512 It shows that netbsd's: src/sys/lib/libkern/arch/arm/memmove.S was also involved in the update.