Summary: | ARM VFP/NEON regs not preserved in signal delivery | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Base System | Reporter: | Andrew "RhodiumToad" Gierth <andrew> | ||||||
Component: | arm | Assignee: | Michal Meloun <mmel> | ||||||
Status: | Closed FIXED | ||||||||
Severity: | Affects Many People | CC: | emaste, marklmi26-fbsd, mmel, otacilio.neto, pstef, sylvain | ||||||
Priority: | --- | ||||||||
Version: | 11.0-STABLE | ||||||||
Hardware: | arm | ||||||||
OS: | Any | ||||||||
Attachments: |
|
(In reply to andrew from comment #0) FYI: # diff clang_armv7a_macros.txt clang_cortex_a7_macros.txt | more 7a8 > #define __ARM_ARCH_EXT_IDIV__ 1 13a15,16 > #define __ARM_FEATURE_FMA 1 > #define __ARM_FEATURE_IDIV 1 18c21 < #define __ARM_FP 0xC --- > #define __ARM_FP 0xE 22c25 < #define __ARM_NEON_FP 0x4 --- > #define __ARM_NEON_FP 0x6 28c31 < #define __ARM_VFPV3__ 1 --- > #define __ARM_VFPV4__ 1 So both have NEON and VFP and FP, but not exactly matching variations. # diff clang_default_macros.txt clang_armv7a_macros.txt | more 6,7c6,7 < #define __ARM_ARCH 6 < #define __ARM_ARCH_6KZ__ 1 --- > #define __ARM_ARCH 7 > #define __ARM_ARCH_7A__ 1 9c9,10 < #define __ARM_ARCH_ISA_THUMB 1 --- > #define __ARM_ARCH_ISA_THUMB 2 > #define __ARM_ARCH_PROFILE 'A' 13c14 < #define __ARM_FEATURE_LDREX 0x4 --- > #define __ARM_FEATURE_LDREX 0xF 19a21,23 > #define __ARM_NEON 1 > #define __ARM_NEON_FP 0x4 > #define __ARM_NEON__ 1 24c28 < #define __ARM_VFPV2__ 1 --- > #define __ARM_VFPV3__ 1 So armv6 and -march=armv7a have the same FP status. But armv6 has yet a different VFP version and no NEON at all. Note that -march=armv7r (not listed) does not have FP, VFP, or NEON: less than armv6 ni these areas. The same goes for -march=armv7m . Technically FreeBSD could declare that no option that would have the compiler use NEON is supported. Otherwise FreeBSD needs to be able to be built with NEON support, possibly auto-adjusting to the processor if enabled. Similarly for building ports or other code. Similar points go for VFP and for FP. It might get more detailed for versions/variations in NEON if more or less needs to be saved/restored. Similarly for VFP and for FP. Again: what the compilers might put to use matters. Uh, I think you're missing the point. Signal delivery isn't preserving ANY floating-point/vector registers on ANY vfp version; the test program fails when compiled with no cpu or arch options at all. Most programs don't do floating-point in signal handlers so this isn't immediately obvious. The use of -mcpu=cortex-a7 or similar option just broadens the problem to include every program that might have vector operations in a signal handler (which includes all programs with signal handlers if linked with libthr) as well as in the main program. Support for preserving the full vfp/neon register set over context switches is already present, it's simply that whoever added that didn't follow through on the requirement to preserve them for signals too. Worse, the mcontext_t structure -- which is user-visible, and used by libthr -- isn't even big enough to hold the registers, so fixing the problem requires breaking the ABI. (In reply to andrew from comment #2) I do not see how what I wrote contradicts your original or new material. I'd guess that FreeBSD will at some point choose to support VFP use and NEON use in some form for processors that have them. But for now the supported context requires that libthr (or its signal handling) be built without use of such. Similarly for any other signal-handling code. The notes about the required ABI change were a good thing to be explicit about as well. stable/11 may never get such since after release ABI changes are avoided. Fixing 11 by changing its ABI likely would require a general debate before potential approval. Thus stable/11 may literally never officially allow builds where libthr uses NEON in its signal handlers, for example. Nor openssl using NEON for another example. head (12) likely will eventually support such, or that would be my guess. (In reply to Mark Millard from comment #3) You're still missing the point. This isn't about NEON except tangentially, and it isn't even about compiling openssl or libthr with cortex-a7 or other CPU types any more. It's about the entire floating-point support being incomplete. The ABI *REQUIRES* use of the floating-point hardware (that's the "hf" at the end of "gnueabihf" - stands for "hard float", i.e. FP args go in FP registers). This isn't something that's just an option that programs can do without. (It might be possible to do a non-ABI-breaking fix for 11 by preserving d0-d15; that would work for programs that don't use -mcpu or other options that enable vfp3 or neon usage.) (In reply to andrew from comment #4) I'm still unclear where I've contradicted you up to this point. I think its more that I'm making a somewhat different set of points. NEON and its registers are not limited to floating point contexts. https://www.arm.com/products/processors/technologies/neon.php lists: It has 32 registers, 64-bits wide (dual view as 16 registers, 128-bits wide. . . . Data types can be: signed/unsigned 8-bit, 16-bit, 32-bit, 64-bit, single precision floating point And http://lists.llvm.org/pipermail/llvm-dev/2016-March/097613.html (2016-Mar-25) indicates that llvm/clang was intended to eventually have: > Examples: > > Works today: > -mfpu=soft -> Int (ALU), FP (LIB), no VFP/NEON instructions > -mfpu=softfp -> Int (ALU), FP (LIB), VFP/NEON instructions allowed > -mfpu=vfp -> Int (ALU), FP (VFP) > -mfpu=neon -> Int (NEON), FP (NEON) > > Change proposed: > -mfpmath=neon -mfpu=vfp -> Int (ALU), FP (NEON) > -mfpmath=vfp -mfpu=neon -> Int (NEON), FP (VFP) Thus even code not doing any floating point activity could be using NEON registers and have problems with signals not preserving those register values --even if no running code was using floating point in any process (unlikely combination). I think that this means that NEON deserves its own mention, not just tied to VFP or other floating point contexts. This is in addition to the floating point related points, not instead of them. Side note: For floating point use NEON is not IEEE-754 compliant but VFP is. Whatever. I have code under test now that seems to be working (with the world built for cortex-a7); though I need to dig around to find if I missed anything. I'll post the patch later. Created attachment 180669 [details]
patch I'm testing (breaks ABI)
Changing the size of mcontext_t changes the size of ucontext_t used by libthr, so everything linked with libthr (which notably includes sshd) will crash in the event of mismatch.
Both openssl speed and git clone now succeed when everything is built with CPUTYPE=cortex-a7, and the test program succeeds regardless of compilation flags; but I'm no expert on ARM (been playing with it only a few weeks) so it's quite likely I've overlooked something.
(In reply to andrew from comment #7) I expect that this sort of code should have a phabricator review (or equivalent) with FreeBSD and arm experts invited. One thing that I notice is that the Procedure Call Standard indicates that at calls to public interfaces (such as a signal handler) needs the fpscr length and stride fields to have been cleared. An implication is that when the signal handler returns the fpscr value needs to be restored, including the length and stride fields. As a cross check I looked at linux code and it does these things. It also showed that there is also code present for fpexc, fpinst and fpinst2. I'll avoid quoting code for fear of pollution. The VFP itself is temporarily disabled "to avoid corrupting new thread state" before some activity and re-enabled after that activity for the restore-user-hw-state code. fpexc has 3 or so bits manipulated. There is some (cpu specific) fpexc code on the preserve-user-clear-hw-state side of things as well. There is also fpinst and fpinst2 code. (In reply to Mark Millard from comment #8) I'm not touching fpexc and fpinst* because those are privileged, and therefore shouldn't be saved and restored via a user-controlled context. If anything from them does actually need to be preserved -- which I honestly don't know: fpexc seems to be used only to disable or enable the VFP, which requires no special attention here, and fpinst* seem to be related to exception handling, which would explain why they need to be preserved over kernel preemptions but wouldn't require them to be handled here -- then it would have to be done in some other way than storing them in user space. fpscr is saved and restored by the patch (though I admit to not having actually tested that part yet). (Incidentally, arm64 appears to have a different bug: it's not saving the fpcontext in get_mcontext/set_mcontext, but it is saving it in the signal code. This also isn't kosher, because it means that user-space getcontext(3)/swapcontext(3) etc. do not deliver all of their specified behavior, which may affect libthr) (In reply to Mark Millard from comment #8) This note may well be outside what FreeBSD chooses to target or be ready for since there are no existing examples as far as I can tell. Still I note it explicitly. http://infocenter.arm.com/help/topic/com.arm.doc.dht0002a/DHT0002A_introducing_neon.pdf says: There is no architectural requirement for a processor to implement both VFP and NEON, but the common features in the programmers models for these extensions mean an operating system that supports VFP requires little or no modifications to also support NEON. There is also: http://infocenter.arm.com/help/topic/com.arm.doc.ihi0053d/IHI0053D_acle_2_1.pdf __ARM_NEON is defined to a value indicating the Advanced SIMD (NEON) architecture supported. The only current value is 1. In principle, for AArch32, the NEON architecture can exist in an integer-only version. To test for the presence of NEON floating-point vector instructions, test __ARM_NEON_FP. When NEON does occur in an integer-only version, the VFP scalar instruction set is also not present. My guess is that FreeBSD would assume that NEON being present means that VFP is as well, expecting that Integer NEON by itself is unlikely to occur. (In reply to andrew from comment #9) You may well be right. I do not claim to understand all the justifications/requirements driving everything Linux does. I just note that it does it in case it prompts something about what does need to be done. fpexp usage for signals is involved in: A) disabling VFP and later enabling it B) invalidating fpinst2 C) clearing the fp exception flag for restore-user-hw-state (so on return from the signal handler). I would have guessed that the exception flag is to be made to appear to be per-thread (or per-process) specific and that a signal handler that causes the fp exception flag to be set should not have the flag value be visible as a side effect on other threads or processes. But I could be guessing wrong about what the fp exception flag indicates or its intended scope. (In reply to Mark Millard from comment #11) Looks like fpexc is processor family specific. For arm11 there are exception handling requirements. . . http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0290g/Ccdhcfga.html says for arm11: [31] EX Exception flag. When EX is set, the VFP11 coprocessor is in the exceptional state. EX must be cleared by the exception handling routine. [28] FP2V FPINST2 instruction valid flag. Set when FPINST2 contains a valid instruction. FP2V must be cleared by the exception handling routine. I guess this is why linux has such code. (In reply to Mark Millard from comment #12) [I should have looked this up as part of making the prior note:] http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0290g/Ccdhcfga.html lists arm11 as an armv6. So for FreeBSD to cover armv6's VFP it looks like the exception handling would be required to clear those fpexc fields (EX and FP2V). (In reply to Mark Millard from comment #13) Another piece of that document's text says: You must save and restore the FPEXC register whenever changing the context. If the EX flag, FPEXC[31], is set, then the VFP11 coprocessor is in the exceptional state, and you must also save and restore the FPINST and FPINST2 registers. You can write the context switch code to determine from the EX flag which registers to save and restore or to simply save all three. As I understand it signal delivery and return would be considered a context switch for the above. (But I'm no expert. Linux did bother to.) (In reply to Mark Millard from comment #11) The existing exception handling so far handles exactly two cases: 1. The VFP is disabled because the current thread might not own the hardware, in which case the exception handler clears the exception, restores the registers from the pcb to the hardware if necessary, and turns the VFP on; 2. The VFP is enabled, in which case the exception handler clears the exception and sends SIGFPE to the process In either event, the exception must be cleared before we return to user mode, and it's at that point that signal delivery happens. Hence, no need to preserve the exception state. The code in vfp.c that preserves fpexc/fpinst/fpinst2 has to do so because of the possibility that the thread gets preempted inside the kernel while handling an exception (i.e. before reaching the critical_enter() in vfp_bounce), This doesn't apply to signal delivery into user space. So unless I'm badly misunderstanding the existing code, what I've done should be sufficient. It might not be optimal - it would perhaps make sense to have (as other targets do) a flag to indicate that the VFP state is present, and use that to decide whether to restore the state to the registers rather than just to the pcb; my version currently always returns from signals with the VFP off and the state in the pcb rather than the hardware. (In reply to Mark Millard from comment #14) http://liris.cnrs.fr/~mmrissa/lib/exe/fetch.php?media=armv7-a-r-manual.pdf indicates that much of the FPEXC content and related things are "VFP subarchitecture defined" for armv7-a and armv7-r: they are not part of the armv7 architecture itself. So I appears to me that FreeBSD must decide what VFP subarchitectures it is going to support automatically --and what versions relative to the "Common VFP Subarchitecture". May be only "Common VFP Subarchitecture" versions are to be supported. [Note that Cortex-A8 has only one bit used but arm11 (an armv6) has many used.] The document says as well: Appendix F Common VFP Subarchitecture Specification Defines version 2 of the Common VFP Subarchitecture. Note This specification is not part of the ARM architecture specification. This sub-architectural information is included here as supplementary information, for the convenience of developers and users who might require this information. . . . Appendix F Common VFP Subarchitecture Specification includes examples of how a Floating-point subarchitecture might define additional registers, in the SUBARCHITECTURE DEFINED register space using addresses in the 0b1001 to 0b1111 range. Appendix F is not part of the ARMv7 architecture. It is included as an example of how a Floating-point subarchitecture might be defined. . . . [presuming a Common VFP Subarchitecture context:] . . . Floating-point support code A complete Floating-point implementation might require a software component, called the support code. For example, if an implementation includes VFPv3U or VFPv4U, support code must handle the trapped floating-point exceptions. The interface to the support code is called the VFP subarchitecture. ARM has defined a subarchitecture that is suitable for use with implementations of the ARM Floating-point Extension, see Appendix F Common VFP Subarchitecture Specification. Context switch software can be written to always save and restore the subarchitecture registers. In this case appropriate context switch software must be chosen based on the registers implemented, using the detection mechanism described in Detecting which VFP Common subarchitecture registers are implemented on page AppxF-2445. I confirm that this occurs on beaglebone black. % ./fptst breakme entering... h0: actual=-807739214198814.750000 expected=50000000.000000 h1: actual=-807739214198813.750000 expected=50000001.000000 h2: actual=-807739230607542.875000 expected=50000002.000000 h3: actual=-807739230607540.000000 expected=50000003.000000 h4: actual=-807739247016269.125000 expected=50000004.000000 h5: actual=-807739247016266.250000 expected=50000005.000000 h6: actual=-807739263424995.375000 expected=50000006.000000 h7: actual=-807739263424993.375000 expected=50000007.000000 A commit references this bug: Author: mmel Date: Sun Mar 26 08:36:56 UTC 2017 New revision: 315974 URL: https://svnweb.freebsd.org/changeset/base/315974 Log: Preserve VFP state across signal delivery. We don't have enouch space to store full VFP context within mcontext stucture. Due to this: - follow i386/amd64 way and store VFP state outside of the mcontext_t but point to it. Use the size of VFP state structure as an 'magic' indicator of the saved VFP state presence. - teach set_mcontext() about this external storage. - for signal delivery, store VFP state to expanded 'struct sigframe'. Submited by: Andrew Gierth (initial version) PR: 217611 MFC after: 2 weeks Changes: head/lib/libthread_db/arch/arm/libpthread_md.c head/sys/arm/arm/machdep.c head/sys/arm/include/frame.h head/sys/arm/include/ucontext.h (In reply to commit-hook from comment #18) Does the struct fpreg problem that was mentioned on the list need a separate bugzilla report? I remind what I ran into when I looked around based on the reference. (The original reference just named the struct.) On 2017-Mar-12, at 12:17 AM, Michal Meloun <melounmichal at gmail.com> wrote: The struct fpreg is also wrong and I'm not sure if or how we can to fix this in compatible way. And I later wrote: Looking up some details shows sys/arm/include/reg.h with: struct fpreg { unsigned int fpr_fpsr; fp_reg_t fpr[8]; }; [Covers only 8 floating point registers?] and shows sys/arm/include/fp.h with: typedef struct fp_extended_precision { u_int32_t fp_exponent; u_int32_t fp_mantissa_hi; u_int32_t fp_mantissa_lo; } fp_extended_precision_t; typedef struct fp_extended_precision fp_reg_t; . . . /* * Type for a saved FP context, if we want to translate the context to a * user-readable form */ typedef struct { u_int32_t fpsr; fp_extended_precision_t regs[8]; } fp_state_t; So each of: struct fpreg fp_state_t has room for 8 instances of 96 bits (beyond fpsr), not sufficient for 32 double precision (i.e., 64-bit) registers. The arm code also has: # grep -r "\<fpreg\>" /usr/src/sys/arm/ | more /usr/src/sys/arm/arm/machdep.c:fill_fpregs(struct thread *td, struct fpreg *regs) /usr/src/sys/arm/arm/machdep.c:set_fpregs(struct thread *td, struct fpreg *regs) /usr/src/sys/arm/include/reg.h:struct fpreg { /usr/src/sys/arm/include/reg.h:int fill_fpregs(struct thread *, struct fpreg *); /usr/src/sys/arm/include/reg.h:int set_fpregs(struct thread *, struct fpreg *); And the system has: /usr/src/sys/sys/procfs.h:typedef struct fpreg fpregset_t; /usr/src/sys/sys/procfs.h: size_t pr_fpregsetsz; /* sizeof(fpregset_t) (1) */ /usr/src/sys/sys/procfs.h:typedef fpregset_t prfpregset_t; /usr/src/sys/sys/ptrace.h:struct fpreg; /usr/src/sys/sys/ptrace.h:int proc_read_fpregs(struct thread *_td, struct fpreg *_fpreg); /usr/src/sys/sys/ptrace.h:int proc_write_fpregs(struct thread *_td, struct fpreg *_fpreg); and: /usr/src/sys/kern/sys_process.c: * Ptrace doesn't support fpregs at all, and there are no security holes /usr/src/sys/kern/sys_process.c: * or translations for fpregs, so we can just copy them. /usr/src/sys/kern/sys_process.c:proc_read_fpregs(struct thread *td, struct fpreg *fpregs) /usr/src/sys/kern/sys_process.c: PROC_ACTION(fill_fpregs(td, fpregs)); /usr/src/sys/kern/sys_process.c:proc_write_fpregs(struct thread *td, struct fpreg *fpregs) /usr/src/sys/kern/sys_process.c: PROC_ACTION(set_fpregs(td, fpregs)); /usr/src/sys/kern/sys_process.c: struct fpreg fpreg; /usr/src/sys/kern/sys_process.c: error = COPYIN(uap->addr, &r.fpreg, sizeof r.fpreg); /usr/src/sys/kern/sys_process.c: error = COPYOUT(&r.fpreg, uap->addr, sizeof r.fpreg); /usr/src/sys/kern/sys_process.c: error = PROC_WRITE(fpregs, td2, addr); /usr/src/sys/kern/sys_process.c: error = PROC_READ(fpregs, td2, addr); and there is use of some of the above in: /usr/src/sys/kern/sys_process.c: * proc_read_fpregs, proc_write_fpregs /usr/src/sys/kern/sys_process.c:proc_read_fpregs(struct thread *td, struct fpreg *fpregs) /usr/src/sys/kern/sys_process.c: PROC_ACTION(fill_fpregs(td, fpregs)); /usr/src/sys/kern/sys_process.c:proc_write_fpregs(struct thread *td, struct fpreg *fpregs) /usr/src/sys/kern/sys_process.c: PROC_ACTION(set_fpregs(td, fpregs)); It appears that fp_state_t is unused in /usr/src/sys/ . (In reply to Mark Millard from comment #19) No, it doesn't need a separate bug report. I only want to solve this puzzle step-by-step. At this time, I have prepared fix for pthread library (libthread) -> https://github.com/strejda/tegra/commit/fd7dce067adaef48fbd4ea37b7507f14c112fd4a but it needs more testing. The struct fpreg related problem is slightly more complicated and cannot be solved (probably) without ABI breakage. Fortunately, only real out-of-kernel consumer of struct fpreg is old KSE library (libthread_db), but I have not analyzed ABI change impact for it (and I'm not sure if it's used at all). Michal (In reply to Michal Meloun from comment #20) The pthread library fix is obviously inadequate since it is hardcoding the structure size in userspace. A commit references this bug: Author: mmel Date: Mon Oct 16 12:53:55 UTC 2017 New revision: 324660 URL: https://svnweb.freebsd.org/changeset/base/324660 Log: Save VFP state in getcontext(3) on ARM. This is a last followup of r315974, which fixes userland part of VFP save/restore problems described in PR 217611. PR: 217611 MFC after: 2 weeks Changes: head/lib/libc/arm/gen/Makefile.inc head/lib/libc/arm/gen/getcontextx.c head/sys/arm/arm/machdep.c head/sys/arm/arm/sys_machdep.c head/sys/arm/include/machdep.h head/sys/arm/include/sysarch.h |
Created attachment 180600 [details] FP registers vs. signals Signal delivery to process is not saving/restoring the VFP/NEON registers. I found this on stable/11, but it also exists on CURRENT. I tested on an RPI2 but the problem clearly exists on other ARM platforms. The bug is obvious by inspection, but I attach an example test program. (If you try this on a much faster platform than cortex-a7 @900MHz, make sure you increase the loop count to ensure runtime exceeds 1 second.) cc fptst.c -o fptst ./fptst entering... h0: actual=10000000.000000 expected=10000000.000000 h1: actual=10000001.000000 expected=10000001.000000 h2: actual=10000002.000000 expected=10000002.000000 h3: actual=10000003.000000 expected=10000003.000000 h4: actual=10000004.000000 expected=10000004.000000 h5: actual=10000005.000000 expected=10000005.000000 h6: actual=10000006.000000 expected=10000006.000000 h7: actual=10000007.000000 expected=10000007.000000 ./fptst breakme entering... h0: actual=-18199474333245.414062 expected=10000000.000000 h1: actual=-18199474333245.414062 expected=10000001.000000 h2: actual=-18199476796269.414062 expected=10000002.000000 h3: actual=-18199476796271.312500 expected=10000003.000000 h4: actual=-18199479259296.312500 expected=10000004.000000 h5: actual=-18199479259295.312500 expected=10000005.000000 h6: actual=-18199481722320.312500 expected=10000006.000000 h7: actual=-18199481722319.312500 expected=10000007.000000 (actual results will vary)