Summary: | VNET and DPCPU defined in modules panic on use on arm64 after LLVM 14 import due to new linker relaxation | ||||||
---|---|---|---|---|---|---|---|
Product: | Base System | Reporter: | Marek Zarychta <zarychtam> | ||||
Component: | kern | Assignee: | freebsd-toolchain (Nobody) <toolchain> | ||||
Status: | Closed FIXED | ||||||
Severity: | Affects Many People | CC: | chris, dch, dim, emaste, jfc, jrtc27, markj, net, toolchain, tuexen, zlei | ||||
Priority: | --- | Keywords: | crash, needs-qa | ||||
Version: | 13.2-RELEASE | Flags: | koobs:
maintainer-feedback?
(toolchain) koobs: mfc-stable13? koobs: mfc-stable12? |
||||
Hardware: | arm64 | ||||||
OS: | Any | ||||||
See Also: |
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=264021 https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=264115 |
||||||
Bug Depends on: | |||||||
Bug Blocks: | 271607 | ||||||
Attachments: |
|
I can reproduce the issue, will look into it. The panic happens on arm64, but not amd64. It does happen when using clang14 (most recent version in the main tree), it does not happen when using clang13. I also does not happen using clang14 when forcing htcp_recalc_beta() not to be inlined. The panic happens when accessing V_htcp_adaptive_backoff in https://cgit.freebsd.org/src/tree/sys/netinet/cc/cc_htcp.c#n471 I disassembled htcp_recalc_beta() when using clang14 and the function not being inlined. This is the relevant code: (kgdb) disassemble htcp_recalc_beta Dump of assembler code for function htcp_recalc_beta: 0x00000000000113cc <+0>: stp x29, x30, [sp, #-16]! 0x00000000000113d0 <+4>: mov x29, sp 0x00000000000113d4 <+8>: ldr x8, [x0] ; x8 = ccv 0x00000000000113d8 <+12>: ldr x9, [x18] ; x9 = curthread 0x00000000000113dc <+16>: adrp x10, 0x21000 ; x10 = ??? 0x00000000000113e0 <+20>: ldr x9, [x9, #1368] ; x9 = curthread->td_vnet 0x00000000000113e4 <+24>: ldr x10, [x10, #2168] ; x10 = ??? 0x00000000000113e8 <+28>: ldr x9, [x9, #40] ; x9 = curthread->td_vnet->vnet_data_base 0x00000000000113ec <+32>: ldr w9, [x9, x10] ; w9 = V_htcp_adaptive_backoff ??? 0x00000000000113f0 <+36>: cbz w9, 0x11428 <htcp_recalc_beta+92> I don't understand the computations in relation to x10, which is the offset used to get the relevant variable. However, this code works. Looking at the code generated by clang13 when htcp_recalc_beta() is inlined, one gets: 0xffff000150610f28 <+212>: ldr x10, [x0] ; x10 = ccv 0xffff000150610f2c <+216>: ldr x11, [x18] ; x11 = curthread 0xffff000150610f30 <+220>: ldr x11, [x11, #1368] ; x11 = curthread->td_vnet 0xffff000150610f34 <+224>: ldr x12, [x11, #40] ; x12 = curthread->td_vnet->vnet_data_base 0xffff000150610f38 <+228>: adrp x11, 0xffff000150621000 ; ??? 0xffff000150610f3c <+232>: ldr x11, [x11, #2256] ; ??? 0xffff000150610f40 <+236>: ldr w12, [x12, x11] 0xffff000150610f44 <+240>: cbz w12, 0xffff000150610f7c <htcp_ack_received+296> It looks similar and it does work. Now comes the inlined code from clang14: 0xffff0001016acf28 <+212>: ldr x10, [x0] ; x10 = ccv 0xffff0001016acf2c <+216>: ldr x11, [x18] ; x11 = curthread 0xffff0001016acf30 <+220>: ldr x12, [x11, #1368] ; x12 = curthread->td_vnet 0xffff0001016acf34 <+224>: nop 0xffff0001016acf38 <+228>: adr x11, 0xffff0001016bd520 <vnet_entry_htcp_adaptive_backoff> 0xffff0001016acf3c <+232>: ldr x12, [x12, #40] ; x12 = curthread->td_vnet->vnet_data_base ==>0xffff0001016acf40 <+236>: ldr w12, [x12, x11] 0xffff0001016acf44 <+240>: cbz w12, 0xffff0001016acf7c <htcp_ack_received+296> I reached out at arm-freebsd@freebsd.org for some help regarding the generated code. Thank you for digging into this. It looks like a clang14 flaw. Please see: bug 264021 and probably also bug 264115. This looks like it's caused by https://github.com/llvm/llvm-project/commit/8acc3b4ab0c76b9c2a54182e31a02f90ebb96329 given the loss of a LDR of the ADRP and the insertion of a NOP (ignoring the instruction rescheduling that changes register allocation slightly). Well, that composed with https://github.com/llvm/llvm-project/commit/4450a2a23df0e7081ca7fee3ec641774afedc2bc To confirm, you are using cc_htcp.ko rather than compiling it into your kernel? (In reply to Jessica Clarke from comment #6) >To confirm, you are using cc_htcp.ko rather than compiling it into your kernel? Of course. Good idea. I didn't know it is even possible to build the kernel with cc_htpc support. (In reply to Marek Zarychta from comment #7) In the kernel should work as it's prepared for those kinds of optimisations, but kernel modules should not, as they rely on the indirection persisting and being rewritten on load. Marking the special vnet and dpcpu symbols as weak might be sufficient to stop the relaxations, but I wouldn't like to say for sure, kernel modules get build with weird symbol interposition semantics that have always been asking for trouble. The problem is introduced in the process of turning the .o into a .ko. Pinpointing the bug requires study of the exact definition of the ARM relocation types, in assembly and in ELF. If you change -c to -S in the command for building cc_htcp.o you get two consecutive lines of assembly adrp x11, :got:vnet_entry_htcp_adaptive_backoff ldr x11, [x11, :got_lo12:vnet_entry_htcp_adaptive_backoff] In cc_htcp.o these become (objdump --disassemble --reloc) 200: 9000000b adrp x11, 0 <htcp_mod_init> 200: R_AARCH64_ADR_GOT_PAGE vnet_entry_htcp_adaptive_backoff 204: f940016b ldr x11, [x11] 204: R_AARCH64_LD64_GOT_LO12_NC vnet_entry_htcp_adaptive_backoff In cc_htcp.ko these are incorrectly optimized after the linker determines the variable is close to the instruction: 10f50: d503201f nop 10f54: 10082f6b adr x11, 21540 <vnet_entry_htcp_adaptive_backoff> This computes the address of the value instead of loading the value. The optimization would be correct if there were an adrp+adr pair but it is not correct for adrp+ld. So is :got_lo12: the wrong prefix, is R_AARCH64_LD64_GOT_LO12_NC the wrong binary relocation code, or did the linker forget to check the opcode of the instruction it rewrote? (In reply to John F. Carr from comment #9) The optimisation is correct for non-preemptible symbols. It's not correct for preemptible symbols (or hacks like vnet/dpcpu that rely on symbol and relocation abuse). (In reply to Jessica Clarke from comment #6) I just tested that the problem does not show up if you build the CC module in the kernel using the CC_HTCP kernel option. Resigning as assignee since it is a clang issue, not a networking issue. Just another data point: Disabling VNET also works around the problem. Yes, it's a problem with how VNET (and DPCPU) are designed (In reply to Michael Tuexen from comment #12) Any details on that isolation that could help toolchain@ with root causing and resolution Michael? (In reply to Kubilay Kocak from comment #15) I think I gave all information I can provide. I'm not familiar with the internals for clang... Fresh "FreeBSD 14.0-CURRENT #7 main-n258920-f0a15aafcb8: Sun Oct 30 09:44:53 CET 2022 arm64" built with "FreeBSD clang version 14.0.5 (https://github.com/llvm/llvm-project.git llvmorg-14.0.5-0-gc12386ae247c)" is not affected. At least I am no more able to reproduce this, so I am closing the bug. Too few tests were run. The panic is still reproducible. (In reply to Jessica Clarke from comment #10) Can you see offhand why the symbol (vnet_entry_htcp_adaptive_backoff) is preemptible in the first place? For arm64 KLDs, VNET_DEFINE_STATIC is hacked up to omit the static qualifier. But the linker changes the binding of the symbol: cc_htcp.o: 73: 0000000000000000 4 OBJECT GLOBAL DEFAULT 5 vnet_entry_htcp_adaptive_backoff cc_htcp.ko: 35: 0000000000021520 4 OBJECT LOCAL DEFAULT 11 vnet_entry_htcp_adaptive_backoff and local symbols have hidden visibility by default. I'm not sure offhand why it's changed. I guess -Bsymbolic would have that effect, but we don't specify it. Isn't that a feature of our EXPORT_SYMS awk kmod_syms.awk | xargs -J% objcopy % dance? (In reply to Jessica Clarke from comment #20) Oh, right. So kmod_syms.awk is defeating the hack in VNET_DEFINE_STATIC. That can be bandaged easily enough. I tried implementing your suggestion of adding an explicit offset with each VNET definition. It's a bit painful due to consumers doing things like: VNET_DEFINE(LIST_HEAD(, foo), n) = LIST_HEAD_INITIALIZER(); I don't think it's possible to implement VNET_DEFINE this way without either outlawing static initializers (or making them an argument to the macro itself instead of using assignment), or outlawing anonymous types, i.e., the above would have to be: LIST_HEAD(foo_list, foo); VNET_DEFINE(struct foo_list, n) = LIST_HEAD_INITIALIZER(); I would go with the second option, but it does require a bunch of consumers to be patched. I haven't yet compared the generated code with what we had before. Using __attribute__((used,section("set_vnet_entry"),alias("vnet_entry_"#_n))) static _t vnet_alias_##_n; as a hacky way to forward-declare the thing without the compiler complaining seems to work. Or you can do typedef _t vnet_type_#_n; and use that everywhere that follows so you can use a normal tentative definition to forward-declare. (In reply to Jessica Clarke from comment #23) I had tried the second suggestion, but that doesn't quite work since "n" can be an array, as in VNET_DEFINE(struct hhook_head *, ipsec_hhh_in[HHOOK_IPSEC_COUNT]). Of course we could add VNET_DEFINE_ARRAY and so on. Maybe that's better anyway. Do you see a way to work around the problem without introducing something like that? Oh arrays like that are a non-starter as you wouldn't want vnet_offset_foo[N]... so yeah probably the interface needs reworking, not just the implementation, if you want a "proper" approach. A commit in branch main references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=80e4ac2964a11edef456a15b77e43aadeaf273a2 commit 80e4ac2964a11edef456a15b77e43aadeaf273a2 Author: Dimitry Andric <dim@FreeBSD.org> AuthorDate: 2023-07-23 13:48:36 +0000 Commit: Dimitry Andric <dim@FreeBSD.org> CommitDate: 2023-07-23 22:35:04 +0000 Work around VNET and DPCPU related panics on aarch64 lld >= 14 and recent GNU ld can relax adrp+add and adrp+ldr instructions, which breaks VNET and DPCPU when used in modules. Until VNET and DPCPU can be fixed to deal with these relaxed instructions, disable linker relaxation for now. PR: 264094 Reviewed by: markj MFC after: 3 days Differential Revision: https://reviews.freebsd.org/D41156 sys/conf/kmod.mk | 7 +++++++ 1 file changed, 7 insertions(+) A commit in branch stable/13 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=f1f27dfa4ad9da53fcdfc6aae644fab83bda4e76 commit f1f27dfa4ad9da53fcdfc6aae644fab83bda4e76 Author: Dimitry Andric <dim@FreeBSD.org> AuthorDate: 2023-07-23 13:48:36 +0000 Commit: Dimitry Andric <dim@FreeBSD.org> CommitDate: 2023-07-26 18:02:37 +0000 Work around VNET and DPCPU related panics on aarch64 lld >= 14 and recent GNU ld can relax adrp+add and adrp+ldr instructions, which breaks VNET and DPCPU when used in modules. Until VNET and DPCPU can be fixed to deal with these relaxed instructions, disable linker relaxation for now. PR: 264094 Reviewed by: markj MFC after: 3 days Differential Revision: https://reviews.freebsd.org/D41156 (cherry picked from commit 80e4ac2964a11edef456a15b77e43aadeaf273a2) sys/conf/kmod.mk | 7 +++++++ 1 file changed, 7 insertions(+) A commit in branch stable/12 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=98e7f836e65ee413cd0b383371a6aca0115084ed commit 98e7f836e65ee413cd0b383371a6aca0115084ed Author: Dimitry Andric <dim@FreeBSD.org> AuthorDate: 2023-07-23 13:48:36 +0000 Commit: Dimitry Andric <dim@FreeBSD.org> CommitDate: 2023-07-26 18:03:46 +0000 Work around VNET and DPCPU related panics on aarch64 lld >= 14 and recent GNU ld can relax adrp+add and adrp+ldr instructions, which breaks VNET and DPCPU when used in modules. Until VNET and DPCPU can be fixed to deal with these relaxed instructions, disable linker relaxation for now. PR: 264094 Reviewed by: markj MFC after: 3 days Differential Revision: https://reviews.freebsd.org/D41156 (cherry picked from commit 80e4ac2964a11edef456a15b77e43aadeaf273a2) sys/conf/kmod.mk | 7 +++++++ 1 file changed, 7 insertions(+) Shall we close this, or do you think an Erratum is necessary? From what I gather, 13.2-RELEASE images would be affected by this bug? 13.2-R definitely needs an EN and patch, yes I have sent an EN request to secteam@, and CC'd most of you on this bug here. A commit in branch releng/13.2 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=e3e6fc3713229e5075473b9e0e9a66699e8a322b commit e3e6fc3713229e5075473b9e0e9a66699e8a322b Author: Dimitry Andric <dim@FreeBSD.org> AuthorDate: 2023-07-23 13:48:36 +0000 Commit: Mark Johnston <markj@FreeBSD.org> CommitDate: 2023-08-01 19:50:47 +0000 Work around VNET and DPCPU related panics on aarch64 lld >= 14 and recent GNU ld can relax adrp+add and adrp+ldr instructions, which breaks VNET and DPCPU when used in modules. Until VNET and DPCPU can be fixed to deal with these relaxed instructions, disable linker relaxation for now. PR: 264094 Reviewed by: markj MFC after: 3 days Differential Revision: https://reviews.freebsd.org/D41156 (cherry picked from commit 80e4ac2964a11edef456a15b77e43aadeaf273a2) (cherry picked from commit f1f27dfa4ad9da53fcdfc6aae644fab83bda4e76) Approved by: so Security: FreeBSD-EN-23:08.vnet sys/conf/kmod.mk | 7 +++++++ 1 file changed, 7 insertions(+) errata issued |
Created attachment 234045 [details] stack backtrace After the upgrade to FreeBSD 14.0-CURRENT #3 main-n255718-f2ab9160844 using previosly working net.inet.tcp.cc.algorithm=htcp introduces panic triggered by sendmail starting up. Reverting to default net.inet.tcp.cc.algorithm=newreno setiing solves the issue. The affected architecture is ARM64, last working version was FreeBSD 14.0-CURRENT #1 main-n253944-b2cf3c2125b: Thu Mar 24 18:17:21 CET 2022.