It's very easy to panic an i386 current by just loading carp module. I believe there are still lot's of i386 setups (network appliance like Soekris or VMs) that will be impacted if this bug is still present when 12.0 will be released. Set to reproduce is very easy: 1. Download 12-ALPHA2 i386 image and start it 2. Login as root 3. Enter 'kldldoad carp' and system will panic Thread on the mailing list: https://lists.freebsd.org/pipermail/freebsd-current/2018-August/070780.html Panic message: root@freebsd:~ # kldload carp Fatal trap 12: page fault while in kernel mode cpuid = 0; apic id = 00 fault virtual address = 0x24ef548c fault code = supervisor write, page not present instruction pointer = 0x20:0x1541635c stack pointer = 0x28:0x14de57f4 frame pointer = 0x28:0x14de57fc code segment = base rx0, limit 0xfffff, type 0x1b = DPL 0, pres 1, def32 1, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 703 (kldload) [ thread pid 703 tid 100073 ] Stopped at vnet_carpstats_init+0x2c: movl %eax,__stop_set_vnet(%ecx,%esi,1) db> bt Tracing pid 703 tid 100073 td 0x15152700 vnet_carpstats_init(0,0,1820be1,1e2,17c0321,...) at vnet_carpstats_init+0x2c/frame 0x14de57fc vnet_register_sysinit(154182cc,0,1732cd3,e6,0,...) at vnet_register_sysinit+0xf6/frame 0x14de5828 linker_load_module(0,0,14de5a6c,42e,115a201,...) at linker_load_module+0xc64/frame 0x14de5a4c kern_kldload(15152700,11a0b800,14de5a98,0,0,...) at kern_kldload+0xf5/frame 0x14de5a80 sys_kldload(15152700,15152984,180ca39,4,14de5ad4,...) at sys_kldload+0x6e/frame 0x14de5aa8 syscall(14de5ba8,3b,3b,3b,ffbfee90,...) at syscall+0x33f/frame 0x14de5b9c Xint0x80_syscall() at PTDpde+0x43af/frame 0x14de5b9c --- syscall (304, FreeBSD ELF32, sys_kldload), eip = 0x200a076f, esp = 0xffbfe7c4, ebp = 0xffbfed18 --- carp_list() at 0x200a076f
Reproduced. I'll try and have a look.
Hi, just to let you know; working on this; just found a simplistic case to reproduce this without carp, which should make it easier to debug.
Looks like this is a toolchain issue that circumvents dpcpu/vnet magic. I'll explain in full details in the morning.
Ok, so the short explanation is that having a non-simple-type at the end of the dpcpu or vnet linker sets and an intelligent compiler/linker combination can result in the last symbol not being relocated. In the case of i386/carp this was the PCPU stats glebius introduced which is an array of 16 pointers. I've spent a day to think of possible work around and the only one was to add padding to the end of the section; with the help of arichardson managed to work my way around linker scripts and with an extra 8 hours I have a dual-stage linker-script solution which will only adjust the kernel modules which actually do have a vnet_set or pcpu_set section and not create one in every module with the size of 1 byte. I'll write the entire details up including sample code and the hacked up prototype solution and post it all here and in phab sometime the next days (possibly after the weekend). TODO: investigate which other architectures but i386 are possibly affected by this as well.
I'll start describing the problem from a reduced piece of code, which is not as big as carp, replicating carpstats, assuming VIMAGE is on: -------- #include <sys/param.h> #include <sys/kernel.h> #include <sys/systm.h> #include <sys/types.h> #include <sys/mbuf.h> #include <sys/counter.h> #include <net/vnet.h> struct xstats { uint64_t foo1; uint64_t bar1; uint64_t baz1; uint64_t mad1; uint64_t foo2; uint64_t bar2; uint64_t baz2; uint64_t mad2; uint64_t foo3; uint64_t bar3; uint64_t baz3; uint64_t mad3; uint64_t foo4; uint64_t bar4; uint64_t baz4; uint64_t mad4; }; VNET_PCPUSTAT_DEFINE(struct xstats, xstats); VNET_PCPUSTAT_SYSINIT(xstats); -------- This unrolls into: 1 struct _hack; 2 counter_u64_t vnet_entry_xstats[sizeof(struct xstats) / sizeof(uint64_t)] __attribute__((__section__("set_vnet"))) __attribute__((__used__)); 3 4 5 static void 6 vnet_xstats_init(const void *unused) 7 { 8 do { 9 for (int i = 0; i < (sizeof((*(__typeof(vnet_entry_xstats) *) (((((__curthread())->td_vnet))->vnet_data_base) + (uintptr_t) & vnet_entry_xstats))) / sizeof(counter_u64_t)); i++) 10 ((*(__typeof(vnet_entry_xstats) *) (((((__curthread())->td_vnet))->vnet_data_base) + (uintptr_t) & vnet_entry_xstats)))[i] = counter_u64_alloc((0x0002)); 11 } while (0); 12 } In essance what looks so complicated is (on a per vnet base): void *array[16]; for (i=0 ; i<16; i++) array[i] = alloc(); The above (with the vnet bits), on i386, is translated into: 00000340 <vnet_xstats_init>: 340: 55 push %ebp 341: 89 e5 mov %esp,%ebp 343: 56 push %esi 344: 50 push %eax 345: be c0 ff ff ff mov $0xffffffc0,%esi 34a: 90 nop 34b: 90 nop 34c: 90 nop 34d: 90 nop 34e: 90 nop 34f: 90 nop 350: c7 04 24 02 00 00 00 movl $0x2,(%esp) 357: e8 fc ff ff ff call 358 <vnet_xstats_init+0x18> 358: R_386_PC32 counter_u64_alloc 35c: 64 8b 0d 00 00 00 00 mov %fs:0x0,%ecx 363: 8b 89 1c 03 00 00 mov 0x31c(%ecx),%ecx 369: 8b 49 1c mov 0x1c(%ecx),%ecx 36c: 89 84 31 88 14 00 00 mov %eax,0x1488(%ecx,%esi,1) 36f: R_386_RELATIVE *ABS* 373: 83 c6 04 add $0x4,%esi 376: 75 d8 jne 350 <vnet_xstats_init+0x10> 378: 83 c4 04 add $0x4,%esp 37b: 5e pop %esi 37c: 5d pop %ebp 37d: c3 ret Now here's the problem: __start_set_vnet is 0x1448 __stop_set_vnet is 0x1488 The problem is that the code generated goes like this: %esi = -64 repeat: %eax = alloc() We do all the curthread->td_vnet->vnet_data_base in %ecx and then do mov %eax,0x1488(%ecx,%esi,1) Which is: move the alloc() result into curthread->td_vnet->vnet_data_base + 0x1488 + (1 * -64) Now 0x1488 - 64 gets us to the beginning of the array[] or array[0]. %esi += 4 So next iteration it'll be 0x1488 - 60 or array[1] ... and so on. while %esi != 0 goto repeat; It's an easy way to to the for(i=0; i<16; i++) loop. The problem is that 0x1488 was not relocated. When we are going over the relocations and calling into elf_relocaddr() the check for VIMAGE is: if (x >= ef->vnet_start && x < ef->vnet_stop) { In our case we have an *ABS* R_386_RELATIVE of 0x1488, which is == ef->vnet_stop but not < ef->vnet_stop. The real problem is that with non-simple-types the code generated with an absolute relocation might be just outside the range. We cannot adjust the check as there might be a simple-type following in the next section which would then be relocated. For CARP this showed up because the VNET_PCPUSTAT_DEFINE() went into the VNET section the last and hence the problem showed up. If there was any other, say int V_Foo after it, we'd never have noticed. We cannot fully control the order in which symbols go into the section, or at least not to the extend we'd like to, so we cannot make sure there's always a char at the end. The only solution arichardson and I agreed to would be to add 1 byte of padding to the end of the section. Using BYTE(1) in a linker script however would always create a set_vnet section in all kernel modules even if they do not have any virtualized objects. Using a https://sourceware.org/binutils/docs-2.31/ld/Output-Section-Discarding.html#Output-Section-Discarding . = . + <0|sym>; should not create the section. This leads to a multi-stage solution: (1) add a symbol based on SIZEOF(set_vnet) which is either 0 or 1. SIZEOF() does not create a section if it does not exist. (2) pass the *(set_vnet) through and then increment "." by the amount of the symbol from (1); which if there is a section it will increase the section by 1 byte and any non-simple-type objects resulting in absolute relocations on the boundary should also be relocated. If there is no section . = . + 0; will not create the empty section. The problem is that the symbol from (1) relies on the section size to be known which we don't seem to in the first pass. The second problem is that the symbol from (1) is not immediately visible (despite it should me; XXX sourceware reference for that) So we really have to do multiple linker invocations with different linker scripts. With ld.bfd an empty set_vnet section will not show up, only the symbol from (1) will be left behind and that's fine. With ld.lld 6 it will also leave an empty set_vnet section behind; emaste points me to https://reviews.llvm.org/rL325887 where this had been under discussion already with dim, him, and lld developers. It's not beautiful, but should not harm either (XXX to be tested). We need to exclude "firmware" from the dance and for now make it i386 specific. The entire problem described above for VNET also applies to DPCPU (set_pcpu). I'll attached a preliminary patch which seems to hack this together and once I added a bit more description etc to the files, I'll put it into Phab as well (if you have comments earlier let me know). I hope this all kind-of makes any sense ;-)
Created attachment 198004 [details] kmod.mk adjustments along with two new linker scripts
https://reviews.freebsd.org/D17512
Track the dependency problems; need to solve at least the link_elf one before we can do the BYTE(1) linker script and follow-up link_elf checks.
A commit references this bug: Author: bz Date: Thu Nov 1 17:26:18 UTC 2018 New revision: 340009 URL: https://svnweb.freebsd.org/changeset/base/340009 Log: carpstats are the last virtualised variable in the file and end up at the end of the vnet_set. The generated code uses an absolute relocation at one byte beyond the end of the carpstats array. This means the relocation for the vnet does not happen for carpstats initialisation and as a result the kernel panics on module load. This problem has only been observed with carp and only on i386. We considered various possible solutions including using linker scripts to add padding to all kernel modules for pcpu and vnet sections. While the symbols (by chance) stay in the order of appearance in the file adding an unused non-file-local variable at the end of the file will extend the size of set_vnet and hence make the absolute relocation for carpstats work (think of this as a single-module set_vnet padding). This is a (tmporary) hack. It is the least intrusive one as we need a timely solution for the upcoming release. We will revisit the problem in HEAD. For a lot more information and the possible alternate solutions please see the PR and the references therein. PR: 230857 MFC after: 3 days Changes: head/sys/netinet/ip_carp.c
As Alex R points out if we run into symbol reordering problems with the extra variable fix committed, we could add an extra .o file to the end of the list which (unless there is a linker-script doing magic) should always stay at the end of the list. We could add that to just problematic modules or to all modules (and then could still ignore the extra bytes). We'd use a static __used variable to not conflict with duplicate symbols or if that does not work, asm. I am just adding this to the PR as to write own more possible ways to fix this. He also mentions -fPIC would probably solve the initial problem (but that's a totally different can of worms on our i386).
A commit references this bug: Author: bz Date: Sun Nov 4 06:25:07 UTC 2018 New revision: 340109 URL: https://svnweb.freebsd.org/changeset/base/340109 Log: MFC r340009: carpstats are the last virtualised variable in the file and end up at the end of the vnet_set. The generated code uses an absolute relocation at one byte beyond the end of the carpstats array. This means the relocation for the vnet does not happen for carpstats initialisation and as a result the kernel panics on module load. This problem has only been observed with carp and only on i386. We considered various possible solutions including using linker scripts to add padding to all kernel modules for pcpu and vnet sections. While the symbols (by chance) stay in the order of appearance in the file adding an unused non-file-local variable at the end of the file will extend the size of set_vnet and hence make the absolute relocation for carpstats work (think of this as a single-module set_vnet padding). This is a (tmporary) hack. It is the least intrusive one as we need a timely solution for the upcoming release. We will revisit the problem in HEAD. For a lot more information and the possible alternate solutions please see the PR and the references therein. PR: 230857 Approved by: re (kib) Changes: _U stable/12/ stable/12/sys/netinet/ip_carp.c
A commit references this bug: Author: bz Date: Sat Jun 8 17:44:43 UTC 2019 New revision: 348808 URL: https://svnweb.freebsd.org/changeset/base/348808 Log: Fix dpcpu and vnet panics with complex types at the end of the section. Apply a linker script when linking i386 kernel modules to apply padding to a set_pcpu or set_vnet section. The padding value is kind-of random and is used to catch modules not compiled with the linker-script, so possibly still having problems leading to kernel panics. This is needed as the code generated on certain architectures for non-simple-types, e.g., an array can generate an absolute relocation on the edge (just outside) the section and thus will not be properly relocated. Adding the padding to the end of the section will ensure that even absolute relocations of complex types will be inside the section, if they are the last object in there and hence relocation will work properly and avoid panics such as observed with carp.ko or ipsec.ko. There is a rather lengthy discussion of various options to apply in the mentioned PRs and their depends/blocks, and the review. There seems no best solution working across multiple toolchains and multiple version of them, so I took the liberty of taking one, as currently our users (and our CI system) are hitting this on just i386 and we need some solution. I wish we would have a proper fix rather than another "hack". Also backout r340009 which manually, temporarily fixed CARP before 12.0-R "by chance" after a lead-up of various other link-elf.c and related fixes. PR: 230857,238012 With suggestions from: arichardson (originally last year) Tested by: lwhsu Event: Waterloo Hackathon 2019 Reported by: lwhsu, olivier MFC after: 6 weeks Differential Revision: https://reviews.freebsd.org/D17512 Changes: head/UPDATING head/sys/conf/kmod.mk head/sys/conf/ldscript.set_padding head/sys/kern/link_elf.c head/sys/netinet/ip_carp.c head/sys/sys/param.h
*** Bug 238012 has been marked as a duplicate of this bug. ***
Is this resolved now?
(In reply to Ed Maste from comment #14) In HEAD it is. I think it was never MFCed due to my personal events last year after returning from Canada. Sigh. Should I try to get this into 12.2-RC1?
r348808 predates stable/13 branch, so is in 13.0 and 13.1 r340009 (carp-specific hack) was MFC'd to stable/12 and is at least in 12.1.
^Triage: committed back in 2022. The mfc-stable* flags' values are now OBE.