Bug 230857

Summary: loading carp module panic i386 kernel (VIMAGE related)
Product: Base System Reporter: Olivier Cochard <olivier>
Component: kernAssignee: Bjoern A. Zeeb <bz>
Status: Closed FIXED    
Severity: Affects Some People CC: bz, dim, emaste, lwhsu, toolchain
Priority: --- Keywords: crash, toolchain, vimage
Version: CURRENT   
Hardware: i386   
OS: Any   
URL: https://reviews.freebsd.org/D17512
See Also: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=238012
Bug Depends on: 232289    
Bug Blocks:    
Attachments:
Description Flags
kmod.mk adjustments along with two new linker scripts none

Description Olivier Cochard freebsd_committer freebsd_triage 2018-08-24 05:43:07 UTC
It's very easy to panic an i386 current by just loading carp module.
I believe there are still lot's of i386 setups (network appliance like Soekris or VMs) that will be impacted if this bug is still present when 12.0 will be released.
Set to reproduce is very easy:
1. Download 12-ALPHA2 i386 image and start it
2. Login as root
3. Enter 'kldldoad carp' and system will panic

Thread on the mailing list:
https://lists.freebsd.org/pipermail/freebsd-current/2018-August/070780.html

Panic message:

root@freebsd:~ # kldload carp


Fatal trap 12: page fault while in kernel mode
cpuid = 0; apic id = 00
fault virtual address   = 0x24ef548c
fault code              = supervisor write, page not present
instruction pointer     = 0x20:0x1541635c
stack pointer           = 0x28:0x14de57f4
frame pointer           = 0x28:0x14de57fc
code segment            = base rx0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, def32 1, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 703 (kldload)
[ thread pid 703 tid 100073 ]
Stopped at      vnet_carpstats_init+0x2c:       movl    %eax,__stop_set_vnet(%ecx,%esi,1)
db> bt
Tracing pid 703 tid 100073 td 0x15152700
vnet_carpstats_init(0,0,1820be1,1e2,17c0321,...) at vnet_carpstats_init+0x2c/frame 0x14de57fc
vnet_register_sysinit(154182cc,0,1732cd3,e6,0,...) at vnet_register_sysinit+0xf6/frame 0x14de5828
linker_load_module(0,0,14de5a6c,42e,115a201,...) at linker_load_module+0xc64/frame 0x14de5a4c
kern_kldload(15152700,11a0b800,14de5a98,0,0,...) at kern_kldload+0xf5/frame 0x14de5a80
sys_kldload(15152700,15152984,180ca39,4,14de5ad4,...) at sys_kldload+0x6e/frame 0x14de5aa8
syscall(14de5ba8,3b,3b,3b,ffbfee90,...) at syscall+0x33f/frame 0x14de5b9c
Xint0x80_syscall() at PTDpde+0x43af/frame 0x14de5b9c
--- syscall (304, FreeBSD ELF32, sys_kldload), eip = 0x200a076f, esp = 0xffbfe7c4, ebp = 0xffbfed18 ---
carp_list() at 0x200a076f
Comment 1 Bjoern A. Zeeb freebsd_committer freebsd_triage 2018-09-19 14:30:32 UTC
Reproduced.  I'll try and have a look.
Comment 2 Bjoern A. Zeeb freebsd_committer freebsd_triage 2018-10-02 22:34:53 UTC
Hi, just to let you know; working on this;  just found a simplistic case to reproduce this without carp, which should make it easier to debug.
Comment 3 Bjoern A. Zeeb freebsd_committer freebsd_triage 2018-10-05 02:16:28 UTC
Looks like this is a toolchain issue that circumvents dpcpu/vnet magic.
I'll explain in full details in the morning.
Comment 4 Bjoern A. Zeeb freebsd_committer freebsd_triage 2018-10-05 19:57:06 UTC
Ok, so the short explanation is that having a non-simple-type at the end of the dpcpu or vnet linker sets and an intelligent compiler/linker combination can result in the last symbol not being relocated.  In the case of i386/carp this was the PCPU stats glebius introduced which is an array of 16 pointers.

I've spent a day to think of possible work around and the only one was to add padding to the end of the section;  with the help of arichardson managed to work my way around linker scripts and with an extra 8 hours I have a dual-stage linker-script solution which will only adjust the kernel modules which actually do have a vnet_set or pcpu_set section and not create one in every module with the size of 1 byte.

I'll write the entire details up including sample code and the hacked up prototype solution and post it all here and in phab sometime the next days (possibly after the weekend).

TODO: investigate which other architectures but i386 are possibly affected by this as well.
Comment 5 Bjoern A. Zeeb freebsd_committer freebsd_triage 2018-10-10 15:11:56 UTC
I'll start describing the problem from a reduced piece of code, which is not as big as carp, replicating carpstats, assuming VIMAGE is on:

--------
#include <sys/param.h>
#include <sys/kernel.h>
#include <sys/systm.h>
#include <sys/types.h>
#include <sys/mbuf.h>
#include <sys/counter.h>
#include <net/vnet.h>

struct xstats {
        uint64_t        foo1;
        uint64_t        bar1;
        uint64_t        baz1;
        uint64_t        mad1;

        uint64_t        foo2;
        uint64_t        bar2;
        uint64_t        baz2;
        uint64_t        mad2;

        uint64_t        foo3;
        uint64_t        bar3;
        uint64_t        baz3;
        uint64_t        mad3;

        uint64_t        foo4;
        uint64_t        bar4;
        uint64_t        baz4;
        uint64_t        mad4;
};

VNET_PCPUSTAT_DEFINE(struct xstats, xstats);
VNET_PCPUSTAT_SYSINIT(xstats);
--------

This unrolls into:

 1 struct _hack;
 2 counter_u64_t   vnet_entry_xstats[sizeof(struct xstats) / sizeof(uint64_t)] __attribute__((__section__("set_vnet"))) __attribute__((__used__));
 3
 4
 5 static void
 6 vnet_xstats_init(const void *unused)
 7 {
 8         do {
 9                 for (int i = 0; i < (sizeof((*(__typeof(vnet_entry_xstats) *) (((((__curthread())->td_vnet))->vnet_data_base) + (uintptr_t) & vnet_entry_xstats))) / sizeof(counter_u64_t)); i++)
10                         ((*(__typeof(vnet_entry_xstats) *) (((((__curthread())->td_vnet))->vnet_data_base) + (uintptr_t) & vnet_entry_xstats)))[i] = counter_u64_alloc((0x0002));
11         } while (0);
12 }


In essance what looks so complicated is (on a per vnet base):
        void *array[16];
        for (i=0 ; i<16; i++)
                array[i] = alloc();


The above (with the vnet bits), on i386, is translated into:

00000340 <vnet_xstats_init>:
 340:   55                      push   %ebp
 341:   89 e5                   mov    %esp,%ebp
 343:   56                      push   %esi
 344:   50                      push   %eax
 345:   be c0 ff ff ff          mov    $0xffffffc0,%esi
 34a:   90                      nop
 34b:   90                      nop
 34c:   90                      nop
 34d:   90                      nop
 34e:   90                      nop
 34f:   90                      nop
 350:   c7 04 24 02 00 00 00    movl   $0x2,(%esp)
 357:   e8 fc ff ff ff          call   358 <vnet_xstats_init+0x18>
                        358: R_386_PC32 counter_u64_alloc
 35c:   64 8b 0d 00 00 00 00    mov    %fs:0x0,%ecx
 363:   8b 89 1c 03 00 00       mov    0x31c(%ecx),%ecx
 369:   8b 49 1c                mov    0x1c(%ecx),%ecx
 36c:   89 84 31 88 14 00 00    mov    %eax,0x1488(%ecx,%esi,1)
                        36f: R_386_RELATIVE     *ABS*
 373:   83 c6 04                add    $0x4,%esi
 376:   75 d8                   jne    350 <vnet_xstats_init+0x10>
 378:   83 c4 04                add    $0x4,%esp
 37b:   5e                      pop    %esi
 37c:   5d                      pop    %ebp
 37d:   c3                      ret


Now here's the problem:

__start_set_vnet is 0x1448
__stop_set_vnet is 0x1488


The problem is that the code generated goes like this:

%esi = -64
repeat:
%eax = alloc()
We do all the curthread->td_vnet->vnet_data_base in %ecx and then do
    mov    %eax,0x1488(%ecx,%esi,1)   
    Which is:
    move the alloc() result into curthread->td_vnet->vnet_data_base + 0x1488 + (1 * -64)
    Now 0x1488 - 64 gets us to the beginning of the array[] or array[0].

%esi += 4
    So next iteration it'll be 0x1488 - 60 or array[1] ... and so on.
while %esi != 0 goto repeat;


It's an easy way to to the for(i=0; i<16; i++) loop.

The problem is that 0x1488 was not relocated.

When we are going over the relocations and calling into elf_relocaddr() the check for VIMAGE is:

      if (x >= ef->vnet_start && x < ef->vnet_stop) {

In our case we have an *ABS* R_386_RELATIVE of 0x1488, which is == ef->vnet_stop but not < ef->vnet_stop.


The real problem is that with non-simple-types the code generated with an absolute relocation might be just outside the range.  We cannot adjust the check as there might be a simple-type following in the next section which would then be relocated.



For CARP this showed up because the VNET_PCPUSTAT_DEFINE() went into the VNET section the last and hence the problem showed up.  If there was any other, say int V_Foo after it, we'd never have noticed.   We cannot fully control the order in which symbols go into the section, or at least not to the extend we'd like to, so we cannot make sure there's always a char at the end.

The only solution arichardson and I agreed to would be to add 1 byte of padding to the end of the section.


Using BYTE(1) in a linker script however would always create a set_vnet section in all kernel modules even if they do not have any virtualized objects.

Using a https://sourceware.org/binutils/docs-2.31/ld/Output-Section-Discarding.html#Output-Section-Discarding . = . + <0|sym>; should not create the section.

This leads to a multi-stage solution:

(1) add a symbol based on SIZEOF(set_vnet) which is either 0 or 1.   SIZEOF() does not create a section if it does not exist.
(2) pass the *(set_vnet) through and then increment "." by the amount of the symbol from (1); which if there is a section it will increase the section by 1 byte and any non-simple-type objects resulting in absolute relocations on the boundary should also be relocated.  If there is no section . = . + 0; will not create the empty section.

The problem is that the symbol from (1) relies on the section size to be known which we don't seem to in the first pass.
The second problem is that the symbol from (1) is not immediately visible (despite it should me; XXX sourceware reference for that)

So we really have to do multiple linker invocations with different linker scripts.

With ld.bfd an empty set_vnet section will not show up, only the symbol from (1) will be left behind and that's fine.
With ld.lld 6 it will also leave an empty set_vnet section behind;  emaste points me to https://reviews.llvm.org/rL325887 where this had been under discussion already with dim, him, and lld developers.   It's not beautiful, but should not harm either (XXX to be tested).


We need to exclude "firmware" from the dance and for now make it i386 specific.

The entire problem described above for VNET also applies to DPCPU (set_pcpu).

I'll attached a preliminary patch which seems to hack this together and once I added a bit more description etc to the files, I'll put it into Phab as well (if you have comments earlier let me know).



I hope this all kind-of makes any sense ;-)
Comment 6 Bjoern A. Zeeb freebsd_committer freebsd_triage 2018-10-10 15:13:36 UTC
Created attachment 198004 [details]
kmod.mk adjustments along with two new linker scripts
Comment 7 Bjoern A. Zeeb freebsd_committer freebsd_triage 2018-10-10 22:49:43 UTC
https://reviews.freebsd.org/D17512
Comment 8 Bjoern A. Zeeb freebsd_committer freebsd_triage 2018-10-17 08:41:26 UTC
Track the dependency problems;  need to solve at least the link_elf one before we can do the BYTE(1) linker script and follow-up link_elf checks.
Comment 9 commit-hook freebsd_committer freebsd_triage 2018-11-01 17:27:12 UTC
A commit references this bug:

Author: bz
Date: Thu Nov  1 17:26:18 UTC 2018
New revision: 340009
URL: https://svnweb.freebsd.org/changeset/base/340009

Log:
  carpstats are the last virtualised variable in the file and end up at the
  end of the vnet_set.  The generated code uses an absolute relocation at
  one byte beyond the end of the carpstats array.  This means the relocation
  for the vnet does not happen for carpstats initialisation and as a result
  the kernel panics on module load.

  This problem has only been observed with carp and only on i386.
  We considered various possible solutions including using linker scripts
  to add padding to all kernel modules for pcpu and vnet sections.

  While the symbols (by chance) stay in the order of appearance in the file
  adding an unused non-file-local variable at the end of the file will extend
  the size of set_vnet and hence make the absolute relocation for carpstats
  work (think of this as a single-module set_vnet padding).

  This is a (tmporary) hack.  It is the least intrusive one as we need a
  timely solution for the upcoming release.  We will revisit the problem in
  HEAD.  For a lot more information and the possible alternate solutions
  please see the PR and the references therein.

  PR:			230857
  MFC after:		3 days

Changes:
  head/sys/netinet/ip_carp.c
Comment 10 Bjoern A. Zeeb freebsd_committer freebsd_triage 2018-11-02 13:52:44 UTC
As Alex R points out if we run into symbol reordering problems with the extra variable fix committed, we could add an extra .o file to the end of the list which (unless there is a linker-script doing magic) should always stay at the end of the list.
We could add that to just problematic modules or to all modules (and then could still ignore the extra bytes).
We'd use a static __used variable to not conflict with duplicate symbols or if that does not work, asm.

I am just adding this to the PR as to write own more possible ways to fix this.

He also mentions -fPIC would probably solve the initial problem (but that's a totally different can of worms on our i386).
Comment 11 commit-hook freebsd_committer freebsd_triage 2018-11-04 06:25:17 UTC
A commit references this bug:

Author: bz
Date: Sun Nov  4 06:25:07 UTC 2018
New revision: 340109
URL: https://svnweb.freebsd.org/changeset/base/340109

Log:
  MFC r340009:

    carpstats are the last virtualised variable in the file and end up at the
    end of the vnet_set.  The generated code uses an absolute relocation at
    one byte beyond the end of the carpstats array.  This means the relocation
    for the vnet does not happen for carpstats initialisation and as a result
    the kernel panics on module load.

    This problem has only been observed with carp and only on i386.
    We considered various possible solutions including using linker scripts
    to add padding to all kernel modules for pcpu and vnet sections.

    While the symbols (by chance) stay in the order of appearance in the file
    adding an unused non-file-local variable at the end of the file will extend
    the size of set_vnet and hence make the absolute relocation for carpstats
    work (think of this as a single-module set_vnet padding).

    This is a (tmporary) hack.  It is the least intrusive one as we need a
    timely solution for the upcoming release.  We will revisit the problem in
    HEAD.  For a lot more information and the possible alternate solutions
    please see the PR and the references therein.

  PR:		230857
  Approved by:	re (kib)

Changes:
_U  stable/12/
  stable/12/sys/netinet/ip_carp.c
Comment 12 commit-hook freebsd_committer freebsd_triage 2019-06-08 17:44:55 UTC
A commit references this bug:

Author: bz
Date: Sat Jun  8 17:44:43 UTC 2019
New revision: 348808
URL: https://svnweb.freebsd.org/changeset/base/348808

Log:
  Fix dpcpu and vnet panics with complex types at the end of the section.

  Apply a linker script when linking i386 kernel modules to apply padding
  to a set_pcpu or set_vnet section.  The padding value is kind-of random
  and is used to catch modules not compiled with the linker-script, so
  possibly still having problems leading to kernel panics.

  This is needed as the code generated on certain architectures for
  non-simple-types, e.g., an array can generate an absolute relocation
  on the edge (just outside) the section and thus will not be properly
  relocated. Adding the padding to the end of the section will ensure
  that even absolute relocations of complex types will be inside the
  section, if they are the last object in there and hence relocation will
  work properly and avoid panics such as observed with carp.ko or ipsec.ko.

  There is a rather lengthy discussion of various options to apply in
  the mentioned PRs and their depends/blocks, and the review.
  There seems no best solution working across multiple toolchains and
  multiple version of them, so I took the liberty of taking one,
  as currently our users (and our CI system) are hitting this on
  just i386 and we need some solution.  I wish we would have a proper
  fix rather than another "hack".

  Also backout r340009 which manually, temporarily fixed CARP before 12.0-R
  "by chance" after a lead-up of various other link-elf.c and related fixes.

  PR:			230857,238012
  With suggestions from:	arichardson (originally last year)
  Tested by:		lwhsu
  Event:			Waterloo Hackathon 2019
  Reported by:		lwhsu, olivier
  MFC after:		6 weeks
  Differential Revision:	https://reviews.freebsd.org/D17512

Changes:
  head/UPDATING
  head/sys/conf/kmod.mk
  head/sys/conf/ldscript.set_padding
  head/sys/kern/link_elf.c
  head/sys/netinet/ip_carp.c
  head/sys/sys/param.h
Comment 13 Bjoern A. Zeeb freebsd_committer freebsd_triage 2019-06-08 17:45:51 UTC
*** Bug 238012 has been marked as a duplicate of this bug. ***
Comment 14 Ed Maste freebsd_committer freebsd_triage 2020-09-25 14:41:21 UTC
Is this resolved now?
Comment 15 Bjoern A. Zeeb freebsd_committer freebsd_triage 2020-09-25 18:27:46 UTC
(In reply to Ed Maste from comment #14)

In HEAD it is.  I think it was never MFCed due to my personal events last year after returning from Canada.  Sigh.  Should I try to get this into 12.2-RC1?
Comment 16 Ed Maste freebsd_committer freebsd_triage 2022-04-11 18:12:40 UTC
r348808 predates stable/13 branch, so is in 13.0 and 13.1
r340009 (carp-specific hack) was MFC'd to stable/12 and is at least in 12.1.
Comment 17 Mark Linimon freebsd_committer freebsd_triage 2023-12-25 14:49:36 UTC
^Triage: committed back in 2022.  The mfc-stable* flags' values are now OBE.