Bug 207898

Summary: kernel linker behaves differently on amd64 vs. i386
Product: Base System Reporter: Don Lewis <truckman>
Component: kernAssignee: freebsd-bugs (Nobody) <bugs>
Status: New ---    
Severity: Affects Some People CC: desk, franco, jilles
Priority: ---    
Version: CURRENT   
Hardware: Any   
OS: Any   
Attachments:
Description Flags
example kernel module source that illustrates the differing kernel linker behavior on amd64 vs i386 none

Description Don Lewis freebsd_committer 2016-03-11 09:27:29 UTC
Created attachment 168000 [details]
example kernel module source that illustrates the differing kernel linker behavior on amd64 vs i386

If one source file in a kernel module defines a symbol as static and another declares it as extern, the module fails to load on i386, with the kernel logging a message about the symbol being undefined.  I believe this is the correct behavior.  On amd64, the module loads and the code in the second source file is able to access the static variable in the first source file.

In the attached example, the main module source file is able to access static character arrays in the other source file when loaded on an amd64 machine.

The behavior is the same on FreeBSD 10.1 through recent 11.0-CURRENT.  FreeBSD 9.x has not been tested.
Comment 1 Don Lewis freebsd_committer 2016-03-12 07:56:36 UTC
Most of the kernel linker code is MI, but there is some MD code in /usr/src/sys/{amd64/amd64,i386/i386}/elf_machdep.c.  I didn't see anything suspicious there.

The MI code is difficult to figure out, but that is where I suspect the problem is.  I suspect that whether or not the problem is triggered depends on the order of the relocation entries in the .ko file.  On amd64, I see this when I run nm on the .ko file:

[snip]
                 U module_register_init
0000000000000000 b msg1
                 U msg1
0000000000000050 b msg2
                 U msg2
                 U strcpy
                 U uprintf

On i386, I see this:

[snip]
         U module_register_init
         U msg1
000014c4 b msg1
         U msg2
00001514 b msg2
         U strcpy
         U uprintf

Note that the "b" entries for msg1 and msg2 precede the "U" entries on amd64, but the reverse is true on i386.

Unfortunately this is difficult to test because swapping the order of SRCS does not change the order as reported by nm.
Comment 2 Jilles Tjoelker freebsd_committer 2016-03-13 22:50:56 UTC
There is another MD aspect of the kernel linker: whether kernel modules are object files (file says "ELF xx-bit yyy relocatable") or DSOs (file says "ELF xx-bit yyy shared object"). Of the architectures you are looking at, i386 uses DSOs and amd64 uses object files.

Using object files may reduce overhead slightly but bypasses functionality that may be useful. For example, DSOs have a symbol table for dynamic linking separate from the one for debugging, while object files only have a single symbol table. Although there is a flag for static (local) symbols, the kernel linker ignores it and some code may have started abusing this ignoring.

Note that, although i386 kernel modules are DSOs, they are not PIC and do not use a GOT and PLT. Therefore, there is no overhead from the DSO format while running the code.
Comment 3 Don Lewis freebsd_committer 2016-03-14 20:01:32 UTC
The dummynet AQM patch accidentally abused this when it added two new files that referenced an existing static variable one of the pre-existing source files.  This was not caught because the authors only tested on amd64.

Why would the linker ignore the flag for local symbols?  That seems like it could be the source of difficult to debug problems.
Comment 4 Justin Cady 2021-05-10 16:20:52 UTC
> Why would the linker ignore the flag for local symbols?  That seems like it could be the source of difficult to debug problems.

I have the same question, and I can confirm that such problems are very difficult to debug. :(

In my case, Module-A, which depends on Modules B and C, is intended to pick up a globally exported symbol from Module-B, but picks up a _local_ symbol of the same name from Module-C (which happens to be scanned by the linker first). I believe that is because of this bug.

I verified this was causing my issue by editing link_elf_lookup_symbol() in sys/kern/link_elf_obj.c to populate the sym pointer and return success only if the symbol has global binding (STB_GLOBAL). With that change, the local symbol from Module-C is ignored, and the global symbol from Module-B is correctly selected.

1. Am I correct that because of this bug, symbol names on amd64 are effectively required to be unique across all kernel module dependencies?

2. Is there any risk to actually fixing this? I tried to understand all of the potential callers of link_elf_lookup_symbol(), but doing so is not straightforward through all the indirect calls (function pointers, macros). Stated differently: is there any expectation that link_elf_lookup_symbol() should return a local symbol?
Comment 5 Justin Cady 2021-06-09 18:00:28 UTC
If anyone knows the answer to my above questions I would still appreciate it.

But I wanted to share a bit of history I found...an old freebsd-hackers thread titled "Kernel linker and undefined references in KLD":

https://lists.freebsd.org/pipermail/freebsd-hackers/2010-July/032418.html

That thread contains a discussion of this issue along with a proposed patch that apparently was never taken forward. One portion of it is quite similar to the small patch I used to validate my theory (only returning STB_GLOBAL symbols from link_elf_lookup_symbol).