Bug 231402 - textproc/kf5-syntax-highlighting: does not build on systems with VLAN interfaces
Summary: textproc/kf5-syntax-highlighting: does not build on systems with VLAN interfaces
Status: Closed FIXED
Alias: None
Product: Ports & Packages
Classification: Unclassified
Component: Individual Port(s) (show other bugs)
Version: Latest
Hardware: amd64 Any
: --- Affects Only Me
Assignee: kde
: 231999 232318 233798 (view as bug list)
Depends on:
Reported: 2018-09-16 15:43 UTC by Ting-Wei Lan
Modified: 2018-12-25 15:07 UTC (History)
7 users (show)

See Also:
bugzilla: maintainer-feedback? (kde)


Note You need to log in before you can comment on or make changes to this bug.
Description Ting-Wei Lan 2018-09-16 15:43:34 UTC
kf5-syntax-highlighting build fails with undefined symbol error on a FreeBSD 11.2 system with at least one VLAN network interface. I know it is odd for network configuration on the system to affect the build, but it is really what I found after 3 days of debugging. Here are the error messages:

[94/132] cd /tmp/wrkdirs/usr/ports/textproc/kf5-syntax-highlighting/work/.build/data && /tmp/wrkdirs/usr/ports/textproc/kf5-syntax-highlighting/work/.build/bin/katehighlightingindexer /tmp/wrkdirs/usr/ports/textproc/kf5-syntax-highlighting/work/.build/data/index.katesyntax /tmp/wrkdirs/usr/ports/textproc/kf5-syntax-highlighting/work/syntax-highlighting-5.49.0/data/schema/language.xsd /tmp/wrkdirs/usr/ports/textproc/kf5-syntax-highlighting/work/.build/data/syntax-data.qrc
FAILED: data/index.katesyntax 
cd /tmp/wrkdirs/usr/ports/textproc/kf5-syntax-highlighting/work/.build/data && /tmp/wrkdirs/usr/ports/textproc/kf5-syntax-highlighting/work/.build/bin/katehighlightingindexer /tmp/wrkdirs/usr/ports/textproc/kf5-syntax-highlighting/work/.build/data/index.katesyntax /tmp/wrkdirs/usr/ports/textproc/kf5-syntax-highlighting/work/syntax-highlighting-5.49.0/data/schema/language.xsd /tmp/wrkdirs/usr/ports/textproc/kf5-syntax-highlighting/work/.build/data/syntax-data.qrc
/usr/local/lib/qt5/plugins/bearer/libqgenericbearer.so: Undefined symbol "_ZN17QNetworkInterfaceC1ERKS_@Qt_5"
ninja: build stopped: subcommand failed.

I guess this is a memory corruption issue in Qt5 network module, which may provide the kernel a bad pointer and cause the kernel to overwrite data of the runtime linker. The symbol '_ZN17QNetworkInterfaceC1ERKS_' does exist in /usr/local/lib/qt5/libQt5Network.so.5 and /usr/local/lib/qt5/plugins/bearer/libqgenericbearer.so correctly lists libQt5Network.so.5 as its dependency with NEEDED, but the runtime linker rejects the symbol in libQt5Network.so.5 when comparing version tags.

Steps to reproduce the problem:

1. Install FreeBSD 11.2 amd64 and download the ports tree. Whether it is a physical machine or a virtual machine doesn't matter.
2. Create a VLAN network interface. It can be done with command 'ifconfig vlan3 create vlan 3 vlandev re0' where 're0' is your network interface.
3. Make sure the runtime linker /libexec/ld-elf.so.1 is compiled with -O2 option. This is the default, so you don't have to do anything in this step unless you don't use binaries distributed by FreeBSD project.
4. Install textproc/qt5-xmlpatterns port with portmaster.
5. Build textproc/kf5-syntax-highlighting.

It was tested on FreeBSD 11.2-RELEASE-p3 amd64 with ports revision 479821. I could reproduce it on 3 systems (physical machine, virtual machine, jail on virtual machine) and each of them runs on different hardware.

I mentioned qt5-xmlpatterns above because it is an optional dependency of kf5-syntax-highlighting. kf5-syntax-highlighting can be built without problems when qt5-xmlpatterns is not installed, but it also means that it doesn't link to qt5-network. kf5-syntax-highlighting automatically picks up qt5-xmlpatterns during the configure phase and it is qt5-xmlpatterns that causes kf5-syntax-highlighting to load qt5-network during the build.

The following are results of my debugging. I haven't found the root cause of the problem, but I think these notes may be useful to do further debugging.

I started by checking symbol tables of both libqgenericbearer.so and libQt5Network.so.5.

$ pkg which /usr/local/lib/qt5/plugins/bearer/libqgenericbearer.so
/usr/local/lib/qt5/plugins/bearer/libqgenericbearer.so was installed by package qt5-network-5.11.1
$ readelf -aW /usr/local/lib/qt5/plugins/bearer/libqgenericbearer.so
Symbol table (.dynsym) contains 140 entries:
   Num:    Value          Size Type    Bind   Vis      Ndx Name
    69: 0000000000000000    21 FUNC    GLOBAL DEFAULT  UND _ZN17QNetworkInterfaceC1ERKS_@Qt_5 (2)

$ pkg which /usr/local/lib/qt5/libQt5Network.so.5
/usr/local/lib/qt5/libQt5Network.so.5 was installed by package qt5-network-5.11.1
$ readelf -aW /usr/local/lib/qt5/libQt5Network.so.5
Symbol table (.dynsym) contains 2161 entries:
   Num:    Value          Size Type    Bind   Vis      Ndx Name
  1245: 00000000000c7790    21 FUNC    GLOBAL DEFAULT   12 _ZN17QNetworkInterfaceC1ERKS_@@Qt_5 (3)

The plugin links to libQt5Network.so.5 properly:

$ ldd /tmp/wrkdirs/usr/ports/textproc/kf5-syntax-highlighting/work/.build/bin/katehighlightingindexer 
        libQt5XmlPatterns.so.5 => /usr/local/lib/qt5/libQt5XmlPatterns.so.5 (0x800a00000)
        libQt5Network.so.5 => /usr/local/lib/qt5/libQt5Network.so.5 (0x801033000)
        libQt5Core.so.5 => /usr/local/lib/qt5/libQt5Core.so.5 (0x801400000)
        libc++.so.1 => /usr/lib/libc++.so.1 (0x801aec000)

$ ldd /usr/local/lib/qt5/plugins/bearer/libqgenericbearer.so
        libQt5Network.so.5 => /usr/local/lib/qt5/libQt5Network.so.5 (0x80120c000)
        libQt5Core.so.5 => /usr/local/lib/qt5/libQt5Core.so.5 (0x801600000)
        libc++.so.1 => /usr/lib/libc++.so.1 (0x801cec000)

But the program which throws the undefined symbol error, katehighlightingindexer, doesn't link to libqgenericbearer.so. It suggests that libqgenericbearer.so is loaded by calling dlopen.

I set a breakpoint on dlopen in GDB, and yes, it calls it with:
dlopen("/usr/local/lib/qt5/plugins/bearer/libqgenericbearer.so", RTLD_NODELETE | RTLD_LAZY);

The return value of dlopen is correct. It is properly loaded, and the hash of the version entry is 363045.

(gdb) b dlopen
Function "dlopen" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (dlopen) pending.

(gdb) r 1 2 3
Starting program: /tmp/wrkdirs/usr/ports/textproc/kf5-syntax-highlighting/work/.build/bin/katehighlightingindexer 1 2 3
[New LWP 101325 of process 74133]

Thread 1 hit Breakpoint 1, dlopen (name=0x805415498 "/usr/local/lib/qt5/plugins/bearer/libqgenericbearer.so", mode=4097) at /usr/src/libexec/rtld-elf/rtld.c:3193
warning: Source file is more recent than executable.
3193            return (rtld_dlopen(name, -1, mode));

(gdb) finish
Run till exit from #0  dlopen (name=0x805415498 "/usr/local/lib/qt5/plugins/bearer/libqgenericbearer.so", mode=4097) at /usr/src/libexec/rtld-elf/rtld.c:3193
0x000000080165a731 in ?? () from /usr/local/lib/qt5/libQt5Core.so.5
Value returned is $2 = (void *) 0x80067e000

(gdb) p ((Obj_Entry *)(0x80067e000))->vertab[2]
$3 = {hash = 363045, flags = 0, name = 0x807202678 "Qt_5", file = 0x8072025de "libQt5Network.so.5"}
(gdb) p ((Obj_Entry *)(0x80067e000))->path
$8 = 0x800634f40 "/usr/local/lib/qt5/plugins/bearer/libqgenericbearer.so"

The number '2' seems to come from the '(2)' suffix of the output of readelf. I assumes it means the version tag used by the symbol has index 2.

(gdb) b _rtld_bind if $_streq(obj->path, "/usr/local/lib/qt5/plugins/bearer/libqgenericbearer.so") && obj->vertab[2].hash != 363045
Breakpoint 3 at 0x80060f907: file /usr/src/libexec/rtld-elf/rtld.c, line 810.

(gdb) c
[Switching to LWP 101325 of process 74133]

Thread 2 hit Breakpoint 3, _rtld_bind (obj=0x80067e000, reloff=1272) at /usr/src/libexec/rtld-elf/rtld.c:810
810         rlock_acquire(rtld_bind_lock, &lockstate);

(gdb) p obj->vertab[2]
$17 = {hash = 32, flags = 0, name = 0x807202678 "Qt_5", file = 0x8072025de "libQt5Network.so.5"}

The value of the hash field of the version entry has changed from 363045 to 32. The value '32' isn't random. I always get the same value here. If you follow the execution of the correct _rtld_bind call, you will find it fails to match the version tag at file /usr/src/libexec/rtld-elf/rtld.c, function matched_symbol, line 4329:

4329                 if (obj->vertab[verndx].hash != req->ventry->hash ||
4330                     strcmp(obj->vertab[verndx].name, req->ventry->name)) {      
4331                         /*
4332                          * Version does not match. Look if this is a
4333                          * global symbol and if it is not hidden. If
4334                          * global symbol (verndx < 2) is available,
4335                          * use it. Do not return symbol if we are
4336                          * called by dlvsym, because dlvsym looks for
4337                          * a specific version and default one is not
4338                          * what dlvsym wants.
4339                          */
4340                         if ((req->flags & SYMLOOK_DLSYM) ||
4341                             (verndx >= VER_NDX_GIVEN) ||
4342                             (obj->versyms[symnum] & VER_NDX_HIDDEN))
4343                                 return (false);
4344                 }

verndx is 2, and req->ventry->hash is 363045. If obj->vertab[2].hash hasn't been modified, the runtime linker will pick this symbol and the execution can continue.

I tried to set a hardware watchpoint on obj->vertab[2].hash in GDB, but the watchpoint never hit. I also tried to set a software watchpoint on the same address, and the result wasn't always the same. Most of the time it ran forever and I interrupted it after a few minutes, but sometimes it stopped at instructions which should not modify the memory, such as 'mov r15,QWORD PTR fs:0x10' and 'mov r15,rdi'. Therefore, I thought the hash value was modified by the kernel, but 'catch syscall' command in GDB didn't seem to work for me. GDB kept printing 'Thread 2 received signal SIGSYS, Bad system call.' and made the program behave abnormally. I decided to use DTrace to track the hash value changes for me:

# dtrace -n 'syscall:::entry, syscall:::return /pid == 99608/ { printf("%s %u ==> %x %x %x %x", probefunc, *(unsigned int *)copyin(0x801242230, 4), arg0, arg1, arg2, arg3); }'

dtrace: description 'syscall:::entry, syscall:::return ' matched 2168 probes
CPU     ID                    FUNCTION:NAME
  1  80243                      ioctl:entry ioctl 363045 ==> 8 c0306938 7fffdfffd770 0
  1  80244                     ioctl:return ioctl 32 ==> 0 0 0 0

0x801242230 was the address of the hash variable obtained from GDB. It seems it was a 'ioctl(8, SIOCGIFMEDIA, 0x7fffdfffd730)' call that changed the value. 8 was a socket file descriptor created by calling 'socket(PF_INET, SOCK_DGRAM | SOCK_CLOEXEC, 0)'. 0x7fffdfffd730 looked like a pointer on the stack, as 'procstat -v' said this region grew down. I stopped debugging here and temporarily removed the VLAN interface with 'ifconfig vlan3 destroy' to let portmaster upgrade kf5-syntax-highlighting and hundreds of other ports for me.

The conclusion is that I probably have to read the code of qt5-network in order to figure out what really happens. I found totally 3 ways to workaround the problem on systems affected by this problem:

1. Remove all VLAN interfaces, which may not be possible if your networking environment requires it.
2. Use Clang 6 shipped with FreeBSD base to recompile /libexec/ld-elf.so.1 with -O1, -O0, or -DDEBUG.
3. Use GCC 8 from ports to recompile /libexec/ld-elf.so.1 with -O0. Using -O1 or -DDEBUG doesn't help when using GCC.

In fact, I didn't replace /libexec/ld-elf.so.1 on the system because it is risky. I did the test by either running the compiled ld-elf.so.1 under /usr/src/libexec/rtld-elf directly as an executable or modifying the interpreter path stored in katehighlightingindexer executable with 'patchelf --set-interpreter' command.
Comment 1 Adriaan de Groot freebsd_committer 2018-09-16 20:57:06 UTC
The vlans seem like a red herring to me (unless networking is loading genericbearer based on the presence of vlans .. which seems really peculiar). However, kf5-syntax-highlighting *should not* depend on xmlpatterns and should not change its build based on its presence. So there's some configuration that needs wrangling there at the very least.
Comment 2 Ting-Wei Lan 2018-09-17 01:15:57 UTC
kf5-syntax-highlighting prints these messages at the end of configure phase:

-- The following OPTIONAL packages have been found:

 * Qt5Widgets
   Example application.
 * Qt5XmlPatterns
   Compile-time validation of syntax definition files.

libqgenericbearer.so is always loaded and _ZN17QNetworkInterfaceC1ERKS_ is always called when katehighlightingindexer is linked to qt5-xmlpatterns.

When there is no VLAN interface, the hash value isn't modified and the symbol can be found successfully. When I set a breakpoint on _ZN17QNetworkInterfaceC1ERKS_, it hits multiple times before it exits.

When there is a VLAN interface, the hash value is modified and the symbol cannot be found. I set breakpoints on both _ZN17QNetworkInterfaceC1ERKS_@plt in libqgenericbearer.so and _ZN17QNetworkInterfaceC1ERKS_ in libQt5Network.so.5. Only the PLT one is hit because _rtld_bind calls rtld_die, which causes the program to exit early.
Comment 3 Yan Batyuto 2018-09-24 22:34:39 UTC
Yes, I have the same issue.
Vlans are configured and kf5-syntax-highlighting does not build.
Comment 4 Ricardo 2018-09-26 17:47:06 UTC
(In reply to Yan Batyuto from comment #3)
I've seen something similar (but with lumina and telegram-desktop). Using VLANs I get:

# telegram-desktop

Got keys from plugin meta data ("generic")
QFactoryLoader::QFactoryLoader() checking directory path "/usr/local/bin/bearer" ...
loaded library "/usr/local/lib/qt5/plugins/bearer/libqgenericbearer.so"
/usr/local/lib/qt5/plugins/bearer/libqgenericbearer.so: Undefined symbol "_ZN17QNetworkInterfaceC1ERKS_@Qt_5"

# start-lumina-desktop


[Lumina] Checking User Files
 - Old Version: "1.4.1"
 - Current Version: "1.4.1"
 - Made Changes: false
Finished with user files check
QFactoryLoader::QFactoryLoader() checking directory path "/usr/local/lib/qt5/plugins/accessiblebridge" ...
QFactoryLoader::QFactoryLoader() checking directory path "/usr/local/bin/accessiblebridge" ...
qt.qpa.xcb: QXcbConnection: XCB error: 148 (Unknown), sequence: 206, resource id: 0, major code: 140 (Unknown), minor code: 20
Got Desktop Process Finished: 1
Finished Closing Down Lumina
QLibraryPrivate::unload succeeded on "/usr/local/lib/qt5/plugins/platforminputcontexts/libcomposeplatforminputcontextplugin.so" 
QLibraryPrivate::unload succeeded on "/usr/local/lib/qt5/plugins/xcbglintegrations/libqxcb-glx-integration.so" 
QLibraryPrivate::unload succeeded on "/usr/local/lib/qt5/plugins/platforms/libqxcb.so" 
QLibraryPrivate::unload succeeded on "Xcursor" 

# lumina-desktop (just to see the qt5_debug_plugins)
Got keys from plugin meta data ("generic")
QFactoryLoader::QFactoryLoader() checking directory path "/usr/local/bin/bearer" ...
loaded library "/usr/local/lib/qt5/plugins/bearer/libqgenericbearer.so"
/usr/local/lib/qt5/plugins/bearer/libqgenericbearer.so: Undefined symbol "_ZN17QNetworkInterfaceC1ERKS_@Qt_5"
Exit 1

If I destroy the vlans everything works.
Comment 5 Ricardo 2018-09-26 17:49:55 UTC
(In reply to Ricardo from comment #4)

Forgot to mention, it's a stable/11 (FreeBSD 11.2-STABLE #2 r338902)
Comment 6 Adriaan de Groot freebsd_committer 2018-10-02 14:18:02 UTC
We can "workaround" with kf5-syntax-highlighting by blocking Qt5Network, which will turn of some build-time schema-validation. That's build-time validation that happens elsewhere anyway, so it doesn't really buy us anything but extra dependencies and build time. This will get it to build again.

*But* the bigger issue is that underneath, in Qt5Networking, there's a problem with VLANs on FreeBSD. That is what actually needs to be sorted out.
Comment 7 Yuri Victorovich freebsd_committer 2018-10-06 21:45:51 UTC
*** Bug 231999 has been marked as a duplicate of this bug. ***
Comment 8 Adriaan de Groot freebsd_committer 2018-10-07 19:43:33 UTC
Any Qt5 application with networking (e.g. quassel, kmail, falkon ...) can be crashed by creating a vlan while the application is running; once there's a vlan the application no longer starts. It all comes down to the missing symbols / mismatched symbols as Ting-Wei Lan has described.

I've started doing *some* debugging, but it's a giant pain in the butt and I don't understand the symbol versioning very well. readelf(1) tells me there are @@Qt_5 symbols and @Qt_5 symbols (one @ or two) and that *seems* to be related.
Comment 9 Marek Zarychta 2018-10-16 14:52:24 UTC
*** Bug 232318 has been marked as a duplicate of this bug. ***
Comment 10 Ting-Wei Lan 2018-10-17 14:18:03 UTC
(In reply to Adriaan de Groot from comment #8)
Note that there is no missing or mismatched symbols here. Both libraries are built correctly and symbols should be successfully resolved. It is the memory corruption issue that overwrites the data of the runtime linker, causing it to reject the symbol early before comparing strings.

If I understand correctly, both @ and @@ are used to denote the version tag. In addition, @@ means it is the default version and the build time linker should choose it if object files don't specify a version. Therefore, undefined @Qt_5 symbol should be resolved to the @@Qt_5 at runtime if there is no memory issue and it is the case when there is no VLAN interface on the system.
Comment 11 Adriaan de Groot freebsd_committer 2018-10-18 12:01:51 UTC
Ting-Wei Lan, could you file a bug against ld, then? This isn't going to get fixed by us staring at the code of Qt5Network.
Comment 12 commit-hook freebsd_committer 2018-10-18 12:20:08 UTC
A commit references this bug:

Author: adridg
Date: Thu Oct 18 12:19:58 UTC 2018
New revision: 482342
URL: https://svnweb.freebsd.org/changeset/ports/482342

  Workaround textprof/kf5-syntax-hightlighting build failure.

  (library) Qt5Network crashes in the presence of VLANs. This terminates
  the build when the host build process runs applications that touch
  the network -- which happens during schema validation, which is done
  if the host has XmlPatters installed. Workaround by ignoring XmlPatterns.

  Underlying problem (Qt5Network and VLANs) has not been addressed.

  PR:		231402
  Reported by:	Ting-Wei Lan

Comment 13 Ting-Wei Lan 2018-10-18 12:58:11 UTC
(In reply to commit-hook from comment #12)
It doesn't look like an issue of ld or rtld to me. Both ld and rtld do the right thing, and it is an ioctl call on a socket file descriptor that modifies the internal data structure of rtld. I still believe the problem is in Qt5Network itself, but I haven't spent time debugging the issue further.
Comment 14 Adriaan de Groot freebsd_committer 2018-12-21 13:17:29 UTC
A **workaround** is to add QT_EXCLUDE_GENERIC_BEARER=1 to your environment.
Comment 15 Adriaan de Groot freebsd_committer 2018-12-21 22:05:53 UTC
*** Bug 233798 has been marked as a duplicate of this bug. ***
Comment 16 Adriaan de Groot freebsd_committer 2018-12-24 14:25:06 UTC
I've been debug-chasing this for a few days in an 11.2 VM. The goal is to allow genericbearer to load -- that is, the environment-variable workaround should not be necessary. As Ting-Wei Lan pointed out originally, everything looks like memory corruption **somewhere**.

 - removing the call to get the interfaceFromIndex(0) fixes the problem
 - I found a spot in QNetworkInterface where adding qWarning() << "foo" fixes the problem

When build WITH_DEBUG=yes I get crashes (SEGV), rather than unresolved symbols: more hint that it's memory corruption. In any case: *because* this is corrupting memory from a Qt-internal method that is listing network interfaces, I would like to fix the root cause rather than working around things.
Comment 17 Adriaan de Groot freebsd_committer 2018-12-24 16:33:51 UTC
*** Bug 232318 has been marked as a duplicate of this bug. ***
Comment 18 commit-hook freebsd_committer 2018-12-24 16:46:21 UTC
A commit references this bug:

Author: adridg
Date: Mon Dec 24 16:46:18 UTC 2018
New revision: 488276
URL: https://svnweb.freebsd.org/changeset/ports/488276

  Fix net/qt5-network in the face of VLANs.

  Adding a VLAN to a FreeBSD system caused memory corruption -- usually
  enough to make rtld fall over with symbol resolution errors, although
  in DEBUG builds it would just crash. Revamp network interface discovery
  to not be full of memory gotcha's.

  An explanation is included in the patches. While here, "make makesum"
  has moved some files around.

  PR:		231402, 233798, 232318
  Reported by:	Ting-Wei Lan, Nils Beyer, Marek Zarychta