Bug 235604 - ports-mgmt/pkg: bus error / segmentation fault (core dumped)
Summary: ports-mgmt/pkg: bus error / segmentation fault (core dumped)
Status: New
Alias: None
Product: Ports & Packages
Classification: Unclassified
Component: Individual Port(s) (show other bugs)
Version: Latest
Hardware: amd64 Any
: --- Affects Only Me
Assignee: freebsd-pkg mailing list
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-02-08 16:33 UTC by Oliver Fromme
Modified: 2019-04-19 14:53 UTC (History)
2 users (show)

See Also:
bugzilla: maintainer-feedback? (pkg)


Attachments
dmesg of the machine (5.62 KB, text/plain)
2019-02-08 16:50 UTC, Oliver Fromme
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Oliver Fromme freebsd_committer 2019-02-08 16:33:39 UTC
There is one of my machines on which ports-mgmt/pkg randomly dumps core.
Simple commands like "pkg info" or "pkg which" seem to work without problems, but when installing ports it often crashes (but not always) with either SIGBUS or SIGSEGV. This seems to happen after actual installation of the package, but before recording the dependencies in the database (if there are no dependencies, it does not appear to crash). This has led to the situation that the dependency tree of the pkg database on that machine is now broken (there doesn't seem to be an easy way to fix that, short of removing all ports and start from scratch, which I will do, once the coredump problem is fixed).

I am pretty certain that it is not a hardware problem. The pkg binary is the *only* program that exhibits the problem. I can run buildworld, buildkernel or other CPU / IO hogs for weeks without any issue. There's one special thing about the machine, however: It is a 64bit VM hosted on KVM. I'm not sure if that matters, though. I do have complete access to the VM, but not to the host system.

The problem existed when the machine had FreeBSD 10. I hoped it would be fixed when upgrading, so I first went to stable/11, then to stable/12 (also updating pkg along the way, of course), but it did not change. I'm now at stable/12 r342861 (2019-01-08) with the latest pkg-1.10.5_5.

I can reliably reproduce a SIGBUS with "pkg audit":

$ /usr/local/sbin/pkg audit
0 problem(s) in the installed packages found.
Child process pid=45723 terminated abnormally: Bus error

$ lldb -f /usr/local/sbin/pkg -c pkg.core -bobt
(lldb) target create "/usr/local/sbin/pkg" --core "pkg.core"
Core file '/home/olli/pkg.core' (x86_64) was loaded.
(lldb) bt
* thread #1, name = 'pkg', stop reason = signal SIGBUS
  * frame #0: 0x0000000800334fe0 libpkg.so.4`___lldb_unnamed_symbol608$$libpkg.so.4 + 112
    frame #1: 0x0000000800340200 libpkg.so.4`___lldb_unnamed_symbol660$$libpkg.so.4 + 320
    frame #2: 0x00000008002f06b4 libpkg.so.4`pkg_shutdown + 36
    frame #3: 0x0000000800b524e5 libc.so.7`__cxa_finalize(dso=0x0000000000000000) at atexit.c:239
    frame #4: 0x0000000800ae20e1 libc.so.7`exit(status=0) at exit.c:74
    frame #5: 0x0000000000214122 pkg`___lldb_unnamed_symbol1$$pkg + 290

So it appears that this bus error occurs in the exit code, after the actual work of pkg was done.  I am currently not in a position to debug it any further myself.  I have uploaded the core file, binary and libraries here:
http://inof.de/FreeBSD/pr/20190208/

Let me know if you need more information. I could probably also give login access to that machine to a FreeBSD developer if needed.
Comment 1 Oliver Fromme freebsd_committer 2019-02-08 16:50:27 UTC
Created attachment 201846 [details]
dmesg of the machine
Comment 2 Oliver Fromme freebsd_committer 2019-03-25 15:28:29 UTC
Here's another backtrace, trying to install libxml2.  I compiled pkg with -O0 -g, hoping that makes debugging a little easier.  It doesn't change the frequency of "bus error" happening, though.

====> Compressing man pages (compress-man)
===>  Building package for libxml2-2.9.8
Child process pid=80214 terminated abnormally: Bus error
*** Error code 1

Stop.
make[1]: stopped in /usr/ports/textproc/libxml2
*** Error code 1

Stop.
make: stopped in /usr/ports/textproc/libxml2
                
# lldb -f /usr/local/sbin/pkg-static -c pkg-static.core -bobt
(lldb) target create "/usr/local/sbin/pkg-static" --core "pkg-static.core"
Core file '/usr/ports/textproc/libxml2/pkg-static.core' (x86_64) was loaded.
(lldb) bt
* thread #1, name = 'pkg-static', stop reason = signal SIGBUS
  * frame #0: 0x000000000045595a pkg-static`ucl_hash_destroy(hashlin=0x000000080091d240, func=(pkg-static`ucl_object_dtor_unref at ucl_util.c:204)) at ucl_hash.c:229
    frame #1: 0x0000000000468737 pkg-static`ucl_object_free_internal(obj=0x0000000800908040, allow_rec=true, dtor=(pkg-static`ucl_object_dtor_unref at ucl_util.c:204)) at ucl_util.c:243
    frame #2: 0x00000000004696dd pkg-static`ucl_object_unref(obj=0x0000000800908040) at ucl_util.c:3283
    frame #3: 0x00000000003dc74e pkg-static`pkg_shutdown at pkg_config.c:1334
    frame #4: 0x00000000008c571f pkg-static`__cxa_finalize(dso=0x0000000000000000) at atexit.c:239
    frame #5: 0x000000000085d4be pkg-static`exit(status=0) at exit.c:74
    frame #6: 0x0000000000396122 pkg-static`_start(ap=<unavailable>, cleanup=<unavailable>) at crt1.c:76
Comment 3 Oliver Fromme freebsd_committer 2019-03-25 19:23:26 UTC
Ok, I got a few minutes and rebuilt the whole thing with jemalloc debugging and malloc option "junk:true".

The crash (SIGBUS) happens in work/pkg-1.10.5/external/libucl/src/ucl_hash.c
Line 229, in function ucl_hash_destroy():

225            for (k = kh_begin (h); k != kh_end (h); ++k) {
226                    if (kh_exist (h, k)) {
227                            cur = (kh_value (h, k)).obj;
228                            while (cur != NULL) {
229                ---->               tmp = cur->next;
230                                    func (__DECONST (ucl_object_t *, cur));
231                                    cur = tmp;
232                            }
233                    }
234            }

(lldb) print cur
(const ucl_object_t *) $0 = 0x5a5a5a5a5a5a5a5a

Obviously it is dereferencing uninitialized memory (0x5a is the value used by jemalloc for junk-filling when debugging is enabled).

Next I tried building with malloc-option "junk:false,zero:true".
Result:  No crash.

However, some time later I tried installing libxml2, and pkg(8) crashes again at exactly the same location, even with malloc option "zero:true". Now we have:

(lldb) print cur
(const ucl_object_t *) $0 = 0x6c6d7862696c3a74

That looks suspiciously like ASCII characters.  In fact, those bytes represent the characters "lmxbil:t", which is "t:libxml" reversed. Obviously somehow the pointer got overwritten with parts of a string.

At this point I could really need some help from someone who is a little more familiar with the source code.  This is taking way too much time for me.

This problem is preventing me from putting this machine into production. Being able to properly install and update packages is crucial. If I can't get this fixed, I'll have to try to install DragonFly or NetBSD.
Comment 4 Oliver Fromme freebsd_committer 2019-04-03 11:14:29 UTC
In a desperate attempt to move forward, I disabled all calls to munmap() and free() in the source code of pkg. The resulting binary does NOT crash anymore. I can finally install and update ports again reliably.
Comment 5 Baptiste Daroussin freebsd_committer 2019-04-19 13:54:50 UTC
Have you tried recent pkg-devel? I have updated libucl in there, it might have fixed this issue.
Comment 6 Oliver Fromme freebsd_committer 2019-04-19 14:53:16 UTC
(In reply to Baptiste Daroussin from comment #5)
Thanks for letting me know!  I will give it a try next week after the holidays.