Bug 251257 - sysutils/fluent-bit: Fix for SIGSEGV with SEGV_MAPERR code for the version 1.6.4
Summary: sysutils/fluent-bit: Fix for SIGSEGV with SEGV_MAPERR code for the version 1.6.4
Status: Closed Overcome By Events
Alias: None
Product: Ports & Packages
Classification: Unclassified
Component: Individual Port(s) (show other bugs)
Version: Latest
Hardware: Any Any
: --- Affects Only Me
Assignee: Palle Girgensohn
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-11-19 15:53 UTC by Artyom Davidov
Modified: 2020-11-24 23:34 UTC (History)
1 user (show)

See Also:
bugzilla: maintainer-feedback? (girgen)


Attachments
Patch for the ports tree to build fixed version from upstream master branch (1.68 KB, patch)
2020-11-19 15:53 UTC, Artyom Davidov
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Artyom Davidov 2020-11-19 15:53:25 UTC
Created attachment 219815 [details]
Patch for the ports tree to build fixed version from upstream master branch

Fluent-bit 1.6.4 have some memory allocation problems that were fixed in upstream master branch but as of time of this writing were not included in any tagged release.
The attached patch will build fluent-bit at commit b7c7081 from the github master branch that seems to include a fix for this problem.

Since I usually opt out for using untagged upstream code in non-development versions of ports in the FreeBSD ports tree, I think that this patch should not be committed in the FreeBSD ports tree.
It is provided just to help others who is facing this issue to quickly build a patched version using their own copy of the FreeBSD ports tree.


truss utility output for this issue will be similar to this output:

write(10,"o\0\0\0\0\0\0\0\^[[1m[\^[[0m2020"...,4096) = 4096 (0x1000)
select(14,{ 7 9 11 13 },{ },0x0,0x0)             = 1 (0x1)
read(9,"o\0\0\0\0\0\0\0\^[[1m[\^[[0m2020"...,4096) = 4096 (0x1000)
SIGNAL 11 (SIGSEGV) code=SEGV_MAPERR trapno=12 addr=0x800a1b028
sigprocmask(SIG_SETMASK,{ SIGSEGV },0x0)         = 0 (0x0)
write(2,"\^[[1m[\^[[0m2020/11/17 02:19:56"...,111) = 111 (0x6f)
write(2,"[2020/11/17 02:19:56] ",22)             = 22 (0x16)
write(2,"[engine] caught signal (",24)           = 24 (0x18)
write(2,"SIGSEGV)\n",9)                          = 9 (0x9)
write(2,"ERROR: no stack trace because un"...,62) = 62 (0x3e)
sigprocmask(SIG_SETMASK,{ SIGHUP|SIGINT|SIGQUIT|SIGILL|SIGTRAP|SIGEMT|SIGFPE|SIGKILL|SIGBUS|SIGSEGV|SIGSYS|SIG
PIPE|SIGALRM|SIGTERM|SIGURG|SIGSTOP|SIGTSTP|SIGCONT|SIGCHLD|SIGTTIN|SIGTTOU|SIGIO|SIGXCPU|SIGXFSZ|SIGVTALRM|SI
GPROF|SIGWINCH|SIGINFO|SIGUSR1|SIGUSR2 },0x0) = 0 (0x0)
thr_self(0x801ccc1e0)                            = 0 (0x0)
thr_kill(100821,SIGABRT)                         = 0 (0x0)
SIGNAL 6 (SIGABRT) code=SI_LWP pid=47105 uid=0
select(14,{ 7 9 11 13 },{ },0x0,0x0)             ERR#4 'Interrupted system call'

The backtrace in lldb output will be similar to this:
lldb -c ./fluent-bit.core -- fluent-bit
(lldb) target create "fluent-bit" --core "./fluent-bit.core"
Core file '/home/ard/work/flb164/fluent-bit.core' (x86_64) was loaded.
(lldb) thread backtrace all
* thread #1, name = 'fluent-bit', stop reason = signal SIGABRT
  * frame #0: 0x000000080131f0fa
    frame #1: 0x000000080131f0c4
  thread #2, name = 'fluent-bit', stop reason = signal SIGABRT
    frame #0: 0x000000080133ebca
    frame #1: 0x000000080103c9dc
  thread #3, name = 'fluent-bit', stop reason = signal SIGABRT
    frame #0: 0x00000008013ad8da
    frame #1: 0x000000080103cd92
    frame #2: 0x00000000006d3198 fluent-bit`___lldb_unnamed_symbol2349$$fluent-bit + 200
    frame #3: 0x000000000043692a fluent-bit`___lldb_unnamed_symbol9$$fluent-bit + 58
    frame #4: 0x000000000044bf60 fluent-bit`___lldb_unnamed_symbol20$$fluent-bit + 192
  thread #4, name = 'fluent-bit', stop reason = signal SIGABRT
    frame #0: 0x000000080133ebca
    frame #1: 0x000000080103c9dc
    frame #2: 0x00000000006d28f8 fluent-bit`cio_stream_delete + 8
  thread #5, name = 'fluent-bit', stop reason = signal SIGABRT
    frame #0: 0x000000080133ebca
    frame #1: 0x000000080103c9dc
    frame #2: 0x00000000006d28f8 fluent-bit`cio_stream_delete + 8
  thread #6, name = 'fluent-bit', stop reason = signal SIGABRT
    frame #0: 0x000000080133ebca
    frame #1: 0x000000080103c9dc
    frame #2: 0x00000000006d28f8 fluent-bit`cio_stream_delete + 8
  thread #7, name = 'fluent-bit', stop reason = signal SIGABRT
    frame #0: 0x000000080133ebca
    frame #1: 0x000000080103c9dc
    frame #2: 0x00000000006d28f8 fluent-bit`cio_stream_delete + 8
Comment 1 Artyom Davidov 2020-11-19 16:03:45 UTC
I will close this PR when there would be a newer tagged release available (1.6.5 or 1.7.0) that includes this upstream fixes.

As for now I would NOT recommend to commit this patch to the ports tree.
We should wait for the new release from the fluent-bit authors.

The fluent-bit version that is provided by this patch should only be used if someone faces the same SIGSEGV SEGV_MAPERR isuue.
Comment 2 Palle Girgensohn freebsd_committer 2020-11-19 16:11:12 UTC
Hi,

It isn't just one single isolated commit from master that we could add as a patch? Or perhaps it's unclear when it was fixed?

Could we perhaps nag the upstreams crew to actually release this as a fix perhaps, if we cannot isolate the fix as a patch?
Comment 3 Artyom Davidov 2020-11-19 16:33:41 UTC
(In reply to Palle Girgensohn from comment #2)
Hi Palle,

There were actually several memory allocation problems fixed in the upstream code after the 1.6.4 release and it is hard to distinguish which one mitigates this particular one.

Still I'm not sure if upstream code actually has a fix for this issue, since it occurs rarely in our environment. 
As for now I just can say that the fluent-bit version build with this patch haven't crashed after 12 hours of uptime under the load.

If it turns out that this version is also SIGSEGV's with the same SEGV_MAPERR code, I'll build a debug version of the fluent-bit and we'll try to figure out where it actually crashes.
Comment 4 Artyom Davidov 2020-11-19 16:57:57 UTC
To make things a little bit clearer - at present time we don't have any crashdump (core file) from the version that has a debug information included - the truss log and a backtrace that I've provided earlier are from the stripped "Release" build of fluent-bit port version 1.6.4_1 and it can not be used for the debugging purposes. 
I guess I should also note that these are from the FreeBSD 11.4 amd64 and it is possible that this issue would not hit FreeBSD 12+ users since it uses another event notification system - on FreeBSD 11.4 fluent-bit falls back to select() instead of kqueue() on FreeBSD 12+.
Comment 5 Artyom Davidov 2020-11-20 16:42:16 UTC
Building from the master branch didn't help a lot.
We've got SEGSEGV as well. They are less frequent though.
The problem seems to be in mpack.c code or in the data that is being passed to it.
The good thing is that all SIGSEGVs were identical and occurred at the same place in flunt-bit code.
I've opened an PR at upstream's github repository and attached an lldb output to it.
https://github.com/fluent/fluent-bit/issues/2794