Bug 250825

Summary: sysutils/fluent-bit: 1.6.2 SIGSEGV on start
Product: Ports & Packages Reporter: pete
Component: Individual Port(s)Assignee: Palle Girgensohn <girgen>
Status: Closed FIXED    
Severity: Affects Only Me CC: ard_1, girgen, yuripv
Priority: --- Flags: bugzilla: maintainer-feedback? (girgen)
Version: Latest   
Hardware: Any   
OS: Any   
Attachments:
Description Flags
Proposed patch for src/flb_sheduler.c
none
Proposed patch for inclusion in the ports tree
none
reproducer
none
possible fix
none
Proposed patch for inclusion in the ports tree
none
Patch for the src/flb_scheduler.c for the upstream
none
Patch for the lib/flb_libco/settings.h for the upstream
none
Patch for the lib/monkey/deps/flb_libco/settings.h for the upstream
none
Proposed patch for inclusion in the ports tree
ard_1: maintainer-approval? (girgen)
Patch for the src/flb_scheduler.c for the upstream none

Description pete 2020-11-03 04:16:00 UTC
Running fluent-bit on both 12.2-RELEASE and 13-CURRENT I am getting a SIGSEGV after starting fluent-bit:

$  fluent-bit -v -i dummy -o stdout
Fluent Bit v1.6.2
* Copyright (C) 2019-2020 The Fluent Bit Authors
* Copyright (C) 2015-2018 Treasure Data
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

[2020/11/03 04:01:21] [ info] Configuration:
[2020/11/03 04:01:21] [ info]  flush time     | 5.000000 seconds
[2020/11/03 04:01:21] [ info]  grace          | 5 seconds
[2020/11/03 04:01:21] [ info]  daemon         | 0
[2020/11/03 04:01:21] [ info] ___________
[2020/11/03 04:01:21] [ info]  inputs:
[2020/11/03 04:01:21] [ info]      dummy
[2020/11/03 04:01:21] [ info] ___________
[2020/11/03 04:01:21] [ info]  filters:
[2020/11/03 04:01:21] [ info] ___________
[2020/11/03 04:01:21] [ info]  outputs:
[2020/11/03 04:01:21] [ info]      syslog.0
[2020/11/03 04:01:21] [ info] ___________
[2020/11/03 04:01:21] [ info]  collectors:
[2020/11/03 04:01:21] [ info] [engine] started (pid=1131)
[2020/11/03 04:01:21] [debug] [engine] coroutine stack size: 24576 bytes (24.0K)
[2020/11/03 04:01:21] [debug] [storage] [cio stream] new stream registered: dummy.0
[2020/11/03 04:01:21] [ info] [storage] version=1.0.6, initializing...
[2020/11/03 04:01:21] [ info] [storage] in-memory
[2020/11/03 04:01:21] [ info] [storage] normal synchronization mode, checksum disabled, max_chunks_up=128
[2020/11/03 04:01:21] [ info] [output:syslog:syslog.0] setup done for 127.0.0.1:514
[2020/11/03 04:01:21] [debug] [router] default match rule dummy.0:syslog.0
[2020/11/03 04:01:21] [ info] [sp] stream processor started
[2020/11/03 04:01:22] [error] [src/flb_scheduler.c:52 errno=0] No error: 0
[2020/11/03 04:01:23] [error] [src/flb_scheduler.c:52 errno=0] No error: 0
[2020/11/03 04:01:24] [error] [src/flb_scheduler.c:52 errno=0] No error: 0
[2020/11/03 04:01:25] [error] [src/flb_scheduler.c:52 errno=0] No error: 0
[2020/11/03 04:01:26] [error] [src/flb_scheduler.c:52 errno=0] No error: 0
[2020/11/03 04:01:26] [debug] [task] created task=0x800b910a0 id=0 OK
[2020/11/03 04:01:26] [engine] caught signal (SIGSEGV)
ERROR: no stack trace because unwind library not available (0)Abort trap (core dumped)


I seem to get the same error regardless of which input and output method I use.  

Running through truss seems to show that it can't find some NLS files - but I'm not sure if that's a red herring (my locale is en_US:UTF-8):

 1117: fstatat(AT_FDCWD,"/usr/share/nls/C/libc.cat",0x7fffdfffcba0,0x0) ERR#2 'No such file or directory'
 1117: fstatat(AT_FDCWD,"/usr/share/nls/libc/C",0x7fffdfffcba0,0x0) ERR#2 'No such file or directory'
 1117: fstatat(AT_FDCWD,"/usr/local/share/nls/C/libc.cat",0x7fffdfffcba0,0x0) ERR#2 'No such file or directory'
 1117: fstatat(AT_FDCWD,"/usr/local/share/nls/libc/C",0x7fffdfffcba0,0x0) ERR#2 'No such file or directory'

I've looked at src/flb_scheduler.c and line 52 is pretty uninteresting:
 45     /* We need to consume the byte */
 46     ret = flb_pipe_r(fd, &val, sizeof(val));
 47 #ifdef __APPLE__
 48     if (ret < 0) {
 49 #else
 50     if (ret <= 0) {
 51 #endif
 52         flb_errno();
 53         return -1;
 54     }


Let me know if more information is needed, I'm building a local pkg via poudriere now to see if I can reproduce or get additional debug info.
Comment 1 Palle Girgensohn freebsd_committer freebsd_triage 2020-11-03 12:13:04 UTC
Is it the same with the new 1.6.3?
Comment 2 Palle Girgensohn freebsd_committer freebsd_triage 2020-11-03 12:14:19 UTC
ah sorry, mixed up versions... 1.6.2 is the latest.

and this is on amd64?
Comment 3 Palle Girgensohn freebsd_committer freebsd_triage 2020-11-03 12:18:10 UTC
(In reply to Palle Girgensohn from comment #2)

1.6.3 is indeed committed, my tree was not up to date... perhaps that latest version helps?
Comment 4 pete 2020-11-03 16:31:12 UTC
(In reply to Palle Girgensohn from comment #3)
Hey there, I am seeing the same behavior on v1.6.3 built last night from the ports tree.

Also, all of my test systems are amd64.
Comment 5 pete 2020-11-05 04:18:22 UTC
I have filed this github issue with the upstream dev team to see if they can shed any light on this.
Comment 6 Palle Girgensohn freebsd_committer freebsd_triage 2020-11-05 07:38:48 UTC
(In reply to pete from comment #5)
Excellent!
Comment 7 pete 2020-11-05 17:10:20 UTC
realized i forgot to link to the upstream github issue:
https://github.com/fluent/fluent-bit/issues/2747
Comment 8 Artyom Davidov 2020-11-12 21:56:13 UTC
(In reply to pete from comment #0)
Hello Pete!

You are not quite right saying that that part of code at src/flb_scheduler.c and line 52 is uninteresting.
This part of code gives us a clue on how to fix the issue that we are facing in FreeBSD.
I do really think that this issue is the same as the issue that was fixed for MacOS in commit https://github.com/fluent/fluent-bit/commit/e8f1d813daf28ef44e60d9e549f80350b2f9e34e#diff-dea67b525613b1123b600f13482893441818000a0cd67ec795ab1a694b09c3dc

The original MacOS bugreport was very similar to the issue that you've described - please take a look at:
https://github.com/fluent/fluent-bit/issues/2460

So as a test we can replace __APPLE__ with the __FreeBSD__ at line 47 and check if the problem will persist.
Comment 9 Artyom Davidov 2020-11-13 00:36:49 UTC
Created attachment 219617 [details]
Proposed patch for src/flb_sheduler.c

This patch can be applied to src/flb_scheduler.c file directly or can be placed in ports/sysutils/fluent-bit/files directory to apply it during port building process.
!Needs testing!
The effect of this patch wasn't tested by me due to the absence of time, sorry.
Comment 10 Artyom Davidov 2020-11-13 00:51:09 UTC
Created attachment 219618 [details]
Proposed patch for inclusion in the ports tree

This is the patch for this issue for the inclusion in the ports tree.
!Needs testing!
Comment 11 Artyom Davidov 2020-11-13 01:13:19 UTC
I've quickly tested this patch against v.1.6.4 (see bug https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=251085) and it seems to apply cleanly.
Also I was unable to crash fluent-bit using the same command-line options as Pete.

# fluent-bit -v -i dummy -o stdout
Fluent Bit v1.6.4
* Copyright (C) 2019-2020 The Fluent Bit Authors
* Copyright (C) 2015-2018 Treasure Data
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

[2020/11/13 04:04:58] [ info] Configuration:
[2020/11/13 04:04:58] [ info]  flush time     | 5.000000 seconds
[2020/11/13 04:04:58] [ info]  grace          | 5 seconds
[2020/11/13 04:04:58] [ info]  daemon         | 0
[2020/11/13 04:04:58] [ info] ___________
[2020/11/13 04:04:58] [ info]  inputs:
[2020/11/13 04:04:58] [ info]      dummy
[2020/11/13 04:04:58] [ info] ___________
[2020/11/13 04:04:58] [ info]  filters:
[2020/11/13 04:04:58] [ info] ___________
[2020/11/13 04:04:58] [ info]  outputs:
[2020/11/13 04:04:58] [ info]      stdout.0
[2020/11/13 04:04:58] [ info] ___________
[2020/11/13 04:04:58] [ info]  collectors:
[2020/11/13 04:04:58] [ info] [engine] started (pid=79114)
[2020/11/13 04:04:58] [debug] [engine] coroutine stack size: 24576 bytes (24.0K)
[2020/11/13 04:04:58] [debug] [storage] [cio stream] new stream registered: dummy.0
[2020/11/13 04:04:58] [ info] [storage] version=1.0.6, initializing...
[2020/11/13 04:04:58] [ info] [storage] in-memory
[2020/11/13 04:04:58] [ info] [storage] normal synchronization mode, checksum disabled, max_chunks_up=128
[2020/11/13 04:04:58] [debug] [router] default match rule dummy.0:stdout.0
[2020/11/13 04:04:58] [ info] [sp] stream processor started
[2020/11/13 04:05:03] [debug] [task] created task=0x801c8b0a0 id=0 OK
[0] dummy.0: [1605229499.189905399, {"message"=>"dummy"}]
[1] dummy.0: [1605229500.262302939, {"message"=>"dummy"}]
[2] dummy.0: [1605229501.289807063, {"message"=>"dummy"}]
[3] dummy.0: [1605229502.344936137, {"message"=>"dummy"}]
[2020/11/13 04:05:03] [debug] [task] destroy task=0x801c8b0a0 (task_id=0)
[0] dummy.0: [1605229503.356045579, {"message"=>"dummy"}]
[2020/11/13 04:05:08] [debug] [task] created task=0x801c8b0a0 id=0 OK
[1] dummy.0: [1605229504.389743018, {"message"=>"dummy"}]
[2] dummy.0: [1605229505.462317117, {"message"=>"dummy"}]
[3] dummy.0: [1605229506.489793583, {"message"=>"dummy"}]
[4] dummy.0: [1605229507.562278913, {"message"=>"dummy"}]
[2020/11/13 04:05:08] [debug] [task] destroy task=0x801c8b0a0 (task_id=0)
Comment 12 pete 2020-11-13 05:01:38 UTC
(In reply to Artyom Davidov from comment #11)
Interesting, which version of FreeBSD are you running this on Artyom?  I am running into the same SIGSEGV with this patch on my 13-CURRENT system under amd64 with the patch to the ports tree applied.

i'd be interesting to see if this pkg also causes faults on your end:
https://www.nomadlogic.org/fluent-bit-1.6.3.tgz
Comment 13 Artyom Davidov 2020-11-13 06:40:53 UTC
(In reply to pete from comment #12)
Hello Pete,
I forgot to mention that I've tested it on FreeBSD 11.4 amd64.
I just tried to reproduce this problem on unpatched build of fluent-bit 1.6.3 and didn't succeed. So it looks like fluent-bit doesn't suffer from this particular SIGSEGV problem on 11.4 amd64.
Since I haven't got at hand for now any newer FreeBSD version, I wouldn't be able to quickly reproduce this SIGSEGV and test this patch. =(

So I would like to ask you to check if that patch have been successfully applied to the ports tree. This patch should create a file in the "ports/sysutils/fluent-bit/files/" directory called "patch-src_flb__scheduler.c".
Also after running "make" in the "ports/sysutils/fluent-bit/" directory the corresponding src/flb_scheduler.c file should have a corrected #ifdef near the line 47 in it.

If patch was successfully applied and it doesn't work, and you are experiencing the same SIGSEGV in you environment, then we should look further to investigate this problem.
Comment 14 Artyom Davidov 2020-11-13 10:40:48 UTC
(In reply to Artyom Davidov from comment #13)
I've managed to bring up a small FreeBSD 12.2-RELEASE amd64 test machine.
And I was able to reproduce this issue.
The patch that was provided earlier fixes one of the two problems - it fixes 
"[2020/11/03 04:01:22] [error] [src/flb_scheduler.c:52 errno=0] No error: 0"
kind of errors but it doesn't help with the SIGSEGV that occurs later.
It looks like that indeed there are two problems - the one that is similar to MacOS problem and that is resolved by the provided patch and another one with the SIGSEGV that requires further investigation.
Comment 15 Yuri Pankov 2020-11-13 19:27:45 UTC
To add a bit more info, rebuilding fluent-bit with debug symbols provides the following backtrace, where we crash in libco's co_swap_function(), and that is a bit over my head at the first look:

#0  thr_kill () at thr_kill.S:4
#1  0x00000008007f7ac4 in __raise (s=s@entry=6) at /usr/src/lib/libc/gen/raise.c:52
#2  0x00000008008ab5d9 in abort () at /usr/src/lib/libc/stdlib/abort.c:67
#3  0x00000000003485bc in flb_signal_handler (signal=11)
    at /usr/ports/sysutils/fluent-bit/work/fluent-bit-1.6.3/src/fluent-bit.c:418
#4  0x0000000800730b90 in handle_signal (actp=actp@entry=0x7fffdfffcec0, sig=sig@entry=11,
    info=info@entry=0x7fffdfffd2b0, ucp=ucp@entry=0x7fffdfffcf40) at /usr/src/lib/libthr/thread/thr_sig.c:303
#5  0x000000080073015f in thr_sighandler (sig=11, info=0x7fffdfffd2b0, _ucp=0x7fffdfffcf40)
    at /usr/src/lib/libthr/thread/thr_sig.c:246
#6  <signal handler called>
#7  0x00000000002df580 in co_swap_function ()
#8  0x000000000050968e in co_switch (handle=0x8016c7640)
    at /usr/ports/sysutils/fluent-bit/work/fluent-bit-1.6.3/lib/flb_libco/amd64.c:158
#9  0x000000000036579d in output_params_set (th=0x8016091c0, data=0x801686980, bytes=104, tag=0x801641020 "dummy.0",
    tag_len=7, i_ins=0x800e58000, out_plugin=0x800e2dfc0, out_context=0x8016051e0, config=0x800e19180)
    at /usr/ports/sysutils/fluent-bit/work/fluent-bit-1.6.3/include/fluent-bit/flb_output.h:429
#10 flb_output_thread (task=0x8016650a0, i_ins=0x800e58000, o_ins=0x800e5b000, config=0x800e19180, buf=0x801686980,
    size=104, tag=0x801641020 "dummy.0", tag_len=7)
    at /usr/ports/sysutils/fluent-bit/work/fluent-bit-1.6.3/include/fluent-bit/flb_output.h:522
#11 tasks_start (in=0x800e58000, config=0x800e19180)
    at /usr/ports/sysutils/fluent-bit/work/fluent-bit-1.6.3/src/flb_engine_dispatch.c:190
#12 0x00000000003650a1 in flb_engine_dispatch (id=0, in=0x800e58000, config=0x800e19180)
    at /usr/ports/sysutils/fluent-bit/work/fluent-bit-1.6.3/src/flb_engine_dispatch.c:293
#13 0x0000000000362950 in flb_engine_flush (config=0x800e19180, in_force=0x0)
    at /usr/ports/sysutils/fluent-bit/work/fluent-bit-1.6.3/src/flb_engine.c:85
#14 0x0000000000362fbe in flb_engine_handle_event (fd=20, mask=1, config=0x800e19180)
    at /usr/ports/sysutils/fluent-bit/work/fluent-bit-1.6.3/src/flb_engine.c:292
#15 flb_engine_start (config=0x800e19180) at /usr/ports/sysutils/fluent-bit/work/fluent-bit-1.6.3/src/flb_engine.c:559
--Type <RET> for more, q to quit, c to continue without paging--
#16 0x000000000034b75c in flb_lib_worker (data=0x800e19180)
    at /usr/ports/sysutils/fluent-bit/work/fluent-bit-1.6.3/src/flb_lib.c:488
#17 0x000000080072774b in thread_start (curthread=0x800e12500) at /usr/src/lib/libthr/thread/thr_create.c:292
Comment 16 pete 2020-11-13 19:37:43 UTC
(In reply to Yuri Pankov from comment #15)
I started a thread in freebsd-questions@ where I think we ended up at the same conclusion:
https://lists.freebsd.org/pipermail/freebsd-questions/2020-November/292045.html

We dug a bit deeper but I wasn't able to make too much progress.  One thing I've noticed is in 12.0 there was an update to our pthread implementation:
https://svnweb.freebsd.org/base?view=revision&revision=337992
Comment 17 Palle Girgensohn freebsd_committer freebsd_triage 2020-11-13 19:58:55 UTC
I applied the patch and I still get the same SIGSEGV

Fluent Bit v1.6.4
* Copyright (C) 2019-2020 The Fluent Bit Authors
* Copyright (C) 2015-2018 Treasure Data
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

[2020/11/13 19:55:06] [ info] Configuration:
[2020/11/13 19:55:06] [ info]  flush time     | 5.000000 seconds
[2020/11/13 19:55:06] [ info]  grace          | 5 seconds
[2020/11/13 19:55:06] [ info]  daemon         | 0
[2020/11/13 19:55:06] [ info] ___________
[2020/11/13 19:55:06] [ info]  inputs:
[2020/11/13 19:55:06] [ info]      dummy
[2020/11/13 19:55:06] [ info] ___________
[2020/11/13 19:55:06] [ info]  filters:
[2020/11/13 19:55:06] [ info] ___________
[2020/11/13 19:55:06] [ info]  outputs:
[2020/11/13 19:55:06] [ info]      stdout.0
[2020/11/13 19:55:06] [ info] ___________
[2020/11/13 19:55:06] [ info]  collectors:
[2020/11/13 19:55:06] [ info] [engine] started (pid=8907)
[2020/11/13 19:55:06] [debug] [engine] coroutine stack size: 24576 bytes (24.0K)
[2020/11/13 19:55:06] [debug] [storage] [cio stream] new stream registered: dummy.0
[2020/11/13 19:55:06] [ info] [storage] version=1.0.6, initializing...
[2020/11/13 19:55:06] [ info] [storage] in-memory
[2020/11/13 19:55:06] [ info] [storage] normal synchronization mode, checksum disabled, max_chunks_up=128
[2020/11/13 19:55:06] [debug] [router] default match rule dummy.0:stdout.0
[2020/11/13 19:55:06] [ info] [sp] stream processor started
[2020/11/13 19:55:07] [error] [src/flb_scheduler.c:52 errno=0] No error: 0
[2020/11/13 19:55:08] [error] [src/flb_scheduler.c:52 errno=0] No error: 0
[2020/11/13 19:55:09] [error] [src/flb_scheduler.c:52 errno=0] No error: 0
[2020/11/13 19:55:10] [error] [src/flb_scheduler.c:52 errno=0] No error: 0
[2020/11/13 19:55:11] [debug] [task] created task=0x800fc80a0 id=0 OK
[2020/11/13 19:55:11] [engine] caught signal (SIGSEGV)
ERROR: no stack trace because unwind library not available (0)Abort (core dumped)



---



root@122-amd64-default:/wrkdirs/usr/ports/sysutils/fluent-bit/work/stage/usr/local/bin # lldb --core  *core -- fluent-bit
(lldb) target create "fluent-bit" --core "fluent-bit.core"
Core file '/wrkdirs/usr/ports/sysutils/fluent-bit/work/stage/usr/local/bin/fluent-bit.core' (x86_64) was loaded.
(lldb) bt
* thread #1, name = 'fluent-bit', stop reason = signal SIGABRT
  * frame #0: 0x00000008008a3c2a libc.so.7`__sys_thr_kill + 10
    frame #1: 0x00000008008a2084 libc.so.7`__raise + 52
    frame #2: 0x0000000800818279 libc.so.7`abort + 73
    frame #3: 0x00000000003266b4 fluent-bit`___lldb_unnamed_symbol6$$fluent-bit + 324
    frame #4: 0x00000008006d4b70 libthr.so.3`___lldb_unnamed_symbol101$$libthr.so.3 + 208
    frame #5: 0x00000008006d413f libthr.so.3`___lldb_unnamed_symbol82$$libthr.so.3 + 319
    frame #6: 0x00007ffffffff193
    frame #7: 0x0000000000336f61 fluent-bit`flb_engine_start + 977
    frame #8: 0x0000000000327e92 fluent-bit`___lldb_unnamed_symbol10$$fluent-bit + 34
    frame #9: 0x00000008006cefac libthr.so.3`___lldb_unnamed_symbol1$$libthr.so.3 + 348
Comment 18 Yuri Pankov 2020-11-13 21:22:55 UTC
Another observation (of questionable worth): replacing amd64.c with ucontext.c (i.e. using the "slow" pthread implementation) in lib/flb_libco/ makes it work, so we at least have the (quotting ucontext.c) "last resort" workaround for the moment.
Comment 19 Artyom Davidov 2020-11-13 21:54:37 UTC
(In reply to Palle Girgensohn from comment #17)
Hello Palle,
On FreeBSD 12.2-RELEASE amd64 and the patch applied I've got the following output from fluent-bit:

# fluent-bit -v -i dummy -o stdout
Fluent Bit v1.6.4
* Copyright (C) 2019-2020 The Fluent Bit Authors
* Copyright (C) 2015-2018 Treasure Data
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

[2020/11/14 03:04:44] [ info] Configuration:
[2020/11/14 03:04:44] [ info]  flush time     | 5.000000 seconds
[2020/11/14 03:04:44] [ info]  grace          | 5 seconds
[2020/11/14 03:04:44] [ info]  daemon         | 0
[2020/11/14 03:04:44] [ info] ___________
[2020/11/14 03:04:44] [ info]  inputs:
[2020/11/14 03:04:44] [ info]      dummy
[2020/11/14 03:04:44] [ info] ___________
[2020/11/14 03:04:44] [ info]  filters:
[2020/11/14 03:04:44] [ info] ___________
[2020/11/14 03:04:44] [ info]  outputs:
[2020/11/14 03:04:44] [ info]      stdout.0
[2020/11/14 03:04:44] [ info] ___________
[2020/11/14 03:04:44] [ info]  collectors:
[2020/11/14 03:04:44] [ info] [engine] started (pid=24965)
[2020/11/14 03:04:44] [debug] [engine] coroutine stack size: 24576 bytes (24.0K)
[2020/11/14 03:04:44] [debug] [storage] [cio stream] new stream registered: dummy.0
[2020/11/14 03:04:44] [ info] [storage] version=1.0.6, initializing...
[2020/11/14 03:04:44] [ info] [storage] in-memory
[2020/11/14 03:04:44] [ info] [storage] normal synchronization mode, checksum disabled, max_chunks_up=128
[2020/11/14 03:04:44] [debug] [router] default match rule dummy.0:stdout.0
[2020/11/14 03:04:44] [ info] [sp] stream processor started
[2020/11/14 03:04:49] [2020/11/14 03:04:49] [debug] [task] created task=0x800b9a0a0 id=0 OK
[engine] caught signal (SIGSEGV)
ERROR: no stack trace because unwind library not available (0)Abort (core dumped)

So this patch will only fix one of two issues - it'll help to get rid of the several errors that look like the following one:
[2020/11/13 19:55:07] [error] [src/flb_scheduler.c:52 errno=0] No error: 0

And this patch would not help with the SIGSEGV issue.
Comment 20 Yuri Pankov 2020-11-13 23:00:37 UTC
Created attachment 219657 [details]
reproducer

I have reduced the problematic code to the attached reproducer, timpl1 contains the same assembly as clang/gcc produce for timpl2().  Calling timpl2 via tfunc works, calling timpl1 via tfunc causes segfault.  Running the same code on illumos works for both cases.  Sadly I don't have any 11.x system, so if anyone could confirm both work there, it would be great.
Comment 21 Artyom Davidov 2020-11-13 23:11:13 UTC
(In reply to pete from comment #16)
Pete, I've followed the thread that you've created in freebsd-questions@ and I would like to clear up some things about fluent-bit on FreeBSD 11.4 amd64.
It is true that I was unable to reproduce this particular SIGSEGV on that FreeBSD version and in fact we're running some number of fluent-bit hosts under production workload. But I cannot say that fluent-bit is rock-solid on FreeBSD 11.4 amd64. 
Though it doesn't SIGSEGV on start-up, it could SIGSEGV sporadically under the load. And it can SIGSEGV several times in a row, but after that it can work for hours without any issues.
I don't know if the increased SIGSEGV rate was due to the bugs in the version 1.6.3 that were fixed in 1.6.4, but as far as I remember, those sporadic crashes existed since Palle added the port to the FreeBSD port tree but their rate were significantly lower than that we've faced with 1.6.3.
Comment 22 Yuri Pankov 2020-11-13 23:20:23 UTC
Created attachment 219658 [details]
possible fix

The attached patch fixes the segfault for me -- turns out clang was putting the object in the wrong section (.text# instead of .text), and still not entirely sure why the (possible) difference with 11.x.
Comment 23 Artyom Davidov 2020-11-13 23:25:44 UTC
(In reply to Yuri Pankov from comment #20)
Yuri, I've tested the reproducer on FreeBSD 11.4-RELEASE-p4 amd64 and I've got the following output:

# ./reproducer
calling timpl2 via tfunc
done
calling timpl1 via tfunc
done

The file was compiled with simple
cc -o reproducer t.c

The compiler version string is
FreeBSD clang version 10.0.0 (git@github.com:llvm/llvm-project.git llvmorg-10.0.0-0-gd32170dbd5b)
Target: x86_64-unknown-freebsd11.4
Thread model: posix
InstalledDir: /usr/bin
Comment 24 Yuri Pankov 2020-11-13 23:28:33 UTC
(In reply to Artyom Davidov from comment #23)
Thank you.  So somehow using ".text#" as section name to put binary code in works on 11.x and does not work on 12.x+.
Comment 25 Artyom Davidov 2020-11-13 23:57:48 UTC
(In reply to Yuri Pankov from comment #22)
I've just tested this fix on FreeBSD 12.2-RELEASE amd64 and got the following output:

 # fluent-bit -v -i dummy -o stdout
Fluent Bit v1.6.4
* Copyright (C) 2019-2020 The Fluent Bit Authors
* Copyright (C) 2015-2018 Treasure Data
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

[2020/11/14 05:49:13] [ info] Configuration:
[2020/11/14 05:49:13] [ info]  flush time     | 5.000000 seconds
[2020/11/14 05:49:13] [ info]  grace          | 5 seconds
[2020/11/14 05:49:13] [ info]  daemon         | 0
[2020/11/14 05:49:13] [ info] ___________
[2020/11/14 05:49:13] [ info]  inputs:
[2020/11/14 05:49:13] [ info]      dummy
[2020/11/14 05:49:13] [ info] ___________
[2020/11/14 05:49:13] [ info]  filters:
[2020/11/14 05:49:13] [ info] ___________
[2020/11/14 05:49:13] [ info]  outputs:
[2020/11/14 05:49:13] [ info]      stdout.0
[2020/11/14 05:49:13] [ info] ___________
[2020/11/14 05:49:13] [ info]  collectors:
[2020/11/14 05:49:13] [ info] [engine] started (pid=13229)
[2020/11/14 05:49:13] [debug] [engine] coroutine stack size: 24576 bytes (24.0K)
[2020/11/14 05:49:13] [debug] [storage] [cio stream] new stream registered: dummy.0
[2020/11/14 05:49:13] [ info] [storage] version=1.0.6, initializing...
[2020/11/14 05:49:13] [ info] [storage] in-memory
[2020/11/14 05:49:13] [ info] [storage] normal synchronization mode, checksum disabled, max_chunks_up=128
[2020/11/14 05:49:13] [debug] [router] default match rule dummy.0:stdout.0
[2020/11/14 05:49:13] [ info] [sp] stream processor started
[2020/11/14 05:49:18] [debug] [task] created task=0x800b9a0a0 id=0 OK
[0] dummy.0: [1605322154.225916035, {"message"=>"dummy"}]
[1] dummy.0: [1605322155.224833287, {"message"=>"dummy"}]
[2] dummy.0: [1605322156.225633565, {"message"=>"dummy"}]
[3] dummy.0: [1605322157.225622246, {"message"=>"dummy"}]
[4] dummy.0: [1605322158.225389385, {"message"=>"dummy"}]
[2020/11/14 05:49:18] [debug] [task] destroy task=0x800b9a0a0 (task_id=0)
^C[2020/11/14 05:49:20] [engine] caught signal (SIGINT)
[0] dummy.0: [1605322159.225520247, {"message"=>"dummy"}]
[2020/11/14 05:49:20] [debug] [task] created task=0x800b9a0a0 id=0 OK
[2020/11/14 05:49:20] [ warn] [engine] service will stop in 5 seconds
[2020/11/14 05:49:20] [debug] [task] destroy task=0x800b9a0a0 (task_id=0)
[2020/11/14 05:49:20] [debug] [input chunk] dummy.0 is paused, cannot append records
[2020/11/14 05:49:21] [debug] [input chunk] dummy.0 is paused, cannot append records
[2020/11/14 05:49:22] [debug] [input chunk] dummy.0 is paused, cannot append records
[2020/11/14 05:49:23] [debug] [input chunk] dummy.0 is paused, cannot append records
[2020/11/14 05:49:24] [debug] [input chunk] dummy.0 is paused, cannot append records
[2020/11/14 05:49:25] [ info] [engine] service stopped
[2020/11/14 05:49:25] [  Error] kevent: Bad file descriptor, errno=9 at /usr/ports/sysutils/fluent-bit/work/fluent-bit-1.6.4/lib/monkey/mk_core/mk_event_kqueue.c:144

The SIGSEGV has gone but signal handling in fluent-bit is still somewhat strange.
Comment 26 Artyom Davidov 2020-11-14 00:05:34 UTC
(In reply to Yuri Pankov from comment #22)
I wonder if this patch should be updated to be more FreeBSD-specific by adding some #elif section to it, so it could be merged by fluent-bit authors upstream?
Or this change should not potentially make any harm to the other OSes or architectures?
Comment 27 Yuri Pankov 2020-11-14 00:20:05 UTC
(In reply to Artyom Davidov from comment #26)
That I'm not sure about, it should be OK for ports for now, but for upstream there could be separate FreeBSD check, yes.

What I currently see is that using '.text#' on -CURRENT produces 2 sections (via `elfdump -a`):

entry: 11
        sh_name: .text#
        sh_type: SHT_PROGBITS
        sh_flags: SHF_ALLOC

entry: 14
        sh_name: .text
        sh_type: SHT_PROGBITS
        sh_flags: SHF_ALLOC|SHF_EXECINSTR

Note the SHF_EXECINSTR in the second one.  So it's either we have stricter checks in 12+, or toolchain in 11.x mends ".text#" to just ".text" -- at least I'm seeing the latter with gcc9 from ports, so likely with gnu ld -- does 11.x use ld.bfd still?
Comment 28 Artyom Davidov 2020-11-14 01:08:03 UTC
(In reply to Yuri Pankov from comment #27)
>does 11.x use ld.bfd still?

I'm not quite sure, but I guess so:
# ls -l /usr/bin/ld*
-r-xr-xr-x  2 root  wheel   1678680 Oct 21 20:44 /usr/bin/ld
-r-xr-xr-x  2 root  wheel   1678680 Oct 21 20:44 /usr/bin/ld.bfd
-r-xr-xr-x  1 root  wheel  43921472 Oct 21 20:44 /usr/bin/ld.lld
-r-xr-xr-x  1 root  wheel     11632 Oct 21 20:44 /usr/bin/ldd
-r-xr-xr-x  1 root  wheel     18996 Oct 21 20:44 /usr/bin/ldd32
# locate ld.bfd
/usr/bin/ld.bfd
Comment 29 Yuri Pankov 2020-11-14 01:17:57 UTC
(In reply to Artyom Davidov from comment #28)
That explains why it works on 11.x, and another option to fix the port would be using ld from devel/binutils, though I'm not sure how to pass that to Cmake.

And it looks like .text# has special for gcc/gnu ld after all, using just .text makes gcc9 whine like the following:

/tmp//ccppyMZz.s: Assembler messages:
/tmp//ccppyMZz.s:3: Warning: ignoring changed section attributes for .text

So we can't simply apply the change upstream, and would really need special FreeBSD/clang/lld case if we go with patching libco.  BTW, looks like there is another instance of libco in fluent-bit source, so if it's used, it would need to be patched as well.
Comment 30 Artyom Davidov 2020-11-15 01:42:07 UTC
(In reply to Yuri Pankov from comment #29)
Digging deeper in this issue and the reasons for addition of that "#" symbol in the libco code to the section name, I've found this gcc-help@ mail list discussion with the explanation of the reasons of doing so.
https://gcc.gnu.org/legacy-ml/gcc-help/2010-09/msg00088.html
So this is really a "gcc-only" hack to trick it to not modify the .text section attributes.
Comment 31 Artyom Davidov 2020-11-15 02:17:23 UTC
(In reply to Yuri Pankov from comment #27)
I've looked thru the corresponding elfdump for the fluent-bit binary built on FreeBSD 11.4 amd64 and I can confirm that it is also have 2 ".text" sections in it.
One - is a standard .text section and another one is a .text#

entry: 13
        sh_name: .text
        sh_type: SHT_PROGBITS
        sh_flags: SHF_ALLOC|SHF_EXECINSTR
        sh_addr: 0x432fb0
        sh_offset: 208816
        sh_size: 2768424
        sh_link: 0
        sh_info: 0
        sh_addralign: 16
        sh_entsize: 0
entry: 16
        sh_name: .text#
        sh_type: SHT_PROGBITS
        sh_flags: SHF_ALLOC
        sh_addr: 0x7851e0
        sh_offset: 3690976
        sh_size: 4096
        sh_link: 0
        sh_info: 0
        sh_addralign: 16
        sh_entsize: 0

Taking into account the information that I've posted in comment 30, now it is clear that the binary produced by the compiler other than gcc would be incorrect, or in another words those binaries will have the structure that is different from what it was expected to be by libco developers.
I guess this could be the reason for the instability of the fluent-bit executable also on FreeBSD 11.x series.
Now it's looks to me that we should patch this libco issue on all supported FreeBSD versions that use clang to produce a correct fluent-bit executables.
Comment 32 Artyom Davidov 2020-11-15 03:57:55 UTC
So to summarize this SIGSEGV issue - this problem arise due to the GCC-only hack that is being used by the libco developers to put data in the .text section of the resulting binary. When it is being combined with clang/llvm compiler it leads to incorrect binary with the additional ".text#" section in it, which is in turn leads to an undesired behavior.
This issue is not FreeBSD specific and could affect any OS that is using compiler other than GCC.

I'll prepare the patches for the libco and the FreeBSD ports tree that could be also merged upstream in a couple of days, after I'll test them.
Comment 33 Artyom Davidov 2020-11-16 23:49:01 UTC
Created attachment 219749 [details]
Proposed patch for inclusion in the ports tree

This is a patch for the FreeBSD ports tree that will fix two issues.
The first one is in  flb_pipe_r() function returning 0 on FreeBSD 12+ and that brings up errors similar to "[error] [src/flb_scheduler.c:52 errno=0] No error: 0". This patch will fix it the same way as it was fixed by upstream code for MacOS.
The second issue is a SIGSEGV that is caused by incorrect section structure of the resulting binary file generated by clang due to some gcc specific code in libco.
Since there are two copies of libco (flb_libco) code in the fluent-bit sources - one in the lib/flb_libco folder and another one in lib/monkey/deps/flb_libco - they are both patched the same way.

This patches were tested on FreeBSD 11.4 and 12.2 amd64.
The resulting binaries was tested with "elfdump -a" to contain only one ".text" section and their startup logs were tested running "fluent-bit -v -i dummy -o stdout".
Comment 34 Artyom Davidov 2020-11-16 23:53:40 UTC
(In reply to Artyom Davidov from comment #33)
I forgot to mention that this patch will bump portrevision and also pet portlint (I've replaced spaces with the tab in USE_GITHUB= yes in the port's Makefile)
Comment 35 Artyom Davidov 2020-11-17 00:07:17 UTC
I guess I should mention one more thing. 
Despite these patches under FreeBSD 11.4 fluent-bit suffers from several other issues - it doesn't properly detect kqueue support and can randomly SIGSEGV during runtime with SEGV_MAPERR code.
But I guess this should be addressed in another bug-reports.
Comment 36 Artyom Davidov 2020-11-17 00:30:21 UTC
Created attachment 219750 [details]
Patch for the src/flb_scheduler.c for the upstream

This patch is intended for the inclusion in the upstream code to fix the "[error] [src/flb_scheduler.c:52 errno=0] No error: 0" error. It uses the same method as that was proposed in the github pull request for MacOS (https://github.com/fluent/fluent-bit/pull/2463)
Comment 37 Artyom Davidov 2020-11-17 00:35:35 UTC
Created attachment 219751 [details]
Patch for the lib/flb_libco/settings.h for the upstream

This patch will fix SIGSEGV due to improper section naming in the clang generated code. This is the first of the two patches for this issue for upstream code.
Comment 38 Artyom Davidov 2020-11-17 00:37:10 UTC
Created attachment 219752 [details]
Patch for the lib/monkey/deps/flb_libco/settings.h for the upstream

This patch will fix SIGSEGV due to improper section naming in the clang generated code. This is the second of the two patches for this issue for upstream code.
Comment 39 Artyom Davidov 2020-11-17 00:43:28 UTC
It would be great if someone can create pull requests for this two issues in the upstream github repository - one with the patch for the src/flb_scheduler.c and another one for the lib/flb_libco/settings.h and lib/monkey/deps/flb_libco/settings.h
Comment 40 pete 2020-11-17 01:03:40 UTC
I can confirm that the attached patches work on 13-CURRENT when applied to our ports tree:

$ fluent-bit -i random -o stdout
Fluent Bit v1.6.4
* Copyright (C) 2019-2020 The Fluent Bit Authors
* Copyright (C) 2015-2018 Treasure Data
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

[2020/11/16 17:01:19] [ info] [engine] started (pid=47800)
[2020/11/16 17:01:19] [ info] [storage] version=1.0.6, initializing...
[2020/11/16 17:01:19] [ info] [storage] in-memory
[2020/11/16 17:01:19] [ info] [storage] normal synchronization mode, checksum disabled, max_chunks_up=128
[2020/11/16 17:01:19] [ info] [sp] stream processor started
[0] random.0: [1605574880.927249761, {"rand_value"=>4909325973904087984}]
[1] random.0: [1605574881.927221529, {"rand_value"=>277570866715169112}]
[2] random.0: [1605574882.927206058, {"rand_value"=>12953736756712043751}]
[3] random.0: [1605574883.927209078, {"rand_value"=>7851878885956954261}]
[4] random.0: [1605574884.927217567, {"rand_value"=>17541597209171492377}]
[0] random.0: [1605574885.927225076, {"rand_value"=>17374193282709205663}]
[1] random.0: [1605574886.927215686, {"rand_value"=>12832133527498613170}]
[2] random.0: [1605574887.927221645, {"rand_value"=>4084494503634151418}]
[3] random.0: [1605574888.927204875, {"rand_value"=>17233444886810613865}]
[0] random.0: [1605574889.927239295, {"rand_value"=>17820593800639646583}]
[1] random.0: [1605574890.927238415, {"rand_value"=>13661266209513106025}]
[2] random.0: [1605574891.927223915, {"rand_value"=>15513617153971889357}]
[3] random.0: [1605574892.927221135, {"rand_value"=>11677897982918305681}]
[4] random.0: [1605574893.927232566, {"rand_value"=>16943591317307619173}]
^C[2020/11/16 17:01:37] [engine] caught signal (SIGINT)
[0] random.0: [1605574894.927309096, {"rand_value"=>2382823764264495886}]
[1] random.0: [1605574895.927217538, {"rand_value"=>13899165999474213982}]
[2] random.0: [1605574896.927232658, {"rand_value"=>2918173789150298808}]
[2020/11/16 17:01:37] [ warn] [engine] service will stop in 5 seconds
[2020/11/16 17:01:42] [ info] [engine] service stopped
[2020/11/16 17:01:42] [  Error] kevent: Bad file descriptor, errno=9 at /wrkdirs/usr/ports/sysutils/fluent-bit/work/fluent-bit-1.6.4/lib/monkey/mk_core/mk_event_kqueue.c:144


I am happy to submit PR's to upstream as well - but it may make more sense for one of the folks who wrote the fixes submit the patch so they get the credit.  Just let me know :)
Comment 41 Artyom Davidov 2020-11-17 01:11:22 UTC
Created attachment 219753 [details]
Proposed patch for inclusion in the ports tree

Updated patch to fix cosmetic styling issue.
I forgot to add the space between >= and 12, sorry.
No other changes from the previous version.
Comment 42 Artyom Davidov 2020-11-17 01:12:59 UTC
Created attachment 219754 [details]
Patch for the src/flb_scheduler.c for the upstream

Updated patch to fix cosmetic styling issue.
I forgot to add the space between >= and 12, sorry.
No other changes from the previous version.
Comment 43 Palle Girgensohn freebsd_committer freebsd_triage 2020-11-17 09:33:36 UTC
Excellent, great job! Works like a charm here.
Comment 44 commit-hook freebsd_committer freebsd_triage 2020-11-17 09:50:24 UTC
A commit references this bug:

Author: girgen
Date: Tue Nov 17 09:50:17 UTC 2020
New revision: 555545
URL: https://svnweb.freebsd.org/changeset/ports/555545

Log:
  Fix SIGSEGV fault on FreeBSD 12+

  PR:	250825
  Submitted by:	Artyom Davidov

Changes:
  head/sysutils/fluent-bit/Makefile
  head/sysutils/fluent-bit/files/patch-lib_flb__libco_settings.h
  head/sysutils/fluent-bit/files/patch-lib_monkey_deps_flb__libco_settings.h
  head/sysutils/fluent-bit/files/patch-src_flb__scheduler.c
Comment 45 Palle Girgensohn freebsd_committer freebsd_triage 2020-11-17 09:50:53 UTC
Committed. Thanks!
Comment 46 Artyom Davidov 2020-11-17 14:51:31 UTC
(In reply to pete from comment #40)
> I am happy to submit PR's to upstream as well - but it may make more sense for one of the folks who wrote the fixes submit the patch so they get the credit.  Just let me know :)

I guess all kudos should go to Yuri Pankov for his deep knowledge of C and great debugging skills. Without him we'll be unable to track down this issue so quickly.
If he is busy with other things or just don't want to mess with github pull requests, I wouldn't mind if you submit these patches upstream. =)

Anyway, I would not mind to submit this patches upstream myself, but it'll take some time. So it's ok for me if someone else submit this patches upstream. If fluent-bit developers will have some questions about these patches I would gladly help to explain what they do and how they works, though most of information is already present in this bugreport. =)
Comment 47 Palle Girgensohn freebsd_committer freebsd_triage 2020-11-17 21:08:20 UTC
OK, I pushed a PR:
https://github.com/fluent/fluent-bit/pull/2781
Comment 48 Yuri Pankov freebsd_committer freebsd_triage 2020-11-17 21:21:21 UTC
(In reply to Palle Girgensohn from comment #47)
please note that there's an issue reported, might be good to link to it?

https://github.com/fluent/fluent-bit/issues/2747
Comment 49 Palle Girgensohn freebsd_committer freebsd_triage 2020-11-17 21:35:12 UTC
(In reply to Yuri Pankov from comment #48)

Excellent, I crosslinked them.
Comment 50 Artyom Davidov 2020-11-19 16:15:52 UTC
(In reply to Artyom Davidov from comment #35)
The fix for that SIGSEGV - SEGV_MAPPER issue that I've mentioned earlier seems to be fixed by the "post 1.6.4" commits in the upstream master branch.
I've created a separate PR for this issue with the patch for the FreeBSD ports tree that will build an intermediate version of the fluent-bit from the upstream github master branch.
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=251257

Maybe it will help someone untill the new fluent-bit version be released.
Comment 51 commit-hook freebsd_committer freebsd_triage 2020-11-23 08:33:55 UTC
A commit references this bug:

Author: girgen
Date: Mon Nov 23 08:33:32 UTC 2020
New revision: 556094
URL: https://svnweb.freebsd.org/changeset/ports/556094

Log:
  Update to 1.6.5

  The bug fixes for FreeBSD mentioned in the release notes where already in the
  previous commit r555545 [1].

  Release notes:	https://fluentbit.io/announcements/v1.6.5/
  PR:		250825 [1]

Changes:
  head/sysutils/fluent-bit/Makefile
  head/sysutils/fluent-bit/distinfo
  head/sysutils/fluent-bit/files/patch-lib_flb__libco_settings.h
  head/sysutils/fluent-bit/files/patch-lib_monkey_deps_flb__libco_settings.h
  head/sysutils/fluent-bit/files/patch-src_flb__scheduler.c
Comment 52 Artyom Davidov 2020-11-23 15:18:58 UTC
(In reply to commit-hook from comment #51)
There was incomplete fix for this issue at upstream.
They forget to merge libco fixes in monkey http server.
https://github.com/fluent/fluent-bit/blob/v1.6.5/lib/monkey/deps/flb_libco/settings.h
settings.h file is still unpatched upstream so we should keep the patch-lib_monkey_deps_flb__libco_settings.h until they fix this upstream.
Comment 53 commit-hook freebsd_committer freebsd_triage 2020-11-23 17:00:44 UTC
A commit references this bug:

Author: girgen
Date: Mon Nov 23 16:59:53 UTC 2020
New revision: 556115
URL: https://svnweb.freebsd.org/changeset/ports/556115

Log:
  Revive a patch that was mistakenly removed in the last update

  PR:	250825

Changes:
  head/sysutils/fluent-bit/Makefile
  head/sysutils/fluent-bit/files/patch-lib_monkey_deps_flb__libco_settings.h
Comment 54 Palle Girgensohn freebsd_committer freebsd_triage 2020-11-23 17:01:23 UTC
(In reply to Artyom Davidov from comment #52)

:face_palm:

Sorry, I did check which patches applied and which didn't, but I apparently still made some mistake. Sorry!