Node A (11.1-RELEASE-p1): ======== [root@gw0:/var/log]# service hastd start Starting hastd. [root@gw0:/var/log]# hastctl status Name Status Role Components hasta - init /dev/gpt/hasta tcp4://192.168.0.247 hastb - init /dev/gpt/hastb tcp4://192.168.0.247 [root@gw0:/var/log]# hastctl role secondary hasta [root@gw0:/var/log]# hastctl role secondary hastb [root@gw0:/var/log]# hastctl status Name Status Role Components hasta - secondary /dev/gpt/hasta tcp4://192.168.0.247 hastb - secondary /dev/gpt/hastb tcp4://192.168.0.247 Node B (11.1-RELEASE-p6): ======== [root@gw1:/var/log]# service hastd start Starting hastd. [root@gw1:/var/log]# hastctl status Name Status Role Components hasta - init /dev/gpt/hasta tcp4://192.168.0.248 hastb - init /dev/gpt/hastb tcp4://192.168.0.248 [root@gw1:/var/log]# hastctl role promary hasta usage: hastctl create [-d] [-c config] [-e extentsize] [-k keepdirty] [-m mediasize] name ... hastctl role [-d] [-c config] <init | primary | secondary> all | name ... hastctl list [-d] [-c config] [all | name ...] hastctl status [-d] [-c config] [all | name ...] hastctl dump [-d] [-c config] [all | name ...] [root@gw1:/var/log]# hastctl role primary hasta [root@gw1:/var/log]# hastctl role primary hastb [root@gw1:/var/log]# hastctl status (hangs) Node B dmesg: pid 26813 (hastd), uid 0: exited on signal 6 (core dumped) pid 26814 (hastd), uid 0: exited on signal 6 (core dumped) pid 26815 (hastd), uid 0: exited on signal 6 (core dumped) pid 26816 (hastd), uid 0: exited on signal 6 (core dumped) pid 26817 (hastd), uid 0: exited on signal 6 (core dumped) pid 26822 (hastd), uid 0: exited on signal 6 (core dumped) pid 26825 (hastd), uid 0: exited on signal 6 (core dumped) pid 26828 (hastd), uid 0: exited on signal 6 (core dumped) pid 26829 (hastd), uid 0: exited on signal 6 (core dumped) pid 26830 (hastd), uid 0: exited on signal 6 (core dumped) pid 26831 (hastd), uid 0: exited on signal 6 (core dumped) pid 26833 (hastd), uid 0: exited on signal 6 (core dumped) pid 26836 (hastd), uid 0: exited on signal 6 (core dumped) pid 26837 (hastd), uid 0: exited on signal 6 (core dumped) Node B messages: Apr 12 15:02:49 gw1 kernel: pid 26891 (hastd), uid 0: exited on signal 6 (core dumped) Apr 12 15:02:50 gw1 hastd[26679]: [hastb] (primary) Worker process killed (pid=26891, signal=6). Apr 12 15:02:50 gw1 hastd[26893]: [hasta] (primary) Descriptor 7 is open (pipe or FIFO), but should be closed. Apr 12 15:02:50 gw1 hastd[26893]: [hasta] (primary) Aborted at function descriptors_assert, file /usr/src/sbin/hastd/hastd.c, line 303. Apr 12 15:02:50 gw1 kernel: pid 26893 (hastd), uid 0: exited on signal 6 (core dumped) Apr 12 15:02:51 gw1 hastd[26679]: [hasta] (primary) Worker process killed (pid=26893, signal=6). Apr 12 15:02:51 gw1 hastd[26896]: [hastb] (primary) Descriptor 7 is open (pipe or FIFO), but should be closed. Apr 12 15:02:51 gw1 hastd[26896]: [hastb] (primary) Aborted at function descriptors_assert, file /usr/src/sbin/hastd/hastd.c, line 303. Apr 12 15:02:52 gw1 kernel: pid 26896 (hastd), uid 0: exited on signal 6 (core dumped) Apr 12 15:02:52 gw1 hastd[26679]: [hastb] (primary) Worker process killed (pid=26896, signal=6). Apr 12 15:02:52 gw1 hastd[26900]: [hasta] (primary) Descriptor 7 is open (pipe or FIFO), but should be closed. Apr 12 15:02:52 gw1 hastd[26900]: [hasta] (primary) Aborted at function descriptors_assert, file /usr/src/sbin/hastd/hastd.c, line 303. Apr 12 15:02:53 gw1 kernel: pid 26900 (hastd), uid 0: exited on signal 6 (core dumped) Apr 12 15:02:54 gw1 hastd[26679]: [hasta] (primary) Worker process killed (pid=26900, signal=6). Apr 12 15:02:54 gw1 hastd[26904]: [hastb] (primary) Descriptor 7 is open (pipe or FIFO), but should be closed. Apr 12 15:02:54 gw1 hastd[26904]: [hastb] (primary) Aborted at function descriptors_assert, file /usr/src/sbin/hastd/hastd.c, line 303. Apr 12 15:02:54 gw1 kernel: pid 26904 (hastd), uid 0: exited on signal 6 (core dumped) Now when I'm trying to switch A to primary: [root@gw0:/var/log]# hastctl role primary hastb [root@gw0:/var/log]# hastctl role primary hasta [root@gw0:/var/log]# hastctl status (hangs) Node A dmesg: pid 72301 (hastd), uid 0: exited on signal 6 (core dumped) pid 72328 (hastd), uid 0: exited on signal 6 (core dumped) pid 72355 (hastd), uid 0: exited on signal 6 (core dumped) pid 72389 (hastd), uid 0: exited on signal 6 (core dumped) pid 72412 (hastd), uid 0: exited on signal 6 (core dumped) pid 72436 (hastd), uid 0: exited on signal 6 (core dumped) pid 72467 (hastd), uid 0: exited on signal 6 (core dumped) pid 72496 (hastd), uid 0: exited on signal 6 (core dumped) pid 72514 (hastd), uid 0: exited on signal 6 (core dumped) pid 72530 (hastd), uid 0: exited on signal 6 (core dumped) pid 72554 (hastd), uid 0: exited on signal 6 (core dumped) pid 72584 (hastd), uid 0: exited on signal 6 (core dumped) pid 72620 (hastd), uid 0: exited on signal 6 (core dumped) pid 72656 (hastd), uid 0: exited on signal 6 (core dumped) pid 72708 (hastd), uid 0: exited on signal 6 (core dumped) pid 72759 (hastd), uid 0: exited on signal 6 (core dumped) pid 72799 (hastd), uid 0: exited on signal 6 (core dumped) Apr 12 15:04:40 gw0 kernel: pid 72530 (hastd), uid 0: exited on signal 6 (core dumped) Apr 12 15:04:41 gw0 hastd[63097]: [hasta] (primary) Worker process killed (pid=72530, signal=6). Apr 12 15:04:41 gw0 hastd[72554]: [hastb] (primary) Descriptor 8 is open (pipe or FIFO), but should be closed. Apr 12 15:04:41 gw0 hastd[72554]: [hastb] (primary) Aborted at function descriptors_assert, file /usr/src/sbin/hastd/hastd.c, line 303. Apr 12 15:04:41 gw0 kernel: pid 72554 (hastd), uid 0: exited on signal 6 (core dumped) Apr 12 15:04:42 gw0 hastd[63097]: [hastb] (primary) Worker process killed (pid=72554, signal=6). Apr 12 15:04:42 gw0 hastd[72584]: [hasta] (primary) Descriptor 8 is open (pipe or FIFO), but should be closed. Apr 12 15:04:42 gw0 hastd[72584]: [hasta] (primary) Aborted at function descriptors_assert, file /usr/src/sbin/hastd/hastd.c, line 303. Apr 12 15:04:42 gw0 kernel: pid 72584 (hastd), uid 0: exited on signal 6 (core dumped) Apr 12 15:04:43 gw0 hastd[63097]: [hasta] (primary) Worker process killed (pid=72584, signal=6). Apr 12 15:04:43 gw0 hastd[72620]: [hastb] (primary) Descriptor 8 is open (pipe or FIFO), but should be closed. Apr 12 15:04:43 gw0 hastd[72620]: [hastb] (primary) Aborted at function descriptors_assert, file /usr/src/sbin/hastd/hastd.c, line 303. Apr 12 15:04:43 gw0 kernel: pid 72620 (hastd), uid 0: exited on signal 6 (core dumped) Apr 12 15:04:44 gw0 hastd[63097]: [hastb] (primary) Worker process killed (pid=72620, signal=6). Apr 12 15:04:44 gw0 hastd[72656]: [hasta] (primary) Descriptor 8 is open (pipe or FIFO), but should be closed. Apr 12 15:04:44 gw0 hastd[72656]: [hasta] (primary) Aborted at function descriptors_assert, file /usr/src/sbin/hastd/hastd.c, line 303. Apr 12 15:04:44 gw0 kernel: pid 72656 (hastd), uid 0: exited on signal 6 (core dumped) Apr 12 15:04:45 gw0 hastd[63097]: [hasta] (primary) Worker process killed (pid=72656, signal=6). Apr 12 15:04:45 gw0 hastd[72708]: [hastb] (primary) Descriptor 8 is open (pipe or FIFO), but should be closed. Apr 12 15:04:45 gw0 hastd[72708]: [hastb] (primary) Aborted at function descriptors_assert, file /usr/src/sbin/hastd/has td.c, line 303. Apr 12 15:04:45 gw0 kernel: pid 72708 (hastd), uid 0: exited on signal 6 (core dumped) Apr 12 15:04:46 gw0 hastd[63097]: [hastb] (primary) Worker process killed (pid=72708, signal=6). Apr 12 15:04:46 gw0 hastd[72759]: [hasta] (primary) Descriptor 8 is open (pipe or FIFO), but should be closed. Apr 12 15:04:46 gw0 hastd[72759]: [hasta] (primary) Aborted at function descriptors_assert, file /usr/src/sbin/hastd/has td.c, line 303. Apr 12 15:04:46 gw0 kernel: pid 72759 (hastd), uid 0: exited on signal 6 (core dumped) Apr 12 15:04:47 gw0 hastd[63097]: [hasta] (primary) Worker process killed (pid=72759, signal=6). Apr 12 15:04:47 gw0 hastd[72799]: [hastb] (primary) Descriptor 8 is open (pipe or FIFO), but should be closed. Apr 12 15:04:47 gw0 hastd[72799]: [hastb] (primary) Aborted at function descriptors_assert, file /usr/src/sbin/hastd/has td.c, line 303. Apr 12 15:04:47 gw0 kernel: pid 72799 (hastd), uid 0: exited on signal 6 (core dumped) Node A config: ============== resource hasta { local /dev/gpt/hasta on gw0 { remote tcp4://192.168.0.247 source tcp4://192.168.0.248 } on gw1 { remote tcp4://192.168.0.248 source tcp4://192.168.0.247 } } resource hastb { local /dev/gpt/hastb on gw0 { remote tcp4://192.168.0.247 source tcp4://192.168.0.248 } on gw1 { remote tcp4://192.168.0.248 source tcp4://192.168.0.247 } } Node B config: ============== resource hasta { local /dev/gpt/hasta on gw0 { remote tcp4://192.168.0.247 source tcp4://192.168.0.248 } on gw1 { remote tcp4://192.168.0.248 source tcp4://192.168.0.247 } } resource hastb { local /dev/gpt/hastb on gw0 { remote tcp4://192.168.0.247 source tcp4://192.168.0.248 } on gw1 { remote tcp4://192.168.0.248 source tcp4://192.168.0.247 } }
Now hastd contunues to constantly dump cores even when I'm trying to launch it on a single node as a primary.
Backtrace: # gdb /sbin/hastd /hastd.core GNU gdb 6.1.1 [FreeBSD] Copyright 2004 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "amd64-marcel-freebsd"... Core was generated by `/sbin/hastd'. Program terminated with signal 6, Aborted. Reading symbols from /lib/libgeom.so.5...Reading symbols from /usr/lib/debug//lib/libgeom.so.5.debug...done. done. Loaded symbols for /lib/libgeom.so.5 Reading symbols from /lib/libutil.so.9...Reading symbols from /usr/lib/debug//lib/libutil.so.9.debug...done. done. Loaded symbols for /lib/libutil.so.9 Reading symbols from /lib/libcrypto.so.8...Reading symbols from /usr/lib/debug//lib/libcrypto.so.8.debug...done. done. Loaded symbols for /lib/libcrypto.so.8 Reading symbols from /lib/libthr.so.3...Reading symbols from /usr/lib/debug//lib/libthr.so.3.debug...done. done. Loaded symbols for /lib/libthr.so.3 Reading symbols from /lib/libc.so.7...Reading symbols from /usr/lib/debug//lib/libc.so.7.debug...done. done. Loaded symbols for /lib/libc.so.7 Reading symbols from /lib/libbsdxml.so.4...Reading symbols from /usr/lib/debug//lib/libbsdxml.so.4.debug...done. done. Loaded symbols for /lib/libbsdxml.so.4 Reading symbols from /lib/libsbuf.so.6...Reading symbols from /usr/lib/debug//lib/libsbuf.so.6.debug...done. done. Loaded symbols for /lib/libsbuf.so.6 Reading symbols from /libexec/ld-elf.so.1...Reading symbols from /usr/lib/debug//libexec/ld-elf.so.1.debug...done. done. Loaded symbols for /libexec/ld-elf.so.1 #0 0x000000080155a84a in thr_kill () from /lib/libc.so.7 (gdb) bt #0 0x000000080155a84a in thr_kill () from /lib/libc.so.7 #1 0x000000080155a814 in __raise (s=6) at /usr/src/lib/libc/gen/raise.c:52 #2 0x000000080155a789 in abort () at /usr/src/lib/libc/stdlib/abort.c:65 #3 0x0000000000414579 in pjdlog_abort (func=0x420e3f "descriptors_assert", file=0x420aeb "/usr/src/sbin/hastd/hastd.c", line=303, failedexpr=0x0, fmt=<value optimized out>) at /usr/src/sbin/hastd/pjdlog.c:613 #4 0x0000000000408267 in descriptors_assert (res=0x80204b400, pjdlogmode=<value optimized out>) at /usr/src/sbin/hastd/hastd.c:303 #5 0x00000000004146eb in hastd_primary (res=0x80204b400) at /usr/src/sbin/hastd/primary.c:1030 #6 0x000000000040a55a in check_signals () at /usr/src/sbin/hastd/hastd.c:359 #7 0x0000000000408852 in main (argc=<value optimized out>, argv=<value optimized out>) at /usr/src/sbin/hastd/hastd.c:1138 #8 0x0000000000403b0f in _start () #9 0x000000080064f000 in ?? () #10 0x0000000000000000 in ?? () (gdb)
Now I got the situation when the secondary node logs the mesages "Descriptor X is open (pipe or FIFO), but should be closed.", and I noticed that in the same time the primary node logs this: Apr 15 19:51:02 gw1 hastd[50310]: [hastb] (primary) Unable to receive handshake header from tcp4://192.168.0.248: Socket is not connected. Apr 15 19:51:10 gw1 hastd[50306]: [hasta] (primary) Unable to receive handshake header from tcp4://192.168.0.248: Socket is not connected. Apr 15 19:51:12 gw1 hastd[50310]: [hastb] (primary) Unable to receive handshake header from tcp4://192.168.0.248: Socket is not connected.
The fd that the hast is complaining about is a pipe, for instance now it's blaming fd 7 to be open, and fd 7 is a pipe: PID COMM FD T V FLAGS REF OFFSET PRO NAME 22817 hastd text v r r------- - - - /sbin/hastd 22817 hastd cwd v d r------- - - - / 22817 hastd root v d r------- - - - / 22817 hastd 0 v c rw------ 3 0 - /dev/null 22817 hastd 1 v c rw------ 3 0 - /dev/null 22817 hastd 2 v c rw------ 3 0 - /dev/null 22817 hastd 3 v r -w---n-l 1 0 - /var/run/hastd.pid 22817 hastd 4 s - rw------ 1 0 UDS /var/run/hastctl 22817 hastd 5 s - rw------ 1 0 TCP 192.168.0.247:8457 0.0.0.0:0 22817 hastd 6 s - rw------ 1 0 UDD /var/run/logpriv 22817 hastd 7 p - rw------ 4 0 - - 22817 hastd 9 s - rw------ 1 0 UDS - 22817 hastd 13 s - rw------ 1 0 UDS - 22817 hastd 16 s - rw------ 1 0 UDS -
Created attachment 192583 [details] HAST nodebug patch This patch removes several debug lines leading to SIGABRT.
This patch removes the debugging leading to SIGABRT, the author is ae@freebsd.org. ae@ is pessimistic about it, saying that it solves nothing, and just makes HAST to contunue by removing the abort() call in pjdlog_abort(). However, this got me a working HAST, first time without a bunch of warning/error/assert messages on both nodes. This probably need more attention.
Created attachment 220060 [details] proposed fix hastd wrongly assumes that parent process (f.e. shell) starts it with no high file descriptors open, use closefrom() to fix
Dear submitter, is this problem still relevant to you? If so, please try attached one-line patch. It solves the problem for me.
A commit references this bug: Author: eugen Date: Sun Nov 29 13:45:54 UTC 2020 New revision: 368148 URL: https://svnweb.freebsd.org/changeset/base/368148 Log: hastd(8) assumes it has no extra file descriptors opened and aborts otherwise, so call closefrom() early. PR: 227461 MFC after: 2 weeks Changes: head/sbin/hastd/hastd.c
^Triage: Assign to committer resolving, prior assignee CC'd
A commit references this bug: Author: eugen Date: Fri Dec 18 12:20:30 UTC 2020 New revision: 368758 URL: https://svnweb.freebsd.org/changeset/base/368758 Log: MFC r368148: fix hastd(8) hastd(8) assumes it has no extra file descriptors opened and aborts otherwise, so call closefrom() early. PR: 227461 Changes: _U stable/12/ stable/12/sbin/hastd/hastd.c
A commit references this bug: Author: eugen Date: Fri Dec 18 12:24:33 UTC 2020 New revision: 368759 URL: https://svnweb.freebsd.org/changeset/base/368759 Log: MFC r368148: fix hastd(8) hastd(8) assumes it has no extra file descriptors opened and aborts otherwise, so call closefrom() early. PR: 227461 Changes: _U stable/11/ stable/11/sbin/hastd/hastd.c
Fixed and merged to stable branches.