Bug 227461

Summary: [hast] regression, unable to start hast, pid XXX (hastd), uid 0: exited on signal 6 (core dumped)
Product: Base System Reporter: emz
Component: kernAssignee: Eugene Grosbein <eugen>
Status: Closed FIXED    
Severity: Affects Some People CC: emaste, eugen, net, pjd
Priority: --- Keywords: regression
Version: 11.1-RELEASEFlags: eugen: mfc-stable12+
eugen: mfc-stable11+
Hardware: Any   
OS: Any   
Attachments:
Description Flags
HAST nodebug patch
none
proposed fix eugen: maintainer-approval? (pjd)

Description emz 2018-04-12 10:09:33 UTC
Node A (11.1-RELEASE-p1):
========

[root@gw0:/var/log]# service hastd start
Starting hastd.
[root@gw0:/var/log]# hastctl status     
Name    Status   Role           Components
hasta   -        init           /dev/gpt/hasta  tcp4://192.168.0.247
hastb   -        init           /dev/gpt/hastb  tcp4://192.168.0.247
[root@gw0:/var/log]# hastctl role secondary hasta
[root@gw0:/var/log]# hastctl role secondary hastb
[root@gw0:/var/log]# hastctl status              
Name    Status   Role           Components
hasta   -        secondary      /dev/gpt/hasta  tcp4://192.168.0.247
hastb   -        secondary      /dev/gpt/hastb  tcp4://192.168.0.247

Node B (11.1-RELEASE-p6):
========
[root@gw1:/var/log]# service hastd start
Starting hastd.
[root@gw1:/var/log]# hastctl status     
Name    Status   Role           Components
hasta   -        init           /dev/gpt/hasta  tcp4://192.168.0.248
hastb   -        init           /dev/gpt/hastb  tcp4://192.168.0.248
[root@gw1:/var/log]# hastctl role promary hasta 
usage: hastctl create [-d] [-c config] [-e extentsize] [-k keepdirty]
                [-m mediasize] name ...
       hastctl role [-d] [-c config] <init | primary | secondary> all | name ...
       hastctl list [-d] [-c config] [all | name ...]
       hastctl status [-d] [-c config] [all | name ...]
       hastctl dump [-d] [-c config] [all | name ...]
[root@gw1:/var/log]# hastctl role primary hasta
[root@gw1:/var/log]# hastctl role primary hastb
[root@gw1:/var/log]# hastctl status
(hangs)

Node B dmesg:
pid 26813 (hastd), uid 0: exited on signal 6 (core dumped)
pid 26814 (hastd), uid 0: exited on signal 6 (core dumped)
pid 26815 (hastd), uid 0: exited on signal 6 (core dumped)
pid 26816 (hastd), uid 0: exited on signal 6 (core dumped)
pid 26817 (hastd), uid 0: exited on signal 6 (core dumped)
pid 26822 (hastd), uid 0: exited on signal 6 (core dumped)
pid 26825 (hastd), uid 0: exited on signal 6 (core dumped)
pid 26828 (hastd), uid 0: exited on signal 6 (core dumped)
pid 26829 (hastd), uid 0: exited on signal 6 (core dumped)
pid 26830 (hastd), uid 0: exited on signal 6 (core dumped)
pid 26831 (hastd), uid 0: exited on signal 6 (core dumped)
pid 26833 (hastd), uid 0: exited on signal 6 (core dumped)
pid 26836 (hastd), uid 0: exited on signal 6 (core dumped)
pid 26837 (hastd), uid 0: exited on signal 6 (core dumped)

Node B messages:
Apr 12 15:02:49 gw1 kernel: pid 26891 (hastd), uid 0: exited on signal 6 (core dumped)
Apr 12 15:02:50 gw1 hastd[26679]: [hastb] (primary) Worker process killed (pid=26891, signal=6).
Apr 12 15:02:50 gw1 hastd[26893]: [hasta] (primary) Descriptor 7 is open (pipe or FIFO), but should be closed.
Apr 12 15:02:50 gw1 hastd[26893]: [hasta] (primary) Aborted at function descriptors_assert, file /usr/src/sbin/hastd/hastd.c, line 303.
Apr 12 15:02:50 gw1 kernel: pid 26893 (hastd), uid 0: exited on signal 6 (core dumped)
Apr 12 15:02:51 gw1 hastd[26679]: [hasta] (primary) Worker process killed (pid=26893, signal=6).
Apr 12 15:02:51 gw1 hastd[26896]: [hastb] (primary) Descriptor 7 is open (pipe or FIFO), but should be closed.
Apr 12 15:02:51 gw1 hastd[26896]: [hastb] (primary) Aborted at function descriptors_assert, file /usr/src/sbin/hastd/hastd.c, line 303.
Apr 12 15:02:52 gw1 kernel: pid 26896 (hastd), uid 0: exited on signal 6 (core dumped)
Apr 12 15:02:52 gw1 hastd[26679]: [hastb] (primary) Worker process killed (pid=26896, signal=6).
Apr 12 15:02:52 gw1 hastd[26900]: [hasta] (primary) Descriptor 7 is open (pipe or FIFO), but should be closed.
Apr 12 15:02:52 gw1 hastd[26900]: [hasta] (primary) Aborted at function descriptors_assert, file /usr/src/sbin/hastd/hastd.c, line 303.
Apr 12 15:02:53 gw1 kernel: pid 26900 (hastd), uid 0: exited on signal 6 (core dumped)
Apr 12 15:02:54 gw1 hastd[26679]: [hasta] (primary) Worker process killed (pid=26900, signal=6).
Apr 12 15:02:54 gw1 hastd[26904]: [hastb] (primary) Descriptor 7 is open (pipe or FIFO), but should be closed.
Apr 12 15:02:54 gw1 hastd[26904]: [hastb] (primary) Aborted at function descriptors_assert, file /usr/src/sbin/hastd/hastd.c, line 303.
Apr 12 15:02:54 gw1 kernel: pid 26904 (hastd), uid 0: exited on signal 6 (core dumped)

Now when I'm trying to switch A to primary:

[root@gw0:/var/log]# hastctl role primary hastb  
[root@gw0:/var/log]# hastctl role primary hasta
[root@gw0:/var/log]# hastctl status
(hangs)

Node A dmesg:

pid 72301 (hastd), uid 0: exited on signal 6 (core dumped)
pid 72328 (hastd), uid 0: exited on signal 6 (core dumped)
pid 72355 (hastd), uid 0: exited on signal 6 (core dumped)
pid 72389 (hastd), uid 0: exited on signal 6 (core dumped)
pid 72412 (hastd), uid 0: exited on signal 6 (core dumped)
pid 72436 (hastd), uid 0: exited on signal 6 (core dumped)
pid 72467 (hastd), uid 0: exited on signal 6 (core dumped)
pid 72496 (hastd), uid 0: exited on signal 6 (core dumped)
pid 72514 (hastd), uid 0: exited on signal 6 (core dumped)
pid 72530 (hastd), uid 0: exited on signal 6 (core dumped)
pid 72554 (hastd), uid 0: exited on signal 6 (core dumped)
pid 72584 (hastd), uid 0: exited on signal 6 (core dumped)
pid 72620 (hastd), uid 0: exited on signal 6 (core dumped)
pid 72656 (hastd), uid 0: exited on signal 6 (core dumped)
pid 72708 (hastd), uid 0: exited on signal 6 (core dumped)
pid 72759 (hastd), uid 0: exited on signal 6 (core dumped)
pid 72799 (hastd), uid 0: exited on signal 6 (core dumped)

Apr 12 15:04:40 gw0 kernel: pid 72530 (hastd), uid 0: exited on signal 6 (core dumped)
Apr 12 15:04:41 gw0 hastd[63097]: [hasta] (primary) Worker process killed (pid=72530, signal=6).
Apr 12 15:04:41 gw0 hastd[72554]: [hastb] (primary) Descriptor 8 is open (pipe or FIFO), but should be closed.
Apr 12 15:04:41 gw0 hastd[72554]: [hastb] (primary) Aborted at function descriptors_assert, file /usr/src/sbin/hastd/hastd.c, line 303.
Apr 12 15:04:41 gw0 kernel: pid 72554 (hastd), uid 0: exited on signal 6 (core dumped)
Apr 12 15:04:42 gw0 hastd[63097]: [hastb] (primary) Worker process killed (pid=72554, signal=6).
Apr 12 15:04:42 gw0 hastd[72584]: [hasta] (primary) Descriptor 8 is open (pipe or FIFO), but should be closed.
Apr 12 15:04:42 gw0 hastd[72584]: [hasta] (primary) Aborted at function descriptors_assert, file /usr/src/sbin/hastd/hastd.c, line 303.
Apr 12 15:04:42 gw0 kernel: pid 72584 (hastd), uid 0: exited on signal 6 (core dumped)
Apr 12 15:04:43 gw0 hastd[63097]: [hasta] (primary) Worker process killed (pid=72584, signal=6).
Apr 12 15:04:43 gw0 hastd[72620]: [hastb] (primary) Descriptor 8 is open (pipe or FIFO), but should be closed.
Apr 12 15:04:43 gw0 hastd[72620]: [hastb] (primary) Aborted at function descriptors_assert, file /usr/src/sbin/hastd/hastd.c, line 303.
Apr 12 15:04:43 gw0 kernel: pid 72620 (hastd), uid 0: exited on signal 6 (core dumped)
Apr 12 15:04:44 gw0 hastd[63097]: [hastb] (primary) Worker process killed (pid=72620, signal=6).
Apr 12 15:04:44 gw0 hastd[72656]: [hasta] (primary) Descriptor 8 is open (pipe or FIFO), but should be closed.
Apr 12 15:04:44 gw0 hastd[72656]: [hasta] (primary) Aborted at function descriptors_assert, file /usr/src/sbin/hastd/hastd.c, line 303.
Apr 12 15:04:44 gw0 kernel: pid 72656 (hastd), uid 0: exited on signal 6 (core dumped)
Apr 12 15:04:45 gw0 hastd[63097]: [hasta] (primary) Worker process killed (pid=72656, signal=6).
Apr 12 15:04:45 gw0 hastd[72708]: [hastb] (primary) Descriptor 8 is open (pipe or FIFO), but should be closed.
Apr 12 15:04:45 gw0 hastd[72708]: [hastb] (primary) Aborted at function descriptors_assert, file /usr/src/sbin/hastd/has
td.c, line 303.
Apr 12 15:04:45 gw0 kernel: pid 72708 (hastd), uid 0: exited on signal 6 (core dumped)
Apr 12 15:04:46 gw0 hastd[63097]: [hastb] (primary) Worker process killed (pid=72708, signal=6).
Apr 12 15:04:46 gw0 hastd[72759]: [hasta] (primary) Descriptor 8 is open (pipe or FIFO), but should be closed.
Apr 12 15:04:46 gw0 hastd[72759]: [hasta] (primary) Aborted at function descriptors_assert, file /usr/src/sbin/hastd/has
td.c, line 303.
Apr 12 15:04:46 gw0 kernel: pid 72759 (hastd), uid 0: exited on signal 6 (core dumped)
Apr 12 15:04:47 gw0 hastd[63097]: [hasta] (primary) Worker process killed (pid=72759, signal=6).
Apr 12 15:04:47 gw0 hastd[72799]: [hastb] (primary) Descriptor 8 is open (pipe or FIFO), but should be closed.
Apr 12 15:04:47 gw0 hastd[72799]: [hastb] (primary) Aborted at function descriptors_assert, file /usr/src/sbin/hastd/has
td.c, line 303.
Apr 12 15:04:47 gw0 kernel: pid 72799 (hastd), uid 0: exited on signal 6 (core dumped)

Node A config:
==============
resource hasta {
    local /dev/gpt/hasta
    on gw0 {
        remote tcp4://192.168.0.247
        source tcp4://192.168.0.248
    }
    on gw1 {
        remote tcp4://192.168.0.248
        source tcp4://192.168.0.247
    }
}

resource hastb {
    local /dev/gpt/hastb
    on gw0 {
        remote tcp4://192.168.0.247
        source tcp4://192.168.0.248
    }
    on gw1 {
        remote tcp4://192.168.0.248
        source tcp4://192.168.0.247
    }
}


Node B config:
==============

resource hasta {
    local /dev/gpt/hasta
    on gw0 {
        remote tcp4://192.168.0.247
        source tcp4://192.168.0.248
    }
    on gw1 {
        remote tcp4://192.168.0.248
        source tcp4://192.168.0.247
    }
}

resource hastb {
    local /dev/gpt/hastb
    on gw0 {
        remote tcp4://192.168.0.247
        source tcp4://192.168.0.248
    }
    on gw1 {
        remote tcp4://192.168.0.248
        source tcp4://192.168.0.247
    }
}
Comment 1 emz 2018-04-12 10:11:08 UTC
Now hastd contunues to constantly dump cores even when I'm trying to launch it on a single node as a primary.
Comment 2 emz 2018-04-12 10:14:43 UTC
Backtrace:

# gdb /sbin/hastd /hastd.core 
GNU gdb 6.1.1 [FreeBSD]
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "amd64-marcel-freebsd"...
Core was generated by `/sbin/hastd'.
Program terminated with signal 6, Aborted.
Reading symbols from /lib/libgeom.so.5...Reading symbols from /usr/lib/debug//lib/libgeom.so.5.debug...done.
done.
Loaded symbols for /lib/libgeom.so.5
Reading symbols from /lib/libutil.so.9...Reading symbols from /usr/lib/debug//lib/libutil.so.9.debug...done.
done.
Loaded symbols for /lib/libutil.so.9
Reading symbols from /lib/libcrypto.so.8...Reading symbols from /usr/lib/debug//lib/libcrypto.so.8.debug...done.
done.
Loaded symbols for /lib/libcrypto.so.8
Reading symbols from /lib/libthr.so.3...Reading symbols from /usr/lib/debug//lib/libthr.so.3.debug...done.
done.
Loaded symbols for /lib/libthr.so.3
Reading symbols from /lib/libc.so.7...Reading symbols from /usr/lib/debug//lib/libc.so.7.debug...done.
done.
Loaded symbols for /lib/libc.so.7
Reading symbols from /lib/libbsdxml.so.4...Reading symbols from /usr/lib/debug//lib/libbsdxml.so.4.debug...done.
done.
Loaded symbols for /lib/libbsdxml.so.4
Reading symbols from /lib/libsbuf.so.6...Reading symbols from /usr/lib/debug//lib/libsbuf.so.6.debug...done.
done.
Loaded symbols for /lib/libsbuf.so.6
Reading symbols from /libexec/ld-elf.so.1...Reading symbols from /usr/lib/debug//libexec/ld-elf.so.1.debug...done.
done.
Loaded symbols for /libexec/ld-elf.so.1
#0  0x000000080155a84a in thr_kill () from /lib/libc.so.7
(gdb) bt
#0  0x000000080155a84a in thr_kill () from /lib/libc.so.7
#1  0x000000080155a814 in __raise (s=6) at /usr/src/lib/libc/gen/raise.c:52
#2  0x000000080155a789 in abort () at /usr/src/lib/libc/stdlib/abort.c:65
#3  0x0000000000414579 in pjdlog_abort (func=0x420e3f "descriptors_assert", 
    file=0x420aeb "/usr/src/sbin/hastd/hastd.c", line=303, failedexpr=0x0, fmt=<value optimized out>)
    at /usr/src/sbin/hastd/pjdlog.c:613
#4  0x0000000000408267 in descriptors_assert (res=0x80204b400, pjdlogmode=<value optimized out>)
    at /usr/src/sbin/hastd/hastd.c:303
#5  0x00000000004146eb in hastd_primary (res=0x80204b400) at /usr/src/sbin/hastd/primary.c:1030
#6  0x000000000040a55a in check_signals () at /usr/src/sbin/hastd/hastd.c:359
#7  0x0000000000408852 in main (argc=<value optimized out>, argv=<value optimized out>)
    at /usr/src/sbin/hastd/hastd.c:1138
#8  0x0000000000403b0f in _start ()
#9  0x000000080064f000 in ?? ()
#10 0x0000000000000000 in ?? ()
(gdb)
Comment 3 emz 2018-04-15 14:51:55 UTC
Now I got the situation when the secondary node logs the mesages "Descriptor X is open (pipe or FIFO), but should be closed.", and I noticed that in the same time the primary node logs this:

Apr 15 19:51:02 gw1 hastd[50310]: [hastb] (primary) Unable to receive handshake header from tcp4://192.168.0.248: Socket is not connected.
Apr 15 19:51:10 gw1 hastd[50306]: [hasta] (primary) Unable to receive handshake header from tcp4://192.168.0.248: Socket is not connected.
Apr 15 19:51:12 gw1 hastd[50310]: [hastb] (primary) Unable to receive handshake header from tcp4://192.168.0.248: Socket is not connected.
Comment 4 emz 2018-04-15 19:53:45 UTC
The fd that the hast is complaining about is a pipe, for instance now it's blaming fd 7 to be open, and fd 7 is a pipe:

PID COMM                FD T V FLAGS    REF  OFFSET PRO NAME
22817 hastd             text v r r-------   -       - -   /sbin/hastd
22817 hastd              cwd v d r-------   -       - -   /
22817 hastd             root v d r-------   -       - -   /
22817 hastd                0 v c rw------   3       0 -   /dev/null
22817 hastd                1 v c rw------   3       0 -   /dev/null
22817 hastd                2 v c rw------   3       0 -   /dev/null
22817 hastd                3 v r -w---n-l   1       0 -   /var/run/hastd.pid
22817 hastd                4 s - rw------   1       0 UDS /var/run/hastctl
22817 hastd                5 s - rw------   1       0 TCP 192.168.0.247:8457 0.0.0.0:0
22817 hastd                6 s - rw------   1       0 UDD /var/run/logpriv
22817 hastd                7 p - rw------   4       0 -   -
22817 hastd                9 s - rw------   1       0 UDS -
22817 hastd               13 s - rw------   1       0 UDS -
22817 hastd               16 s - rw------   1       0 UDS -
Comment 5 emz 2018-04-17 07:16:42 UTC
Created attachment 192583 [details]
HAST nodebug patch

This patch removes several debug lines leading to SIGABRT.
Comment 6 emz 2018-04-17 07:20:28 UTC
This patch removes the debugging leading to SIGABRT, the author is ae@freebsd.org. ae@ is pessimistic about it, saying that it solves nothing, and just makes HAST to contunue by removing the abort() call in pjdlog_abort().

However, this got me a working HAST, first time without a bunch of warning/error/assert messages on both nodes.

This probably need more attention.
Comment 7 Eugene Grosbein freebsd_committer freebsd_triage 2020-11-29 13:14:54 UTC
Created attachment 220060 [details]
proposed fix

hastd wrongly assumes that parent process (f.e. shell) starts it with no high file descriptors open, use closefrom() to fix
Comment 8 Eugene Grosbein freebsd_committer freebsd_triage 2020-11-29 13:16:25 UTC
Dear submitter, is this problem still relevant to you? If so, please try attached one-line patch. It solves the problem for me.
Comment 9 commit-hook freebsd_committer freebsd_triage 2020-11-29 13:46:10 UTC
A commit references this bug:

Author: eugen
Date: Sun Nov 29 13:45:54 UTC 2020
New revision: 368148
URL: https://svnweb.freebsd.org/changeset/base/368148

Log:
  hastd(8) assumes it has no extra file descriptors opened
  and aborts otherwise, so call closefrom() early.

  PR:		227461
  MFC after:	2 weeks

Changes:
  head/sbin/hastd/hastd.c
Comment 10 Kubilay Kocak freebsd_committer freebsd_triage 2020-12-08 01:57:20 UTC
^Triage: Assign to committer resolving, prior assignee CC'd
Comment 11 commit-hook freebsd_committer freebsd_triage 2020-12-18 12:21:24 UTC
A commit references this bug:

Author: eugen
Date: Fri Dec 18 12:20:30 UTC 2020
New revision: 368758
URL: https://svnweb.freebsd.org/changeset/base/368758

Log:
  MFC r368148: fix hastd(8)

  hastd(8) assumes it has no extra file descriptors opened
  and aborts otherwise, so call closefrom() early.

  PR:		227461

Changes:
_U  stable/12/
  stable/12/sbin/hastd/hastd.c
Comment 12 commit-hook freebsd_committer freebsd_triage 2020-12-18 12:25:27 UTC
A commit references this bug:

Author: eugen
Date: Fri Dec 18 12:24:33 UTC 2020
New revision: 368759
URL: https://svnweb.freebsd.org/changeset/base/368759

Log:
  MFC r368148: fix hastd(8)

  hastd(8) assumes it has no extra file descriptors opened
  and aborts otherwise, so call closefrom() early.

  PR:           227461

Changes:
_U  stable/11/
  stable/11/sbin/hastd/hastd.c
Comment 13 Eugene Grosbein freebsd_committer freebsd_triage 2020-12-18 12:36:07 UTC
Fixed and merged to stable branches.