Bug 227461 - [hast] regression, unable to start hast, pid XXX (hastd), uid 0: exited on signal 6 (core dumped)
Summary: [hast] regression, unable to start hast, pid XXX (hastd), uid 0: exited on si...
Status: New
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 11.1-RELEASE
Hardware: Any Any
: --- Affects Only Me
Assignee: Pawel Jakub Dawidek
URL:
Keywords: regression
Depends on:
Blocks:
 
Reported: 2018-04-12 10:09 UTC by emz
Modified: 2018-04-17 07:20 UTC (History)
3 users (show)

See Also:


Attachments
HAST nodebug patch (1.59 KB, patch)
2018-04-17 07:16 UTC, emz
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description emz 2018-04-12 10:09:33 UTC
Node A (11.1-RELEASE-p1):
========

[root@gw0:/var/log]# service hastd start
Starting hastd.
[root@gw0:/var/log]# hastctl status     
Name    Status   Role           Components
hasta   -        init           /dev/gpt/hasta  tcp4://192.168.0.247
hastb   -        init           /dev/gpt/hastb  tcp4://192.168.0.247
[root@gw0:/var/log]# hastctl role secondary hasta
[root@gw0:/var/log]# hastctl role secondary hastb
[root@gw0:/var/log]# hastctl status              
Name    Status   Role           Components
hasta   -        secondary      /dev/gpt/hasta  tcp4://192.168.0.247
hastb   -        secondary      /dev/gpt/hastb  tcp4://192.168.0.247

Node B (11.1-RELEASE-p6):
========
[root@gw1:/var/log]# service hastd start
Starting hastd.
[root@gw1:/var/log]# hastctl status     
Name    Status   Role           Components
hasta   -        init           /dev/gpt/hasta  tcp4://192.168.0.248
hastb   -        init           /dev/gpt/hastb  tcp4://192.168.0.248
[root@gw1:/var/log]# hastctl role promary hasta 
usage: hastctl create [-d] [-c config] [-e extentsize] [-k keepdirty]
                [-m mediasize] name ...
       hastctl role [-d] [-c config] <init | primary | secondary> all | name ...
       hastctl list [-d] [-c config] [all | name ...]
       hastctl status [-d] [-c config] [all | name ...]
       hastctl dump [-d] [-c config] [all | name ...]
[root@gw1:/var/log]# hastctl role primary hasta
[root@gw1:/var/log]# hastctl role primary hastb
[root@gw1:/var/log]# hastctl status
(hangs)

Node B dmesg:
pid 26813 (hastd), uid 0: exited on signal 6 (core dumped)
pid 26814 (hastd), uid 0: exited on signal 6 (core dumped)
pid 26815 (hastd), uid 0: exited on signal 6 (core dumped)
pid 26816 (hastd), uid 0: exited on signal 6 (core dumped)
pid 26817 (hastd), uid 0: exited on signal 6 (core dumped)
pid 26822 (hastd), uid 0: exited on signal 6 (core dumped)
pid 26825 (hastd), uid 0: exited on signal 6 (core dumped)
pid 26828 (hastd), uid 0: exited on signal 6 (core dumped)
pid 26829 (hastd), uid 0: exited on signal 6 (core dumped)
pid 26830 (hastd), uid 0: exited on signal 6 (core dumped)
pid 26831 (hastd), uid 0: exited on signal 6 (core dumped)
pid 26833 (hastd), uid 0: exited on signal 6 (core dumped)
pid 26836 (hastd), uid 0: exited on signal 6 (core dumped)
pid 26837 (hastd), uid 0: exited on signal 6 (core dumped)

Node B messages:
Apr 12 15:02:49 gw1 kernel: pid 26891 (hastd), uid 0: exited on signal 6 (core dumped)
Apr 12 15:02:50 gw1 hastd[26679]: [hastb] (primary) Worker process killed (pid=26891, signal=6).
Apr 12 15:02:50 gw1 hastd[26893]: [hasta] (primary) Descriptor 7 is open (pipe or FIFO), but should be closed.
Apr 12 15:02:50 gw1 hastd[26893]: [hasta] (primary) Aborted at function descriptors_assert, file /usr/src/sbin/hastd/hastd.c, line 303.
Apr 12 15:02:50 gw1 kernel: pid 26893 (hastd), uid 0: exited on signal 6 (core dumped)
Apr 12 15:02:51 gw1 hastd[26679]: [hasta] (primary) Worker process killed (pid=26893, signal=6).
Apr 12 15:02:51 gw1 hastd[26896]: [hastb] (primary) Descriptor 7 is open (pipe or FIFO), but should be closed.
Apr 12 15:02:51 gw1 hastd[26896]: [hastb] (primary) Aborted at function descriptors_assert, file /usr/src/sbin/hastd/hastd.c, line 303.
Apr 12 15:02:52 gw1 kernel: pid 26896 (hastd), uid 0: exited on signal 6 (core dumped)
Apr 12 15:02:52 gw1 hastd[26679]: [hastb] (primary) Worker process killed (pid=26896, signal=6).
Apr 12 15:02:52 gw1 hastd[26900]: [hasta] (primary) Descriptor 7 is open (pipe or FIFO), but should be closed.
Apr 12 15:02:52 gw1 hastd[26900]: [hasta] (primary) Aborted at function descriptors_assert, file /usr/src/sbin/hastd/hastd.c, line 303.
Apr 12 15:02:53 gw1 kernel: pid 26900 (hastd), uid 0: exited on signal 6 (core dumped)
Apr 12 15:02:54 gw1 hastd[26679]: [hasta] (primary) Worker process killed (pid=26900, signal=6).
Apr 12 15:02:54 gw1 hastd[26904]: [hastb] (primary) Descriptor 7 is open (pipe or FIFO), but should be closed.
Apr 12 15:02:54 gw1 hastd[26904]: [hastb] (primary) Aborted at function descriptors_assert, file /usr/src/sbin/hastd/hastd.c, line 303.
Apr 12 15:02:54 gw1 kernel: pid 26904 (hastd), uid 0: exited on signal 6 (core dumped)

Now when I'm trying to switch A to primary:

[root@gw0:/var/log]# hastctl role primary hastb  
[root@gw0:/var/log]# hastctl role primary hasta
[root@gw0:/var/log]# hastctl status
(hangs)

Node A dmesg:

pid 72301 (hastd), uid 0: exited on signal 6 (core dumped)
pid 72328 (hastd), uid 0: exited on signal 6 (core dumped)
pid 72355 (hastd), uid 0: exited on signal 6 (core dumped)
pid 72389 (hastd), uid 0: exited on signal 6 (core dumped)
pid 72412 (hastd), uid 0: exited on signal 6 (core dumped)
pid 72436 (hastd), uid 0: exited on signal 6 (core dumped)
pid 72467 (hastd), uid 0: exited on signal 6 (core dumped)
pid 72496 (hastd), uid 0: exited on signal 6 (core dumped)
pid 72514 (hastd), uid 0: exited on signal 6 (core dumped)
pid 72530 (hastd), uid 0: exited on signal 6 (core dumped)
pid 72554 (hastd), uid 0: exited on signal 6 (core dumped)
pid 72584 (hastd), uid 0: exited on signal 6 (core dumped)
pid 72620 (hastd), uid 0: exited on signal 6 (core dumped)
pid 72656 (hastd), uid 0: exited on signal 6 (core dumped)
pid 72708 (hastd), uid 0: exited on signal 6 (core dumped)
pid 72759 (hastd), uid 0: exited on signal 6 (core dumped)
pid 72799 (hastd), uid 0: exited on signal 6 (core dumped)

Apr 12 15:04:40 gw0 kernel: pid 72530 (hastd), uid 0: exited on signal 6 (core dumped)
Apr 12 15:04:41 gw0 hastd[63097]: [hasta] (primary) Worker process killed (pid=72530, signal=6).
Apr 12 15:04:41 gw0 hastd[72554]: [hastb] (primary) Descriptor 8 is open (pipe or FIFO), but should be closed.
Apr 12 15:04:41 gw0 hastd[72554]: [hastb] (primary) Aborted at function descriptors_assert, file /usr/src/sbin/hastd/hastd.c, line 303.
Apr 12 15:04:41 gw0 kernel: pid 72554 (hastd), uid 0: exited on signal 6 (core dumped)
Apr 12 15:04:42 gw0 hastd[63097]: [hastb] (primary) Worker process killed (pid=72554, signal=6).
Apr 12 15:04:42 gw0 hastd[72584]: [hasta] (primary) Descriptor 8 is open (pipe or FIFO), but should be closed.
Apr 12 15:04:42 gw0 hastd[72584]: [hasta] (primary) Aborted at function descriptors_assert, file /usr/src/sbin/hastd/hastd.c, line 303.
Apr 12 15:04:42 gw0 kernel: pid 72584 (hastd), uid 0: exited on signal 6 (core dumped)
Apr 12 15:04:43 gw0 hastd[63097]: [hasta] (primary) Worker process killed (pid=72584, signal=6).
Apr 12 15:04:43 gw0 hastd[72620]: [hastb] (primary) Descriptor 8 is open (pipe or FIFO), but should be closed.
Apr 12 15:04:43 gw0 hastd[72620]: [hastb] (primary) Aborted at function descriptors_assert, file /usr/src/sbin/hastd/hastd.c, line 303.
Apr 12 15:04:43 gw0 kernel: pid 72620 (hastd), uid 0: exited on signal 6 (core dumped)
Apr 12 15:04:44 gw0 hastd[63097]: [hastb] (primary) Worker process killed (pid=72620, signal=6).
Apr 12 15:04:44 gw0 hastd[72656]: [hasta] (primary) Descriptor 8 is open (pipe or FIFO), but should be closed.
Apr 12 15:04:44 gw0 hastd[72656]: [hasta] (primary) Aborted at function descriptors_assert, file /usr/src/sbin/hastd/hastd.c, line 303.
Apr 12 15:04:44 gw0 kernel: pid 72656 (hastd), uid 0: exited on signal 6 (core dumped)
Apr 12 15:04:45 gw0 hastd[63097]: [hasta] (primary) Worker process killed (pid=72656, signal=6).
Apr 12 15:04:45 gw0 hastd[72708]: [hastb] (primary) Descriptor 8 is open (pipe or FIFO), but should be closed.
Apr 12 15:04:45 gw0 hastd[72708]: [hastb] (primary) Aborted at function descriptors_assert, file /usr/src/sbin/hastd/has
td.c, line 303.
Apr 12 15:04:45 gw0 kernel: pid 72708 (hastd), uid 0: exited on signal 6 (core dumped)
Apr 12 15:04:46 gw0 hastd[63097]: [hastb] (primary) Worker process killed (pid=72708, signal=6).
Apr 12 15:04:46 gw0 hastd[72759]: [hasta] (primary) Descriptor 8 is open (pipe or FIFO), but should be closed.
Apr 12 15:04:46 gw0 hastd[72759]: [hasta] (primary) Aborted at function descriptors_assert, file /usr/src/sbin/hastd/has
td.c, line 303.
Apr 12 15:04:46 gw0 kernel: pid 72759 (hastd), uid 0: exited on signal 6 (core dumped)
Apr 12 15:04:47 gw0 hastd[63097]: [hasta] (primary) Worker process killed (pid=72759, signal=6).
Apr 12 15:04:47 gw0 hastd[72799]: [hastb] (primary) Descriptor 8 is open (pipe or FIFO), but should be closed.
Apr 12 15:04:47 gw0 hastd[72799]: [hastb] (primary) Aborted at function descriptors_assert, file /usr/src/sbin/hastd/has
td.c, line 303.
Apr 12 15:04:47 gw0 kernel: pid 72799 (hastd), uid 0: exited on signal 6 (core dumped)

Node A config:
==============
resource hasta {
    local /dev/gpt/hasta
    on gw0 {
        remote tcp4://192.168.0.247
        source tcp4://192.168.0.248
    }
    on gw1 {
        remote tcp4://192.168.0.248
        source tcp4://192.168.0.247
    }
}

resource hastb {
    local /dev/gpt/hastb
    on gw0 {
        remote tcp4://192.168.0.247
        source tcp4://192.168.0.248
    }
    on gw1 {
        remote tcp4://192.168.0.248
        source tcp4://192.168.0.247
    }
}


Node B config:
==============

resource hasta {
    local /dev/gpt/hasta
    on gw0 {
        remote tcp4://192.168.0.247
        source tcp4://192.168.0.248
    }
    on gw1 {
        remote tcp4://192.168.0.248
        source tcp4://192.168.0.247
    }
}

resource hastb {
    local /dev/gpt/hastb
    on gw0 {
        remote tcp4://192.168.0.247
        source tcp4://192.168.0.248
    }
    on gw1 {
        remote tcp4://192.168.0.248
        source tcp4://192.168.0.247
    }
}
Comment 1 emz 2018-04-12 10:11:08 UTC
Now hastd contunues to constantly dump cores even when I'm trying to launch it on a single node as a primary.
Comment 2 emz 2018-04-12 10:14:43 UTC
Backtrace:

# gdb /sbin/hastd /hastd.core 
GNU gdb 6.1.1 [FreeBSD]
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "amd64-marcel-freebsd"...
Core was generated by `/sbin/hastd'.
Program terminated with signal 6, Aborted.
Reading symbols from /lib/libgeom.so.5...Reading symbols from /usr/lib/debug//lib/libgeom.so.5.debug...done.
done.
Loaded symbols for /lib/libgeom.so.5
Reading symbols from /lib/libutil.so.9...Reading symbols from /usr/lib/debug//lib/libutil.so.9.debug...done.
done.
Loaded symbols for /lib/libutil.so.9
Reading symbols from /lib/libcrypto.so.8...Reading symbols from /usr/lib/debug//lib/libcrypto.so.8.debug...done.
done.
Loaded symbols for /lib/libcrypto.so.8
Reading symbols from /lib/libthr.so.3...Reading symbols from /usr/lib/debug//lib/libthr.so.3.debug...done.
done.
Loaded symbols for /lib/libthr.so.3
Reading symbols from /lib/libc.so.7...Reading symbols from /usr/lib/debug//lib/libc.so.7.debug...done.
done.
Loaded symbols for /lib/libc.so.7
Reading symbols from /lib/libbsdxml.so.4...Reading symbols from /usr/lib/debug//lib/libbsdxml.so.4.debug...done.
done.
Loaded symbols for /lib/libbsdxml.so.4
Reading symbols from /lib/libsbuf.so.6...Reading symbols from /usr/lib/debug//lib/libsbuf.so.6.debug...done.
done.
Loaded symbols for /lib/libsbuf.so.6
Reading symbols from /libexec/ld-elf.so.1...Reading symbols from /usr/lib/debug//libexec/ld-elf.so.1.debug...done.
done.
Loaded symbols for /libexec/ld-elf.so.1
#0  0x000000080155a84a in thr_kill () from /lib/libc.so.7
(gdb) bt
#0  0x000000080155a84a in thr_kill () from /lib/libc.so.7
#1  0x000000080155a814 in __raise (s=6) at /usr/src/lib/libc/gen/raise.c:52
#2  0x000000080155a789 in abort () at /usr/src/lib/libc/stdlib/abort.c:65
#3  0x0000000000414579 in pjdlog_abort (func=0x420e3f "descriptors_assert", 
    file=0x420aeb "/usr/src/sbin/hastd/hastd.c", line=303, failedexpr=0x0, fmt=<value optimized out>)
    at /usr/src/sbin/hastd/pjdlog.c:613
#4  0x0000000000408267 in descriptors_assert (res=0x80204b400, pjdlogmode=<value optimized out>)
    at /usr/src/sbin/hastd/hastd.c:303
#5  0x00000000004146eb in hastd_primary (res=0x80204b400) at /usr/src/sbin/hastd/primary.c:1030
#6  0x000000000040a55a in check_signals () at /usr/src/sbin/hastd/hastd.c:359
#7  0x0000000000408852 in main (argc=<value optimized out>, argv=<value optimized out>)
    at /usr/src/sbin/hastd/hastd.c:1138
#8  0x0000000000403b0f in _start ()
#9  0x000000080064f000 in ?? ()
#10 0x0000000000000000 in ?? ()
(gdb)
Comment 3 emz 2018-04-15 14:51:55 UTC
Now I got the situation when the secondary node logs the mesages "Descriptor X is open (pipe or FIFO), but should be closed.", and I noticed that in the same time the primary node logs this:

Apr 15 19:51:02 gw1 hastd[50310]: [hastb] (primary) Unable to receive handshake header from tcp4://192.168.0.248: Socket is not connected.
Apr 15 19:51:10 gw1 hastd[50306]: [hasta] (primary) Unable to receive handshake header from tcp4://192.168.0.248: Socket is not connected.
Apr 15 19:51:12 gw1 hastd[50310]: [hastb] (primary) Unable to receive handshake header from tcp4://192.168.0.248: Socket is not connected.
Comment 4 emz 2018-04-15 19:53:45 UTC
The fd that the hast is complaining about is a pipe, for instance now it's blaming fd 7 to be open, and fd 7 is a pipe:

PID COMM                FD T V FLAGS    REF  OFFSET PRO NAME
22817 hastd             text v r r-------   -       - -   /sbin/hastd
22817 hastd              cwd v d r-------   -       - -   /
22817 hastd             root v d r-------   -       - -   /
22817 hastd                0 v c rw------   3       0 -   /dev/null
22817 hastd                1 v c rw------   3       0 -   /dev/null
22817 hastd                2 v c rw------   3       0 -   /dev/null
22817 hastd                3 v r -w---n-l   1       0 -   /var/run/hastd.pid
22817 hastd                4 s - rw------   1       0 UDS /var/run/hastctl
22817 hastd                5 s - rw------   1       0 TCP 192.168.0.247:8457 0.0.0.0:0
22817 hastd                6 s - rw------   1       0 UDD /var/run/logpriv
22817 hastd                7 p - rw------   4       0 -   -
22817 hastd                9 s - rw------   1       0 UDS -
22817 hastd               13 s - rw------   1       0 UDS -
22817 hastd               16 s - rw------   1       0 UDS -
Comment 5 emz 2018-04-17 07:16:42 UTC
Created attachment 192583 [details]
HAST nodebug patch

This patch removes several debug lines leading to SIGABRT.
Comment 6 emz 2018-04-17 07:20:28 UTC
This patch removes the debugging leading to SIGABRT, the author is ae@freebsd.org. ae@ is pessimistic about it, saying that it solves nothing, and just makes HAST to contunue by removing the abort() call in pjdlog_abort().

However, this got me a working HAST, first time without a bunch of warning/error/assert messages on both nodes.

This probably need more attention.