Bug 186114 - net/mpd5 hangs after a certain number of users connect
Summary: net/mpd5 hangs after a certain number of users connect
Status: Closed FIXED
Alias: None
Product: Ports & Packages
Classification: Unclassified
Component: Individual Port(s) (show other bugs)
Version: Latest
Hardware: Any Any
: Normal Affects Some People
Assignee: Eugene Grosbein
URL: https://lists.freebsd.org/pipermail/f...
Keywords: crash, needs-qa
: 214482 (view as bug list)
Depends on: 220151
Blocks:
  Show dependency treegraph
 
Reported: 2014-01-26 02:20 UTC by Юрий
Modified: 2017-08-02 12:38 UTC (History)
19 users (show)

See Also:
eugen: maintainer-feedback+


Attachments
Kernel Config (13.35 KB, text/plain)
2016-01-08 20:12 UTC, Cassiano Peixoto
no flags Details
dmesg (10.86 KB, text/plain)
2016-01-08 20:13 UTC, Cassiano Peixoto
no flags Details
pciconf (4.32 KB, text/plain)
2016-01-08 20:13 UTC, Cassiano Peixoto
no flags Details
mpd.conf for my site-site vpn (8.64 KB, text/plain)
2016-01-15 11:52 UTC, YS
no flags Details
crash-bt (5.96 KB, text/plain)
2016-10-20 19:19 UTC, Cassiano Peixoto
no flags Details
debugging patch to track rtld bind lock write owner (1.20 KB, patch)
2017-06-16 14:38 UTC, Konstantin Belousov
no flags Details | Diff
patch for lib/syslog by kib (2.53 KB, patch)
2017-06-16 15:23 UTC, Eugene Grosbein
no flags Details | Diff
patch for mpd/console locks by me (2.49 KB, patch)
2017-06-16 15:24 UTC, Eugene Grosbein
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Юрий 2014-01-26 02:20:00 UTC
I have BRAS on FreeBSD. It was 9.2 STABLE. I tried to update it up to 10.0 RELEASE, later tried to STABLE. On both variants I have the same problem.

Some time after start, around 5 minutes, it works normally. But after 100-150 users have connected trough PPPoE (MPD5.7) MPD process stops in state umtxn.

Of course, no one can connect after that. But who have already connected keeping work.

last pid: 17712;  load averages:  1.16,  0.65,  0.27          up 0+00:01:51  05:28:23
50 processes:  1 running, 49 sleeping
CPU:  0.0% user,  0.0% nice,  1.0% system,  0.9% interrupt, 98.1% idle
Mem: 1162M Active, 56M Inact, 400M Wired, 145M Buf, 2274M Free
Swap: 4096M Total, 4096M Free

  PID USERNAME    THR PRI NICE   SIZE    RES STATE   C   TIME    WCPU COMMAND
 2535 root          1  20    0   201M   184M select  3   1:14  10.69% zebra
 2476 _pflogd       1  20    0 14600K  2200K bpf     0   0:12   0.00% pflogd
 2541 root          1  20    0   224M   206M select  2   0:07   0.00% bgpd
 9803 root          1  20    0 78624K 44092K select  2   0:02   0.00% bsnmpd
 3462 root          3  20    0 56736K  9164K umtxn   0   0:01   0.00% mpd5
 7243 mysql        17  32    0  6958M   636M uwait   1   0:01   0.00% mysqld
 6095 bind          7  20    0   129M 76864K kqread  1   0:01   0.00% named
 3872 root          1  20    0 61124K  6808K select  1   0:00   0.00% nmbd
 8644 root          3  20    0 47332K  6216K select  1   0:00   0.00% utm5_rfw


procstat -k 3462
  PID    TID COMM             TDNAME           KSTACK
 3462 100113 mpd5             -                mi_switch sleepq_catch_signals sleepq_wait_sig _sleep umtxq_sleep do_lock_umutex __umtx_op_wait_umutex amd64_syscall Xfast_syscall
 3462 100115 mpd5             -                mi_switch sleepq_catch_signals sleepq_wait_sig _cv_wait_sig seltdwait sys_poll amd64_syscall Xfast_syscall
 3462 100512 mpd5             -                mi_switch sleepq_catch_signals sleepq_wait_sig _sleep umtxq_sleep do_lock_umutex __umtx_op_wait_umutex amd64_syscall Xfast_syscall



/var/log/mpd.log
Jan 26 05:28:13 gw01 mpd: [B_ppp-46] IPCP: Up event
Jan 26 05:28:13 gw01 mpd: [B_ppp-46] IPCP: state change Starting --> Req-Sent
Jan 26 05:28:13 gw01 mpd: [B_ppp-46] IPCP: SendConfigReq #1
Jan 26 05:28:13 gw01 mpd: [B_ppp-46]   IPADDR 10.10.0.1
Jan 26 05:28:13 gw01 mpd: [B_ppp-46] IPV6CP: Up event
Jan 26 05:28:13 gw01 mpd: [B_ppp-46] IPV6CP: state change Starting --> Req-Sent
Jan 26 05:28:13 gw01 mpd: [B_ppp-46] IPV6CP: SendConfigReq #1
Jan 26 05:28:13 gw01 mpd: [vlan6-107] LCP: rec'd Terminate Request #240 (Opened)
Jan 26 05:28:13 gw01 mpd: [vlan6-107] LCP: state change Opened --> Stopping
Jan 26 05:28:13 gw01 mpd: [vlan6-107] Link: Leave bundle "B_ppp-46"

It always stops with the same 3 last strings.



Jan 26 05:52:38 gw01 kernel: sonewconn: pcb 0xfffff80007757c40: Listen queue overflow:
 4 already in queue awaiting acceptance
Jan 26 05:53:09 gw01 last message repeated 60 times
Jan 26 05:53:34 gw01 last message repeated 51 times


Kernel conf:
GENERIC + 
device          ipmi
device          coretemp
device          smbus

device          lagg
device          netmap

options         IPI_PREEMPTION

options         IPFIREWALL
options         IPFIREWALL_VERBOSE
options         IPDIVERT
options         DUMMYNET
options         IPFIREWALL_NAT
options         LIBALIAS

device          pf
device          pflog
device          pfsync

options         ALTQ
options         ALTQ_CBQ        # Class Bases Queuing (CBQ)
options         ALTQ_RED        # Random Early Detection (RED)
options         ALTQ_RIO        # RED In/Out
options         ALTQ_HFSC       # Hierarchical Packet Scheduler (HFSC)
options         ALTQ_PRIQ       # Priority Queuing (PRIQ)
options         ALTQ_NOPCC      # Required for SMP build

options         NETGRAPH
options         NETGRAPH_BPF
options         NETGRAPH_CAR
options         NETGRAPH_ETHER
options         NETGRAPH_IPFW
options         NETGRAPH_IFACE
options         NETGRAPH_KSOCKET
options         NETGRAPH_PPP
options         NETGRAPH_PPTPGRE
options         NETGRAPH_PPPOE
options         NETGRAPH_SOCKET
options         NETGRAPH_TCPMSS
options         NETGRAPH_TEE
options         NETGRAPH_VJC
options         NETGRAPH_MPPC_ENCRYPTION
options         NETGRAPH_NETFLOW



CPU: Intel(R) Xeon(R) CPU           X3470  @ 2.93GHz (2933.36-MHz K8-class CPU)
  Origin = "GenuineIntel"  Id = 0x106e5  Family = 0x6  Model = 0x1e  Stepping = 5
  Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE>
  Features2=0x98e3fd<SSE3,DTES64,MON,DS_CPL,VMX,SMX,EST,TM2,SSSE3,CX16,xTPR,PDCM,SSE4.1,SSE4.2,POPCNT>
  AMD Features=0x28100800<SYSCALL,NX,RDTSCP,LM>
  AMD Features2=0x1<LAHF>
  TSC: P-state invariant, performance statistics
real memory  = 4294967296 (4096 MB)
avail memory = 4052344832 (3864 MB)
Event timer "LAPIC" quality 400
ACPI APIC Table: <INTEL  S3420GPC>
FreeBSD/SMP: Multiprocessor System Detected: 4 CPUs
FreeBSD/SMP: 1 package(s) x 4 core(s)
 cpu0 (BSP): APIC ID:  0
 cpu1 (AP): APIC ID:  2
 cpu2 (AP): APIC ID:  4
 cpu3 (AP): APIC ID:  6


I tried to get ktrace dump. But I could not open it.
ktrdump: kvm_nlist: No such file or directory


I think, It is something wrong with netgraph system.

How-To-Repeat: Update to FreeBSD 10.0 and try to connect 100-150 users.
Comment 1 Юрий 2014-02-04 00:33:37 UTC
In additional:

-  this  problem  only with pppoe part. With pptp I and other peope on
forum have not any problem.

- here is gdb report from frozen mpd5:

gdb mpd5 3645
GNU gdb 6.1.1 [FreeBSD]
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "amd64-marcel-freebsd"...(no debugging symbols found)...
Attaching to program: /usr/local/sbin/mpd5, process 3645
Reading symbols from /usr/lib/libwrap.so.6...(no debugging symbols found)...done.
Loaded symbols for /usr/lib/libwrap.so.6
Reading symbols from /usr/lib/libpam.so.5...(no debugging symbols found)...done.
Loaded symbols for /usr/lib/libpam.so.5
Reading symbols from /lib/libcrypt.so.5...(no debugging symbols found)...done.
Loaded symbols for /lib/libcrypt.so.5
Reading symbols from /usr/lib/libnetgraph.so.4...(no debugging symbols found)...done.
Loaded symbols for /usr/lib/libnetgraph.so.4
Reading symbols from /lib/libutil.so.9...(no debugging symbols found)...done.
Loaded symbols for /lib/libutil.so.9
Reading symbols from /usr/lib/libradius.so.4...(no debugging symbols found)...done.
Loaded symbols for /usr/lib/libradius.so.4
Reading symbols from /usr/lib/libssl.so.7...(no debugging symbols found)...done.
Loaded symbols for /usr/lib/libssl.so.7
Reading symbols from /lib/libpcap.so.8...(no debugging symbols found)...done.
Loaded symbols for /lib/libpcap.so.8
Reading symbols from /usr/lib/libfetch.so.6...(no debugging symbols found)...done.
Loaded symbols for /usr/lib/libfetch.so.6
Reading symbols from /lib/libcrypto.so.7...(no debugging symbols found)...done.
Loaded symbols for /lib/libcrypto.so.7
Reading symbols from /lib/libthr.so.3...(no debugging symbols found)...done.
[New Thread 80340ec00 (LWP 100390/mpd5)]
[New Thread 803020400 (LWP 100120/mpd5)]
[New Thread 802c06800 (LWP 100119/mpd5)]
Loaded symbols for /lib/libthr.so.3
Reading symbols from /lib/libc.so.7...(no debugging symbols found)...done.
Loaded symbols for /lib/libc.so.7
Reading symbols from /libexec/ld-elf.so.1...(no debugging symbols found)...done.
Loaded symbols for /libexec/ld-elf.so.1
[Switching to Thread 80340ec00 (LWP 100390/mpd5)]
0x0000000801fc089a in __error () from /lib/libthr.so.3
(gdb) bt
#0  0x0000000801fc089a in __error () from /lib/libthr.so.3
#1  0x0000000801fbb79d in pthread_mutex_destroy () from /lib/libthr.so.3
#2  0x00000008022dea9a in vsyslog () from /lib/libc.so.7
#3  0x00000000004487c7 in ?? ()
#4  0x00000000004486e2 in ?? ()
#5  0x000000000045123c in ?? ()
#6  0x0000000000426534 in ?? ()
#7  0x000000000045d1b6 in ?? ()
#8  0x0000000801fb54a4 in pthread_create () from /lib/libthr.so.3
#9  0x00007fffff5fc000 in ?? ()
Error accessing memory address 0x7fffff7fc000: Bad address.
(gdb) where
#0  0x0000000801fc089a in __error () from /lib/libthr.so.3
#1  0x0000000801fbb79d in pthread_mutex_destroy () from /lib/libthr.so.3
#2  0x00000008022dea9a in vsyslog () from /lib/libc.so.7
#3  0x00000000004487c7 in ?? ()
#4  0x00000000004486e2 in ?? ()
#5  0x000000000045123c in ?? ()
#6  0x0000000000426534 in ?? ()
#7  0x000000000045d1b6 in ?? ()
#8  0x0000000801fb54a4 in pthread_create () from /lib/libthr.so.3
#9  0x00007fffff5fc000 in ?? ()
Error accessing memory address 0x7fffff7fc000: Bad address.
(gdb) quit
Comment 2 Юрий 2014-02-05 12:40:50 UTC
One more interesting thing.

Problem appears usually when MPD5 starts with booting of system.
If I comment mpd_enable in rc.conf and start MPD5 manually after 10-20
minutes - it works normally.
Comment 3 Mark Linimon freebsd_committer freebsd_triage 2014-10-17 23:28:10 UTC
Make this a ports PR and assign.
Comment 4 Boris 2014-11-06 10:40:19 UTC
I can trigger similar and probably related bug in the mpd by just disconnecting upstream link for a few minutes. Clients continue to make queries, mpd completely hangs in the "select" state and can only be killed by -9 signal. This regression appeared somewhere in 8.x series. Killing mpd together with zombie ngX interfaces makes it possible to recover normal operation without reboot.

Nothing interesting found in the all logs.
Comment 5 Alexey Pereklad 2014-12-24 15:09:52 UTC
Got the same problem in FreeBSD 10.0. The symptoms are exactly the same as hawk256 wrote above
Comment 6 YS 2015-08-21 05:44:26 UTC
Site to site vpn, hang too:

821 root          1  20    0 55112K  7504K umtxn   1   0:12   0.00% mpd5

FreeBSD base.tim 10.2-STABLE FreeBSD 10.2-STABLE #1 r286839: Mon Aug 17 12:31:21 EEST 2015     slava@base.tim:/usr/obj/usr/src/sys/route  amd64

mpd5 -v
Version 5.8a 23-Jul-2015
Comment 7 Cassiano Peixoto 2015-11-24 14:15:03 UTC
(In reply to YS from comment #6)
Hi,

I've the same issue here using FreeBSD 10.2 with mpd 5.7 pppoe server. With about 200 users it just freezes. I have to kill -9 mpd process.
Comment 8 Kubilay Kocak freebsd_committer freebsd_triage 2016-01-08 13:43:32 UTC
@Cassiano, mpd5 was recently updated to 5.8, can you update to that version and confirm whether the hanging symptoms are still reproducible please. If you could also provide the following that would be great:

* uname -a output
* kernel configuration (as attachment) if *not* GENERIC
* /var/run/dmesg.boot (as attachment)
* pciconf -lv output (as attachment)

We'll do our best to progress the issue.
Comment 9 Cassiano Peixoto 2016-01-08 20:12:26 UTC
Created attachment 165278 [details]
Kernel Config
Comment 10 Cassiano Peixoto 2016-01-08 20:13:19 UTC
Created attachment 165279 [details]
dmesg
Comment 11 Cassiano Peixoto 2016-01-08 20:13:41 UTC
Created attachment 165280 [details]
pciconf
Comment 12 Cassiano Peixoto 2016-01-08 20:16:02 UTC
Hi, Kubilay.

Thanks for you reply. I've been using mpd 5.8 with same issue. Seems it has not fixed.

Here is my uname -a output:

FreeBSD pppoe.us 10.2-STABLE FreeBSD 10.2-STABLE #0 r292035M: Tue Dec 15 10:45:41 BRST 2015     root@pppoe.us:/usr/obj/usr/src/sys/PPPOE  amd64
Comment 13 YS 2016-01-15 11:52:54 UTC
Created attachment 165626 [details]
mpd.conf for my site-site vpn

freebsd 9.3 - without problem
freebsd 10.2 - with problem after some time
Comment 14 Eugene Grosbein 2016-07-12 10:39:09 UTC
This PR should not be classified as Ports PR because problem lies in the kernel NETGRAPH subsystem and not mpd5 itself. Same version of mpd5 works just fine with 8.4-STABLE. This is Base System problem.
Comment 15 Alexey Pereklad 2016-07-13 07:41:15 UTC
Yeah, got mpd5 working with no problem on FreeBSD 9.x (mpd 5.7, 5.8).
Comment 16 Dmitry Luhtionov 2016-09-30 07:29:57 UTC
Disabling using syslog (3) temporary resolve this problem.
I think, problem between pthreads and syslog.
Comment 17 Kubilay Kocak freebsd_committer freebsd_triage 2016-10-20 06:50:53 UTC
(In reply to Eugene Grosbein from comment #14)

The initial report is fine to be classified against the port. Once the root cause(s) is confirmed, implicated or identified, a dependent (Depends On) sub task can be created for the base change.
Comment 18 Cassiano Peixoto 2016-10-20 19:19:25 UTC
Created attachment 175989 [details]
crash-bt
Comment 19 Cassiano Peixoto 2016-10-20 19:20:30 UTC
Comment on attachment 175989 [details]
crash-bt

I'm having random crashes. BT attached.
Comment 20 Eugene Grosbein freebsd_committer 2017-06-11 14:28:51 UTC
(In reply to Cassiano Peixoto from comment #19)

I've performed heavy stress-test of mpd-5.8 under FreeBSD 11.0-STABLE without usage of RADIUS and found no problems. Can you try 11.0-BETA1?
Comment 21 Cassiano Peixoto 2017-06-12 11:42:26 UTC
Hi Eugene,

I've been using 11.0-p1 for many months. I've about 1k users connected, and after some days (or weeks) i run into on the following issue:

PID USERNAME       THR PRI NICE   SIZE    RES STATE   C   TIME    WCPU COMMAND
55180 root             2  52    0   338M 85204K uwrlck  2   2:28   0.00% mpd5

The process stuck on uwrlck state. Do you have any idea?

My last try was set a cpu affinity on process. After this it's running 9 days and counting. But i gonna need more days to make sure it really worked.

Anyway, how have you made the test? how many users connected?

Thanks.
Comment 22 Eugene Grosbein freebsd_committer 2017-06-12 14:10:01 UTC
(In reply to Cassiano Peixoto from comment #21)

I repeatedly created/destroyed hundreds of PPPoE sessions for long time. But my test was incomplete in a sense, as it does not involve RADIUS, so mpd was running in single-thread mode and lockups are believed due to multi-threading issues. I'm going to re-do my tests.

Also, please take a look at this and try to test patches:
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=214482#c33
Comment 23 Cassiano Peixoto 2017-06-12 14:30:26 UTC
Hi Eugene,

You can count me in to help with any test. I think just creating/destroying PPPoE sessions will not make the issues happen.

Another think i forgot to tell, i disabled syslog in mpd5 Makefile:

# Set syslog logging facility. Change LOG_DAEMON to whatever you like.
# Comment this line disable syslog (3) support
#SYSLOG_FACILITY=       LOG_DAEMON

But even with syslog disabled i ran into with uwrlck issue. Then my last try was to set cpu affinity as i told before.

I read the link you sent. Looks like an issue with vsyslog(). So my questions is: even with syslog facility disabled on mpd should i expected this kind of behaviour (stuck)? 

Thanks.
Comment 24 Eugene Grosbein freebsd_committer 2017-06-12 14:40:59 UTC
(In reply to Cassiano Peixoto from comment #23)

I've passed some ftp traffic over PPPoE sessions too.

Please enable SYSLOG_FACILITY back while testing.

As for uwrlck problem, it seems this one is similar but distinct problem. It may happen if you repeatedly use mpd console or http server, for example. Please describe how do you use them.
Comment 25 Cassiano Peixoto 2017-06-12 14:50:47 UTC
Ok, i'll apply your patches and back with syslog facility. Do i need to rebuild the world (userland)?

Regarding uwrlck I am using mpd5 web to collect some connections statistics each 60 minutes (it's a cron job). I'm using the following quite simple shell script:

OUTPUT="/usr/local/etc/tmp/bras.txt"

TMP="/tmp/sessions.tmp"
TMP2="/tmp/veloc.tmp"
TMP3="/tmp/result.tmp"
if [ -e $TMP ]; then
        rm $TMP
fi
if [ -e $TMP2 ]; then
        rm $TMP2
fi
if [ -e $TMP3 ]; then
        rm $TMP3
fi

SESSIONS=`/usr/local/bin/curl -k -l --user admin:xxx http://127.0.0.1:5006/bincmd\?show%20sessions 2>&1 > $TMP`
while read session; do
        if [ "`echo $session |grep -v 'RESULT'`" != "" ]; then
                #echo $session
                INTERFACE=`echo $session |awk '{print $1}'`
                IP=`echo $session |awk '{print $2}'`
                USER=`echo $session |awk '{print $8}'`
                MAC=`echo $session |awk '{print $9}'`
                VLAN=`echo $session |awk '{print $3}' | awk -F '-' '{print $1}'`
                DESC=`ifconfig ${VLAN} |grep description |awk '{print $2}'`
                VELOC=`/usr/local/bin/curl -k -l --user admin:xxx http://127.0.0.1:5006/bincmd\?iface%20${INTERFACE}\&show%20customer 2>&1 | egrep 'in#|out#|seconds' > $TMP2`
                TRAF=`/usr/bin/netstat -nbf link -I ${INTERFACE}`
                TRAF=`echo $TRAF| awk '{print $19" "$16" "$22" "$20}' | tr ' ' ';'`
                BLOCK=`ipfw table 1 list |grep ${IP}`
                if [ "$BLOCK" = "" ]; then
                        BLOCK="0"
                else
                        BLOCK="1"
                fi
                BLOCK2=`ipfw table 2 list |grep ${IP}`
                if [ "$BLOCK2" = "" ]; then
                        BLOCK2="0"
                else
                        BLOCK="1"
                fi
                BLOCK2=`ipfw table 2 list |grep ${IP}`
                if [ "$BLOCK2" = "" ]; then
                        BLOCK2="0"
                else
                        BLOCK2="1"
                fi
                while read veloc; do
                        if [ "`echo $veloc | grep in`" != "" ]; then
                                BTI=$(echo $veloc |awk '{print $5}')
                        fi
                        if [ "`echo $veloc | grep out`" != "" ]; then
                                BTO=$(echo $veloc |awk '{print $5}')
                        fi
                        if [ "`echo $veloc | grep 'seconds'`" != "" ]; then
                                TIME=$(echo $veloc|awk '{print $4}')
                        fi
                done < $TMP2
                echo "${USER};${IP};${TIME};${VLAN};${DESC};${BTI};${BTO};${INTERFACE};${TRAF};${BLOCK};${BLOCK2};${MAC}" >> $TMP3
        fi
done < $TMP
cp $TMP3 $OUTPUT
Comment 26 Eugene Grosbein freebsd_committer 2017-06-12 15:34:52 UTC
(In reply to Cassiano Peixoto from comment #25)

You need not rebuild world if you choose to test the patch for mpd itself. You need to rebuild and reinstall libc only if you choose to test the patch for syslog. Please apply only one of patches at the same time, not both.

As for web server, I've looked into the code and I'm afraid it would not be so easy to patch it. For now, I suggest to disable your statistics collection during the test period or switch to another method not involving web.

For example, you could use "set iface description" conversion specifications: http://mpd.sourceforge.net/doc5/mpd28.html#28

Like this: set iface description "%I: Login %U, IP %A, MAC %M, Link %l"
Then use /sbin/ifconfig to get data.

Otherwise, you may try to use mpd's CLI console instead of web console. I've just updated that patch for mpd to include workaround for console lock possible issues, so make sure you use latest patch version.
Comment 27 Cassiano Peixoto 2017-06-12 16:46:21 UTC
Ok let's begin with mpd patch (without libc patch). Can i enable with syslog facility after patch applied?

Do you think would be possible to provide a web patch as well? Because it's easier to get data output from there.

Thanks.
Comment 28 Eugene Grosbein freebsd_committer 2017-06-12 16:56:58 UTC
(In reply to Cassiano Peixoto from comment #27)

You should enable syslog facility. As for web, it is more complex and may require other developers assistance. With time it definitly can be solved but first we need to ensure that root of problems is correctly determined, so testing is needed first.

Also, you'll better compile mpd with debugging symbols enabled:

make WITH_DEBUG=yes clean all deinstall install

And if hang occurs again, use "killall -QUIT mpd" to generate crashdump to analyze.
Comment 29 Cassiano Peixoto 2017-06-12 17:08:58 UTC
Ok got it. Patch applied and working. As you know this kind of problem takes time to happen. But i'll keep you posted about the progress. Let's keep working at least one week.

Thanks.
Comment 30 Eugene Grosbein freebsd_committer 2017-06-12 17:34:42 UTC
(In reply to Cassiano Peixoto from comment #29)

Please verify with "file /usr/local/sbin/mpd5" command that the binary is "not stripped".
Comment 31 Cassiano Peixoto 2017-06-12 17:38:01 UTC
(In reply to Eugene Grosbein from comment #30)
Yes, not stripped :)

(root@B-ras)~# file /usr/local/sbin/mpd5
/usr/local/sbin/mpd5: ELF 64-bit LSB executable, x86-64, version 1 (FreeBSD), dynamically linked, interpreter /libexec/ld-elf.so.1, for FreeBSD 11.0 (1100122), FreeBSD-style, not stripped
Comment 32 Cassiano Peixoto 2017-06-12 21:03:13 UTC
(In reply to Cassiano Peixoto from comment #31)
Eugene,

I just tried on another machine to compile mpd5 with debug enabled and then ran a killall -QUIT to generate a crashdump. But nothing happens, i looked on /var/crash but there is nothing there. Am i missing something?

Thanks.
Comment 33 Eugene Grosbein freebsd_committer 2017-06-13 08:25:08 UTC
(In reply to Cassiano Peixoto from comment #32)

I'd better say "corefile" instead of "crashdump" as crashdump is generated for kernel panic and core file for userland process crashed or killed with a signal.

Use something like sysctl kern.corefile='/var/tmp/%N.core' to specify location for corefiles. By default, its' written to process current directory. Take a look at /usr/local/etc/mpd5/ for mpd.core
Comment 34 Cassiano Peixoto 2017-06-13 11:30:45 UTC
(In reply to Eugene Grosbein from comment #33)
I tried like you said, but didn't work, look:

# sysctl kern.corefile='/var/tmp/%N.core'

# ps ax | grep mpd5
31784  -  Ss      0:00.01 /usr/local/sbin/mpd5 -p /var/run/mpd5.pid -b

# killall -QUIT mpd5

# ls /var/tmp/
nginx/      vi.recover/

Shouldn't i run the process with gdb?
Comment 35 Eugene Grosbein freebsd_committer 2017-06-13 14:58:07 UTC
(In reply to Cassiano Peixoto from comment #34)

Does "killall" really terminate the process or does it continue to run? Perhaps, the signal is blocked for some reason. There are several ways to start mpd5 from the beginning. How do you start it?
Comment 36 Cassiano Peixoto 2017-06-13 17:09:54 UTC
(In reply to Eugene Grosbein from comment #35)

it's not running anymore, look:

# ps ax | grep mpd5
52264  1  S+      0:00.00 grep mpd5
# /usr/local/etc/rc.d/mpd5 forcestart
Starting mpd5.
# ps ax | grep mpd5
52275  -  Ss      0:00.01 /usr/local/sbin/mpd5 -p /var/run/mpd5.pid -b
52286  1  S+      0:00.00 grep mpd5
# killall -QUIT mpd5
# ps ax | grep mpd5
52293  1  S+      0:00.00 grep mpd5
# sysctl kern.corefile
kern.corefile: /var/tmp/%N.core
# ls /var/tmp/
nginx		vi.recover

As you can see above i'm using rc.d script to mpd5 startup. It always start with: -p /var/run/mpd5.pid -b
Comment 37 Eugene Grosbein freebsd_committer 2017-06-13 17:24:02 UTC
(In reply to Cassiano Peixoto from comment #36)

Do you use tcsh for your login shell?

Anyway, try starting mpd using direct command "/usr/local/sbin/mpd5 -p /var/run/mpd5.pid -b" for a test instead of "service" to verify if it will be any difference with signal handling.
Comment 38 Cassiano Peixoto 2017-06-13 17:26:48 UTC
(In reply to Eugene Grosbein from comment #37)
Same behaviour:

# ps ax | grep mpd5
53396  1  R+      0:00.00 grep mpd5

# /usr/local/sbin/mpd5 -p /var/run/mpd5.pid -b

# ps ax | grep mpd5
53402  -  Ss      0:00.01 /usr/local/sbin/mpd5 -p /var/run/mpd5.pid -b
53409  1  S+      0:00.00 grep mpd5

# killall -QUIT mpd5

# ps ax | grep mpd5
53413  1  S+      0:00.00 grep mpd5

# ls /var/tmp/
nginx		vi.recover
Comment 39 Eugene Grosbein freebsd_committer 2017-06-13 18:02:02 UTC
(In reply to Cassiano Peixoto from comment #38)

Please show output of the command: dmesg | grep mpd5
Comment 40 Cassiano Peixoto 2017-06-13 18:03:47 UTC
(In reply to Eugene Grosbein from comment #39)

# dmesg | grep mpd5
pid 53402 (mpd5), uid 0: exited on signal 3
pid 53441 (mpd5), uid 0: exited on signal 3
pid 53758 (mpd5), uid 0: exited on signal 3
Comment 41 Eugene Grosbein freebsd_committer 2017-06-13 19:40:56 UTC
(In reply to Cassiano Peixoto from comment #40)

The signal is delivered but core generation is disabled for some reason. To find it, please show output of commands:

sysctl kern.coredump
proccontrol -q -m trace -p $(cat /var/run/mpd5.pid)
limits -c
Comment 42 Cassiano Peixoto 2017-06-13 20:05:36 UTC
(In reply to Eugene Grosbein from comment #41)
I've just found out. My coredumpsize limit has been configured to 0Kb :(

Now it worked fine. Thanks for your help. I'll keep you posted. Until now it's running fine with patch applied.

Thanks.
Comment 43 Eugene Grosbein freebsd_committer 2017-06-14 14:05:11 UTC
(In reply to Cassiano Peixoto from comment #42)

I've repeated my stress test for mpd5 and without RADIUS it runs just fine but with RADIUS I'm able to reliably reproduce this problem in a minute using unpatched mpd5 and libc.

This the patch for mpd5, the problem goes away. Note however, that I did not use mpd's web server and the patch is not complete.

The patch for libc seems to non-functional and incomplete too. I've raised this issue to freebsd-stable@ mailing list trying to get more attention to the problem from developers.
Comment 44 Cassiano Peixoto 2017-06-14 14:15:00 UTC
(In reply to Eugene Grosbein from comment #43)
Hi Eugene,

Thank you very much for your support. I stopped using mpd5 web, i replaced my script with mpd5 console to collect statistics. Let's see if it keeps stable.

Is there a way to reproduce your tests with radius? As i'm using mpd5+radius would be interesting to try after mpd5 patch applied.

If you don't mind I'd like to suggest two things:

1) Attach the patches on this PR, it will be helpful to someone who is following it.
2) Assign this PR to you, because clearly Ermal Luçi is not working on this anymore.
Comment 45 Cassiano Peixoto 2017-06-16 11:54:06 UTC
(In reply to Eugene Grosbein from comment #43)
Hi Eugene,

I just got a stuck process this morning with uwrlck state. As i said before i'm not using mpd web anymore.

Here is the bt result after it hangs:

#0  0x000000080248567c in _umtx_op_err () from /lib/libthr.so.3
#1  0x0000000802479c81 in __thr_rwlock_wrlock (rwlock=0x802694500, tsp=<value optimized out>) at /usr/src/lib/libthr/thread/thr_umtx.c:325
#2  0x00000008024829f3 in _thr_rtld_wlock_acquire (lock=<value optimized out>) at thr_umtx.h:239
#3  0x0000000800703c72 in wlock_acquire (lock=0x800918960, lockstate=0x7fffdfffda58) at /usr/src/libexec/rtld-elf/rtld_lock.c:217
#4  0x00000008006fff4f in _rtld_allocate_tls (oldtls=0x0, tcbsize=32, tcbalign=16) at /usr/src/libexec/rtld-elf/rtld.c:4802
#5  0x0000000802481b69 in _tcb_ctor (thread=0x807c7ba00, initial=<value optimized out>) at /usr/src/lib/libthr/thread/thr_ctrdtr.c:45
#6  0x0000000802481110 in _thr_alloc (curthread=<value optimized out>) at /usr/src/lib/libthr/thread/thr_list.c:170
#7  0x00000008024772a6 in _pthread_create (thread=0x80342fd78, attr=0x0, start_routine=0x4b17c0 <paction_main>, arg=<value optimized out>) at /usr/src/lib/libthr/thread/thr_create.c:78
#8  0x00000000004b1744 in paction_start (actionp=0x80cf7f3b8, mutex=0x7bee08, handler=0x442f40 <AuthAccount>, finish=0x4430c0 <AuthAccountFinish>, arg=0x807c20010) at contrib/libpdel/util/paction.c:115
#9  0x0000000000442eaa in AuthAccountStart (l=0x80cf7f010, type=3) at auth.c:927
#10 0x0000000000442f31 in AuthAccountTimeout (arg=0x80cf7f010) at auth.c:949
#11 0x00000000004a0344 in TimerExpires (type=3, cookie=0x80cf7f120) at timer.c:99
#12 0x0000000000462ed7 in EventHandler (arg=0x80cf7f120) at event.c:146
#13 0x00000000004b154e in pevent_ctx_execute (arg=0x807db6f08) at contrib/libpdel/util/pevent.c:885
#14 0x00000000004b112f in pevent_ctx_service (ev=0x807db6f08) at contrib/libpdel/util/pevent.c:774
#15 0x00000000004b0baf in pevent_ctx_main (arg=0x80322c008) at contrib/libpdel/util/pevent.c:720
#16 0x0000000802477b55 in thread_start (curthread=<value optimized out>) at /usr/src/lib/libthr/thread/thr_create.c:289
Comment 46 Konstantin Belousov freebsd_committer 2017-06-16 13:27:16 UTC
(In reply to Cassiano Peixoto from comment #45)
Show the backtrace for all threads.
Comment 47 Konstantin Belousov freebsd_committer 2017-06-16 13:31:41 UTC
(In reply to Konstantin Belousov from comment #46)
And exactly what FreeBSD version is it ?
Comment 48 Dmitry Luhtionov 2017-06-16 13:35:17 UTC
In comment #31

(root@B-ras)~# file /usr/local/sbin/mpd5
/usr/local/sbin/mpd5: ELF 64-bit LSB executable, x86-64, version 1 (FreeBSD), dynamically linked, interpreter /libexec/ld-elf.so.1, for FreeBSD 11.0 (1100122), FreeBSD-style, not stripped
Comment 49 Cassiano Peixoto 2017-06-16 13:42:29 UTC
(In reply to Konstantin Belousov from comment #47)
I'm using FreeBSD 11.0-p1. But it happens on any FreeBSD version above 9x.

Here is the backtrace for all threads:

(gdb) thread apply all bt

Thread 6 (Thread 803216a00 (LWP 100375/<unknown>)):
#0  0x000000080248567c in _umtx_op_err () from /lib/libthr.so.3
#1  0x0000000802479c81 in __thr_rwlock_wrlock (rwlock=0x802694500, tsp=<value optimized out>) at /usr/src/lib/libthr/thread/thr_umtx.c:325
#2  0x00000008024829f3 in _thr_rtld_wlock_acquire (lock=<value optimized out>) at thr_umtx.h:239
#3  0x0000000800703c72 in wlock_acquire (lock=0x800918960, lockstate=0x7fffdfffda58) at /usr/src/libexec/rtld-elf/rtld_lock.c:217
#4  0x00000008006fff4f in _rtld_allocate_tls (oldtls=0x0, tcbsize=32, tcbalign=16) at /usr/src/libexec/rtld-elf/rtld.c:4802
#5  0x0000000802481b69 in _tcb_ctor (thread=0x807c7ba00, initial=<value optimized out>) at /usr/src/lib/libthr/thread/thr_ctrdtr.c:45
#6  0x0000000802481110 in _thr_alloc (curthread=<value optimized out>) at /usr/src/lib/libthr/thread/thr_list.c:170
#7  0x00000008024772a6 in _pthread_create (thread=0x80342fd78, attr=0x0, start_routine=0x4b17c0 <paction_main>, arg=<value optimized out>) at /usr/src/lib/libthr/thread/thr_create.c:78
#8  0x00000000004b1744 in paction_start (actionp=0x80cf7f3b8, mutex=0x7bee08, handler=0x442f40 <AuthAccount>, finish=0x4430c0 <AuthAccountFinish>, arg=0x807c20010) at contrib/libpdel/util/paction.c:115
#9  0x0000000000442eaa in AuthAccountStart (l=0x80cf7f010, type=3) at auth.c:927
#10 0x0000000000442f31 in AuthAccountTimeout (arg=0x80cf7f010) at auth.c:949
#11 0x00000000004a0344 in TimerExpires (type=3, cookie=0x80cf7f120) at timer.c:99
#12 0x0000000000462ed7 in EventHandler (arg=0x80cf7f120) at event.c:146
#13 0x00000000004b154e in pevent_ctx_execute (arg=0x807db6f08) at contrib/libpdel/util/pevent.c:885
#14 0x00000000004b112f in pevent_ctx_service (ev=0x807db6f08) at contrib/libpdel/util/pevent.c:774
#15 0x00000000004b0baf in pevent_ctx_main (arg=0x80322c008) at contrib/libpdel/util/pevent.c:720
#16 0x0000000802477b55 in thread_start (curthread=<value optimized out>) at /usr/src/lib/libthr/thread/thr_create.c:289
#17 0x0000000000000000 in ?? ()

Thread 5 (Thread 80342a000 (LWP 100378/<unknown>)):
#0  0x000000080248567a in _umtx_op_err () from /lib/libthr.so.3
#1  0x00000008024796e4 in __thr_umutex_lock (mtx=<value optimized out>, id=<value optimized out>) at /usr/src/lib/libthr/thread/thr_umtx.c:80
#2  0x0000000802481106 in _thr_alloc (curthread=<value optimized out>) at thr_umtx.h:123
#3  0x00000008024772a6 in _pthread_create (thread=0x8076dd438, attr=0x0, start_routine=0x4c4ec0 <http_server_connection_main>, arg=<value optimized out>) at /usr/src/lib/libthr/thread/thr_create.c:78
#4  0x00000000004c400a in http_server_accept (arg=0x803420088) at contrib/libpdel/http/http_server.c:653
#5  0x00000000004b154e in pevent_ctx_execute (arg=0x803420108) at contrib/libpdel/util/pevent.c:885
#6  0x00000000004b112f in pevent_ctx_service (ev=0x803420108) at contrib/libpdel/util/pevent.c:774
#7  0x00000000004b0baf in pevent_ctx_main (arg=0x80340e0a8) at contrib/libpdel/util/pevent.c:720
#8  0x0000000802477b55 in thread_start (curthread=<value optimized out>) at /usr/src/lib/libthr/thread/thr_create.c:289
#9  0x0000000000000000 in ?? ()

Thread 4 (Thread 807ccdf00 (LWP 101673/<unknown>)):
#0  0x000000080277d43a in _nanosleep () from /lib/libc.so.7
#1  0x000000080247a4cc in __thr_nanosleep (time_to_sleep=0x7fffcbd5ceb0, time_remaining=0x0) at /usr/src/lib/libthr/thread/thr_syscalls.c:257
#2  0x00000008027d5036 in __usleep (useconds=<value optimized out>) at /usr/src/lib/libc/gen/usleep.c:52
#3  0x00000000004b1aaf in paction_cleanup (arg=0x80342f288) at contrib/libpdel/util/paction.c:243
#4  0x0000000802485550 in __pthread_cleanup_pop_imp (execute=<value optimized out>) at /usr/src/lib/libthr/thread/thr_clean.c:73
#5  0x00000000004b189f in paction_main (arg=0x80342f288) at contrib/libpdel/util/paction.c:197
#6  0x0000000802477b55 in thread_start (curthread=<value optimized out>) at /usr/src/lib/libthr/thread/thr_create.c:289
#7  0x00007fffcbb5d000 in ?? ()
Cannot access memory at address 0x7fffcbd5d000

Thread 3 (Thread 804fd4900 (LWP 101715/<unknown>)):
#0  0x000000080277d43a in _nanosleep () from /lib/libc.so.7
#1  0x000000080247a4cc in __thr_nanosleep (time_to_sleep=0x7fffcad54eb0, time_remaining=0x0) at /usr/src/lib/libthr/thread/thr_syscalls.c:257
#2  0x00000008027d5036 in __usleep (useconds=<value optimized out>) at /usr/src/lib/libc/gen/usleep.c:52
---Type <return> to continue, or q <return> to quit---
#3  0x00000000004b1aaf in paction_cleanup (arg=0x80342fb48) at contrib/libpdel/util/paction.c:243
#4  0x0000000802485550 in __pthread_cleanup_pop_imp (execute=<value optimized out>) at /usr/src/lib/libthr/thread/thr_clean.c:73
#5  0x00000000004b189f in paction_main (arg=0x80342fb48) at contrib/libpdel/util/paction.c:197
#6  0x0000000802477b55 in thread_start (curthread=<value optimized out>) at /usr/src/lib/libthr/thread/thr_create.c:289
#7  0x00007fffcab55000 in ?? ()
Cannot access memory at address 0x7fffcad55000

Thread 2 (Thread 807d70500 (LWP 101723/<unknown>)):
#0  0x000000080277d43a in _nanosleep () from /lib/libc.so.7
#1  0x000000080247a4cc in __thr_nanosleep (time_to_sleep=0x7fffbc0deeb0, time_remaining=0x0) at /usr/src/lib/libthr/thread/thr_syscalls.c:257
#2  0x00000008027d5036 in __usleep (useconds=<value optimized out>) at /usr/src/lib/libc/gen/usleep.c:52
#3  0x00000000004b1aaf in paction_cleanup (arg=0x80342fcd8) at contrib/libpdel/util/paction.c:243
#4  0x0000000802485550 in __pthread_cleanup_pop_imp (execute=<value optimized out>) at /usr/src/lib/libthr/thread/thr_clean.c:73
#5  0x00000000004b189f in paction_main (arg=0x80342fcd8) at contrib/libpdel/util/paction.c:197
#6  0x0000000802477b55 in thread_start (curthread=<value optimized out>) at /usr/src/lib/libthr/thread/thr_create.c:289
#7  0x00007fffbbedf000 in ?? ()
Cannot access memory at address 0x7fffbc0df000

Thread 1 (Thread 807cd4800 (LWP 101061/<unknown>)):
#0  0x000000080277d43a in _nanosleep () from /lib/libc.so.7
#1  0x000000080247a4cc in __thr_nanosleep (time_to_sleep=0x7fffa9c4ceb0, time_remaining=0x0) at /usr/src/lib/libthr/thread/thr_syscalls.c:257
#2  0x00000008027d5036 in __usleep (useconds=<value optimized out>) at /usr/src/lib/libc/gen/usleep.c:52
#3  0x00000000004b1aaf in paction_cleanup (arg=0x80342fd28) at contrib/libpdel/util/paction.c:243
#4  0x0000000802485550 in __pthread_cleanup_pop_imp (execute=<value optimized out>) at /usr/src/lib/libthr/thread/thr_clean.c:73
#5  0x00000000004b189f in paction_main (arg=0x80342fd28) at contrib/libpdel/util/paction.c:197
#6  0x0000000802477b55 in thread_start (curthread=<value optimized out>) at /usr/src/lib/libthr/thread/thr_create.c:289
#7  0x00007fffa9a4d000 in ?? ()
Cannot access memory at address 0x7fffa9c4d000
#0  0x000000080248567c in _umtx_op_err () from /lib/libthr.so.3
Comment 50 Konstantin Belousov freebsd_committer 2017-06-16 14:08:42 UTC
(In reply to Cassiano Peixoto from comment #49)
From the thread 6, in the frame 1, please do 'p *rwlock'
Comment 51 Cassiano Peixoto 2017-06-16 14:10:52 UTC
(In reply to Konstantin Belousov from comment #50)
Please, see below:

(gdb) frame 1
#1  0x0000000802479c81 in __thr_rwlock_wrlock (rwlock=0x802694500, tsp=<value optimized out>) at /usr/src/lib/libthr/thread/thr_umtx.c:325
325		return (_umtx_op_err(rwlock, UMTX_OP_RW_WRLOCK, 0, (void *)tm_size,
Current language:  auto; currently minimal
(gdb) p *rwlock
$1 = {rw_state = -1610612736, rw_flags = 2, rw_blocked_readers = 0, rw_blocked_writers = 0, rw_spare = 0x802694510}
(gdb)
Comment 52 Konstantin Belousov freebsd_committer 2017-06-16 14:37:18 UTC
(In reply to Cassiano Peixoto from comment #51)
Please apply the following patch, recompile libthr.so _only_, and then reproduce the issue again.  I need the same information: backtraces from all alive threads, and for the thread which is blocked on lock in _rtld_allocate_tls(), the printout of *rwlock.
Comment 53 Konstantin Belousov freebsd_committer 2017-06-16 14:38:02 UTC
Created attachment 183536 [details]
debugging patch to track rtld bind lock write owner
Comment 54 Cassiano Peixoto 2017-06-16 15:10:51 UTC
(In reply to Konstantin Belousov from comment #53)
Should i restart or recompile mpd5 after patch applied?
Comment 55 Dmitry Luhtionov 2017-06-16 15:11:48 UTC
Restart only.
Comment 56 Dmitry Luhtionov 2017-06-16 15:17:31 UTC
Although I may be wrong.

Can you compile mpd5 from sources from sf.net ?

I can add some helpful debug to than.
Comment 57 Eugene Grosbein freebsd_committer 2017-06-16 15:23:29 UTC
Created attachment 183537 [details]
patch for lib/syslog by kib

(In reply to Cassiano Peixoto from comment #51)

I'd advise you stop using any patches sumbitted my me to the PR 214482 as they are no longer needed to find a root of the "umtxn" problem and are not quite correct. Instead, as more complete fix to "umtxn" problem use two following patches: first by Konstantin for syslog (libc) that should be applied so:

cd /usr/src
patch < /path/to/patch
cd lib/libc
make obj depend && make all install

Then restart mpd5 only.

Second patch for mpd5 itself by me. It tries to fix "uwrlck" problem in more correct way, replacing my previous patch. It deals with CLI console only, not touching web part still. Please make sure no http access is done to mpd. It would be fine to comment out "set web" commands in the mpd.conf for a while.

And, of course, apply Konstantin's latest patch for rtld.
Comment 58 Eugene Grosbein freebsd_committer 2017-06-16 15:24:45 UTC
Created attachment 183538 [details]
patch for mpd/console locks by me
Comment 59 Eugene Grosbein freebsd_committer 2017-06-16 16:16:20 UTC
(In reply to Cassiano Peixoto from comment #49)

Note: Thread 5 shows "http_server_accept" in the trace that means that mpd web interface is being used.
Comment 60 Cassiano Peixoto 2017-06-16 16:55:54 UTC
(In reply to Dmitry Luhtionov from comment #56)
Hi Dmitry,

Yes, i can. Please just provide the link and i can download and compile.

Thanks.
Comment 61 Cassiano Peixoto 2017-06-16 17:11:00 UTC
(In reply to Eugene Grosbein from comment #59)
Eugene, i think it's because web option was still enabled in mpd.conf. Anyway i disabled it now to avoid any issues.
Comment 62 Cassiano Peixoto 2017-06-16 17:16:37 UTC
(In reply to Eugene Grosbein from comment #57)
Eugene, i got the following error compiling libc patch:

===> tests (all)
===> tests/tls_dso (all)
===> tests/c063 (all)
(cd /usr/src/lib/libc/tests/c063 &&  DEPENDFILE=.depend.faccessat_test  NO_SUBDIR=1 make -f /usr/src/lib/libc/tests/c063/Makefile _RECURSING_PROGS=t  PROG=faccessat_test )
make[3]: /usr/obj/usr/src/lib/libc/tests/c063/.depend.faccessat_test, 1: ignoring stale .depend.faccessat_test for /usr/obj/usr/src/lib/libnetbsd/libnetbsd.a
cc -O2 -pipe -D_INCOMPLETE_XOPEN_C063 -I/usr/src/lib/libnetbsd -I/usr/src/contrib/netbsd-tests -g -std=gnu99 -fstack-protector-strong -Wsystem-headers -Werror -Wall -Wno-format-y2k -Wno-uninitialized -Wno-pointer-sign -Wno-empty-body -Wno-string-plus-int -Wno-unused-const-variable -Wno-tautological-compare -Wno-unused-value -Wno-parentheses-equality -Wno-unused-function -Wno-enum-conversion -Wno-unused-local-typedef -Wno-switch -Wno-switch-enum -Wno-knr-promoted-parameter -Qunused-arguments  -L/usr/obj/usr/src/lib/libnetbsd -o faccessat_test.full t_faccessat.o  -lprivateatf-c -L/usr/obj/usr/src/lib/libnetbsd -lnetbsd
/usr/bin/ld: cannot find -lnetbsd
cc: error: linker command failed with exit code 1 (use -v to see invocation)
*** Error code 1

Stop.
make[3]: stopped in /usr/src/lib/libc/tests/c063
*** Error code 1

Stop.
make[2]: stopped in /usr/src/lib/libc/tests/c063
*** Error code 1

Stop.
make[1]: stopped in /usr/src/lib/libc/tests
*** Error code 1

Stop.
make: stopped in /usr/src/lib/libc
Exit 1
Comment 63 Konstantin Belousov freebsd_committer 2017-06-16 17:19:15 UTC
(In reply to Cassiano Peixoto from comment #62)
Use 'make -DWITHOUT_TESTS=yes' to compile both libc and libthr.
Comment 64 Konstantin Belousov freebsd_committer 2017-06-16 17:23:13 UTC
(In reply to Konstantin Belousov from comment #63)
Sorry, the correct spelling is 'make WITHOUT_TESTS=yes', no -D is needed.
Comment 65 Cassiano Peixoto 2017-06-16 17:40:12 UTC
(In reply to Konstantin Belousov from comment #64)
Konstantin,

Libc patch compiled fine after using 'make WITHOUT_TESTS=yes'. But libthr patch doesn't, see:

cc  -O2 -pipe   -DPTHREAD_KERNEL -I/usr/src/lib/libthr/../libc/include -I/usr/src/lib/libthr/thread  -I/usr/src/lib/libthr/../../include -I/usr/src/lib/libthr/arch/amd64/include -I/usr/src/lib/libthr/sys -I/usr/src/lib/libthr/../../libexec/rtld-elf -I/usr/src/lib/libthr/../../libexec/rtld-elf/amd64 -I/usr/src/lib/libthr/../libthread_db -Winline -fexceptions -D_PTHREAD_FORCED_UNWIND -D_PTHREADS_INVARIANTS -mno-mmx -mno-sse -mno-avx -MD  -MF.depend.thr_init.o -MTthr_init.o -std=gnu99 -Wsystem-headers -Werror -Wall -Wno-format-y2k -W -Wno-unused-parameter -Wstrict-prototypes -Wmissing-prototypes -Wpointer-arith -Wno-uninitialized -Wno-pointer-sign -Wno-empty-body -Wno-string-plus-int -Wno-unused-const-variable -Wno-tautological-compare -Wno-unused-value -Wno-parentheses-equality -Wno-unused-function -Wno-enum-conversion -Wno-unused-local-typedef  -Qunused-arguments  -c /usr/src/lib/libthr/thread/thr_init.c -o thr_init.o
/usr/src/lib/libthr/thread/thr_init.c:71:35: error: excess elements in scalar initializer [-Werror]
struct urwlock  _thr_atfork_lock = DEFAULT_URWLOCK;
                                   ^~~~~~~~~~~~~~~
/usr/src/lib/libthr/thread/thr_umtx.h:40:39: note: expanded from macro 'DEFAULT_URWLOCK'
#define DEFAULT_URWLOCK {0,0,0,0,0,{0,0,0}}
                                      ^
/usr/src/lib/libthr/thread/thr_init.c:71:35: error: suggest braces around initialization of subobject [-Werror,-Wmissing-braces]
struct urwlock  _thr_atfork_lock = DEFAULT_URWLOCK;
                                   ^~~~~~~~~~~~~~~
/usr/src/lib/libthr/thread/thr_umtx.h:40:34: note: expanded from macro 'DEFAULT_URWLOCK'
#define DEFAULT_URWLOCK {0,0,0,0,0,{0,0,0}}
                                 ^~~~~~~~~
/usr/src/lib/libthr/thread/thr_init.c:127:33: error: excess elements in scalar initializer [-Werror]
struct urwlock  _thr_list_lock = DEFAULT_URWLOCK;
                                 ^~~~~~~~~~~~~~~
/usr/src/lib/libthr/thread/thr_umtx.h:40:39: note: expanded from macro 'DEFAULT_URWLOCK'
#define DEFAULT_URWLOCK {0,0,0,0,0,{0,0,0}}
                                      ^
/usr/src/lib/libthr/thread/thr_init.c:127:33: error: suggest braces around initialization of subobject [-Werror,-Wmissing-braces]
struct urwlock  _thr_list_lock = DEFAULT_URWLOCK;
                                 ^~~~~~~~~~~~~~~
/usr/src/lib/libthr/thread/thr_umtx.h:40:34: note: expanded from macro 'DEFAULT_URWLOCK'
#define DEFAULT_URWLOCK {0,0,0,0,0,{0,0,0}}
                                 ^~~~~~~~~
4 errors generated.
*** Error code 1

Stop.
make: stopped in /usr/src/lib/libthr
Comment 66 Konstantin Belousov freebsd_committer 2017-06-16 17:58:40 UTC
(In reply to Cassiano Peixoto from comment #65)
Perhaps you need to manually copy patched sys/_umtx.h to /usr/include/sys.  Do not forget to revert this when cleaning your machine.
Comment 67 Cassiano Peixoto 2017-06-16 18:08:06 UTC
(In reply to Konstantin Belousov from comment #66)
Konstantin,

Worked after copied _umtx.h. Thanks.

Konstantin and Eugene,

I applied the patches, to sum up what i've applied:

- libc patch
- libthr patch
- mpd console patch

As you know it takes some days to happen again. Last time took 5 days. I wish, but i don't know how to force to reproduce the issue at any time.

I'll keep you posted guys, thank you.
Comment 68 Eugene Grosbein freebsd_committer 2017-06-16 21:50:38 UTC
*** Bug 214482 has been marked as a duplicate of this bug. ***
Comment 69 Kubilay Kocak freebsd_committer freebsd_triage 2017-06-16 23:49:37 UTC
Assign to Eugene who appears (currently) to be coordinating. 

It would be ideal if a separate (dependent) issue is created for the/any base changes required to resolve this issue, so that it can be assigned to the person resolving on that side (kib?) and tracked for MFC's and release engineering if appropriate (ports categorized issues don't have those flags)
Comment 70 Eugene Grosbein freebsd_committer 2017-06-17 09:09:22 UTC
JFYI: these patches made my mpd server side pretty stable, so my stress test can run long enough to hit another problem that produces repeatable kernel panics at client side. I've digged that problem and produced a fix for ng_iface(4) and filled another PR for that distinct problem:

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=220076
Comment 71 Eugene Grosbein freebsd_committer 2017-06-17 12:24:16 UTC
IFYI2: another reason for this task to break and new set of patches here: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=220078
Comment 72 Kubilay Kocak freebsd_committer freebsd_triage 2017-06-18 09:40:42 UTC
(In reply to Eugene Grosbein from comment #71)

Is there an issue for the syslog related changes/patch discussed with kib@ on freebsd-current?
Comment 73 Eugene Grosbein freebsd_committer 2017-06-18 11:46:43 UTC
(In reply to Kubilay Kocak from comment #72)

This is the patch https://bugs.freebsd.org/bugzilla/attachment.cgi?id=183537

My tests show no issues with it. It really helps and I'd like to see it included in upcoming 11.1-RELEASE and merged to stable/10 too for announced 10.4.
Comment 74 Kubilay Kocak freebsd_committer freebsd_triage 2017-06-20 03:03:33 UTC
(In reply to Eugene Grosbein from comment #73)

I order to ensure that (base) change is tracked and merged correctly, I'd recommend (as suggested in comment 69) creating a separate dependent issue so that it can be:

- Assigned to kib (as distinct from this overarching issue)
- Tracked for MFC's correctly and completely
- Used to notify/inform release engineering specifically and only for that change
Comment 75 Eugene Grosbein freebsd_committer 2017-06-20 09:51:57 UTC
(In reply to Kubilay Kocak from comment #74)

Done: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=220151

Konstantin Belousov (author of the patch) is in the re@, so this team should be aware of the problem.
Comment 76 Cassiano Peixoto 2017-06-20 18:54:10 UTC
(In reply to Eugene Grosbein from comment #75)
Eugene, are you going to provide all these patches to FreeBSD 10-STABLE?
Comment 77 Eugene Grosbein freebsd_committer 2017-06-20 19:31:00 UTC
(In reply to Cassiano Peixoto from comment #76)

All patches are pretty simple and should just work for 10.x
Comment 78 Cassiano Peixoto 2017-06-27 12:46:25 UTC
(In reply to Eugene Grosbein from comment #77)
Hi guys,

After 8 days working with no issues i think at last it has been fixed. Well done Eugene :)

I could see libc patch has been committed to 11-STABLE. Is there a plan to commit libthr patch and all others PRs related before 11.1-RELEASE?

Thank you guys.
Comment 79 Eugene Grosbein freebsd_committer 2017-06-27 17:22:59 UTC
(In reply to Cassiano Peixoto from comment #78)

I believe, the libthr patch was for debugging purposes only and is not needed to fix the problem itself.

Did you test it with web server disabled? If so, please update your ports tree and rebuild mpd5 port that now has "console" patch integrated, so remove your local version of the patch before updating ports tree - my tests show that it should be enough to fix mpd's web server too. And run mpd-5.8_1 with web server enabled as you did earlier to verify that.
Comment 80 Cassiano Peixoto 2017-06-27 17:46:28 UTC
(In reply to Eugene Grosbein from comment #79)
Yes, i did. All my tests were with web server disabled as you requested. Only with console enabled. I'll run with web server enabled to try the patch. 

How about other related patchs (like netgraph, ipfw, etc) ? Are you going to commit?

Thanks.
Comment 81 Eugene Grosbein freebsd_committer 2017-06-27 17:53:29 UTC
(In reply to Cassiano Peixoto from comment #80)

I cannot commit kernel patches myself as I have no src commit bit. Any src committed is needed to take a look at least, so I've filled my PRs.
Comment 82 Cassiano Peixoto 2017-06-27 17:59:35 UTC
(In reply to Eugene Grosbein from comment #81)
Humm i see. Maybe Konstantin Belousov could help us :)
Comment 83 Cassiano Peixoto 2017-06-27 21:14:11 UTC
(In reply to Eugene Grosbein from comment #81)
Eugene and Konstantin,

Bad news, it just stopped working. Eugene i hadn't enabled web server yet. So it stucked the mpd5 process with patch applied listed on comment #67.

The top out i can see same message:

1295 root             9  52    0   314M 54820K uwrlck  7 625:46   0.74% mpd5

Here it is the bt:

(gdb) bt
#0  0x000000080228567c in _umtx_op_err () from /lib/libthr.so.3
#1  0x0000000802279c81 in __thr_rwlock_wrlock (rwlock=0x802494500, tsp=<value optimized out>) at /usr/src/lib/libthr/thread/thr_umtx.c:325
#2  0x00000008022829f3 in _thr_rtld_wlock_acquire (lock=<value optimized out>) at thr_umtx.h:239
#3  0x00000008006aec72 in wlock_acquire (lock=0x8008c3960, lockstate=0x7fffdfffda08) at /usr/src/libexec/rtld-elf/rtld_lock.c:217
#4  0x00000008006aaf4f in _rtld_allocate_tls (oldtls=0x0, tcbsize=32, tcbalign=16) at /usr/src/libexec/rtld-elf/rtld.c:4802
#5  0x0000000802281b69 in _tcb_ctor (thread=0x80c740f00, initial=<value optimized out>) at /usr/src/lib/libthr/thread/thr_ctrdtr.c:45
#6  0x0000000802281110 in _thr_alloc (curthread=<value optimized out>) at /usr/src/lib/libthr/thread/thr_list.c:170
#7  0x00000008022772a6 in _pthread_create (thread=0x80324d598, attr=0x0, start_routine=0x4653b0 <fseeko@plt+388260>, arg=<value optimized out>) at /usr/src/lib/libthr/thread/thr_create.c:78
#8  0x000000000046535b in ?? ()
#9  0x00000000004292a8 in ?? ()
#10 0x000000000042d219 in ?? ()
#11 0x0000000000448da3 in ?? ()
#12 0x00000000004493b1 in ?? ()
#13 0x000000000043ac94 in ?? ()
#14 0x000000000043d650 in ?? ()
#15 0x000000000043d366 in ?? ()
#16 0x000000000043b21d in ?? ()
#17 0x0000000000445012 in ?? ()
#18 0x000000000044b598 in ?? ()
#19 0x0000000000439c9f in ?? ()
#20 0x00000000004651d6 in ?? ()
#21 0x0000000000464908 in ?? ()
#22 0x0000000802277b55 in thread_start (curthread=<value optimized out>) at /usr/src/lib/libthr/thread/thr_create.c:289

Here it is all threads:

(gdb) thread apply all bt

Thread 9 (Thread 803016a00 (LWP 100704/<unknown>)):
#0  0x000000080228567c in _umtx_op_err () from /lib/libthr.so.3
#1  0x0000000802279c81 in __thr_rwlock_wrlock (rwlock=0x802494500, tsp=<value optimized out>) at /usr/src/lib/libthr/thread/thr_umtx.c:325
#2  0x00000008022829f3 in _thr_rtld_wlock_acquire (lock=<value optimized out>) at thr_umtx.h:239
#3  0x00000008006aec72 in wlock_acquire (lock=0x8008c3960, lockstate=0x7fffdfffda08) at /usr/src/libexec/rtld-elf/rtld_lock.c:217
#4  0x00000008006aaf4f in _rtld_allocate_tls (oldtls=0x0, tcbsize=32, tcbalign=16) at /usr/src/libexec/rtld-elf/rtld.c:4802
#5  0x0000000802281b69 in _tcb_ctor (thread=0x80c740f00, initial=<value optimized out>) at /usr/src/lib/libthr/thread/thr_ctrdtr.c:45
#6  0x0000000802281110 in _thr_alloc (curthread=<value optimized out>) at /usr/src/lib/libthr/thread/thr_list.c:170
#7  0x00000008022772a6 in _pthread_create (thread=0x80324d598, attr=0x0, start_routine=0x4653b0 <fseeko@plt+388260>, arg=<value optimized out>) at /usr/src/lib/libthr/thread/thr_create.c:78
#8  0x000000000046535b in ?? ()
#9  0x00000000004292a8 in ?? ()
#10 0x000000000042d219 in ?? ()
#11 0x0000000000448da3 in ?? ()
#12 0x00000000004493b1 in ?? ()
#13 0x000000000043ac94 in ?? ()
#14 0x000000000043d650 in ?? ()
#15 0x000000000043d366 in ?? ()
#16 0x000000000043b21d in ?? ()
#17 0x0000000000445012 in ?? ()
#18 0x000000000044b598 in ?? ()
#19 0x0000000000439c9f in ?? ()
#20 0x00000000004651d6 in ?? ()
#21 0x0000000000464908 in ?? ()
#22 0x0000000802277b55 in thread_start (curthread=<value optimized out>) at /usr/src/lib/libthr/thread/thr_create.c:289
#23 0x0000000000000000 in ?? ()

Thread 8 (Thread 80326ca00 (LWP 101005/<unknown>)):
#0  0x000000080257d43a in _nanosleep () from /lib/libc.so.7
#1  0x000000080227a4cc in __thr_nanosleep (time_to_sleep=0x7fffc6731ed0, time_remaining=0x0) at /usr/src/lib/libthr/thread/thr_syscalls.c:257
#2  0x00000008025d5076 in __usleep (useconds=<value optimized out>) at /usr/src/lib/libc/gen/usleep.c:52
#3  0x0000000000465532 in ?? ()
#4  0x0000000802285550 in __pthread_cleanup_pop_imp (execute=<value optimized out>) at /usr/src/lib/libthr/thread/thr_clean.c:73
#5  0x000000000046541e in ?? ()
#6  0x0000000802277b55 in thread_start (curthread=<value optimized out>) at /usr/src/lib/libthr/thread/thr_create.c:289
#7  0x00007fffc6532000 in ?? ()
Cannot access memory at address 0x7fffc6732000

Thread 7 (Thread 80326ed00 (LWP 101007/<unknown>)):
#0  0x000000080257d43a in _nanosleep () from /lib/libc.so.7
#1  0x000000080227a4cc in __thr_nanosleep (time_to_sleep=0x7fffb7cbced0, time_remaining=0x0) at /usr/src/lib/libthr/thread/thr_syscalls.c:257
#2  0x00000008025d5076 in __usleep (useconds=<value optimized out>) at /usr/src/lib/libc/gen/usleep.c:52
#3  0x0000000000465532 in ?? ()
#4  0x0000000802285550 in __pthread_cleanup_pop_imp (execute=<value optimized out>) at /usr/src/lib/libthr/thread/thr_clean.c:73
#5  0x000000000046541e in ?? ()
#6  0x0000000802277b55 in thread_start (curthread=<value optimized out>) at /usr/src/lib/libthr/thread/thr_create.c:289
#7  0x00007fffb7abd000 in ?? ()
Cannot access memory at address 0x7fffb7cbd000

---Type <return> to continue, or q <return> to quit---
Thread 6 (Thread 803252e00 (LWP 100809/<unknown>)):
#0  0x000000080257d43a in _nanosleep () from /lib/libc.so.7
#1  0x000000080227a4cc in __thr_nanosleep (time_to_sleep=0x7fffb86c1ed0, time_remaining=0x0) at /usr/src/lib/libthr/thread/thr_syscalls.c:257
#2  0x00000008025d5076 in __usleep (useconds=<value optimized out>) at /usr/src/lib/libc/gen/usleep.c:52
#3  0x0000000000465532 in ?? ()
#4  0x0000000802285550 in __pthread_cleanup_pop_imp (execute=<value optimized out>) at /usr/src/lib/libthr/thread/thr_clean.c:73
#5  0x000000000046541e in ?? ()
#6  0x0000000802277b55 in thread_start (curthread=<value optimized out>) at /usr/src/lib/libthr/thread/thr_create.c:289
#7  0x00007fffb84c2000 in ?? ()
Cannot access memory at address 0x7fffb86c2000

Thread 5 (Thread 807292a00 (LWP 100290/<unknown>)):
#0  0x000000080257d43a in _nanosleep () from /lib/libc.so.7
#1  0x000000080227a4cc in __thr_nanosleep (time_to_sleep=0x7fff88b44ed0, time_remaining=0x0) at /usr/src/lib/libthr/thread/thr_syscalls.c:257
#2  0x00000008025d5076 in __usleep (useconds=<value optimized out>) at /usr/src/lib/libc/gen/usleep.c:52
#3  0x0000000000465532 in ?? ()
#4  0x0000000802285550 in __pthread_cleanup_pop_imp (execute=<value optimized out>) at /usr/src/lib/libthr/thread/thr_clean.c:73
#5  0x000000000046541e in ?? ()
#6  0x0000000802277b55 in thread_start (curthread=<value optimized out>) at /usr/src/lib/libthr/thread/thr_create.c:289
#7  0x00007fff88945000 in ?? ()
Cannot access memory at address 0x7fff88b45000

Thread 4 (Thread 803253d00 (LWP 101008/<unknown>)):
#0  0x000000080257d43a in _nanosleep () from /lib/libc.so.7
#1  0x000000080227a4cc in __thr_nanosleep (time_to_sleep=0x7fffbdaebed0, time_remaining=0x0) at /usr/src/lib/libthr/thread/thr_syscalls.c:257
#2  0x00000008025d5076 in __usleep (useconds=<value optimized out>) at /usr/src/lib/libc/gen/usleep.c:52
#3  0x0000000000465532 in ?? ()
#4  0x0000000802285550 in __pthread_cleanup_pop_imp (execute=<value optimized out>) at /usr/src/lib/libthr/thread/thr_clean.c:73
#5  0x000000000046541e in ?? ()
#6  0x0000000802277b55 in thread_start (curthread=<value optimized out>) at /usr/src/lib/libthr/thread/thr_create.c:289
#7  0x00007fffbd8ec000 in ?? ()
Cannot access memory at address 0x7fffbdaec000

Thread 3 (Thread 808089000 (LWP 101010/<unknown>)):
#0  0x000000080257d43a in _nanosleep () from /lib/libc.so.7
#1  0x000000080227a4cc in __thr_nanosleep (time_to_sleep=0x7fff8bf5eed0, time_remaining=0x0) at /usr/src/lib/libthr/thread/thr_syscalls.c:257
#2  0x00000008025d5076 in __usleep (useconds=<value optimized out>) at /usr/src/lib/libc/gen/usleep.c:52
#3  0x0000000000465532 in ?? ()
#4  0x0000000802285550 in __pthread_cleanup_pop_imp (execute=<value optimized out>) at /usr/src/lib/libthr/thread/thr_clean.c:73
#5  0x000000000046541e in ?? ()
#6  0x0000000802277b55 in thread_start (curthread=<value optimized out>) at /usr/src/lib/libthr/thread/thr_create.c:289
#7  0x00007fff8bd5f000 in ?? ()
Cannot access memory at address 0x7fff8bf5f000

Thread 2 (Thread 80808a400 (LWP 101011/<unknown>)):
#0  0x000000080257d43a in _nanosleep () from /lib/libc.so.7
#1  0x000000080227a4cc in __thr_nanosleep (time_to_sleep=0x7fff9178aed0, time_remaining=0x0) at /usr/src/lib/libthr/thread/thr_syscalls.c:257
#2  0x00000008025d5076 in __usleep (useconds=<value optimized out>) at /usr/src/lib/libc/gen/usleep.c:52
#3  0x0000000000465532 in ?? ()
#4  0x0000000802285550 in __pthread_cleanup_pop_imp (execute=<value optimized out>) at /usr/src/lib/libthr/thread/thr_clean.c:73
#5  0x000000000046541e in ?? ()
---Type <return> to continue, or q <return> to quit---
#6  0x0000000802277b55 in thread_start (curthread=<value optimized out>) at /usr/src/lib/libthr/thread/thr_create.c:289
#7  0x00007fff9158b000 in ?? ()
Cannot access memory at address 0x7fff9178b000

Thread 1 (Thread 807258d00 (LWP 101012/<unknown>)):
#0  0x000000080257d43a in _nanosleep () from /lib/libc.so.7
#1  0x000000080227a4cc in __thr_nanosleep (time_to_sleep=0x7fff7c2e0ed0, time_remaining=0x0) at /usr/src/lib/libthr/thread/thr_syscalls.c:257
#2  0x00000008025d5076 in __usleep (useconds=<value optimized out>) at /usr/src/lib/libc/gen/usleep.c:52
#3  0x0000000000465532 in ?? ()
#4  0x0000000802285550 in __pthread_cleanup_pop_imp (execute=<value optimized out>) at /usr/src/lib/libthr/thread/thr_clean.c:73
#5  0x000000000046541e in ?? ()
#6  0x0000000802277b55 in thread_start (curthread=<value optimized out>) at /usr/src/lib/libthr/thread/thr_create.c:289
#7  0x00007fff7c0e1000 in ?? ()
Cannot access memory at address 0x7fff7c2e1000
#0  0x000000080228567c in _umtx_op_err () from /lib/libthr.so.3

Printof of *rwlock:

(gdb) frame 1
#1  0x0000000802279c81 in __thr_rwlock_wrlock (rwlock=0x802494500, tsp=<value optimized out>) at /usr/src/lib/libthr/thread/thr_umtx.c:325
325		return (_umtx_op_err(rwlock, UMTX_OP_RW_WRLOCK, 0, (void *)tm_size,
Current language:  auto; currently minimal
(gdb) p *rwlock
$1 = {rw_state = -1610612736, rw_flags = 2, rw_blocked_readers = 0, rw_blocked_writers = 0, rw_wowner = 100704, rw_spare = 0x802494514}

I could see some kernel messages as well, earlier today:

Jun 24 10:14:21 B-ras kernel: node: ID [20498d]: type 'tee', 0 hooks, flags 0x9, 0 refs, mpd1295-vlan340-44-lt:
Jun 24 10:14:21 B-ras kernel: Last active @ /usr/src/sys/netgraph/ng_base.c, line 804
Jun 24 10:14:21 B-ras kernel: problem discovered at file /usr/src/sys/netgraph/ng_base.c, line 3709
Jun 24 10:14:21 B-ras kernel: KDB: stack backtrace:
Jun 24 10:14:21 B-ras kernel: #0 0xffffffff80b1af87 at kdb_backtrace+0x67
Jun 24 10:14:21 B-ras kernel: #1 0xffffffff80c46671 at ng_send_fn1+0xc1
Jun 24 10:14:21 B-ras kernel: #2 0xffffffff80c3e9c4 at ng_destroy_hook+0x334
Jun 24 10:14:21 B-ras kernel: #3 0xffffffff80c4289b at ng_apply_item+0x3eb
Jun 24 10:14:21 B-ras kernel: #4 0xffffffff80c48475 at ngthread+0x315
Jun 24 10:14:21 B-ras kernel: #5 0xffffffff80a88c55 at fork_exit+0x85
Jun 24 10:14:21 B-ras kernel: #6 0xffffffff80ec5f8e at fork_trampoline+0xe
Jun 24 10:14:21 B-ras kernel: Accessing freed node: ID [20498d]: type 'tee', 0 hooks, flags 0x9, 1 refs, mpd1295-vlan340-44-lt:
Jun 24 10:14:21 B-ras kernel: Last active @ /usr/src/sys/netgraph/ng_base.c, line 3709
Jun 24 10:14:21 B-ras kernel: problem discovered at file /usr/src/sys/netgraph/ng_base.c, line 2455
Jun 24 10:14:21 B-ras kernel: KDB: stack backtrace:
Jun 24 10:14:21 B-ras kernel: #0 0xffffffff80b1af87 at kdb_backtrace+0x67
Jun 24 10:14:21 B-ras kernel: #1 0xffffffff80c42711 at ng_apply_item+0x261
Jun 24 10:14:21 B-ras kernel: #2 0xffffffff80c4217e at ng_snd_item+0x1ee
Jun 24 10:14:21 B-ras kernel: #3 0xffffffff80c3e9c4 at ng_destroy_hook+0x334
Jun 24 10:14:21 B-ras kernel: #4 0xffffffff80c4289b at ng_apply_item+0x3eb
Jun 24 10:14:21 B-ras kernel: #5 0xffffffff80c48475 at ngthread+0x315
Jun 24 10:14:21 B-ras kernel: #6 0xffffffff80a88c55 at fork_exit+0x85
Jun 24 10:14:21 B-ras kernel: #7 0xffffffff80ec5f8e at fork_trampoline+0xe
Jun 24 10:14:21 B-ras kernel: Accessing freed node: ID [20498d]: type 'tee', 0 hooks, flags 0x9, 1 refs, mpd1295-vlan340-44-lt:
Jun 24 10:14:21 B-ras kernel: Last active @ /usr/src/sys/netgraph/ng_base.c, line 2455
Jun 24 10:14:22 B-ras kernel: problem discovered at file /usr/src/sys/netgraph/ng_base.c, line 2335
Jun 24 10:14:22 B-ras kernel: KDB: stack backtrace:
Jun 24 10:14:22 B-ras kernel: #0 0xffffffff80b1af87 at kdb_backtrace+0x67
Jun 24 10:14:22 B-ras kernel: #1 0xffffffff80c42308 at ng_snd_item+0x378
Jun 24 10:14:22 B-ras kernel: #2 0xffffffff80c3e9c4 at ng_destroy_hook+0x334
Jun 24 10:14:22 B-ras kernel: #3 0xffffffff80c4289b at ng_apply_item+0x3eb
Jun 24 10:14:22 B-ras kernel: #4 0xffffffff80c48475 at ngthread+0x315
Jun 24 10:14:22 B-ras kernel: #5 0xffffffff80a88c55 at fork_exit+0x85
Jun 24 10:14:22 B-ras kernel: #6 0xffffffff80ec5f8e at fork_trampoline+0xe
Jun 24 10:14:22 B-ras kernel: Accessing freed node: ID [20498d]: type 'tee', 0 hooks, flags 0x9, 0 refs, mpd1295-vlan340-44-lt:
Jun 24 10:14:22 B-ras kernel: Last active @ /usr/src/sys/netgraph/ng_base.c, line 2335
Jun 24 10:14:22 B-ras kernel: problem discovered at file /usr/src/sys/netgraph/ng_base.c, line 804
Jun 24 10:14:22 B-ras kernel: KDB: stack backtrace:
Jun 24 10:14:22 B-ras kernel: #0 0xffffffff80b1af87 at kdb_backtrace+0x67
Jun 24 10:14:22 B-ras kernel: #1 0xffffffff80c3ee46 at ng_unref_node+0x146
Jun 24 10:14:22 B-ras kernel: #2 0xffffffff80c42326 at ng_snd_item+0x396
Jun 24 10:14:22 B-ras kernel: #3 0xffffffff80c3e9c4 at ng_destroy_hook+0x334
Jun 24 10:14:22 B-ras kernel: #4 0xffffffff80c4289b at ng_apply_item+0x3eb
Jun 24 10:14:22 B-ras kernel: #5 0xffffffff80c48475 at ngthread+0x315
Jun 24 10:14:22 B-ras kernel: #6 0xffffffff80a88c55 at fork_exit+0x85
Jun 24 10:14:22 B-ras kernel: #7 0xffffffff80ec5f8e at fork_trampoline+0xe

Let me know if you need something else.

Thanks.
Comment 84 Eugene Grosbein freebsd_committer 2017-06-28 04:31:00 UTC
(In reply to Cassiano Peixoto from comment #83)

It seems you hit another problem in the libc/stdio. Konstantin produced another patch for this problem: https://reviews.freebsd.org/file/data/nthhi3ogesbuhnjclgmu/PHID-FILE-ahlgnygvinulibyezovs/D11246.diff ( https://reviews.freebsd.org/D11246 )

I'm very sorry not notifying you about this patch before. I ran my tests with this patch applied too and have no hangs. Please apply it similarly and additionally to the "syslog" patch and restart mpd with web server enabled.
Comment 85 Cassiano Peixoto 2017-06-28 12:35:47 UTC
(In reply to Eugene Grosbein from comment #84)
Hi Eugene,

Don't worry, no problem.

I tried to apply Konstantin patch but i had some rejects:

# cat lib/libc/stdio/fgets.c.rej
@@ -53,17 +53,17 @@
 fgets(char * __restrict buf, int n, FILE * __restrict fp)
 {
 	size_t len;
-	char *s;
+	char *s, *ret;
 	unsigned char *p, *t;
 
-	FLOCKFILE(fp);
+	FLOCKFILE_CANCELSAFE(fp);
 	ORIENT(fp, -1);
 
 	if (n <= 0) {		/* sanity check */
 		fp->_flags |= __SERR;
 		errno = EINVAL;
-		FUNLOCKFILE(fp);
-		return (NULL);
+		ret = NULL;
+		goto end;
 	}
 
 	s = buf;
@@ -76,8 +76,8 @@
 			if (__srefill(fp)) {
 				/* EOF/error: stop with partial or no line */
 				if (!__sfeof(fp) || s == buf) {
-					FUNLOCKFILE(fp);
-					return (NULL);
+					ret = NULL;
+					goto end;
 				}
 				break;
 			}

# cat lib/libc/stdio/fgetwln.c.rej
@@ -45,13 +45,14 @@
 wchar_t *
 fgetwln_l(FILE * __restrict fp, size_t *lenp, locale_t locale)
 {
+	wchar_t *ret;
 	wint_t wc;
 	size_t len;
 	int savserr;
 
 	FIX_LOCALE(locale);
 
-	FLOCKFILE(fp);
+	FLOCKFILE_CANCELSAFE(fp);
 	ORIENT(fp, 1);
 
 	savserr = fp->_flags & __SERR;

# cat lib/libc/stdio/fgetws.c.rej
@@ -46,14 +46,14 @@
 fgetws_l(wchar_t * __restrict ws, int n, FILE * __restrict fp, locale_t locale)
 {
 	int sret;
-	wchar_t *wsp;
+	wchar_t *wsp, *ret;
 	size_t nconv;
 	const char *src;
 	unsigned char *nl;
 	FIX_LOCALE(locale);
 	struct xlocale_ctype *l = XLOCALE_CTYPE(locale);
 
-	FLOCKFILE(fp);
+	FLOCKFILE_CANCELSAFE(fp);
 	ORIENT(fp, 1);
 
 	if (n <= 0) {
@@ -113,12 +113,14 @@
 		goto error;
 ok:
 	*wsp = L'\0';
-	FUNLOCKFILE(fp);
+	ret = ws;
+end:
+	FUNLOCKFILE_CANCELSAFE();
 	return (ws);
 
 error:
-	FUNLOCKFILE(fp);
-	return (NULL);
+	ret = NULL;
+	goto end;
 }
 
 wchar_t *

Should i update to latest FreeBSD-11 STABLE? I'm using 11.0-p0.

Thanks
Comment 86 Eugene Grosbein freebsd_committer 2017-06-28 13:51:33 UTC
(In reply to Cassiano Peixoto from comment #85)

Yes, that would be easiest thing. All these patches apply to 11-STABLE just fine.
Comment 87 Eugene Grosbein freebsd_committer 2017-06-28 13:52:43 UTC
(In reply to Cassiano Peixoto from comment #85)

And there is no need to apply "syslog" patch to 11-STABLE as it has already been merged.
Comment 88 Konstantin Belousov freebsd_committer 2017-06-28 14:43:45 UTC
(In reply to Eugene Grosbein from comment #84)
I very much doubt that the rtld lock leak (?) can be caused by stdio bugs.

In other words, apply the debugging rtld patch I posted 06-16, and follow the accompanying instructions from there.
Comment 89 Cassiano Peixoto 2017-06-28 14:48:01 UTC
(In reply to Konstantin Belousov from comment #88)
Hi Konstantin,

rtld patch has been applied since them as i said on comment #67. Do you need some additional debug info from core file?
Comment 90 Cassiano Peixoto 2017-06-28 20:22:38 UTC
(In reply to Eugene Grosbein from comment #87)
Eugene,

Sorry my delay but i was updating to 11.1-BETA3 r320450M. 

On updated i've applied the following patches:

- libc/stdio from https://reviews.freebsd.org/D11246
- updated to mpd5.8_1 with web server enabled
- ipfw patch
- libthr patch (rtld)
- in_mcast patch
- ip_input patch
- ng_iface patch
- stf patch

Let me know if you need anything else.

Thanks
Comment 91 Konstantin Belousov freebsd_committer 2017-06-28 23:44:40 UTC
(In reply to Cassiano Peixoto from comment #89)
See comment #52 for instructions.
Comment 92 Cassiano Peixoto 2017-06-29 00:11:17 UTC
(In reply to Konstantin Belousov from comment #91)
Did you read everything on my comment #83? Everything you asked on comment #52 is there. Or am i missed something?
Comment 93 Eugene Grosbein freebsd_committer 2017-06-29 07:54:44 UTC
(In reply to Cassiano Peixoto from comment #90)

Make sure you have latest revision of https://reviews.freebsd.org/D11246 patch as Konstantin updated it several hours ago to include "fseeko" chunk and your last hang and traces show a reference to that "fseeko".
Comment 94 Cassiano Peixoto 2017-06-29 09:28:50 UTC
(In reply to Eugene Grosbein from comment #93)
Ok i reverted the patch and applied the new one, but i got an error when compiling libc:

cc  -O2 -pipe   -I/usr/src/lib/libc/include -I/usr/src/include -I/usr/src/lib/libc/amd64 -DNLS  -D__DBINTERFACE_PRIVATE -I/usr/src/contrib/gdtoa -I/usr/src/contrib/libc-vis -DINET6 -I/usr/obj/usr/src/lib/libc -I/usr/src/lib/libc/resolv -D_ACL_PRIVATE -DPOSIX_MISTAKE -I/usr/src/lib/libmd -I/usr/src/contrib/jemalloc/include -I/usr/src/contrib/tzcode/stdtime -I/usr/src/lib/libc/stdtime -I/usr/src/lib/libc/locale -DBROKEN_DES -DPORTMAP -DDES_BUILTIN -I/usr/src/lib/libc/rpc -DWANT_HYPERV -DYP -DNS_CACHING -DSYMBOL_VERSIONING -MD  -MF.depend.stdio.o -MTstdio.o -std=gnu99 -fstack-protector-strong -Wsystem-headers -Werror -Wall -Wno-format-y2k -Wno-uninitialized -Wno-pointer-sign -Wno-empty-body -Wno-string-plus-int -Wno-unused-const-variable -Wno-tautological-compare -Wno-unused-value -Wno-parentheses-equality -Wno-unused-function -Wno-enum-conversion -Wno-unused-local-typedef -Wno-address-of-packed-member -Wno-switch -Wno-switch-enum -Wno-knr-promoted-parameter  -Qunused-arguments  -I/usr/src/lib/libutil -I/usr/src/lib/msun/amd64 -I/usr/src/lib/msun/x86 -I/usr/src/lib/msun/src -c /usr/src/lib/libc/stdio/stdio.c -o stdio.o
/usr/src/lib/libc/stdio/stdio.c:179:1: error: redefinition of '__stdio_cancel_cleanup'
__stdio_cancel_cleanup(void * arg)
^
/usr/src/lib/libc/stdio/stdio.c:171:1: note: previous definition is here
__stdio_cancel_cleanup(void * arg)
^
1 error generated.
*** Error code 1

Stop.
make: stopped in /usr/src/lib/libc
Comment 95 Eugene Grosbein freebsd_committer 2017-06-29 10:57:02 UTC
(In reply to Cassiano Peixoto from comment #94)

It seems your source tree is broken now. Try to clean is up and apply all patches again:

cd /usr/src && svnlite cleanup --remove-unversioned && svnlite revert -R .
Comment 96 Cassiano Peixoto 2017-06-29 11:08:38 UTC
(In reply to Cassiano Peixoto from comment #94)
Let it go, something messed when applied the patch. I re-synced the src and applied fine. So, patched applied and mpd5 recompiled. Let's watch.

Thanks.
Comment 97 Konstantin Belousov freebsd_committer 2017-06-29 15:13:52 UTC
(In reply to Cassiano Peixoto from comment #92)
Indeed, everything I asked for, is in the comment #83.  Now I cannot make a sense from the reported line 4802 of rtld.c in the backtrace for the thread 100704.  There is no lock call neither in stable/11 nor in HEAD, at this line.  What exact version of the sources do you use, from which branch ?
Comment 98 Eugene Grosbein freebsd_committer 2017-06-29 16:56:13 UTC
(In reply to Konstantin Belousov from comment #97)

That's from releng/11.0:

https://svnweb.freebsd.org/base/releng/11.0/libexec/rtld-elf/rtld.c?annotate=304456#l4802
Comment 99 Cassiano Peixoto 2017-06-29 17:00:21 UTC
(In reply to Konstantin Belousov from comment #97)
When it happend i was using 11.0-release-p0. But yesterday i updated to 11.1-beta3.
Comment 100 Konstantin Belousov freebsd_committer 2017-06-29 18:36:43 UTC
(In reply to Cassiano Peixoto from comment #99)
Uh, ok.  So lets wait for the issue to re-appear on latest stable.  There were some fixes that might be relevant.
Comment 101 Cassiano Peixoto 2017-07-05 17:25:53 UTC
(In reply to Konstantin Belousov from comment #100)
Hi guys,

I had the issue today, but with a different behavior. It didn't show on CPU state as uwrlck anymore. The process was running as usual, but all customers connected stopped working.

I tried to check both mpd5 console and web but didn't answer.

I tried to killall -QUIT mpd5 process with no success, I try to kill -9 the pid process with no success as well. Looks like a zombie process.

At last i tried to reboot the server, and didn't work. I had to reset the server to boot again. :(
Comment 102 Eugene Grosbein freebsd_committer 2017-07-05 17:48:39 UTC
mav@ is not maintainer of net/mpd5 anymore.
Comment 103 Eugene Grosbein freebsd_committer 2017-07-05 18:01:25 UTC
(In reply to Cassiano Peixoto from comment #101)

In fact, a zombie process is not a process anymore. It does not prevent your from restarting new instance of mpd5. Did you look at ps output to check real state of mpd5 or tried to restart it?

Also, a process may become a zombie only if it did not detach properly at start as normal daemons do (and mpd5 does it) AND its parent does not properly reap its children. init(8) does it just fine for normal daemons including mpd5.

Do you run mpd5 in detached mode using -b flag? If so, I highly doubt it became a zombie. More likely, it could stuck in some uninterruptable system call but that's pretty unusual too.

If you run mpd5 in "foreground" mode using another supervising daemon like init(8) itself, that's another story.

And what did you have in the log of mpd5 corresponding that time?
You showed no diagnostics: ps -l output, mpd logs, kernel dmesg logs, something?

And system reboot cleans it all, so it seems you had some external problem just that took more time to disappear and second reboot just gave it that time.
Comment 104 Cassiano Peixoto 2017-07-05 19:09:01 UTC
Hi,

It wasn't zombie, it looks like zombie because the kind of behavior. See process top output below:

 PID USERNAME       THR PRI NICE   SIZE    RES STATE   C   TIME    WCPU COMMAND
 1295 root            30  73    0   327M   147M CPU3  3  627:10   101.55% mpd5

Pay attention to WCPU, it's too high.

I never ran mpd5 in detached mode, but if you need it, i can.

Regarding logs, there is nothing relevant in mpd5 logs, but in kernel logs i have:

Jul  5 11:59:59 B-ras kernel: Accessing freed hook: name l4658, 0 refs, Last touched:
Jul  5 11:59:59 B-ras kernel: Last active @ /usr/src/sys/netgraph/ng_base.c, line 1196
Jul  5 11:59:59 B-ras kernel: problem discovered at file /usr/src/sys/netgraph/ng_base.c, line 1131
Jul  5 11:59:59 B-ras kernel: KDB: stack backtrace:
Jul  5 11:59:59 B-ras kernel: #0 0xffffffff80aa4ad7 at kdb_backtrace+0x67
Jul  5 11:59:59 B-ras kernel: #1 0xffffffff80bb2acc at ng_findhook+0xac
Jul  5 11:59:59 B-ras kernel: #2 0xffffffff80bb96bc at ng_con_nodes+0x2c
Jul  5 11:59:59 B-ras kernel: #3 0xffffffff80bb4e70 at ng_apply_item+0xa90
Jul  5 11:59:59 B-ras kernel: #4 0xffffffff80bb413b at ng_snd_item+0x1db
Jul  5 11:59:59 B-ras kernel: #5 0xffffffff82451469 at ngc_send+0x209
Jul  5 11:59:59 B-ras kernel: #6 0xffffffff80aea8ec at sosend_generic+0x4ec
Jul  5 11:59:59 B-ras kernel: #7 0xffffffff80af11c1 at kern_sendit+0x291
Jul  5 11:59:59 B-ras kernel: #8 0xffffffff80af1523 at sendit+0x1a3
Jul  5 11:59:59 B-ras kernel: #9 0xffffffff80af136d at sys_sendto+0x4d
Jul  5 11:59:59 B-ras kernel: #10 0xffffffff80e36394 at amd64_syscall+0x6c4
Jul  5 12:00:00 B-ras kernel: #11 0xffffffff80e194eb at Xfast_syscall+0xfb

Remember, i was using web status to collect info, i'll disable it, maybe it could be the problem, don't you think?
Comment 105 Eugene Grosbein freebsd_committer 2017-07-05 20:22:23 UTC
(In reply to Cassiano Peixoto from comment #104)

I do not think web console has any connection to this new problem, keep it on.
How do you start mpd5 - do you use its standard startup (rc.d) script and or not and what does "ps -ax | grep mpd" show? I mean its arguments.

If whis problem repeats, use "ps -axH" to see all threads of mpd5 and their states and top -H for the same purpose.

As for "Accessing freed hook": it seems my patch for ng_iface from the PR 220076 is not complete. It will take some time for me to enhance it and test before I update it there.
Comment 106 Cassiano Peixoto 2017-07-05 20:46:47 UTC
(In reply to Eugene Grosbein from comment #105)
Ok, i kept web console on.

I'm starting mpd5 using rc.d script:

# ps -ax | grep mpd
 1278  -  Ss     25:03.78 /usr/local/sbin/mpd5 -p /var/run/mpd5.pid -b

I'll follow your instructions as soon as it repeats.

Please let me know when patch 220076 for PR has been updated.

Thanks.
Comment 107 Eugene Grosbein freebsd_committer 2017-07-06 11:20:35 UTC
(In reply to Cassiano Peixoto from comment #106)

I've just updated the patch for ng_iface(4). Apply it to kernel sources instead of its previous version, rebuild, install and boot new kernel.
Comment 108 Donald Baud 2017-07-07 19:52:58 UTC
(In reply to Cassiano Peixoto from comment #104)

The logs show that the panic happened at "Jul  5 12:00:00" 
Is it possible that an action in your cron script could have triggered it?
Comment 109 Eugene Grosbein freebsd_committer 2017-07-07 20:01:54 UTC
(In reply to Donald Baud from comment #108)

It does not really matter what triggered the panic as its reason is known: ng_iface lacks protection of its private data from modifications while it is being used. This is hopefully fixed with latest update of patch in the PR 220076.
Comment 110 Cassiano Peixoto 2017-07-07 20:06:16 UTC
(In reply to Eugene Grosbein from comment #109)
Hi, 

Should i update with this last patch (PR 220076) and rebuild the kernel?
Comment 111 Eugene Grosbein freebsd_committer 2017-07-07 20:13:19 UTC
(In reply to Cassiano Peixoto from comment #110)

Yes, just like I've asked in the comment 107.
Comment 112 Cassiano Peixoto 2017-07-07 20:46:16 UTC
(In reply to Eugene Grosbein from comment #111)
Ok, just applied and rebooted. I'll keep you posted.
Comment 113 Cassiano Peixoto 2017-07-10 14:19:45 UTC
(In reply to Cassiano Peixoto from comment #112)
Eugene, patch has been applied but i still can see the kernel error messages:

Jul 10 01:22:30 B-ras kernel: KDB: stack backtrace:
Jul 10 01:22:30 B-ras kernel: #0 0xffffffff80aa4ad7 at kdb_backtrace+0x67
Jul 10 01:22:30 B-ras kernel: #1 0xffffffff80bb2acc at ng_findhook+0xac
Jul 10 01:22:30 B-ras kernel: #2 0xffffffff80bb96bc at ng_con_nodes+0x2c
Jul 10 01:22:30 B-ras kernel: #3 0xffffffff80bb4e70 at ng_apply_item+0xa90
Jul 10 01:22:31 B-ras kernel: #4 0xffffffff80bb413b at ng_snd_item+0x1db
Jul 10 01:22:31 B-ras kernel: #5 0xffffffff82451469 at ngc_send+0x209
Jul 10 01:22:31 B-ras kernel: #6 0xffffffff80aea8ec at sosend_generic+0x4ec
Jul 10 01:22:31 B-ras kernel: #7 0xffffffff80af11c1 at kern_sendit+0x291
Jul 10 01:22:31 B-ras kernel: #8 0xffffffff80af1523 at sendit+0x1a3
Jul 10 01:22:31 B-ras kernel: #9 0xffffffff80af136d at sys_sendto+0x4d
Jul 10 01:22:31 B-ras kernel: #10 0xffffffff80e36394 at amd64_syscall+0x6c4
Jul 10 01:22:31 B-ras kernel: #11 0xffffffff80e194eb at Xfast_syscall+0xfb
Jul 10 01:22:31 B-ras kernel: Accessing freed hook: name right, 0 refs, Last touched:
Jul 10 01:22:31 B-ras kernel: Last active @ /usr/src/sys/netgraph/ng_base.c, line 1228
Jul 10 01:22:31 B-ras kernel: problem discovered at file /usr/src/sys/netgraph/ng_base.c, line 1131

Since then the process didn't freeze yet, but i can see that messages anyway.
Comment 114 Cassiano Peixoto 2017-07-10 14:32:41 UTC
(In reply to Cassiano Peixoto from comment #113)
Just to add more info about comment #112. I got some relevant messages in mpd5.log:

Jul 10 01:22:32 B-ras mpd: Link: Packet from unexisting bundle "288"
Jul 10 01:22:32 B-ras mpd: Link: Packet from unexisting bundle "116"
Jul 10 01:22:32 B-ras mpd: Link: Packet from unexisting bundle "334"
Jul 10 01:22:32 B-ras mpd: Link: Packet from unexisting bundle "525"
Jul 10 01:22:32 B-ras mpd: Link: Packet from unexisting bundle "681"
Jul 10 01:22:32 B-ras mpd: Link: Packet from unexisting bundle "271"
Jul 10 01:22:32 B-ras mpd: Link: Packet from unexisting bundle "85"
Jul 10 01:22:32 B-ras mpd: Link: Packet from unexisting bundle "785"
Jul 10 01:22:32 B-ras mpd: Link: Packet from unexisting bundle "289"
Jul 10 01:22:32 B-ras mpd: Link: Packet from unexisting bundle "386"
Jul 10 01:22:32 B-ras mpd: Link: Packet from unexisting bundle "530"
Jul 10 01:22:32 B-ras mpd: Link: Packet from unexisting bundle "381"
Jul 10 01:22:32 B-ras mpd: Link: Packet from unexisting bundle "805"
Jul 10 01:22:32 B-ras mpd: Link: Packet from unexisting bundle "152"
Jul 10 01:22:32 B-ras mpd: Link: Packet from unexisting bundle "186"
Jul 10 01:22:32 B-ras mpd: Link: Packet from unexisting bundle "566"
Jul 10 01:22:32 B-ras mpd: Link: Packet from unexisting bundle "492"
Jul 10 01:22:32 B-ras mpd: Link: Packet from unexisting bundle "685"
Jul 10 01:22:32 B-ras mpd: Link: Packet from unexisting bundle "83"
Jul 10 01:22:32 B-ras mpd: Link: Packet from unexisting bundle "937"
Jul 10 01:22:32 B-ras mpd: Link: Packet from unexisting bundle "396"
Jul 10 01:22:32 B-ras mpd: Link: Packet from unexisting bundle "499"

Logs has been flooded with these messages.
Comment 115 Eugene Grosbein freebsd_committer 2017-07-10 15:45:21 UTC
(In reply to Cassiano Peixoto from comment #113)

"Accessing freed hook: name right" is another case and has no connection with mpd5 process hanging. And that comes from ng_tee(4), not ng_iface(4).

I'm not sure if this message indicate real problem: accessing freed hook should not happen but NETGRAPH has some protection measures against that and they seem to do their job just fine. You should ignore "Accessing freed hook: name right" warning for the time being unless some bad thing happens like kernel panic.

These warning will go away when you return non-debugging kernel back (e.g. a kernel built without NETGRAPH_DEBUG option).
Comment 116 Eugene Grosbein freebsd_committer 2017-07-10 15:59:51 UTC
(In reply to Cassiano Peixoto from comment #114)

I guess, this message is harmless and comes from the fact that mpd5 may destroy ngXXX interface and corresponding bundle before some last packets from an user get delivered by the kernel while it is being disconnected. What protocol do you use there (PPPoE, PPtP etc.)?

Do you have "log +bund +bund2" in your mpd.conf? If not, you may use mpd console to run this command enabling bundle level debug logs. After that wait for new "Link: Packet from unexisting bundle" message and filter the log for mentioned bundle number and post results.
Comment 117 Cassiano Peixoto 2017-07-10 17:05:50 UTC
(In reply to Eugene Grosbein from comment #116)
I'm using for PPPoE only. I'll enable the logs and will let you know. But anyway for now no freezes happened.
Comment 118 Eugene Grosbein freebsd_committer 2017-07-17 11:46:35 UTC
(In reply to Konstantin Belousov from comment #63)

I see you already MFC'd "cancel-safe" patches to stable/11. Do you have any plans to MFC them to stable/10, so upcoming 10.4-RELEASE be fixed too?
Comment 119 Konstantin Belousov freebsd_committer 2017-07-17 12:23:21 UTC
(In reply to Eugene Grosbein from comment #118)
I already mentioned that the r320472 change only matters for rare case of user-IO streams, created e.g. by funopen(3).  As such, I do not think that this change matters for mpd5.

I do not want to merge this stuff to stable/10 at the late stage of the branch lifecycle.
Comment 120 Eugene Grosbein freebsd_committer 2017-07-17 12:36:46 UTC
(In reply to Konstantin Belousov from comment #119)

Well, comment #83 refers to some problem in "fseeko" that seems to be fixed by your "stdio cancel-safe" patch among other things as this patch eliminated hangs for Cassiano Peixoto. That's sad 10.4 will still be unfixed.
Comment 121 Konstantin Belousov freebsd_committer 2017-07-17 13:03:33 UTC
(In reply to Eugene Grosbein from comment #120)
Comment about fseeko was that the function calls user method, when the operating stream is from funopen(3).
Comment 122 Eugene Grosbein freebsd_committer 2017-07-17 13:11:57 UTC
(In reply to Konstantin Belousov from comment #121)

mpd5 uses libpdel that uses funopen() extensively. For example, it uses funopen()  while loading HTTP headers from a stream (mpd5 web console feature).
Comment 123 Cassiano Peixoto 2017-07-21 12:55:27 UTC
(In reply to Eugene Grosbein from comment #116)
Hi Eugene,

The server has been running for 14 days without freezing. I think the issue has been fixed :)

How would you like to proceed? Are you going to commit the others patches?

Congratulations, after 2 years and 6 months, someone at last fixed it.

Thank you! :)
Comment 124 Eugene Grosbein freebsd_committer 2017-07-21 13:17:24 UTC
The problem of mpd5 daemon hanging is now fixed with following changes:

1. libc/syslog "cancel-safe" fix merged to stable/11 and stable/10 (to appear in upcoming 10.4-RELEASE and 11.1-RELEASE).

2. Multiple libc/stdio "cancel-safe" fixes merged to stable/11 and stable/10 (to appear in upcoming 10.4-RELEASE and 11.1-RELEASE).

3. Mpd5 "cancel-safe" console management fix comitted to upstream code and present in the net/mpd5 port version mpd5-5.8_1 and newer.

Other problems concerning general kernel stability issues will be carried with distinct PRs linked to this one.

Big thanks to kib, dchagin, Cassiano Peixoto and others involved in reporting, analyzing, debugging and fixing this.
Comment 125 Cassiano Peixoto 2017-07-27 19:29:43 UTC
(In reply to Eugene Grosbein from comment #124)
Hi Eugene,

Just one question, how about libthread patch? Is it has been merged as well?

Take a look on comment #53:

https://bugs.freebsd.org/bugzilla/attachment.cgi?id=183536&action=edit
Comment 126 Eugene Grosbein freebsd_committer 2017-07-28 02:10:24 UTC
(In reply to Cassiano Peixoto from comment #125)

It is not needed to fix things. It was produced to discover roots of the problem only.
Comment 127 Harald Schmalzbauer 2017-07-28 15:00:15 UTC
(In reply to Eugene Grosbein from comment #124)

Big thanks to all who stood strong and solved that!

And thanks to the conclusion by Eugene, which I partly quote here:

<quote>
The problem of mpd5 daemon hanging is now fixed with following changes:

1. libc/syslog "cancel-safe" fix merged to stable/11 and stable/10 (to appear in upcoming 10.4-RELEASE and 11.1-RELEASE).

2. Multiple libc/stdio "cancel-safe" fixes merged to stable/11 and stable/10 (to appear in upcoming 10.4-RELEASE and 11.1-RELEASE).
</quote>

I guess 1) was r320472
I guess 2) were r320508 and r320509

These three were merged to stable/11 in r320942 at 2017-07-13
and stable/10 in r321074 at 017-07-17
So the fixes have NOT made it into 11.1-RELEASE!
But 10.4 should have them, since I can't see releng/10.4 yet.

Just for the records, in case one want's to know what to expect from 11.1.

-harry
Comment 128 Eugene Grosbein freebsd_committer 2017-08-02 07:11:55 UTC
(In reply to Harald Schmalzbauer from comment #127)

Additionally, there were specific merges to release branch releng/11.1 before release was tagged. So, 11.1-RELEASE does have those fixes.
Comment 129 Harald Schmalzbauer 2017-08-02 07:57:45 UTC
(In reply to Eugene Grosbein from comment #128)

Could you please tell us the revision? Hope I really missed something and you are right, but if I e.g. watch the log of lib/libc/stdio/fclose.c, it wasn't touched since it has been copied from base/stable/11@r320475 to releng/11.

kib@ did a MFS (r320763), but either I got all completely wrong and the referenced commits in my last post are unrelated&wrong, or they aren't in 11.1 (which would explain why my local MFS doesn't fail).

Btw. I locally MFS also your r320776 to 11.1.0 (r310888 MFC, usr.sbin/syslogd/syslogd.c: Retry to open an F_PIPE process when it dies unexpectedly).

Please, let us document the 11.1-RELEASE status unambiguously here, I know how tedious it is to waste time chasing contradictory facts :-) I'm happy if I'm proven wrong, but finally here's the information people can rely on.

-harry
Comment 130 Eugene Grosbein freebsd_committer 2017-08-02 12:36:53 UTC
(In reply to Harald Schmalzbauer from comment #129)

11.1 release branch releng/11.1 was created with r320486 from stable/11@r320475 https://svnweb.freebsd.org/base?view=revision&revision=320486

11.1-RELEASE was tagged with r321354, July 21: https://svnweb.freebsd.org/base?view=revision&revision=321354

Biggest problem was one with syslog() and libc/syslog "cancel-safe" fix was merged to stable/11 with r320311 before creation of releng/11.1: https://svnweb.freebsd.org/base?view=revision&revision=320311 - this made mpd5 pretty stable with an exception of very rare case that were dealt with stdio fixes later.

There were several "cancel-safe" commits to head's libc/stdio that stem from https://reviews.freebsd.org/D11246 - r320472 (main one) plus some fixes to that fix: r320508 and r320509. They all were merged to stable/11 with r320942, July 13: https://svnweb.freebsd.org/base?view=revision&revision=320942 before creation of the release. I ran stable/11 that time and believed that future release would have it. But you are right, it hasn't.

One should manually merge r320942 to the release and rebuild libc, or switch to stable/11, or just avoid very frequent access to mpd5 web console (disable it altogether to make sure).
Comment 131 Eugene Grosbein freebsd_committer 2017-08-02 12:38:34 UTC
(In reply to Harald Schmalzbauer from comment #129)

And syslogd(8) problem has no connection with this mpd5 problem.