I have BRAS on FreeBSD. It was 9.2 STABLE. I tried to update it up to 10.0 RELEASE, later tried to STABLE. On both variants I have the same problem. Some time after start, around 5 minutes, it works normally. But after 100-150 users have connected trough PPPoE (MPD5.7) MPD process stops in state umtxn. Of course, no one can connect after that. But who have already connected keeping work. last pid: 17712; load averages: 1.16, 0.65, 0.27 up 0+00:01:51 05:28:23 50 processes: 1 running, 49 sleeping CPU: 0.0% user, 0.0% nice, 1.0% system, 0.9% interrupt, 98.1% idle Mem: 1162M Active, 56M Inact, 400M Wired, 145M Buf, 2274M Free Swap: 4096M Total, 4096M Free PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND 2535 root 1 20 0 201M 184M select 3 1:14 10.69% zebra 2476 _pflogd 1 20 0 14600K 2200K bpf 0 0:12 0.00% pflogd 2541 root 1 20 0 224M 206M select 2 0:07 0.00% bgpd 9803 root 1 20 0 78624K 44092K select 2 0:02 0.00% bsnmpd 3462 root 3 20 0 56736K 9164K umtxn 0 0:01 0.00% mpd5 7243 mysql 17 32 0 6958M 636M uwait 1 0:01 0.00% mysqld 6095 bind 7 20 0 129M 76864K kqread 1 0:01 0.00% named 3872 root 1 20 0 61124K 6808K select 1 0:00 0.00% nmbd 8644 root 3 20 0 47332K 6216K select 1 0:00 0.00% utm5_rfw procstat -k 3462 PID TID COMM TDNAME KSTACK 3462 100113 mpd5 - mi_switch sleepq_catch_signals sleepq_wait_sig _sleep umtxq_sleep do_lock_umutex __umtx_op_wait_umutex amd64_syscall Xfast_syscall 3462 100115 mpd5 - mi_switch sleepq_catch_signals sleepq_wait_sig _cv_wait_sig seltdwait sys_poll amd64_syscall Xfast_syscall 3462 100512 mpd5 - mi_switch sleepq_catch_signals sleepq_wait_sig _sleep umtxq_sleep do_lock_umutex __umtx_op_wait_umutex amd64_syscall Xfast_syscall /var/log/mpd.log Jan 26 05:28:13 gw01 mpd: [B_ppp-46] IPCP: Up event Jan 26 05:28:13 gw01 mpd: [B_ppp-46] IPCP: state change Starting --> Req-Sent Jan 26 05:28:13 gw01 mpd: [B_ppp-46] IPCP: SendConfigReq #1 Jan 26 05:28:13 gw01 mpd: [B_ppp-46] IPADDR 10.10.0.1 Jan 26 05:28:13 gw01 mpd: [B_ppp-46] IPV6CP: Up event Jan 26 05:28:13 gw01 mpd: [B_ppp-46] IPV6CP: state change Starting --> Req-Sent Jan 26 05:28:13 gw01 mpd: [B_ppp-46] IPV6CP: SendConfigReq #1 Jan 26 05:28:13 gw01 mpd: [vlan6-107] LCP: rec'd Terminate Request #240 (Opened) Jan 26 05:28:13 gw01 mpd: [vlan6-107] LCP: state change Opened --> Stopping Jan 26 05:28:13 gw01 mpd: [vlan6-107] Link: Leave bundle "B_ppp-46" It always stops with the same 3 last strings. Jan 26 05:52:38 gw01 kernel: sonewconn: pcb 0xfffff80007757c40: Listen queue overflow: 4 already in queue awaiting acceptance Jan 26 05:53:09 gw01 last message repeated 60 times Jan 26 05:53:34 gw01 last message repeated 51 times Kernel conf: GENERIC + device ipmi device coretemp device smbus device lagg device netmap options IPI_PREEMPTION options IPFIREWALL options IPFIREWALL_VERBOSE options IPDIVERT options DUMMYNET options IPFIREWALL_NAT options LIBALIAS device pf device pflog device pfsync options ALTQ options ALTQ_CBQ # Class Bases Queuing (CBQ) options ALTQ_RED # Random Early Detection (RED) options ALTQ_RIO # RED In/Out options ALTQ_HFSC # Hierarchical Packet Scheduler (HFSC) options ALTQ_PRIQ # Priority Queuing (PRIQ) options ALTQ_NOPCC # Required for SMP build options NETGRAPH options NETGRAPH_BPF options NETGRAPH_CAR options NETGRAPH_ETHER options NETGRAPH_IPFW options NETGRAPH_IFACE options NETGRAPH_KSOCKET options NETGRAPH_PPP options NETGRAPH_PPTPGRE options NETGRAPH_PPPOE options NETGRAPH_SOCKET options NETGRAPH_TCPMSS options NETGRAPH_TEE options NETGRAPH_VJC options NETGRAPH_MPPC_ENCRYPTION options NETGRAPH_NETFLOW CPU: Intel(R) Xeon(R) CPU X3470 @ 2.93GHz (2933.36-MHz K8-class CPU) Origin = "GenuineIntel" Id = 0x106e5 Family = 0x6 Model = 0x1e Stepping = 5 Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE> Features2=0x98e3fd<SSE3,DTES64,MON,DS_CPL,VMX,SMX,EST,TM2,SSSE3,CX16,xTPR,PDCM,SSE4.1,SSE4.2,POPCNT> AMD Features=0x28100800<SYSCALL,NX,RDTSCP,LM> AMD Features2=0x1<LAHF> TSC: P-state invariant, performance statistics real memory = 4294967296 (4096 MB) avail memory = 4052344832 (3864 MB) Event timer "LAPIC" quality 400 ACPI APIC Table: <INTEL S3420GPC> FreeBSD/SMP: Multiprocessor System Detected: 4 CPUs FreeBSD/SMP: 1 package(s) x 4 core(s) cpu0 (BSP): APIC ID: 0 cpu1 (AP): APIC ID: 2 cpu2 (AP): APIC ID: 4 cpu3 (AP): APIC ID: 6 I tried to get ktrace dump. But I could not open it. ktrdump: kvm_nlist: No such file or directory I think, It is something wrong with netgraph system. How-To-Repeat: Update to FreeBSD 10.0 and try to connect 100-150 users.
In additional: - this problem only with pppoe part. With pptp I and other peope on forum have not any problem. - here is gdb report from frozen mpd5: gdb mpd5 3645 GNU gdb 6.1.1 [FreeBSD] Copyright 2004 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "amd64-marcel-freebsd"...(no debugging symbols found)... Attaching to program: /usr/local/sbin/mpd5, process 3645 Reading symbols from /usr/lib/libwrap.so.6...(no debugging symbols found)...done. Loaded symbols for /usr/lib/libwrap.so.6 Reading symbols from /usr/lib/libpam.so.5...(no debugging symbols found)...done. Loaded symbols for /usr/lib/libpam.so.5 Reading symbols from /lib/libcrypt.so.5...(no debugging symbols found)...done. Loaded symbols for /lib/libcrypt.so.5 Reading symbols from /usr/lib/libnetgraph.so.4...(no debugging symbols found)...done. Loaded symbols for /usr/lib/libnetgraph.so.4 Reading symbols from /lib/libutil.so.9...(no debugging symbols found)...done. Loaded symbols for /lib/libutil.so.9 Reading symbols from /usr/lib/libradius.so.4...(no debugging symbols found)...done. Loaded symbols for /usr/lib/libradius.so.4 Reading symbols from /usr/lib/libssl.so.7...(no debugging symbols found)...done. Loaded symbols for /usr/lib/libssl.so.7 Reading symbols from /lib/libpcap.so.8...(no debugging symbols found)...done. Loaded symbols for /lib/libpcap.so.8 Reading symbols from /usr/lib/libfetch.so.6...(no debugging symbols found)...done. Loaded symbols for /usr/lib/libfetch.so.6 Reading symbols from /lib/libcrypto.so.7...(no debugging symbols found)...done. Loaded symbols for /lib/libcrypto.so.7 Reading symbols from /lib/libthr.so.3...(no debugging symbols found)...done. [New Thread 80340ec00 (LWP 100390/mpd5)] [New Thread 803020400 (LWP 100120/mpd5)] [New Thread 802c06800 (LWP 100119/mpd5)] Loaded symbols for /lib/libthr.so.3 Reading symbols from /lib/libc.so.7...(no debugging symbols found)...done. Loaded symbols for /lib/libc.so.7 Reading symbols from /libexec/ld-elf.so.1...(no debugging symbols found)...done. Loaded symbols for /libexec/ld-elf.so.1 [Switching to Thread 80340ec00 (LWP 100390/mpd5)] 0x0000000801fc089a in __error () from /lib/libthr.so.3 (gdb) bt #0 0x0000000801fc089a in __error () from /lib/libthr.so.3 #1 0x0000000801fbb79d in pthread_mutex_destroy () from /lib/libthr.so.3 #2 0x00000008022dea9a in vsyslog () from /lib/libc.so.7 #3 0x00000000004487c7 in ?? () #4 0x00000000004486e2 in ?? () #5 0x000000000045123c in ?? () #6 0x0000000000426534 in ?? () #7 0x000000000045d1b6 in ?? () #8 0x0000000801fb54a4 in pthread_create () from /lib/libthr.so.3 #9 0x00007fffff5fc000 in ?? () Error accessing memory address 0x7fffff7fc000: Bad address. (gdb) where #0 0x0000000801fc089a in __error () from /lib/libthr.so.3 #1 0x0000000801fbb79d in pthread_mutex_destroy () from /lib/libthr.so.3 #2 0x00000008022dea9a in vsyslog () from /lib/libc.so.7 #3 0x00000000004487c7 in ?? () #4 0x00000000004486e2 in ?? () #5 0x000000000045123c in ?? () #6 0x0000000000426534 in ?? () #7 0x000000000045d1b6 in ?? () #8 0x0000000801fb54a4 in pthread_create () from /lib/libthr.so.3 #9 0x00007fffff5fc000 in ?? () Error accessing memory address 0x7fffff7fc000: Bad address. (gdb) quit
One more interesting thing. Problem appears usually when MPD5 starts with booting of system. If I comment mpd_enable in rc.conf and start MPD5 manually after 10-20 minutes - it works normally.
Make this a ports PR and assign.
I can trigger similar and probably related bug in the mpd by just disconnecting upstream link for a few minutes. Clients continue to make queries, mpd completely hangs in the "select" state and can only be killed by -9 signal. This regression appeared somewhere in 8.x series. Killing mpd together with zombie ngX interfaces makes it possible to recover normal operation without reboot. Nothing interesting found in the all logs.
Got the same problem in FreeBSD 10.0. The symptoms are exactly the same as hawk256 wrote above
Site to site vpn, hang too: 821 root 1 20 0 55112K 7504K umtxn 1 0:12 0.00% mpd5 FreeBSD base.tim 10.2-STABLE FreeBSD 10.2-STABLE #1 r286839: Mon Aug 17 12:31:21 EEST 2015 slava@base.tim:/usr/obj/usr/src/sys/route amd64 mpd5 -v Version 5.8a 23-Jul-2015
(In reply to YS from comment #6) Hi, I've the same issue here using FreeBSD 10.2 with mpd 5.7 pppoe server. With about 200 users it just freezes. I have to kill -9 mpd process.
@Cassiano, mpd5 was recently updated to 5.8, can you update to that version and confirm whether the hanging symptoms are still reproducible please. If you could also provide the following that would be great: * uname -a output * kernel configuration (as attachment) if *not* GENERIC * /var/run/dmesg.boot (as attachment) * pciconf -lv output (as attachment) We'll do our best to progress the issue.
Created attachment 165278 [details] Kernel Config
Created attachment 165279 [details] dmesg
Created attachment 165280 [details] pciconf
Hi, Kubilay. Thanks for you reply. I've been using mpd 5.8 with same issue. Seems it has not fixed. Here is my uname -a output: FreeBSD pppoe.us 10.2-STABLE FreeBSD 10.2-STABLE #0 r292035M: Tue Dec 15 10:45:41 BRST 2015 root@pppoe.us:/usr/obj/usr/src/sys/PPPOE amd64
Created attachment 165626 [details] mpd.conf for my site-site vpn freebsd 9.3 - without problem freebsd 10.2 - with problem after some time
This PR should not be classified as Ports PR because problem lies in the kernel NETGRAPH subsystem and not mpd5 itself. Same version of mpd5 works just fine with 8.4-STABLE. This is Base System problem.
Yeah, got mpd5 working with no problem on FreeBSD 9.x (mpd 5.7, 5.8).
Disabling using syslog (3) temporary resolve this problem. I think, problem between pthreads and syslog.
(In reply to Eugene Grosbein from comment #14) The initial report is fine to be classified against the port. Once the root cause(s) is confirmed, implicated or identified, a dependent (Depends On) sub task can be created for the base change.
Created attachment 175989 [details] crash-bt
Comment on attachment 175989 [details] crash-bt I'm having random crashes. BT attached.
(In reply to Cassiano Peixoto from comment #19) I've performed heavy stress-test of mpd-5.8 under FreeBSD 11.0-STABLE without usage of RADIUS and found no problems. Can you try 11.0-BETA1?
Hi Eugene, I've been using 11.0-p1 for many months. I've about 1k users connected, and after some days (or weeks) i run into on the following issue: PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND 55180 root 2 52 0 338M 85204K uwrlck 2 2:28 0.00% mpd5 The process stuck on uwrlck state. Do you have any idea? My last try was set a cpu affinity on process. After this it's running 9 days and counting. But i gonna need more days to make sure it really worked. Anyway, how have you made the test? how many users connected? Thanks.
(In reply to Cassiano Peixoto from comment #21) I repeatedly created/destroyed hundreds of PPPoE sessions for long time. But my test was incomplete in a sense, as it does not involve RADIUS, so mpd was running in single-thread mode and lockups are believed due to multi-threading issues. I'm going to re-do my tests. Also, please take a look at this and try to test patches: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=214482#c33
Hi Eugene, You can count me in to help with any test. I think just creating/destroying PPPoE sessions will not make the issues happen. Another think i forgot to tell, i disabled syslog in mpd5 Makefile: # Set syslog logging facility. Change LOG_DAEMON to whatever you like. # Comment this line disable syslog (3) support #SYSLOG_FACILITY= LOG_DAEMON But even with syslog disabled i ran into with uwrlck issue. Then my last try was to set cpu affinity as i told before. I read the link you sent. Looks like an issue with vsyslog(). So my questions is: even with syslog facility disabled on mpd should i expected this kind of behaviour (stuck)? Thanks.
(In reply to Cassiano Peixoto from comment #23) I've passed some ftp traffic over PPPoE sessions too. Please enable SYSLOG_FACILITY back while testing. As for uwrlck problem, it seems this one is similar but distinct problem. It may happen if you repeatedly use mpd console or http server, for example. Please describe how do you use them.
Ok, i'll apply your patches and back with syslog facility. Do i need to rebuild the world (userland)? Regarding uwrlck I am using mpd5 web to collect some connections statistics each 60 minutes (it's a cron job). I'm using the following quite simple shell script: OUTPUT="/usr/local/etc/tmp/bras.txt" TMP="/tmp/sessions.tmp" TMP2="/tmp/veloc.tmp" TMP3="/tmp/result.tmp" if [ -e $TMP ]; then rm $TMP fi if [ -e $TMP2 ]; then rm $TMP2 fi if [ -e $TMP3 ]; then rm $TMP3 fi SESSIONS=`/usr/local/bin/curl -k -l --user admin:xxx http://127.0.0.1:5006/bincmd\?show%20sessions 2>&1 > $TMP` while read session; do if [ "`echo $session |grep -v 'RESULT'`" != "" ]; then #echo $session INTERFACE=`echo $session |awk '{print $1}'` IP=`echo $session |awk '{print $2}'` USER=`echo $session |awk '{print $8}'` MAC=`echo $session |awk '{print $9}'` VLAN=`echo $session |awk '{print $3}' | awk -F '-' '{print $1}'` DESC=`ifconfig ${VLAN} |grep description |awk '{print $2}'` VELOC=`/usr/local/bin/curl -k -l --user admin:xxx http://127.0.0.1:5006/bincmd\?iface%20${INTERFACE}\&show%20customer 2>&1 | egrep 'in#|out#|seconds' > $TMP2` TRAF=`/usr/bin/netstat -nbf link -I ${INTERFACE}` TRAF=`echo $TRAF| awk '{print $19" "$16" "$22" "$20}' | tr ' ' ';'` BLOCK=`ipfw table 1 list |grep ${IP}` if [ "$BLOCK" = "" ]; then BLOCK="0" else BLOCK="1" fi BLOCK2=`ipfw table 2 list |grep ${IP}` if [ "$BLOCK2" = "" ]; then BLOCK2="0" else BLOCK="1" fi BLOCK2=`ipfw table 2 list |grep ${IP}` if [ "$BLOCK2" = "" ]; then BLOCK2="0" else BLOCK2="1" fi while read veloc; do if [ "`echo $veloc | grep in`" != "" ]; then BTI=$(echo $veloc |awk '{print $5}') fi if [ "`echo $veloc | grep out`" != "" ]; then BTO=$(echo $veloc |awk '{print $5}') fi if [ "`echo $veloc | grep 'seconds'`" != "" ]; then TIME=$(echo $veloc|awk '{print $4}') fi done < $TMP2 echo "${USER};${IP};${TIME};${VLAN};${DESC};${BTI};${BTO};${INTERFACE};${TRAF};${BLOCK};${BLOCK2};${MAC}" >> $TMP3 fi done < $TMP cp $TMP3 $OUTPUT
(In reply to Cassiano Peixoto from comment #25) You need not rebuild world if you choose to test the patch for mpd itself. You need to rebuild and reinstall libc only if you choose to test the patch for syslog. Please apply only one of patches at the same time, not both. As for web server, I've looked into the code and I'm afraid it would not be so easy to patch it. For now, I suggest to disable your statistics collection during the test period or switch to another method not involving web. For example, you could use "set iface description" conversion specifications: http://mpd.sourceforge.net/doc5/mpd28.html#28 Like this: set iface description "%I: Login %U, IP %A, MAC %M, Link %l" Then use /sbin/ifconfig to get data. Otherwise, you may try to use mpd's CLI console instead of web console. I've just updated that patch for mpd to include workaround for console lock possible issues, so make sure you use latest patch version.
Ok let's begin with mpd patch (without libc patch). Can i enable with syslog facility after patch applied? Do you think would be possible to provide a web patch as well? Because it's easier to get data output from there. Thanks.
(In reply to Cassiano Peixoto from comment #27) You should enable syslog facility. As for web, it is more complex and may require other developers assistance. With time it definitly can be solved but first we need to ensure that root of problems is correctly determined, so testing is needed first. Also, you'll better compile mpd with debugging symbols enabled: make WITH_DEBUG=yes clean all deinstall install And if hang occurs again, use "killall -QUIT mpd" to generate crashdump to analyze.
Ok got it. Patch applied and working. As you know this kind of problem takes time to happen. But i'll keep you posted about the progress. Let's keep working at least one week. Thanks.
(In reply to Cassiano Peixoto from comment #29) Please verify with "file /usr/local/sbin/mpd5" command that the binary is "not stripped".
(In reply to Eugene Grosbein from comment #30) Yes, not stripped :) (root@B-ras)~# file /usr/local/sbin/mpd5 /usr/local/sbin/mpd5: ELF 64-bit LSB executable, x86-64, version 1 (FreeBSD), dynamically linked, interpreter /libexec/ld-elf.so.1, for FreeBSD 11.0 (1100122), FreeBSD-style, not stripped
(In reply to Cassiano Peixoto from comment #31) Eugene, I just tried on another machine to compile mpd5 with debug enabled and then ran a killall -QUIT to generate a crashdump. But nothing happens, i looked on /var/crash but there is nothing there. Am i missing something? Thanks.
(In reply to Cassiano Peixoto from comment #32) I'd better say "corefile" instead of "crashdump" as crashdump is generated for kernel panic and core file for userland process crashed or killed with a signal. Use something like sysctl kern.corefile='/var/tmp/%N.core' to specify location for corefiles. By default, its' written to process current directory. Take a look at /usr/local/etc/mpd5/ for mpd.core
(In reply to Eugene Grosbein from comment #33) I tried like you said, but didn't work, look: # sysctl kern.corefile='/var/tmp/%N.core' # ps ax | grep mpd5 31784 - Ss 0:00.01 /usr/local/sbin/mpd5 -p /var/run/mpd5.pid -b # killall -QUIT mpd5 # ls /var/tmp/ nginx/ vi.recover/ Shouldn't i run the process with gdb?
(In reply to Cassiano Peixoto from comment #34) Does "killall" really terminate the process or does it continue to run? Perhaps, the signal is blocked for some reason. There are several ways to start mpd5 from the beginning. How do you start it?
(In reply to Eugene Grosbein from comment #35) it's not running anymore, look: # ps ax | grep mpd5 52264 1 S+ 0:00.00 grep mpd5 # /usr/local/etc/rc.d/mpd5 forcestart Starting mpd5. # ps ax | grep mpd5 52275 - Ss 0:00.01 /usr/local/sbin/mpd5 -p /var/run/mpd5.pid -b 52286 1 S+ 0:00.00 grep mpd5 # killall -QUIT mpd5 # ps ax | grep mpd5 52293 1 S+ 0:00.00 grep mpd5 # sysctl kern.corefile kern.corefile: /var/tmp/%N.core # ls /var/tmp/ nginx vi.recover As you can see above i'm using rc.d script to mpd5 startup. It always start with: -p /var/run/mpd5.pid -b
(In reply to Cassiano Peixoto from comment #36) Do you use tcsh for your login shell? Anyway, try starting mpd using direct command "/usr/local/sbin/mpd5 -p /var/run/mpd5.pid -b" for a test instead of "service" to verify if it will be any difference with signal handling.
(In reply to Eugene Grosbein from comment #37) Same behaviour: # ps ax | grep mpd5 53396 1 R+ 0:00.00 grep mpd5 # /usr/local/sbin/mpd5 -p /var/run/mpd5.pid -b # ps ax | grep mpd5 53402 - Ss 0:00.01 /usr/local/sbin/mpd5 -p /var/run/mpd5.pid -b 53409 1 S+ 0:00.00 grep mpd5 # killall -QUIT mpd5 # ps ax | grep mpd5 53413 1 S+ 0:00.00 grep mpd5 # ls /var/tmp/ nginx vi.recover
(In reply to Cassiano Peixoto from comment #38) Please show output of the command: dmesg | grep mpd5
(In reply to Eugene Grosbein from comment #39) # dmesg | grep mpd5 pid 53402 (mpd5), uid 0: exited on signal 3 pid 53441 (mpd5), uid 0: exited on signal 3 pid 53758 (mpd5), uid 0: exited on signal 3
(In reply to Cassiano Peixoto from comment #40) The signal is delivered but core generation is disabled for some reason. To find it, please show output of commands: sysctl kern.coredump proccontrol -q -m trace -p $(cat /var/run/mpd5.pid) limits -c
(In reply to Eugene Grosbein from comment #41) I've just found out. My coredumpsize limit has been configured to 0Kb :( Now it worked fine. Thanks for your help. I'll keep you posted. Until now it's running fine with patch applied. Thanks.
(In reply to Cassiano Peixoto from comment #42) I've repeated my stress test for mpd5 and without RADIUS it runs just fine but with RADIUS I'm able to reliably reproduce this problem in a minute using unpatched mpd5 and libc. This the patch for mpd5, the problem goes away. Note however, that I did not use mpd's web server and the patch is not complete. The patch for libc seems to non-functional and incomplete too. I've raised this issue to freebsd-stable@ mailing list trying to get more attention to the problem from developers.
(In reply to Eugene Grosbein from comment #43) Hi Eugene, Thank you very much for your support. I stopped using mpd5 web, i replaced my script with mpd5 console to collect statistics. Let's see if it keeps stable. Is there a way to reproduce your tests with radius? As i'm using mpd5+radius would be interesting to try after mpd5 patch applied. If you don't mind I'd like to suggest two things: 1) Attach the patches on this PR, it will be helpful to someone who is following it. 2) Assign this PR to you, because clearly Ermal Luçi is not working on this anymore.
(In reply to Eugene Grosbein from comment #43) Hi Eugene, I just got a stuck process this morning with uwrlck state. As i said before i'm not using mpd web anymore. Here is the bt result after it hangs: #0 0x000000080248567c in _umtx_op_err () from /lib/libthr.so.3 #1 0x0000000802479c81 in __thr_rwlock_wrlock (rwlock=0x802694500, tsp=<value optimized out>) at /usr/src/lib/libthr/thread/thr_umtx.c:325 #2 0x00000008024829f3 in _thr_rtld_wlock_acquire (lock=<value optimized out>) at thr_umtx.h:239 #3 0x0000000800703c72 in wlock_acquire (lock=0x800918960, lockstate=0x7fffdfffda58) at /usr/src/libexec/rtld-elf/rtld_lock.c:217 #4 0x00000008006fff4f in _rtld_allocate_tls (oldtls=0x0, tcbsize=32, tcbalign=16) at /usr/src/libexec/rtld-elf/rtld.c:4802 #5 0x0000000802481b69 in _tcb_ctor (thread=0x807c7ba00, initial=<value optimized out>) at /usr/src/lib/libthr/thread/thr_ctrdtr.c:45 #6 0x0000000802481110 in _thr_alloc (curthread=<value optimized out>) at /usr/src/lib/libthr/thread/thr_list.c:170 #7 0x00000008024772a6 in _pthread_create (thread=0x80342fd78, attr=0x0, start_routine=0x4b17c0 <paction_main>, arg=<value optimized out>) at /usr/src/lib/libthr/thread/thr_create.c:78 #8 0x00000000004b1744 in paction_start (actionp=0x80cf7f3b8, mutex=0x7bee08, handler=0x442f40 <AuthAccount>, finish=0x4430c0 <AuthAccountFinish>, arg=0x807c20010) at contrib/libpdel/util/paction.c:115 #9 0x0000000000442eaa in AuthAccountStart (l=0x80cf7f010, type=3) at auth.c:927 #10 0x0000000000442f31 in AuthAccountTimeout (arg=0x80cf7f010) at auth.c:949 #11 0x00000000004a0344 in TimerExpires (type=3, cookie=0x80cf7f120) at timer.c:99 #12 0x0000000000462ed7 in EventHandler (arg=0x80cf7f120) at event.c:146 #13 0x00000000004b154e in pevent_ctx_execute (arg=0x807db6f08) at contrib/libpdel/util/pevent.c:885 #14 0x00000000004b112f in pevent_ctx_service (ev=0x807db6f08) at contrib/libpdel/util/pevent.c:774 #15 0x00000000004b0baf in pevent_ctx_main (arg=0x80322c008) at contrib/libpdel/util/pevent.c:720 #16 0x0000000802477b55 in thread_start (curthread=<value optimized out>) at /usr/src/lib/libthr/thread/thr_create.c:289
(In reply to Cassiano Peixoto from comment #45) Show the backtrace for all threads.
(In reply to Konstantin Belousov from comment #46) And exactly what FreeBSD version is it ?
In comment #31 (root@B-ras)~# file /usr/local/sbin/mpd5 /usr/local/sbin/mpd5: ELF 64-bit LSB executable, x86-64, version 1 (FreeBSD), dynamically linked, interpreter /libexec/ld-elf.so.1, for FreeBSD 11.0 (1100122), FreeBSD-style, not stripped
(In reply to Konstantin Belousov from comment #47) I'm using FreeBSD 11.0-p1. But it happens on any FreeBSD version above 9x. Here is the backtrace for all threads: (gdb) thread apply all bt Thread 6 (Thread 803216a00 (LWP 100375/<unknown>)): #0 0x000000080248567c in _umtx_op_err () from /lib/libthr.so.3 #1 0x0000000802479c81 in __thr_rwlock_wrlock (rwlock=0x802694500, tsp=<value optimized out>) at /usr/src/lib/libthr/thread/thr_umtx.c:325 #2 0x00000008024829f3 in _thr_rtld_wlock_acquire (lock=<value optimized out>) at thr_umtx.h:239 #3 0x0000000800703c72 in wlock_acquire (lock=0x800918960, lockstate=0x7fffdfffda58) at /usr/src/libexec/rtld-elf/rtld_lock.c:217 #4 0x00000008006fff4f in _rtld_allocate_tls (oldtls=0x0, tcbsize=32, tcbalign=16) at /usr/src/libexec/rtld-elf/rtld.c:4802 #5 0x0000000802481b69 in _tcb_ctor (thread=0x807c7ba00, initial=<value optimized out>) at /usr/src/lib/libthr/thread/thr_ctrdtr.c:45 #6 0x0000000802481110 in _thr_alloc (curthread=<value optimized out>) at /usr/src/lib/libthr/thread/thr_list.c:170 #7 0x00000008024772a6 in _pthread_create (thread=0x80342fd78, attr=0x0, start_routine=0x4b17c0 <paction_main>, arg=<value optimized out>) at /usr/src/lib/libthr/thread/thr_create.c:78 #8 0x00000000004b1744 in paction_start (actionp=0x80cf7f3b8, mutex=0x7bee08, handler=0x442f40 <AuthAccount>, finish=0x4430c0 <AuthAccountFinish>, arg=0x807c20010) at contrib/libpdel/util/paction.c:115 #9 0x0000000000442eaa in AuthAccountStart (l=0x80cf7f010, type=3) at auth.c:927 #10 0x0000000000442f31 in AuthAccountTimeout (arg=0x80cf7f010) at auth.c:949 #11 0x00000000004a0344 in TimerExpires (type=3, cookie=0x80cf7f120) at timer.c:99 #12 0x0000000000462ed7 in EventHandler (arg=0x80cf7f120) at event.c:146 #13 0x00000000004b154e in pevent_ctx_execute (arg=0x807db6f08) at contrib/libpdel/util/pevent.c:885 #14 0x00000000004b112f in pevent_ctx_service (ev=0x807db6f08) at contrib/libpdel/util/pevent.c:774 #15 0x00000000004b0baf in pevent_ctx_main (arg=0x80322c008) at contrib/libpdel/util/pevent.c:720 #16 0x0000000802477b55 in thread_start (curthread=<value optimized out>) at /usr/src/lib/libthr/thread/thr_create.c:289 #17 0x0000000000000000 in ?? () Thread 5 (Thread 80342a000 (LWP 100378/<unknown>)): #0 0x000000080248567a in _umtx_op_err () from /lib/libthr.so.3 #1 0x00000008024796e4 in __thr_umutex_lock (mtx=<value optimized out>, id=<value optimized out>) at /usr/src/lib/libthr/thread/thr_umtx.c:80 #2 0x0000000802481106 in _thr_alloc (curthread=<value optimized out>) at thr_umtx.h:123 #3 0x00000008024772a6 in _pthread_create (thread=0x8076dd438, attr=0x0, start_routine=0x4c4ec0 <http_server_connection_main>, arg=<value optimized out>) at /usr/src/lib/libthr/thread/thr_create.c:78 #4 0x00000000004c400a in http_server_accept (arg=0x803420088) at contrib/libpdel/http/http_server.c:653 #5 0x00000000004b154e in pevent_ctx_execute (arg=0x803420108) at contrib/libpdel/util/pevent.c:885 #6 0x00000000004b112f in pevent_ctx_service (ev=0x803420108) at contrib/libpdel/util/pevent.c:774 #7 0x00000000004b0baf in pevent_ctx_main (arg=0x80340e0a8) at contrib/libpdel/util/pevent.c:720 #8 0x0000000802477b55 in thread_start (curthread=<value optimized out>) at /usr/src/lib/libthr/thread/thr_create.c:289 #9 0x0000000000000000 in ?? () Thread 4 (Thread 807ccdf00 (LWP 101673/<unknown>)): #0 0x000000080277d43a in _nanosleep () from /lib/libc.so.7 #1 0x000000080247a4cc in __thr_nanosleep (time_to_sleep=0x7fffcbd5ceb0, time_remaining=0x0) at /usr/src/lib/libthr/thread/thr_syscalls.c:257 #2 0x00000008027d5036 in __usleep (useconds=<value optimized out>) at /usr/src/lib/libc/gen/usleep.c:52 #3 0x00000000004b1aaf in paction_cleanup (arg=0x80342f288) at contrib/libpdel/util/paction.c:243 #4 0x0000000802485550 in __pthread_cleanup_pop_imp (execute=<value optimized out>) at /usr/src/lib/libthr/thread/thr_clean.c:73 #5 0x00000000004b189f in paction_main (arg=0x80342f288) at contrib/libpdel/util/paction.c:197 #6 0x0000000802477b55 in thread_start (curthread=<value optimized out>) at /usr/src/lib/libthr/thread/thr_create.c:289 #7 0x00007fffcbb5d000 in ?? () Cannot access memory at address 0x7fffcbd5d000 Thread 3 (Thread 804fd4900 (LWP 101715/<unknown>)): #0 0x000000080277d43a in _nanosleep () from /lib/libc.so.7 #1 0x000000080247a4cc in __thr_nanosleep (time_to_sleep=0x7fffcad54eb0, time_remaining=0x0) at /usr/src/lib/libthr/thread/thr_syscalls.c:257 #2 0x00000008027d5036 in __usleep (useconds=<value optimized out>) at /usr/src/lib/libc/gen/usleep.c:52 ---Type <return> to continue, or q <return> to quit--- #3 0x00000000004b1aaf in paction_cleanup (arg=0x80342fb48) at contrib/libpdel/util/paction.c:243 #4 0x0000000802485550 in __pthread_cleanup_pop_imp (execute=<value optimized out>) at /usr/src/lib/libthr/thread/thr_clean.c:73 #5 0x00000000004b189f in paction_main (arg=0x80342fb48) at contrib/libpdel/util/paction.c:197 #6 0x0000000802477b55 in thread_start (curthread=<value optimized out>) at /usr/src/lib/libthr/thread/thr_create.c:289 #7 0x00007fffcab55000 in ?? () Cannot access memory at address 0x7fffcad55000 Thread 2 (Thread 807d70500 (LWP 101723/<unknown>)): #0 0x000000080277d43a in _nanosleep () from /lib/libc.so.7 #1 0x000000080247a4cc in __thr_nanosleep (time_to_sleep=0x7fffbc0deeb0, time_remaining=0x0) at /usr/src/lib/libthr/thread/thr_syscalls.c:257 #2 0x00000008027d5036 in __usleep (useconds=<value optimized out>) at /usr/src/lib/libc/gen/usleep.c:52 #3 0x00000000004b1aaf in paction_cleanup (arg=0x80342fcd8) at contrib/libpdel/util/paction.c:243 #4 0x0000000802485550 in __pthread_cleanup_pop_imp (execute=<value optimized out>) at /usr/src/lib/libthr/thread/thr_clean.c:73 #5 0x00000000004b189f in paction_main (arg=0x80342fcd8) at contrib/libpdel/util/paction.c:197 #6 0x0000000802477b55 in thread_start (curthread=<value optimized out>) at /usr/src/lib/libthr/thread/thr_create.c:289 #7 0x00007fffbbedf000 in ?? () Cannot access memory at address 0x7fffbc0df000 Thread 1 (Thread 807cd4800 (LWP 101061/<unknown>)): #0 0x000000080277d43a in _nanosleep () from /lib/libc.so.7 #1 0x000000080247a4cc in __thr_nanosleep (time_to_sleep=0x7fffa9c4ceb0, time_remaining=0x0) at /usr/src/lib/libthr/thread/thr_syscalls.c:257 #2 0x00000008027d5036 in __usleep (useconds=<value optimized out>) at /usr/src/lib/libc/gen/usleep.c:52 #3 0x00000000004b1aaf in paction_cleanup (arg=0x80342fd28) at contrib/libpdel/util/paction.c:243 #4 0x0000000802485550 in __pthread_cleanup_pop_imp (execute=<value optimized out>) at /usr/src/lib/libthr/thread/thr_clean.c:73 #5 0x00000000004b189f in paction_main (arg=0x80342fd28) at contrib/libpdel/util/paction.c:197 #6 0x0000000802477b55 in thread_start (curthread=<value optimized out>) at /usr/src/lib/libthr/thread/thr_create.c:289 #7 0x00007fffa9a4d000 in ?? () Cannot access memory at address 0x7fffa9c4d000 #0 0x000000080248567c in _umtx_op_err () from /lib/libthr.so.3
(In reply to Cassiano Peixoto from comment #49) From the thread 6, in the frame 1, please do 'p *rwlock'
(In reply to Konstantin Belousov from comment #50) Please, see below: (gdb) frame 1 #1 0x0000000802479c81 in __thr_rwlock_wrlock (rwlock=0x802694500, tsp=<value optimized out>) at /usr/src/lib/libthr/thread/thr_umtx.c:325 325 return (_umtx_op_err(rwlock, UMTX_OP_RW_WRLOCK, 0, (void *)tm_size, Current language: auto; currently minimal (gdb) p *rwlock $1 = {rw_state = -1610612736, rw_flags = 2, rw_blocked_readers = 0, rw_blocked_writers = 0, rw_spare = 0x802694510} (gdb)
(In reply to Cassiano Peixoto from comment #51) Please apply the following patch, recompile libthr.so _only_, and then reproduce the issue again. I need the same information: backtraces from all alive threads, and for the thread which is blocked on lock in _rtld_allocate_tls(), the printout of *rwlock.
Created attachment 183536 [details] debugging patch to track rtld bind lock write owner
(In reply to Konstantin Belousov from comment #53) Should i restart or recompile mpd5 after patch applied?
Restart only.
Although I may be wrong. Can you compile mpd5 from sources from sf.net ? I can add some helpful debug to than.
Created attachment 183537 [details] patch for lib/syslog by kib (In reply to Cassiano Peixoto from comment #51) I'd advise you stop using any patches sumbitted my me to the PR 214482 as they are no longer needed to find a root of the "umtxn" problem and are not quite correct. Instead, as more complete fix to "umtxn" problem use two following patches: first by Konstantin for syslog (libc) that should be applied so: cd /usr/src patch < /path/to/patch cd lib/libc make obj depend && make all install Then restart mpd5 only. Second patch for mpd5 itself by me. It tries to fix "uwrlck" problem in more correct way, replacing my previous patch. It deals with CLI console only, not touching web part still. Please make sure no http access is done to mpd. It would be fine to comment out "set web" commands in the mpd.conf for a while. And, of course, apply Konstantin's latest patch for rtld.
Created attachment 183538 [details] patch for mpd/console locks by me
(In reply to Cassiano Peixoto from comment #49) Note: Thread 5 shows "http_server_accept" in the trace that means that mpd web interface is being used.
(In reply to Dmitry Luhtionov from comment #56) Hi Dmitry, Yes, i can. Please just provide the link and i can download and compile. Thanks.
(In reply to Eugene Grosbein from comment #59) Eugene, i think it's because web option was still enabled in mpd.conf. Anyway i disabled it now to avoid any issues.
(In reply to Eugene Grosbein from comment #57) Eugene, i got the following error compiling libc patch: ===> tests (all) ===> tests/tls_dso (all) ===> tests/c063 (all) (cd /usr/src/lib/libc/tests/c063 && DEPENDFILE=.depend.faccessat_test NO_SUBDIR=1 make -f /usr/src/lib/libc/tests/c063/Makefile _RECURSING_PROGS=t PROG=faccessat_test ) make[3]: /usr/obj/usr/src/lib/libc/tests/c063/.depend.faccessat_test, 1: ignoring stale .depend.faccessat_test for /usr/obj/usr/src/lib/libnetbsd/libnetbsd.a cc -O2 -pipe -D_INCOMPLETE_XOPEN_C063 -I/usr/src/lib/libnetbsd -I/usr/src/contrib/netbsd-tests -g -std=gnu99 -fstack-protector-strong -Wsystem-headers -Werror -Wall -Wno-format-y2k -Wno-uninitialized -Wno-pointer-sign -Wno-empty-body -Wno-string-plus-int -Wno-unused-const-variable -Wno-tautological-compare -Wno-unused-value -Wno-parentheses-equality -Wno-unused-function -Wno-enum-conversion -Wno-unused-local-typedef -Wno-switch -Wno-switch-enum -Wno-knr-promoted-parameter -Qunused-arguments -L/usr/obj/usr/src/lib/libnetbsd -o faccessat_test.full t_faccessat.o -lprivateatf-c -L/usr/obj/usr/src/lib/libnetbsd -lnetbsd /usr/bin/ld: cannot find -lnetbsd cc: error: linker command failed with exit code 1 (use -v to see invocation) *** Error code 1 Stop. make[3]: stopped in /usr/src/lib/libc/tests/c063 *** Error code 1 Stop. make[2]: stopped in /usr/src/lib/libc/tests/c063 *** Error code 1 Stop. make[1]: stopped in /usr/src/lib/libc/tests *** Error code 1 Stop. make: stopped in /usr/src/lib/libc Exit 1
(In reply to Cassiano Peixoto from comment #62) Use 'make -DWITHOUT_TESTS=yes' to compile both libc and libthr.
(In reply to Konstantin Belousov from comment #63) Sorry, the correct spelling is 'make WITHOUT_TESTS=yes', no -D is needed.
(In reply to Konstantin Belousov from comment #64) Konstantin, Libc patch compiled fine after using 'make WITHOUT_TESTS=yes'. But libthr patch doesn't, see: cc -O2 -pipe -DPTHREAD_KERNEL -I/usr/src/lib/libthr/../libc/include -I/usr/src/lib/libthr/thread -I/usr/src/lib/libthr/../../include -I/usr/src/lib/libthr/arch/amd64/include -I/usr/src/lib/libthr/sys -I/usr/src/lib/libthr/../../libexec/rtld-elf -I/usr/src/lib/libthr/../../libexec/rtld-elf/amd64 -I/usr/src/lib/libthr/../libthread_db -Winline -fexceptions -D_PTHREAD_FORCED_UNWIND -D_PTHREADS_INVARIANTS -mno-mmx -mno-sse -mno-avx -MD -MF.depend.thr_init.o -MTthr_init.o -std=gnu99 -Wsystem-headers -Werror -Wall -Wno-format-y2k -W -Wno-unused-parameter -Wstrict-prototypes -Wmissing-prototypes -Wpointer-arith -Wno-uninitialized -Wno-pointer-sign -Wno-empty-body -Wno-string-plus-int -Wno-unused-const-variable -Wno-tautological-compare -Wno-unused-value -Wno-parentheses-equality -Wno-unused-function -Wno-enum-conversion -Wno-unused-local-typedef -Qunused-arguments -c /usr/src/lib/libthr/thread/thr_init.c -o thr_init.o /usr/src/lib/libthr/thread/thr_init.c:71:35: error: excess elements in scalar initializer [-Werror] struct urwlock _thr_atfork_lock = DEFAULT_URWLOCK; ^~~~~~~~~~~~~~~ /usr/src/lib/libthr/thread/thr_umtx.h:40:39: note: expanded from macro 'DEFAULT_URWLOCK' #define DEFAULT_URWLOCK {0,0,0,0,0,{0,0,0}} ^ /usr/src/lib/libthr/thread/thr_init.c:71:35: error: suggest braces around initialization of subobject [-Werror,-Wmissing-braces] struct urwlock _thr_atfork_lock = DEFAULT_URWLOCK; ^~~~~~~~~~~~~~~ /usr/src/lib/libthr/thread/thr_umtx.h:40:34: note: expanded from macro 'DEFAULT_URWLOCK' #define DEFAULT_URWLOCK {0,0,0,0,0,{0,0,0}} ^~~~~~~~~ /usr/src/lib/libthr/thread/thr_init.c:127:33: error: excess elements in scalar initializer [-Werror] struct urwlock _thr_list_lock = DEFAULT_URWLOCK; ^~~~~~~~~~~~~~~ /usr/src/lib/libthr/thread/thr_umtx.h:40:39: note: expanded from macro 'DEFAULT_URWLOCK' #define DEFAULT_URWLOCK {0,0,0,0,0,{0,0,0}} ^ /usr/src/lib/libthr/thread/thr_init.c:127:33: error: suggest braces around initialization of subobject [-Werror,-Wmissing-braces] struct urwlock _thr_list_lock = DEFAULT_URWLOCK; ^~~~~~~~~~~~~~~ /usr/src/lib/libthr/thread/thr_umtx.h:40:34: note: expanded from macro 'DEFAULT_URWLOCK' #define DEFAULT_URWLOCK {0,0,0,0,0,{0,0,0}} ^~~~~~~~~ 4 errors generated. *** Error code 1 Stop. make: stopped in /usr/src/lib/libthr
(In reply to Cassiano Peixoto from comment #65) Perhaps you need to manually copy patched sys/_umtx.h to /usr/include/sys. Do not forget to revert this when cleaning your machine.
(In reply to Konstantin Belousov from comment #66) Konstantin, Worked after copied _umtx.h. Thanks. Konstantin and Eugene, I applied the patches, to sum up what i've applied: - libc patch - libthr patch - mpd console patch As you know it takes some days to happen again. Last time took 5 days. I wish, but i don't know how to force to reproduce the issue at any time. I'll keep you posted guys, thank you.
*** Bug 214482 has been marked as a duplicate of this bug. ***
Assign to Eugene who appears (currently) to be coordinating. It would be ideal if a separate (dependent) issue is created for the/any base changes required to resolve this issue, so that it can be assigned to the person resolving on that side (kib?) and tracked for MFC's and release engineering if appropriate (ports categorized issues don't have those flags)
JFYI: these patches made my mpd server side pretty stable, so my stress test can run long enough to hit another problem that produces repeatable kernel panics at client side. I've digged that problem and produced a fix for ng_iface(4) and filled another PR for that distinct problem: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=220076
IFYI2: another reason for this task to break and new set of patches here: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=220078
(In reply to Eugene Grosbein from comment #71) Is there an issue for the syslog related changes/patch discussed with kib@ on freebsd-current?
(In reply to Kubilay Kocak from comment #72) This is the patch https://bugs.freebsd.org/bugzilla/attachment.cgi?id=183537 My tests show no issues with it. It really helps and I'd like to see it included in upcoming 11.1-RELEASE and merged to stable/10 too for announced 10.4.
(In reply to Eugene Grosbein from comment #73) I order to ensure that (base) change is tracked and merged correctly, I'd recommend (as suggested in comment 69) creating a separate dependent issue so that it can be: - Assigned to kib (as distinct from this overarching issue) - Tracked for MFC's correctly and completely - Used to notify/inform release engineering specifically and only for that change
(In reply to Kubilay Kocak from comment #74) Done: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=220151 Konstantin Belousov (author of the patch) is in the re@, so this team should be aware of the problem.
(In reply to Eugene Grosbein from comment #75) Eugene, are you going to provide all these patches to FreeBSD 10-STABLE?
(In reply to Cassiano Peixoto from comment #76) All patches are pretty simple and should just work for 10.x
(In reply to Eugene Grosbein from comment #77) Hi guys, After 8 days working with no issues i think at last it has been fixed. Well done Eugene :) I could see libc patch has been committed to 11-STABLE. Is there a plan to commit libthr patch and all others PRs related before 11.1-RELEASE? Thank you guys.
(In reply to Cassiano Peixoto from comment #78) I believe, the libthr patch was for debugging purposes only and is not needed to fix the problem itself. Did you test it with web server disabled? If so, please update your ports tree and rebuild mpd5 port that now has "console" patch integrated, so remove your local version of the patch before updating ports tree - my tests show that it should be enough to fix mpd's web server too. And run mpd-5.8_1 with web server enabled as you did earlier to verify that.
(In reply to Eugene Grosbein from comment #79) Yes, i did. All my tests were with web server disabled as you requested. Only with console enabled. I'll run with web server enabled to try the patch. How about other related patchs (like netgraph, ipfw, etc) ? Are you going to commit? Thanks.
(In reply to Cassiano Peixoto from comment #80) I cannot commit kernel patches myself as I have no src commit bit. Any src committed is needed to take a look at least, so I've filled my PRs.
(In reply to Eugene Grosbein from comment #81) Humm i see. Maybe Konstantin Belousov could help us :)
(In reply to Eugene Grosbein from comment #81) Eugene and Konstantin, Bad news, it just stopped working. Eugene i hadn't enabled web server yet. So it stucked the mpd5 process with patch applied listed on comment #67. The top out i can see same message: 1295 root 9 52 0 314M 54820K uwrlck 7 625:46 0.74% mpd5 Here it is the bt: (gdb) bt #0 0x000000080228567c in _umtx_op_err () from /lib/libthr.so.3 #1 0x0000000802279c81 in __thr_rwlock_wrlock (rwlock=0x802494500, tsp=<value optimized out>) at /usr/src/lib/libthr/thread/thr_umtx.c:325 #2 0x00000008022829f3 in _thr_rtld_wlock_acquire (lock=<value optimized out>) at thr_umtx.h:239 #3 0x00000008006aec72 in wlock_acquire (lock=0x8008c3960, lockstate=0x7fffdfffda08) at /usr/src/libexec/rtld-elf/rtld_lock.c:217 #4 0x00000008006aaf4f in _rtld_allocate_tls (oldtls=0x0, tcbsize=32, tcbalign=16) at /usr/src/libexec/rtld-elf/rtld.c:4802 #5 0x0000000802281b69 in _tcb_ctor (thread=0x80c740f00, initial=<value optimized out>) at /usr/src/lib/libthr/thread/thr_ctrdtr.c:45 #6 0x0000000802281110 in _thr_alloc (curthread=<value optimized out>) at /usr/src/lib/libthr/thread/thr_list.c:170 #7 0x00000008022772a6 in _pthread_create (thread=0x80324d598, attr=0x0, start_routine=0x4653b0 <fseeko@plt+388260>, arg=<value optimized out>) at /usr/src/lib/libthr/thread/thr_create.c:78 #8 0x000000000046535b in ?? () #9 0x00000000004292a8 in ?? () #10 0x000000000042d219 in ?? () #11 0x0000000000448da3 in ?? () #12 0x00000000004493b1 in ?? () #13 0x000000000043ac94 in ?? () #14 0x000000000043d650 in ?? () #15 0x000000000043d366 in ?? () #16 0x000000000043b21d in ?? () #17 0x0000000000445012 in ?? () #18 0x000000000044b598 in ?? () #19 0x0000000000439c9f in ?? () #20 0x00000000004651d6 in ?? () #21 0x0000000000464908 in ?? () #22 0x0000000802277b55 in thread_start (curthread=<value optimized out>) at /usr/src/lib/libthr/thread/thr_create.c:289 Here it is all threads: (gdb) thread apply all bt Thread 9 (Thread 803016a00 (LWP 100704/<unknown>)): #0 0x000000080228567c in _umtx_op_err () from /lib/libthr.so.3 #1 0x0000000802279c81 in __thr_rwlock_wrlock (rwlock=0x802494500, tsp=<value optimized out>) at /usr/src/lib/libthr/thread/thr_umtx.c:325 #2 0x00000008022829f3 in _thr_rtld_wlock_acquire (lock=<value optimized out>) at thr_umtx.h:239 #3 0x00000008006aec72 in wlock_acquire (lock=0x8008c3960, lockstate=0x7fffdfffda08) at /usr/src/libexec/rtld-elf/rtld_lock.c:217 #4 0x00000008006aaf4f in _rtld_allocate_tls (oldtls=0x0, tcbsize=32, tcbalign=16) at /usr/src/libexec/rtld-elf/rtld.c:4802 #5 0x0000000802281b69 in _tcb_ctor (thread=0x80c740f00, initial=<value optimized out>) at /usr/src/lib/libthr/thread/thr_ctrdtr.c:45 #6 0x0000000802281110 in _thr_alloc (curthread=<value optimized out>) at /usr/src/lib/libthr/thread/thr_list.c:170 #7 0x00000008022772a6 in _pthread_create (thread=0x80324d598, attr=0x0, start_routine=0x4653b0 <fseeko@plt+388260>, arg=<value optimized out>) at /usr/src/lib/libthr/thread/thr_create.c:78 #8 0x000000000046535b in ?? () #9 0x00000000004292a8 in ?? () #10 0x000000000042d219 in ?? () #11 0x0000000000448da3 in ?? () #12 0x00000000004493b1 in ?? () #13 0x000000000043ac94 in ?? () #14 0x000000000043d650 in ?? () #15 0x000000000043d366 in ?? () #16 0x000000000043b21d in ?? () #17 0x0000000000445012 in ?? () #18 0x000000000044b598 in ?? () #19 0x0000000000439c9f in ?? () #20 0x00000000004651d6 in ?? () #21 0x0000000000464908 in ?? () #22 0x0000000802277b55 in thread_start (curthread=<value optimized out>) at /usr/src/lib/libthr/thread/thr_create.c:289 #23 0x0000000000000000 in ?? () Thread 8 (Thread 80326ca00 (LWP 101005/<unknown>)): #0 0x000000080257d43a in _nanosleep () from /lib/libc.so.7 #1 0x000000080227a4cc in __thr_nanosleep (time_to_sleep=0x7fffc6731ed0, time_remaining=0x0) at /usr/src/lib/libthr/thread/thr_syscalls.c:257 #2 0x00000008025d5076 in __usleep (useconds=<value optimized out>) at /usr/src/lib/libc/gen/usleep.c:52 #3 0x0000000000465532 in ?? () #4 0x0000000802285550 in __pthread_cleanup_pop_imp (execute=<value optimized out>) at /usr/src/lib/libthr/thread/thr_clean.c:73 #5 0x000000000046541e in ?? () #6 0x0000000802277b55 in thread_start (curthread=<value optimized out>) at /usr/src/lib/libthr/thread/thr_create.c:289 #7 0x00007fffc6532000 in ?? () Cannot access memory at address 0x7fffc6732000 Thread 7 (Thread 80326ed00 (LWP 101007/<unknown>)): #0 0x000000080257d43a in _nanosleep () from /lib/libc.so.7 #1 0x000000080227a4cc in __thr_nanosleep (time_to_sleep=0x7fffb7cbced0, time_remaining=0x0) at /usr/src/lib/libthr/thread/thr_syscalls.c:257 #2 0x00000008025d5076 in __usleep (useconds=<value optimized out>) at /usr/src/lib/libc/gen/usleep.c:52 #3 0x0000000000465532 in ?? () #4 0x0000000802285550 in __pthread_cleanup_pop_imp (execute=<value optimized out>) at /usr/src/lib/libthr/thread/thr_clean.c:73 #5 0x000000000046541e in ?? () #6 0x0000000802277b55 in thread_start (curthread=<value optimized out>) at /usr/src/lib/libthr/thread/thr_create.c:289 #7 0x00007fffb7abd000 in ?? () Cannot access memory at address 0x7fffb7cbd000 ---Type <return> to continue, or q <return> to quit--- Thread 6 (Thread 803252e00 (LWP 100809/<unknown>)): #0 0x000000080257d43a in _nanosleep () from /lib/libc.so.7 #1 0x000000080227a4cc in __thr_nanosleep (time_to_sleep=0x7fffb86c1ed0, time_remaining=0x0) at /usr/src/lib/libthr/thread/thr_syscalls.c:257 #2 0x00000008025d5076 in __usleep (useconds=<value optimized out>) at /usr/src/lib/libc/gen/usleep.c:52 #3 0x0000000000465532 in ?? () #4 0x0000000802285550 in __pthread_cleanup_pop_imp (execute=<value optimized out>) at /usr/src/lib/libthr/thread/thr_clean.c:73 #5 0x000000000046541e in ?? () #6 0x0000000802277b55 in thread_start (curthread=<value optimized out>) at /usr/src/lib/libthr/thread/thr_create.c:289 #7 0x00007fffb84c2000 in ?? () Cannot access memory at address 0x7fffb86c2000 Thread 5 (Thread 807292a00 (LWP 100290/<unknown>)): #0 0x000000080257d43a in _nanosleep () from /lib/libc.so.7 #1 0x000000080227a4cc in __thr_nanosleep (time_to_sleep=0x7fff88b44ed0, time_remaining=0x0) at /usr/src/lib/libthr/thread/thr_syscalls.c:257 #2 0x00000008025d5076 in __usleep (useconds=<value optimized out>) at /usr/src/lib/libc/gen/usleep.c:52 #3 0x0000000000465532 in ?? () #4 0x0000000802285550 in __pthread_cleanup_pop_imp (execute=<value optimized out>) at /usr/src/lib/libthr/thread/thr_clean.c:73 #5 0x000000000046541e in ?? () #6 0x0000000802277b55 in thread_start (curthread=<value optimized out>) at /usr/src/lib/libthr/thread/thr_create.c:289 #7 0x00007fff88945000 in ?? () Cannot access memory at address 0x7fff88b45000 Thread 4 (Thread 803253d00 (LWP 101008/<unknown>)): #0 0x000000080257d43a in _nanosleep () from /lib/libc.so.7 #1 0x000000080227a4cc in __thr_nanosleep (time_to_sleep=0x7fffbdaebed0, time_remaining=0x0) at /usr/src/lib/libthr/thread/thr_syscalls.c:257 #2 0x00000008025d5076 in __usleep (useconds=<value optimized out>) at /usr/src/lib/libc/gen/usleep.c:52 #3 0x0000000000465532 in ?? () #4 0x0000000802285550 in __pthread_cleanup_pop_imp (execute=<value optimized out>) at /usr/src/lib/libthr/thread/thr_clean.c:73 #5 0x000000000046541e in ?? () #6 0x0000000802277b55 in thread_start (curthread=<value optimized out>) at /usr/src/lib/libthr/thread/thr_create.c:289 #7 0x00007fffbd8ec000 in ?? () Cannot access memory at address 0x7fffbdaec000 Thread 3 (Thread 808089000 (LWP 101010/<unknown>)): #0 0x000000080257d43a in _nanosleep () from /lib/libc.so.7 #1 0x000000080227a4cc in __thr_nanosleep (time_to_sleep=0x7fff8bf5eed0, time_remaining=0x0) at /usr/src/lib/libthr/thread/thr_syscalls.c:257 #2 0x00000008025d5076 in __usleep (useconds=<value optimized out>) at /usr/src/lib/libc/gen/usleep.c:52 #3 0x0000000000465532 in ?? () #4 0x0000000802285550 in __pthread_cleanup_pop_imp (execute=<value optimized out>) at /usr/src/lib/libthr/thread/thr_clean.c:73 #5 0x000000000046541e in ?? () #6 0x0000000802277b55 in thread_start (curthread=<value optimized out>) at /usr/src/lib/libthr/thread/thr_create.c:289 #7 0x00007fff8bd5f000 in ?? () Cannot access memory at address 0x7fff8bf5f000 Thread 2 (Thread 80808a400 (LWP 101011/<unknown>)): #0 0x000000080257d43a in _nanosleep () from /lib/libc.so.7 #1 0x000000080227a4cc in __thr_nanosleep (time_to_sleep=0x7fff9178aed0, time_remaining=0x0) at /usr/src/lib/libthr/thread/thr_syscalls.c:257 #2 0x00000008025d5076 in __usleep (useconds=<value optimized out>) at /usr/src/lib/libc/gen/usleep.c:52 #3 0x0000000000465532 in ?? () #4 0x0000000802285550 in __pthread_cleanup_pop_imp (execute=<value optimized out>) at /usr/src/lib/libthr/thread/thr_clean.c:73 #5 0x000000000046541e in ?? () ---Type <return> to continue, or q <return> to quit--- #6 0x0000000802277b55 in thread_start (curthread=<value optimized out>) at /usr/src/lib/libthr/thread/thr_create.c:289 #7 0x00007fff9158b000 in ?? () Cannot access memory at address 0x7fff9178b000 Thread 1 (Thread 807258d00 (LWP 101012/<unknown>)): #0 0x000000080257d43a in _nanosleep () from /lib/libc.so.7 #1 0x000000080227a4cc in __thr_nanosleep (time_to_sleep=0x7fff7c2e0ed0, time_remaining=0x0) at /usr/src/lib/libthr/thread/thr_syscalls.c:257 #2 0x00000008025d5076 in __usleep (useconds=<value optimized out>) at /usr/src/lib/libc/gen/usleep.c:52 #3 0x0000000000465532 in ?? () #4 0x0000000802285550 in __pthread_cleanup_pop_imp (execute=<value optimized out>) at /usr/src/lib/libthr/thread/thr_clean.c:73 #5 0x000000000046541e in ?? () #6 0x0000000802277b55 in thread_start (curthread=<value optimized out>) at /usr/src/lib/libthr/thread/thr_create.c:289 #7 0x00007fff7c0e1000 in ?? () Cannot access memory at address 0x7fff7c2e1000 #0 0x000000080228567c in _umtx_op_err () from /lib/libthr.so.3 Printof of *rwlock: (gdb) frame 1 #1 0x0000000802279c81 in __thr_rwlock_wrlock (rwlock=0x802494500, tsp=<value optimized out>) at /usr/src/lib/libthr/thread/thr_umtx.c:325 325 return (_umtx_op_err(rwlock, UMTX_OP_RW_WRLOCK, 0, (void *)tm_size, Current language: auto; currently minimal (gdb) p *rwlock $1 = {rw_state = -1610612736, rw_flags = 2, rw_blocked_readers = 0, rw_blocked_writers = 0, rw_wowner = 100704, rw_spare = 0x802494514} I could see some kernel messages as well, earlier today: Jun 24 10:14:21 B-ras kernel: node: ID [20498d]: type 'tee', 0 hooks, flags 0x9, 0 refs, mpd1295-vlan340-44-lt: Jun 24 10:14:21 B-ras kernel: Last active @ /usr/src/sys/netgraph/ng_base.c, line 804 Jun 24 10:14:21 B-ras kernel: problem discovered at file /usr/src/sys/netgraph/ng_base.c, line 3709 Jun 24 10:14:21 B-ras kernel: KDB: stack backtrace: Jun 24 10:14:21 B-ras kernel: #0 0xffffffff80b1af87 at kdb_backtrace+0x67 Jun 24 10:14:21 B-ras kernel: #1 0xffffffff80c46671 at ng_send_fn1+0xc1 Jun 24 10:14:21 B-ras kernel: #2 0xffffffff80c3e9c4 at ng_destroy_hook+0x334 Jun 24 10:14:21 B-ras kernel: #3 0xffffffff80c4289b at ng_apply_item+0x3eb Jun 24 10:14:21 B-ras kernel: #4 0xffffffff80c48475 at ngthread+0x315 Jun 24 10:14:21 B-ras kernel: #5 0xffffffff80a88c55 at fork_exit+0x85 Jun 24 10:14:21 B-ras kernel: #6 0xffffffff80ec5f8e at fork_trampoline+0xe Jun 24 10:14:21 B-ras kernel: Accessing freed node: ID [20498d]: type 'tee', 0 hooks, flags 0x9, 1 refs, mpd1295-vlan340-44-lt: Jun 24 10:14:21 B-ras kernel: Last active @ /usr/src/sys/netgraph/ng_base.c, line 3709 Jun 24 10:14:21 B-ras kernel: problem discovered at file /usr/src/sys/netgraph/ng_base.c, line 2455 Jun 24 10:14:21 B-ras kernel: KDB: stack backtrace: Jun 24 10:14:21 B-ras kernel: #0 0xffffffff80b1af87 at kdb_backtrace+0x67 Jun 24 10:14:21 B-ras kernel: #1 0xffffffff80c42711 at ng_apply_item+0x261 Jun 24 10:14:21 B-ras kernel: #2 0xffffffff80c4217e at ng_snd_item+0x1ee Jun 24 10:14:21 B-ras kernel: #3 0xffffffff80c3e9c4 at ng_destroy_hook+0x334 Jun 24 10:14:21 B-ras kernel: #4 0xffffffff80c4289b at ng_apply_item+0x3eb Jun 24 10:14:21 B-ras kernel: #5 0xffffffff80c48475 at ngthread+0x315 Jun 24 10:14:21 B-ras kernel: #6 0xffffffff80a88c55 at fork_exit+0x85 Jun 24 10:14:21 B-ras kernel: #7 0xffffffff80ec5f8e at fork_trampoline+0xe Jun 24 10:14:21 B-ras kernel: Accessing freed node: ID [20498d]: type 'tee', 0 hooks, flags 0x9, 1 refs, mpd1295-vlan340-44-lt: Jun 24 10:14:21 B-ras kernel: Last active @ /usr/src/sys/netgraph/ng_base.c, line 2455 Jun 24 10:14:22 B-ras kernel: problem discovered at file /usr/src/sys/netgraph/ng_base.c, line 2335 Jun 24 10:14:22 B-ras kernel: KDB: stack backtrace: Jun 24 10:14:22 B-ras kernel: #0 0xffffffff80b1af87 at kdb_backtrace+0x67 Jun 24 10:14:22 B-ras kernel: #1 0xffffffff80c42308 at ng_snd_item+0x378 Jun 24 10:14:22 B-ras kernel: #2 0xffffffff80c3e9c4 at ng_destroy_hook+0x334 Jun 24 10:14:22 B-ras kernel: #3 0xffffffff80c4289b at ng_apply_item+0x3eb Jun 24 10:14:22 B-ras kernel: #4 0xffffffff80c48475 at ngthread+0x315 Jun 24 10:14:22 B-ras kernel: #5 0xffffffff80a88c55 at fork_exit+0x85 Jun 24 10:14:22 B-ras kernel: #6 0xffffffff80ec5f8e at fork_trampoline+0xe Jun 24 10:14:22 B-ras kernel: Accessing freed node: ID [20498d]: type 'tee', 0 hooks, flags 0x9, 0 refs, mpd1295-vlan340-44-lt: Jun 24 10:14:22 B-ras kernel: Last active @ /usr/src/sys/netgraph/ng_base.c, line 2335 Jun 24 10:14:22 B-ras kernel: problem discovered at file /usr/src/sys/netgraph/ng_base.c, line 804 Jun 24 10:14:22 B-ras kernel: KDB: stack backtrace: Jun 24 10:14:22 B-ras kernel: #0 0xffffffff80b1af87 at kdb_backtrace+0x67 Jun 24 10:14:22 B-ras kernel: #1 0xffffffff80c3ee46 at ng_unref_node+0x146 Jun 24 10:14:22 B-ras kernel: #2 0xffffffff80c42326 at ng_snd_item+0x396 Jun 24 10:14:22 B-ras kernel: #3 0xffffffff80c3e9c4 at ng_destroy_hook+0x334 Jun 24 10:14:22 B-ras kernel: #4 0xffffffff80c4289b at ng_apply_item+0x3eb Jun 24 10:14:22 B-ras kernel: #5 0xffffffff80c48475 at ngthread+0x315 Jun 24 10:14:22 B-ras kernel: #6 0xffffffff80a88c55 at fork_exit+0x85 Jun 24 10:14:22 B-ras kernel: #7 0xffffffff80ec5f8e at fork_trampoline+0xe Let me know if you need something else. Thanks.
(In reply to Cassiano Peixoto from comment #83) It seems you hit another problem in the libc/stdio. Konstantin produced another patch for this problem: https://reviews.freebsd.org/file/data/nthhi3ogesbuhnjclgmu/PHID-FILE-ahlgnygvinulibyezovs/D11246.diff ( https://reviews.freebsd.org/D11246 ) I'm very sorry not notifying you about this patch before. I ran my tests with this patch applied too and have no hangs. Please apply it similarly and additionally to the "syslog" patch and restart mpd with web server enabled.
(In reply to Eugene Grosbein from comment #84) Hi Eugene, Don't worry, no problem. I tried to apply Konstantin patch but i had some rejects: # cat lib/libc/stdio/fgets.c.rej @@ -53,17 +53,17 @@ fgets(char * __restrict buf, int n, FILE * __restrict fp) { size_t len; - char *s; + char *s, *ret; unsigned char *p, *t; - FLOCKFILE(fp); + FLOCKFILE_CANCELSAFE(fp); ORIENT(fp, -1); if (n <= 0) { /* sanity check */ fp->_flags |= __SERR; errno = EINVAL; - FUNLOCKFILE(fp); - return (NULL); + ret = NULL; + goto end; } s = buf; @@ -76,8 +76,8 @@ if (__srefill(fp)) { /* EOF/error: stop with partial or no line */ if (!__sfeof(fp) || s == buf) { - FUNLOCKFILE(fp); - return (NULL); + ret = NULL; + goto end; } break; } # cat lib/libc/stdio/fgetwln.c.rej @@ -45,13 +45,14 @@ wchar_t * fgetwln_l(FILE * __restrict fp, size_t *lenp, locale_t locale) { + wchar_t *ret; wint_t wc; size_t len; int savserr; FIX_LOCALE(locale); - FLOCKFILE(fp); + FLOCKFILE_CANCELSAFE(fp); ORIENT(fp, 1); savserr = fp->_flags & __SERR; # cat lib/libc/stdio/fgetws.c.rej @@ -46,14 +46,14 @@ fgetws_l(wchar_t * __restrict ws, int n, FILE * __restrict fp, locale_t locale) { int sret; - wchar_t *wsp; + wchar_t *wsp, *ret; size_t nconv; const char *src; unsigned char *nl; FIX_LOCALE(locale); struct xlocale_ctype *l = XLOCALE_CTYPE(locale); - FLOCKFILE(fp); + FLOCKFILE_CANCELSAFE(fp); ORIENT(fp, 1); if (n <= 0) { @@ -113,12 +113,14 @@ goto error; ok: *wsp = L'\0'; - FUNLOCKFILE(fp); + ret = ws; +end: + FUNLOCKFILE_CANCELSAFE(); return (ws); error: - FUNLOCKFILE(fp); - return (NULL); + ret = NULL; + goto end; } wchar_t * Should i update to latest FreeBSD-11 STABLE? I'm using 11.0-p0. Thanks
(In reply to Cassiano Peixoto from comment #85) Yes, that would be easiest thing. All these patches apply to 11-STABLE just fine.
(In reply to Cassiano Peixoto from comment #85) And there is no need to apply "syslog" patch to 11-STABLE as it has already been merged.
(In reply to Eugene Grosbein from comment #84) I very much doubt that the rtld lock leak (?) can be caused by stdio bugs. In other words, apply the debugging rtld patch I posted 06-16, and follow the accompanying instructions from there.
(In reply to Konstantin Belousov from comment #88) Hi Konstantin, rtld patch has been applied since them as i said on comment #67. Do you need some additional debug info from core file?
(In reply to Eugene Grosbein from comment #87) Eugene, Sorry my delay but i was updating to 11.1-BETA3 r320450M. On updated i've applied the following patches: - libc/stdio from https://reviews.freebsd.org/D11246 - updated to mpd5.8_1 with web server enabled - ipfw patch - libthr patch (rtld) - in_mcast patch - ip_input patch - ng_iface patch - stf patch Let me know if you need anything else. Thanks
(In reply to Cassiano Peixoto from comment #89) See comment #52 for instructions.
(In reply to Konstantin Belousov from comment #91) Did you read everything on my comment #83? Everything you asked on comment #52 is there. Or am i missed something?
(In reply to Cassiano Peixoto from comment #90) Make sure you have latest revision of https://reviews.freebsd.org/D11246 patch as Konstantin updated it several hours ago to include "fseeko" chunk and your last hang and traces show a reference to that "fseeko".
(In reply to Eugene Grosbein from comment #93) Ok i reverted the patch and applied the new one, but i got an error when compiling libc: cc -O2 -pipe -I/usr/src/lib/libc/include -I/usr/src/include -I/usr/src/lib/libc/amd64 -DNLS -D__DBINTERFACE_PRIVATE -I/usr/src/contrib/gdtoa -I/usr/src/contrib/libc-vis -DINET6 -I/usr/obj/usr/src/lib/libc -I/usr/src/lib/libc/resolv -D_ACL_PRIVATE -DPOSIX_MISTAKE -I/usr/src/lib/libmd -I/usr/src/contrib/jemalloc/include -I/usr/src/contrib/tzcode/stdtime -I/usr/src/lib/libc/stdtime -I/usr/src/lib/libc/locale -DBROKEN_DES -DPORTMAP -DDES_BUILTIN -I/usr/src/lib/libc/rpc -DWANT_HYPERV -DYP -DNS_CACHING -DSYMBOL_VERSIONING -MD -MF.depend.stdio.o -MTstdio.o -std=gnu99 -fstack-protector-strong -Wsystem-headers -Werror -Wall -Wno-format-y2k -Wno-uninitialized -Wno-pointer-sign -Wno-empty-body -Wno-string-plus-int -Wno-unused-const-variable -Wno-tautological-compare -Wno-unused-value -Wno-parentheses-equality -Wno-unused-function -Wno-enum-conversion -Wno-unused-local-typedef -Wno-address-of-packed-member -Wno-switch -Wno-switch-enum -Wno-knr-promoted-parameter -Qunused-arguments -I/usr/src/lib/libutil -I/usr/src/lib/msun/amd64 -I/usr/src/lib/msun/x86 -I/usr/src/lib/msun/src -c /usr/src/lib/libc/stdio/stdio.c -o stdio.o /usr/src/lib/libc/stdio/stdio.c:179:1: error: redefinition of '__stdio_cancel_cleanup' __stdio_cancel_cleanup(void * arg) ^ /usr/src/lib/libc/stdio/stdio.c:171:1: note: previous definition is here __stdio_cancel_cleanup(void * arg) ^ 1 error generated. *** Error code 1 Stop. make: stopped in /usr/src/lib/libc
(In reply to Cassiano Peixoto from comment #94) It seems your source tree is broken now. Try to clean is up and apply all patches again: cd /usr/src && svnlite cleanup --remove-unversioned && svnlite revert -R .
(In reply to Cassiano Peixoto from comment #94) Let it go, something messed when applied the patch. I re-synced the src and applied fine. So, patched applied and mpd5 recompiled. Let's watch. Thanks.
(In reply to Cassiano Peixoto from comment #92) Indeed, everything I asked for, is in the comment #83. Now I cannot make a sense from the reported line 4802 of rtld.c in the backtrace for the thread 100704. There is no lock call neither in stable/11 nor in HEAD, at this line. What exact version of the sources do you use, from which branch ?
(In reply to Konstantin Belousov from comment #97) That's from releng/11.0: https://svnweb.freebsd.org/base/releng/11.0/libexec/rtld-elf/rtld.c?annotate=304456#l4802
(In reply to Konstantin Belousov from comment #97) When it happend i was using 11.0-release-p0. But yesterday i updated to 11.1-beta3.
(In reply to Cassiano Peixoto from comment #99) Uh, ok. So lets wait for the issue to re-appear on latest stable. There were some fixes that might be relevant.
(In reply to Konstantin Belousov from comment #100) Hi guys, I had the issue today, but with a different behavior. It didn't show on CPU state as uwrlck anymore. The process was running as usual, but all customers connected stopped working. I tried to check both mpd5 console and web but didn't answer. I tried to killall -QUIT mpd5 process with no success, I try to kill -9 the pid process with no success as well. Looks like a zombie process. At last i tried to reboot the server, and didn't work. I had to reset the server to boot again. :(
mav@ is not maintainer of net/mpd5 anymore.
(In reply to Cassiano Peixoto from comment #101) In fact, a zombie process is not a process anymore. It does not prevent your from restarting new instance of mpd5. Did you look at ps output to check real state of mpd5 or tried to restart it? Also, a process may become a zombie only if it did not detach properly at start as normal daemons do (and mpd5 does it) AND its parent does not properly reap its children. init(8) does it just fine for normal daemons including mpd5. Do you run mpd5 in detached mode using -b flag? If so, I highly doubt it became a zombie. More likely, it could stuck in some uninterruptable system call but that's pretty unusual too. If you run mpd5 in "foreground" mode using another supervising daemon like init(8) itself, that's another story. And what did you have in the log of mpd5 corresponding that time? You showed no diagnostics: ps -l output, mpd logs, kernel dmesg logs, something? And system reboot cleans it all, so it seems you had some external problem just that took more time to disappear and second reboot just gave it that time.
Hi, It wasn't zombie, it looks like zombie because the kind of behavior. See process top output below: PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND 1295 root 30 73 0 327M 147M CPU3 3 627:10 101.55% mpd5 Pay attention to WCPU, it's too high. I never ran mpd5 in detached mode, but if you need it, i can. Regarding logs, there is nothing relevant in mpd5 logs, but in kernel logs i have: Jul 5 11:59:59 B-ras kernel: Accessing freed hook: name l4658, 0 refs, Last touched: Jul 5 11:59:59 B-ras kernel: Last active @ /usr/src/sys/netgraph/ng_base.c, line 1196 Jul 5 11:59:59 B-ras kernel: problem discovered at file /usr/src/sys/netgraph/ng_base.c, line 1131 Jul 5 11:59:59 B-ras kernel: KDB: stack backtrace: Jul 5 11:59:59 B-ras kernel: #0 0xffffffff80aa4ad7 at kdb_backtrace+0x67 Jul 5 11:59:59 B-ras kernel: #1 0xffffffff80bb2acc at ng_findhook+0xac Jul 5 11:59:59 B-ras kernel: #2 0xffffffff80bb96bc at ng_con_nodes+0x2c Jul 5 11:59:59 B-ras kernel: #3 0xffffffff80bb4e70 at ng_apply_item+0xa90 Jul 5 11:59:59 B-ras kernel: #4 0xffffffff80bb413b at ng_snd_item+0x1db Jul 5 11:59:59 B-ras kernel: #5 0xffffffff82451469 at ngc_send+0x209 Jul 5 11:59:59 B-ras kernel: #6 0xffffffff80aea8ec at sosend_generic+0x4ec Jul 5 11:59:59 B-ras kernel: #7 0xffffffff80af11c1 at kern_sendit+0x291 Jul 5 11:59:59 B-ras kernel: #8 0xffffffff80af1523 at sendit+0x1a3 Jul 5 11:59:59 B-ras kernel: #9 0xffffffff80af136d at sys_sendto+0x4d Jul 5 11:59:59 B-ras kernel: #10 0xffffffff80e36394 at amd64_syscall+0x6c4 Jul 5 12:00:00 B-ras kernel: #11 0xffffffff80e194eb at Xfast_syscall+0xfb Remember, i was using web status to collect info, i'll disable it, maybe it could be the problem, don't you think?
(In reply to Cassiano Peixoto from comment #104) I do not think web console has any connection to this new problem, keep it on. How do you start mpd5 - do you use its standard startup (rc.d) script and or not and what does "ps -ax | grep mpd" show? I mean its arguments. If whis problem repeats, use "ps -axH" to see all threads of mpd5 and their states and top -H for the same purpose. As for "Accessing freed hook": it seems my patch for ng_iface from the PR 220076 is not complete. It will take some time for me to enhance it and test before I update it there.
(In reply to Eugene Grosbein from comment #105) Ok, i kept web console on. I'm starting mpd5 using rc.d script: # ps -ax | grep mpd 1278 - Ss 25:03.78 /usr/local/sbin/mpd5 -p /var/run/mpd5.pid -b I'll follow your instructions as soon as it repeats. Please let me know when patch 220076 for PR has been updated. Thanks.
(In reply to Cassiano Peixoto from comment #106) I've just updated the patch for ng_iface(4). Apply it to kernel sources instead of its previous version, rebuild, install and boot new kernel.
(In reply to Cassiano Peixoto from comment #104) The logs show that the panic happened at "Jul 5 12:00:00" Is it possible that an action in your cron script could have triggered it?
(In reply to Donald Baud from comment #108) It does not really matter what triggered the panic as its reason is known: ng_iface lacks protection of its private data from modifications while it is being used. This is hopefully fixed with latest update of patch in the PR 220076.
(In reply to Eugene Grosbein from comment #109) Hi, Should i update with this last patch (PR 220076) and rebuild the kernel?
(In reply to Cassiano Peixoto from comment #110) Yes, just like I've asked in the comment 107.
(In reply to Eugene Grosbein from comment #111) Ok, just applied and rebooted. I'll keep you posted.
(In reply to Cassiano Peixoto from comment #112) Eugene, patch has been applied but i still can see the kernel error messages: Jul 10 01:22:30 B-ras kernel: KDB: stack backtrace: Jul 10 01:22:30 B-ras kernel: #0 0xffffffff80aa4ad7 at kdb_backtrace+0x67 Jul 10 01:22:30 B-ras kernel: #1 0xffffffff80bb2acc at ng_findhook+0xac Jul 10 01:22:30 B-ras kernel: #2 0xffffffff80bb96bc at ng_con_nodes+0x2c Jul 10 01:22:30 B-ras kernel: #3 0xffffffff80bb4e70 at ng_apply_item+0xa90 Jul 10 01:22:31 B-ras kernel: #4 0xffffffff80bb413b at ng_snd_item+0x1db Jul 10 01:22:31 B-ras kernel: #5 0xffffffff82451469 at ngc_send+0x209 Jul 10 01:22:31 B-ras kernel: #6 0xffffffff80aea8ec at sosend_generic+0x4ec Jul 10 01:22:31 B-ras kernel: #7 0xffffffff80af11c1 at kern_sendit+0x291 Jul 10 01:22:31 B-ras kernel: #8 0xffffffff80af1523 at sendit+0x1a3 Jul 10 01:22:31 B-ras kernel: #9 0xffffffff80af136d at sys_sendto+0x4d Jul 10 01:22:31 B-ras kernel: #10 0xffffffff80e36394 at amd64_syscall+0x6c4 Jul 10 01:22:31 B-ras kernel: #11 0xffffffff80e194eb at Xfast_syscall+0xfb Jul 10 01:22:31 B-ras kernel: Accessing freed hook: name right, 0 refs, Last touched: Jul 10 01:22:31 B-ras kernel: Last active @ /usr/src/sys/netgraph/ng_base.c, line 1228 Jul 10 01:22:31 B-ras kernel: problem discovered at file /usr/src/sys/netgraph/ng_base.c, line 1131 Since then the process didn't freeze yet, but i can see that messages anyway.
(In reply to Cassiano Peixoto from comment #113) Just to add more info about comment #112. I got some relevant messages in mpd5.log: Jul 10 01:22:32 B-ras mpd: Link: Packet from unexisting bundle "288" Jul 10 01:22:32 B-ras mpd: Link: Packet from unexisting bundle "116" Jul 10 01:22:32 B-ras mpd: Link: Packet from unexisting bundle "334" Jul 10 01:22:32 B-ras mpd: Link: Packet from unexisting bundle "525" Jul 10 01:22:32 B-ras mpd: Link: Packet from unexisting bundle "681" Jul 10 01:22:32 B-ras mpd: Link: Packet from unexisting bundle "271" Jul 10 01:22:32 B-ras mpd: Link: Packet from unexisting bundle "85" Jul 10 01:22:32 B-ras mpd: Link: Packet from unexisting bundle "785" Jul 10 01:22:32 B-ras mpd: Link: Packet from unexisting bundle "289" Jul 10 01:22:32 B-ras mpd: Link: Packet from unexisting bundle "386" Jul 10 01:22:32 B-ras mpd: Link: Packet from unexisting bundle "530" Jul 10 01:22:32 B-ras mpd: Link: Packet from unexisting bundle "381" Jul 10 01:22:32 B-ras mpd: Link: Packet from unexisting bundle "805" Jul 10 01:22:32 B-ras mpd: Link: Packet from unexisting bundle "152" Jul 10 01:22:32 B-ras mpd: Link: Packet from unexisting bundle "186" Jul 10 01:22:32 B-ras mpd: Link: Packet from unexisting bundle "566" Jul 10 01:22:32 B-ras mpd: Link: Packet from unexisting bundle "492" Jul 10 01:22:32 B-ras mpd: Link: Packet from unexisting bundle "685" Jul 10 01:22:32 B-ras mpd: Link: Packet from unexisting bundle "83" Jul 10 01:22:32 B-ras mpd: Link: Packet from unexisting bundle "937" Jul 10 01:22:32 B-ras mpd: Link: Packet from unexisting bundle "396" Jul 10 01:22:32 B-ras mpd: Link: Packet from unexisting bundle "499" Logs has been flooded with these messages.
(In reply to Cassiano Peixoto from comment #113) "Accessing freed hook: name right" is another case and has no connection with mpd5 process hanging. And that comes from ng_tee(4), not ng_iface(4). I'm not sure if this message indicate real problem: accessing freed hook should not happen but NETGRAPH has some protection measures against that and they seem to do their job just fine. You should ignore "Accessing freed hook: name right" warning for the time being unless some bad thing happens like kernel panic. These warning will go away when you return non-debugging kernel back (e.g. a kernel built without NETGRAPH_DEBUG option).
(In reply to Cassiano Peixoto from comment #114) I guess, this message is harmless and comes from the fact that mpd5 may destroy ngXXX interface and corresponding bundle before some last packets from an user get delivered by the kernel while it is being disconnected. What protocol do you use there (PPPoE, PPtP etc.)? Do you have "log +bund +bund2" in your mpd.conf? If not, you may use mpd console to run this command enabling bundle level debug logs. After that wait for new "Link: Packet from unexisting bundle" message and filter the log for mentioned bundle number and post results.
(In reply to Eugene Grosbein from comment #116) I'm using for PPPoE only. I'll enable the logs and will let you know. But anyway for now no freezes happened.
(In reply to Konstantin Belousov from comment #63) I see you already MFC'd "cancel-safe" patches to stable/11. Do you have any plans to MFC them to stable/10, so upcoming 10.4-RELEASE be fixed too?
(In reply to Eugene Grosbein from comment #118) I already mentioned that the r320472 change only matters for rare case of user-IO streams, created e.g. by funopen(3). As such, I do not think that this change matters for mpd5. I do not want to merge this stuff to stable/10 at the late stage of the branch lifecycle.
(In reply to Konstantin Belousov from comment #119) Well, comment #83 refers to some problem in "fseeko" that seems to be fixed by your "stdio cancel-safe" patch among other things as this patch eliminated hangs for Cassiano Peixoto. That's sad 10.4 will still be unfixed.
(In reply to Eugene Grosbein from comment #120) Comment about fseeko was that the function calls user method, when the operating stream is from funopen(3).
(In reply to Konstantin Belousov from comment #121) mpd5 uses libpdel that uses funopen() extensively. For example, it uses funopen() while loading HTTP headers from a stream (mpd5 web console feature).
(In reply to Eugene Grosbein from comment #116) Hi Eugene, The server has been running for 14 days without freezing. I think the issue has been fixed :) How would you like to proceed? Are you going to commit the others patches? Congratulations, after 2 years and 6 months, someone at last fixed it. Thank you! :)
The problem of mpd5 daemon hanging is now fixed with following changes: 1. libc/syslog "cancel-safe" fix merged to stable/11 and stable/10 (to appear in upcoming 10.4-RELEASE and 11.1-RELEASE). 2. Multiple libc/stdio "cancel-safe" fixes merged to stable/11 and stable/10 (to appear in upcoming 10.4-RELEASE and 11.1-RELEASE). 3. Mpd5 "cancel-safe" console management fix comitted to upstream code and present in the net/mpd5 port version mpd5-5.8_1 and newer. Other problems concerning general kernel stability issues will be carried with distinct PRs linked to this one. Big thanks to kib, dchagin, Cassiano Peixoto and others involved in reporting, analyzing, debugging and fixing this.
(In reply to Eugene Grosbein from comment #124) Hi Eugene, Just one question, how about libthread patch? Is it has been merged as well? Take a look on comment #53: https://bugs.freebsd.org/bugzilla/attachment.cgi?id=183536&action=edit
(In reply to Cassiano Peixoto from comment #125) It is not needed to fix things. It was produced to discover roots of the problem only.
(In reply to Eugene Grosbein from comment #124) Big thanks to all who stood strong and solved that! And thanks to the conclusion by Eugene, which I partly quote here: <quote> The problem of mpd5 daemon hanging is now fixed with following changes: 1. libc/syslog "cancel-safe" fix merged to stable/11 and stable/10 (to appear in upcoming 10.4-RELEASE and 11.1-RELEASE). 2. Multiple libc/stdio "cancel-safe" fixes merged to stable/11 and stable/10 (to appear in upcoming 10.4-RELEASE and 11.1-RELEASE). </quote> I guess 1) was r320472 I guess 2) were r320508 and r320509 These three were merged to stable/11 in r320942 at 2017-07-13 and stable/10 in r321074 at 017-07-17 So the fixes have NOT made it into 11.1-RELEASE! But 10.4 should have them, since I can't see releng/10.4 yet. Just for the records, in case one want's to know what to expect from 11.1. -harry
(In reply to Harald Schmalzbauer from comment #127) Additionally, there were specific merges to release branch releng/11.1 before release was tagged. So, 11.1-RELEASE does have those fixes.
(In reply to Eugene Grosbein from comment #128) Could you please tell us the revision? Hope I really missed something and you are right, but if I e.g. watch the log of lib/libc/stdio/fclose.c, it wasn't touched since it has been copied from base/stable/11@r320475 to releng/11. kib@ did a MFS (r320763), but either I got all completely wrong and the referenced commits in my last post are unrelated&wrong, or they aren't in 11.1 (which would explain why my local MFS doesn't fail). Btw. I locally MFS also your r320776 to 11.1.0 (r310888 MFC, usr.sbin/syslogd/syslogd.c: Retry to open an F_PIPE process when it dies unexpectedly). Please, let us document the 11.1-RELEASE status unambiguously here, I know how tedious it is to waste time chasing contradictory facts :-) I'm happy if I'm proven wrong, but finally here's the information people can rely on. -harry
(In reply to Harald Schmalzbauer from comment #129) 11.1 release branch releng/11.1 was created with r320486 from stable/11@r320475 https://svnweb.freebsd.org/base?view=revision&revision=320486 11.1-RELEASE was tagged with r321354, July 21: https://svnweb.freebsd.org/base?view=revision&revision=321354 Biggest problem was one with syslog() and libc/syslog "cancel-safe" fix was merged to stable/11 with r320311 before creation of releng/11.1: https://svnweb.freebsd.org/base?view=revision&revision=320311 - this made mpd5 pretty stable with an exception of very rare case that were dealt with stdio fixes later. There were several "cancel-safe" commits to head's libc/stdio that stem from https://reviews.freebsd.org/D11246 - r320472 (main one) plus some fixes to that fix: r320508 and r320509. They all were merged to stable/11 with r320942, July 13: https://svnweb.freebsd.org/base?view=revision&revision=320942 before creation of the release. I ran stable/11 that time and believed that future release would have it. But you are right, it hasn't. One should manually merge r320942 to the release and rebuild libc, or switch to stable/11, or just avoid very frequent access to mpd5 web console (disable it altogether to make sure).
(In reply to Harald Schmalzbauer from comment #129) And syslogd(8) problem has no connection with this mpd5 problem.