After 256 of uptime, Postgres processes using tmpfs are blocked unkillably in tmpfs: # uname -srv FreeBSD 12.0-STABLE FreeBSD 12.0-STABLE #3 r345628M: Thu Mar 28 18:04:12 EDT 2019 root@straasha.imrryr.org:/usr/obj/usr/src/amd64.amd64/sys/GENERIC # uptime 12:31AM up 256 days, 7:20, 13 users, load averages: 0.26, 0.27, 0.25 # ps -o pid,etime,time,wchan,args -p $(pgrep -U postgres) PID ELAPSED TIME WCHAN COMMAND 93891 71-11:25:07 0:33.60 tmpfs postgres: autovacuum launcher (postgres) 93892 71-11:25:07 2:46.85 tmpfs postgres: stats collector (postgres) Perhaps a red herring, but lsof shows VBAD (revoked) file descriptors 1 and 2: COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME postgres 93891 postgres cwd VDIR 4173671843,1163386591 28 8 /var/danepg/postgres postgres 93891 postgres rtd VDIR 1895299314,75292256 26 4 / postgres 93891 postgres txt VREG 1895299314,75292256 7240832 231575 /usr/local/bin/postgres postgres 93891 postgres txt VREG 1895299314,75292256 148320 3166517 /libexec/ld-elf.so.1 postgres 93891 postgres txt VREG 1895299314,75292256 131824 3166628 /lib/libthr.so.3 postgres 93891 postgres txt VREG 1895299314,75292256 59976 1743178 /usr/local/lib/libintl.so.8.1.6 postgres 93891 postgres txt VREG 1895299314,75292256 604936 3166936 /usr/lib/libssl.so.111 postgres 93891 postgres txt VREG 1895299314,75292256 3055864 3166565 /lib/libcrypto.so.111 postgres 93891 postgres txt VREG 1895299314,75292256 217024 3166599 /lib/libm.so.5 postgres 93891 postgres txt VREG 1895299314,75292256 3290256 2820939 /usr/local/lib/libicui18n.so.64.2 postgres 93891 postgres txt VREG 1895299314,75292256 2016680 2820955 /usr/local/lib/libicuuc.so.64.2 postgres 93891 postgres txt VREG 1895299314,75292256 1949672 3166554 /lib/libc.so.7 postgres 93891 postgres txt VREG 1895299314,75292256 5080 2820934 /usr/local/lib/libicudata.so.64.2 postgres 93891 postgres txt VREG 1895299314,75292256 837240 3166742 /usr/lib/libc++.so.1 postgres 93891 postgres txt VREG 1895299314,75292256 117080 3166570 /lib/libcxxrt.so.1 postgres 93891 postgres txt VREG 1895299314,75292256 94264 3166580 /lib/libgcc_s.so.1 postgres 93891 postgres 0r VCHR 0,19 0t0 19 /dev/null postgres 93891 postgres 1u VBAD (revoked) postgres 93891 postgres 2u VBAD (revoked) postgres 93891 postgres 3u VREG 4173671843,1163386591 8192 92 /var/danepg/postgres/global/1262 postgres 93891 postgres 4u unix 0xfffff80e1796a368 0t0 ->0xfffff8002ddc4a38 postgres 93891 postgres 7u PIPE 0xfffff80492e152f8 16384 ->0xfffff80492e15460 postgres 93891 postgres 9u IPv6 0xfffff804ccc604a0 0t0 UDP [::1]:54178->[::1]:54178 postgres 93891 postgres 10u PIPE 0xfffff80048b7e000 16384 ->0xfffff80048b7e168 postgres 93891 postgres 11u PIPE 0xfffff80048b7e168 0 ->0xfffff80048b7e000 But more importantly any attempt to stat a tmpfs directory: /var/danepg/tmp/postgres/stats also locks up unkillably: COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME ls 44300 root cwd VDIR 4294967295,2264989444 128 3 /var/danepg/tmp/postgres ls 44300 root rtd VDIR 1895299314,75292256 26 4 / ls 44300 root txt VREG 1895299314,75292256 40272 3167122 /bin/ls ls 44300 root txt VREG 1895299314,75292256 148320 3166517 /libexec/ld-elf.so.1 ls 44300 root txt VREG 1895299314,75292256 87864 3166640 /lib/libutil.so.9 ls 44300 root txt VREG 1895299314,75292256 392608 3166610 /lib/libncursesw.so.8 ls 44300 root txt VREG 1895299314,75292256 1949672 3166554 /lib/libc.so.7 ls 44300 root 0u VCHR 0,177 0t314 177 /dev/pts/19 ls 44300 root 1u VCHR 0,177 0t314 177 /dev/pts/19 ls 44300 root 2u VCHR 0,177 0t314 177 /dev/pts/19 ls 44300 root 3r VDIR 4294967295,2264989444 128 3 /var/danepg/tmp/postgres ls 44300 root 4r VDIR 4294967295,2264989444 128 3 /var/danepg/tmp/postgres ls 44300 root 5r VDIR 4294967295,2264989444 128 3 /var/danepg/tmp/postgres # ps -o pid,wchan,args -p 44300 PID WCHAN COMMAND 44300 tmpfs /bin/ls -F I'll probably have to reboot before this gets resolved... Will try to trigger a kernel panic and savecore, with a bit of luck.
I have a kernel core. Don't know what to look for.
Any advice on what data to extract from the savecore dump would be appreciated. Probably otherwise hard to reproduce this, as the system had been running fine for 256 days before tmpfs locked up. With the system back up I see the locked up file was a directory: # ls -ld /var/danepg/tmp/postgres/stats drwx------ 2 postgres postgres 256 Dec 10 12:35 /var/danepg/tmp/postgres/stats # ls -lR /var/danepg/tmp/postgres/stats total 72 -rw------- 1 postgres postgres 2202 Dec 10 12:35 db_0.stat -rw------- 1 postgres postgres 2033 Dec 10 12:34 db_13453.stat -rw------- 1 postgres postgres 20961 Dec 10 12:35 db_16385.stat -rw------- 1 postgres postgres 792 Dec 10 12:35 global.stat Is it likely that this is related to: https://reviews.freebsd.org/rS353065 ? Can the crash dump confirm or deny? This directory is part of a Postgres "tablespace" for temporary tables...
(In reply to Viktor Dukhovni from comment #2) Sorry for the delayed reply. To start it would be useful to see backtraces from the stuck threads. If you still have the kernel and vmcore available, the output of "thread apply all bt" would be a good start.
I'm sorry that this didn't get any attention when it was reported. If you still encounter this issue, "procstat -kka" output would be useful, or backtraces from a vmcore as noted in comment 3. Note that FreeBSD 12.2 is out now, please try updating if you're still on 12.0.