Bug 242537

Summary: 12.0 Release: tmpfs (used by Postgres 12.0) unkillable hang
Product: Base System Reporter: Viktor Dukhovni <ietf-dane>
Component: kernAssignee: freebsd-fs (Nobody) <fs>
Status: Closed Overcome By Events    
Severity: Affects Only Me CC: chris, markj
Priority: ---    
Version: 12.0-RELEASE   
Hardware: amd64   
OS: Any   

Description Viktor Dukhovni 2019-12-10 06:01:59 UTC
After 256 of uptime, Postgres processes using tmpfs are blocked unkillably in tmpfs:

# uname -srv
FreeBSD 12.0-STABLE FreeBSD 12.0-STABLE #3 r345628M: Thu Mar 28 18:04:12 EDT 2019     root@straasha.imrryr.org:/usr/obj/usr/src/amd64.amd64/sys/GENERIC

# uptime
12:31AM  up 256 days,  7:20, 13 users, load averages: 0.26, 0.27, 0.25

# ps -o pid,etime,time,wchan,args -p $(pgrep -U postgres)
  PID      ELAPSED    TIME WCHAN COMMAND
93891  71-11:25:07 0:33.60 tmpfs postgres: autovacuum launcher    (postgres)
93892  71-11:25:07 2:46.85 tmpfs postgres: stats collector    (postgres)

Perhaps a red herring, but lsof shows VBAD (revoked) file descriptors 1 and 2:

COMMAND    PID     USER   FD   TYPE                DEVICE SIZE/OFF    NODE NAME
postgres 93891 postgres  cwd   VDIR 4173671843,1163386591       28       8 /var/danepg/postgres
postgres 93891 postgres  rtd   VDIR   1895299314,75292256       26       4 /
postgres 93891 postgres  txt   VREG   1895299314,75292256  7240832  231575 /usr/local/bin/postgres
postgres 93891 postgres  txt   VREG   1895299314,75292256   148320 3166517 /libexec/ld-elf.so.1
postgres 93891 postgres  txt   VREG   1895299314,75292256   131824 3166628 /lib/libthr.so.3
postgres 93891 postgres  txt   VREG   1895299314,75292256    59976 1743178 /usr/local/lib/libintl.so.8.1.6
postgres 93891 postgres  txt   VREG   1895299314,75292256   604936 3166936 /usr/lib/libssl.so.111
postgres 93891 postgres  txt   VREG   1895299314,75292256  3055864 3166565 /lib/libcrypto.so.111
postgres 93891 postgres  txt   VREG   1895299314,75292256   217024 3166599 /lib/libm.so.5
postgres 93891 postgres  txt   VREG   1895299314,75292256  3290256 2820939 /usr/local/lib/libicui18n.so.64.2
postgres 93891 postgres  txt   VREG   1895299314,75292256  2016680 2820955 /usr/local/lib/libicuuc.so.64.2
postgres 93891 postgres  txt   VREG   1895299314,75292256  1949672 3166554 /lib/libc.so.7
postgres 93891 postgres  txt   VREG   1895299314,75292256     5080 2820934 /usr/local/lib/libicudata.so.64.2
postgres 93891 postgres  txt   VREG   1895299314,75292256   837240 3166742 /usr/lib/libc++.so.1
postgres 93891 postgres  txt   VREG   1895299314,75292256   117080 3166570 /lib/libcxxrt.so.1
postgres 93891 postgres  txt   VREG   1895299314,75292256    94264 3166580 /lib/libgcc_s.so.1
postgres 93891 postgres    0r  VCHR                  0,19      0t0      19 /dev/null
postgres 93891 postgres    1u  VBAD                                        (revoked)
postgres 93891 postgres    2u  VBAD                                        (revoked)
postgres 93891 postgres    3u  VREG 4173671843,1163386591     8192      92 /var/danepg/postgres/global/1262
postgres 93891 postgres    4u  unix    0xfffff80e1796a368      0t0         ->0xfffff8002ddc4a38
postgres 93891 postgres    7u  PIPE    0xfffff80492e152f8    16384         ->0xfffff80492e15460
postgres 93891 postgres    9u  IPv6    0xfffff804ccc604a0      0t0     UDP [::1]:54178->[::1]:54178
postgres 93891 postgres   10u  PIPE    0xfffff80048b7e000    16384         ->0xfffff80048b7e168
postgres 93891 postgres   11u  PIPE    0xfffff80048b7e168        0         ->0xfffff80048b7e000

But more importantly any attempt to stat a tmpfs directory: /var/danepg/tmp/postgres/stats also
locks up unkillably:

COMMAND   PID USER   FD   TYPE                DEVICE SIZE/OFF    NODE NAME
ls      44300 root  cwd   VDIR 4294967295,2264989444      128       3 /var/danepg/tmp/postgres
ls      44300 root  rtd   VDIR   1895299314,75292256       26       4 /
ls      44300 root  txt   VREG   1895299314,75292256    40272 3167122 /bin/ls
ls      44300 root  txt   VREG   1895299314,75292256   148320 3166517 /libexec/ld-elf.so.1
ls      44300 root  txt   VREG   1895299314,75292256    87864 3166640 /lib/libutil.so.9
ls      44300 root  txt   VREG   1895299314,75292256   392608 3166610 /lib/libncursesw.so.8
ls      44300 root  txt   VREG   1895299314,75292256  1949672 3166554 /lib/libc.so.7
ls      44300 root    0u  VCHR                 0,177    0t314     177 /dev/pts/19
ls      44300 root    1u  VCHR                 0,177    0t314     177 /dev/pts/19
ls      44300 root    2u  VCHR                 0,177    0t314     177 /dev/pts/19
ls      44300 root    3r  VDIR 4294967295,2264989444      128       3 /var/danepg/tmp/postgres
ls      44300 root    4r  VDIR 4294967295,2264989444      128       3 /var/danepg/tmp/postgres
ls      44300 root    5r  VDIR 4294967295,2264989444      128       3 /var/danepg/tmp/postgres

# ps -o pid,wchan,args -p 44300
  PID WCHAN COMMAND
44300 tmpfs /bin/ls -F

I'll probably have to reboot before this gets resolved...
Will try to trigger a kernel panic and savecore, with a bit of luck.
Comment 1 Viktor Dukhovni 2019-12-10 07:03:44 UTC
I have a kernel core. Don't know what to look for.
Comment 2 Viktor Dukhovni 2019-12-10 17:49:23 UTC
Any advice on what data to extract from the savecore dump would be appreciated. Probably otherwise hard to reproduce this, as the system had been running fine for 256 days before tmpfs locked up.

With the system back up I see the locked up file was a directory:

# ls -ld /var/danepg/tmp/postgres/stats
drwx------  2 postgres  postgres  256 Dec 10 12:35 /var/danepg/tmp/postgres/stats

# ls -lR /var/danepg/tmp/postgres/stats
total 72
-rw-------  1 postgres  postgres   2202 Dec 10 12:35 db_0.stat
-rw-------  1 postgres  postgres   2033 Dec 10 12:34 db_13453.stat
-rw-------  1 postgres  postgres  20961 Dec 10 12:35 db_16385.stat
-rw-------  1 postgres  postgres    792 Dec 10 12:35 global.stat

Is it likely that this is related to: https://reviews.freebsd.org/rS353065 ?
Can the crash dump confirm or deny?

This directory is part of a Postgres "tablespace" for temporary tables...
Comment 3 Mark Johnston freebsd_committer freebsd_triage 2020-06-09 23:49:44 UTC
(In reply to Viktor Dukhovni from comment #2)
Sorry for the delayed reply.  To start it would be useful to see backtraces from the stuck threads.  If you still have the kernel and vmcore available, the output of "thread apply all bt" would be a good start.
Comment 4 Mark Johnston freebsd_committer freebsd_triage 2020-11-24 22:09:39 UTC
I'm sorry that this didn't get any attention when it was reported.  If you still encounter this issue, "procstat -kka" output would be useful, or backtraces from a vmcore as noted in comment 3.  Note that FreeBSD 12.2 is out now, please try updating if you're still on 12.0.