Bug 279742

Summary: math/pspp fills disk with errors, one line at a time, on -14 and -CURRENT
Product: Ports & Packages Reporter: dgilbert
Component: Individual Port(s)Assignee: Wen Heping <wen>
Status: New ---    
Severity: Affects Many People CC: 000.fbsd, kib, wen
Priority: --- Keywords: regression
Version: LatestFlags: linimon: maintainer-feedback? (wen)
Hardware: amd64   
OS: Any   
Attachments:
Description Flags
core.txt of crash.
none
patch to add broken for 14 and 15
none
Patch to the previous patch to say "broken for all" none

Description dgilbert 2024-06-15 02:23:53 UTC
Created attachment 251458 [details]
core.txt of crash.

I clicked on 14.0-STABLE because 14.1-RELEASE was not yet a choice.

I upgraded my poudriere box to 14.1, created a new jail for 14.1, and launched into a "-a" build pretty much immediately after returning from BSDCan.  The build machine is a Threadripper 1900X with 128G of RAM and 140TB of disk in RAID-Z2.  It has stably built poudriere almost constantly since I upgraded it to it's current state --- about 3 years or so.

After the first poudriere hang, I instrumented things like temperatures.  None of these spiked, but the hang happened again and again.  After awhile, it was clear that pspp compiling was the trigger.  Note that pspp would have compiled under 14.0 less than a week before (ie: just before BSDCan).

I had to get debugging in to my kernel and learn how to cause it to debug.  That took a couple tries --- all-the-while repeatedly crashing while pspp was building.  Top was up on the window I keep open ... and this was the last top on display.

last pid: 31372;  load averages: 21.72, 32.46, 41.6670                                                  up 0+04:34:59  20:36:48
220 processes: 12 running, 192 sleeping, 2 zombie, 14 waiting
CPU: 21.7% user,  0.0% nice, 40.4% system,  0.0% interrupt, 37.8% idle
Mem: 32M Active, 264K Inact, 124G Wired, 604M Free
ARC: 16G Total, 230M MFU, 334M MRU, 22M Anon, 15G Header, 191M Other
     107M Compressed, 460M Uncompressed, 4.28:1 Ratio
Swap: 256G Total, 98G Used, 158G Free, 38% Inuse, 2868K In, 3612K Out

  PID USERNAME    THR PRI NICE   SIZE    RES STATE    C   TIME    WCPU COMMAND
61759 root          2 166  i10    60M  2616K vofflo   3  40:43  75.20% pspp-output
 6367 root          2 166  i10    88M  2604K vofflo  15  36:06  73.63% pspp-output
15409 root          2 166  i10    92M  2608K vofflo   2  33:53  72.64% pspp-output
81893 root          2 166  i10    86M  2600K CPU12   12  34:05  72.04% pspp-output
78622 root          2 166  i10    57M  2588K CPU11   11  28:42  69.19% pspp-output
25531 root          2 166  i10    95M  2616K CPU5     5  27:00  68.84% pspp-output
81789 root          2 166  i10    42M  2584K CPU6     6  23:16  65.11% pspp-output
87988 root          2 166  i10   102M  2596K CPU7     7  20:57  64.28% pspp-output
11364 root          2 166  i10    57M  2612K CPU10   10  19:50  64.14% pspp-output
23538 root          2 166  i10    66M  2604K CPU11   11  21:09  63.94% pspp-output
61379 root          2 166  i10    93M  2624K tmpfs    4  21:10  63.46% pspp-output
85836 root          2 166  i10    74M  2608K CPU14   14  19:19  62.69% pspp-output
58400 root          2 166  i10    76M  2440K RUN      5  13:26  56.27% pspp-output
58294 root          2 166  i10    72M  2444K CPU1     1  14:44  56.15% pspp-output
70050 root          2 166  i10    48M  2440K RUN      1  12:46  56.10% pspp-output
 2561 root          1  20    0   303M  1728K select  12   1:09   0.40% smbd
 2502 postgres      1  20    0   173M  1012K select   4   0:13   0.16% postgres
65067 root          1  20    0    17M  1452K CPU9     9   0:21   0.14% top
 2577 root          1  20    0    17M  1216K select   9   0:40   0.13% tmux
72517 root          6 166  i10  2310M   452K uwait    1   9:40   0.07% ghc-9.6.4
 8903 root         45 166  i10    34G  4716K uwait    0  12:30   0.06% java
 2503 postgres      1  20    0    31M   684K select   7   0:08   0.05% postgres
37351 root          1  20    0    22M   328K select   9   0:03   0.05% sshd
 2190 root          1  20    0    14M   172K select   6   0:00   0.03% syslogd
72294 root         11 166  i10   345M  1664K kqread   4   0:02   0.01% node
 2294 root          1  20    0   280M   228K select  11   0:00   0.01% httpd
 1192 root          1  20    0    18M   340K select   0   0:27   0.01% mountd
 1162 ntpd          1  20    0    23M   520K select  12   0:01   0.01% ntpd
95259 root          1  20    0    12M   328K ttyin    4   0:03   0.01% cu
 1749 uwsgi         1  20    0    57M   412K kqread  12   0:01   0.00% uwsgi-3.8
36420 root          1  20    0    19M   544K select   5   0:01   0.00% minicom
 1307 root          1  20    0   164M   460K kqread   8   0:00   0.00% php-fpm
 1253 root        128  68    0    12M  2316K rpcsvc  11   0:13   0.00% nfsd
91926 root          2 166  i10    74M  2908K pfault  15 123:07   0.00% pspp-output
72530 root         11 166  i10  7498M   836K pfault   5  99:32   0.00% node
46100 root         18 166  i10   261G   932K uwait    4  18:33   0.00% dotnet
73028 root          1 166  i10   165M  4096B WAIT    11   3:56   0.00% <pkg-static>
 2955 root          1 166  i10    15M  4096B wait    13   3:24   0.00% <sh>
93083 root          1 166  i10   195M  4096B WAIT    13   3:14   0.00% <pkg-static>
22537 root          6 166  i10   298M    17M uwait   10   1:22   0.00% ld.lld
 2588 root          1 166  i10    22M   224K select   6   1:02   0.00% sh
24257 root          1 166  i10   145M  4096B WAIT    12   0:42   0.00% <pkg-static>
 1301 www           1  20    0    27M  4096B WAIT     5   0:32   0.00% <nginx>
90301 root         14 166  i10   261G   260K uwait    4   0:24   0.00% dotnet

It's worth noting here that the virtual terminal switch (alt - F<n>) works after this happens, but no other input is recognized (can't hit return in a window and shells going through the machine to others don't continue their output).

When it happened this time, I dropped to KDB and dumped.  core.txt attached.

NOTE: this is repeatable.  I have been through the cycle 6 times so far.
Comment 1 Konstantin Belousov freebsd_committer freebsd_triage 2024-06-15 02:32:05 UTC
Something wired all memory, so your machine is swapping trying to fit several
huge (XX Gb) processes into ~4G.

Try to start looking e.g. at vmstat -z and vmstat -m output to possibly
identify some in-kernel memory hog.
Comment 2 dgilbert 2024-06-15 04:27:11 UTC
(In reply to Konstantin Belousov from comment #1)

URM... so is there a way to do that to the core, or do I need to re-trigger the event?
Comment 3 Konstantin Belousov freebsd_committer freebsd_triage 2024-06-15 04:45:42 UTC
(In reply to dgilbert from comment #2)
Look at the vmstat man page, the -M flag.
The vmstat -z was included into the core.txt, but vmstat -m failed.

In vmstat -z, there is outstanding number of vm objects (184079069), which
more or less aligns with the number of allocated tmpfs nodes.  This, together
with 64/32/16 malloc zones, seems to eat all machine memory:
(gdb) p (264 + 232 + 64 + 32 + 16)*184168869ull
$3 = 111974672352

I wonder if it is 'simply' tmpfs metadata consuming memory.  Try to limit tmpfs
appetite perhaps?
Comment 4 dgilbert 2024-06-16 05:55:27 UTC
Interesting.  So... I got vmstat -z, but you've seen that.  When I run...

[2:13:313]root@vr:/var/crash> vmstat -M vmcore.9 -m
vmstat: memstat_kvm_malloc:
            Type  Use Memory Req Size(s)

... the first line of output is stderr, the 2nd is stdout.  Obviously not giving up the goods.  Ideas?
Comment 5 dgilbert 2024-06-16 18:32:40 UTC
So.  This seems like it might be a different ticket.  At some point in the build, it starts trying to create files of the form:

doc/pspp-figures/chisquare.png-960475: error while writing to output stream
doc/pspp-figures/chisquare.png-960476: error while writing to output stream
doc/pspp-figures/chisquare.png-960477: error while writing to output stream

... and so on.  By the time of the crash, in the 13G log, it has output:

[2:14:314]root@vr:/usr/local/etc> grep ^doc </usr/local/poudriere/data/logs/bulk/freebsd_14_1_x64-HEAD-ALL/2024-06-14_16h03m42s/logs/pspp-2.0.1.log | wc
 182336216 1276353512 14216278858

... or tried to output 182 million files.

Can we move this ticket to ports, or should I open a new one?
Comment 6 dgilbert 2024-06-19 22:27:24 UTC
OK, Given:

1) that this very definitely affected me after an upgrade to 14.1; and

2) poudriere is a big pile of zfs; and

3) pspp seems to have been producing this nonsense for some time (but not affecting 14.0 or 13.3 builds);

I think this an example of the EN-24:10.zfs errata at work.

I'm recompiling a kernel now to test, but this seems highly likely to me.
Comment 7 dgilbert 2024-06-20 17:06:31 UTC
OK.  I _think_ the machine hanging is a matter of the ZFS bug _however_ there's still the problem of the unbounded disk filling service that is compiling pspp:

[1:9:309]dgilbert@vr:/usr/local/poudriere/data/logs/bulk/freebsd_14_1_x64-HEAD-ALL> ll -h */*/pspp*
-rw-r--r--  1 root wheel   13G Jun 13 00:54 2024-06-06_20h57m16s/logs/pspp-2.0.1.log
-rw-r--r--  1 root wheel   13G Jun 13 07:33 2024-06-13_01h09m40s/logs/pspp-2.0.1.log
-rw-r--r--  1 root wheel   13G Jun 13 17:26 2024-06-13_11h12m23s/logs/pspp-2.0.1.log
-rw-r--r--  1 root wheel   13G Jun 14 14:37 2024-06-13_22h10m26s/logs/pspp-2.0.1.log
-rw-r--r--  1 root wheel   13G Jun 14 20:37 2024-06-14_16h03m42s/logs/pspp-2.0.1.log
-rw-r--r--  1 root wheel  1.2G Jun 16 14:18 2024-06-16_14h13m54s/logs/pspp-2.0.1.log
-rw-r--r--  1 root wheel   67M Jun 16 14:35 2024-06-16_14h34m00s/logs/pspp-2.0.1.log
-rw-r--r--  1 root wheel   52M Jun 16 14:37 2024-06-16_14h36m05s/logs/pspp-2.0.1.log
-rw-r--r--  1 root wheel  326G Jun 20 01:11 2024-06-19_01h09m43s/logs/pspp-2.0.1.log
-rw-r--r--  3 root wheel  209G Jun 20 12:59 2024-06-20_01h25m51s/logs/pspp-2.0.1.log
-rw-r--r--  1 root wheel   52M Jun 16 14:37 latest-done/logs/pspp-2.0.1.log
-rw-r--r--  3 root wheel  209G Jun 20 12:59 latest/logs/pspp-2.0.1.log

... these logs only stopp around 300Gig because the computer crashed or I killed the build.  The logs are 100's of lines a second of the form:

doc/pspp-figures/<blah>.png-<number>: error while writing to output stream

as-in:

doc/pspp-figures/ctables34.png-31392991: error while writing to output stream
doc/pspp-figures/descriptives.png-57977716: error while writing to output stream
doc/pspp-figures/temporary.png-39555797: error while writing to output stream
doc/pspp-figures/descriptives.png-57977717: error while writing to output stream
doc/pspp-figures/independent-samples-t.png-26730070: error while writing to output stream
doc/pspp-figures/descriptives.png-57977718: error while writing to output stream
d
Comment 8 dgilbert 2024-06-20 17:54:31 UTC
sampled one of the log files.  20737125 of that form.
Comment 9 dgilbert 2024-07-06 17:01:06 UTC
Created attachment 251900 [details]
patch to add broken for 14 and 15
Comment 10 dgilbert 2024-07-06 17:04:05 UTC
I have since tested 13.3.  It compiles.  But on 14.0, 14.1 and 15, it fills disk.
Comment 11 commit-hook freebsd_committer freebsd_triage 2024-07-06 23:42:29 UTC
A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/ports/commit/?id=291cba0140ece129cbb14f335676a862fa538452

commit 291cba0140ece129cbb14f335676a862fa538452
Author:     Wen Heping <wen@FreeBSD.org>
AuthorDate: 2024-07-06 23:40:11 +0000
Commit:     Wen Heping <wen@FreeBSD.org>
CommitDate: 2024-07-06 23:41:59 +0000

    math/pspp: Mark BROKEN on FreeBSD 14 and 15

    PR:             279742
    Reported by:    dgilbert@eicat.ca

 math/pspp/Makefile | 3 +++
 1 file changed, 3 insertions(+)
Comment 12 Miroslav Lachman 2024-07-24 09:35:11 UTC
(In reply to dgilbert from comment #10)
I don't know how your settings of 13.3 differs from mine, but I am not able to build pspp on freshly upgraded build machine to 13.3-p4. I was able to build pspp last time on 13.2 with poudriere-devel-3.4.99.20231211.

It took me a lot of time to figure it out the problem with completely frozen remote machine (without remote console) is caused by pspp build. It generates a millions of files on a ZFS filesystem in one directory. In my case, it was around 8 millions before I killed the build. 

All of them were like this one with ever incrementing numbered suffix.
math/pspp/work/pspp-2.0.1/doc/pspp-figures/chisquare.png-999995

I think that system hang is also caused by ZFS with too many files in one directory, not just out of swap. Out of swap space will usually kill some processes and system become responsive again (in my experience), but this time the system dost not recover even after the build is killed.

So please, can pspp be marked as BROKEN on all supported FreeBSD versions? Or are there any way to fix the build? I don't know why pspp creates these files. I don't use pspp personally, I just need to build it and install it to another machine for webdevelopers.
Comment 13 dgilbert 2024-09-23 16:04:30 UTC
(In reply to Miroslav Lachman from comment #12)

I'm also seeing it broken on 13.4, so I think that is very shortly "broken on all supported"
Comment 14 dgilbert 2024-09-23 16:19:54 UTC
Created attachment 253783 [details]
Patch to the previous patch to say "broken for all"

I've added a patch (to the previous patch) that changes it to "all"