Summary: | math/pspp fills disk with errors, one line at a time, on -14 and -CURRENT | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Ports & Packages | Reporter: | dgilbert | ||||||||
Component: | Individual Port(s) | Assignee: | Wen Heping <wen> | ||||||||
Status: | New --- | ||||||||||
Severity: | Affects Many People | CC: | 000.fbsd, kib, wen | ||||||||
Priority: | --- | Keywords: | regression | ||||||||
Version: | Latest | Flags: | linimon:
maintainer-feedback?
(wen) |
||||||||
Hardware: | amd64 | ||||||||||
OS: | Any | ||||||||||
Attachments: |
|
Something wired all memory, so your machine is swapping trying to fit several huge (XX Gb) processes into ~4G. Try to start looking e.g. at vmstat -z and vmstat -m output to possibly identify some in-kernel memory hog. (In reply to Konstantin Belousov from comment #1) URM... so is there a way to do that to the core, or do I need to re-trigger the event? (In reply to dgilbert from comment #2) Look at the vmstat man page, the -M flag. The vmstat -z was included into the core.txt, but vmstat -m failed. In vmstat -z, there is outstanding number of vm objects (184079069), which more or less aligns with the number of allocated tmpfs nodes. This, together with 64/32/16 malloc zones, seems to eat all machine memory: (gdb) p (264 + 232 + 64 + 32 + 16)*184168869ull $3 = 111974672352 I wonder if it is 'simply' tmpfs metadata consuming memory. Try to limit tmpfs appetite perhaps? Interesting. So... I got vmstat -z, but you've seen that. When I run... [2:13:313]root@vr:/var/crash> vmstat -M vmcore.9 -m vmstat: memstat_kvm_malloc: Type Use Memory Req Size(s) ... the first line of output is stderr, the 2nd is stdout. Obviously not giving up the goods. Ideas? So. This seems like it might be a different ticket. At some point in the build, it starts trying to create files of the form: doc/pspp-figures/chisquare.png-960475: error while writing to output stream doc/pspp-figures/chisquare.png-960476: error while writing to output stream doc/pspp-figures/chisquare.png-960477: error while writing to output stream ... and so on. By the time of the crash, in the 13G log, it has output: [2:14:314]root@vr:/usr/local/etc> grep ^doc </usr/local/poudriere/data/logs/bulk/freebsd_14_1_x64-HEAD-ALL/2024-06-14_16h03m42s/logs/pspp-2.0.1.log | wc 182336216 1276353512 14216278858 ... or tried to output 182 million files. Can we move this ticket to ports, or should I open a new one? OK, Given: 1) that this very definitely affected me after an upgrade to 14.1; and 2) poudriere is a big pile of zfs; and 3) pspp seems to have been producing this nonsense for some time (but not affecting 14.0 or 13.3 builds); I think this an example of the EN-24:10.zfs errata at work. I'm recompiling a kernel now to test, but this seems highly likely to me. OK. I _think_ the machine hanging is a matter of the ZFS bug _however_ there's still the problem of the unbounded disk filling service that is compiling pspp: [1:9:309]dgilbert@vr:/usr/local/poudriere/data/logs/bulk/freebsd_14_1_x64-HEAD-ALL> ll -h */*/pspp* -rw-r--r-- 1 root wheel 13G Jun 13 00:54 2024-06-06_20h57m16s/logs/pspp-2.0.1.log -rw-r--r-- 1 root wheel 13G Jun 13 07:33 2024-06-13_01h09m40s/logs/pspp-2.0.1.log -rw-r--r-- 1 root wheel 13G Jun 13 17:26 2024-06-13_11h12m23s/logs/pspp-2.0.1.log -rw-r--r-- 1 root wheel 13G Jun 14 14:37 2024-06-13_22h10m26s/logs/pspp-2.0.1.log -rw-r--r-- 1 root wheel 13G Jun 14 20:37 2024-06-14_16h03m42s/logs/pspp-2.0.1.log -rw-r--r-- 1 root wheel 1.2G Jun 16 14:18 2024-06-16_14h13m54s/logs/pspp-2.0.1.log -rw-r--r-- 1 root wheel 67M Jun 16 14:35 2024-06-16_14h34m00s/logs/pspp-2.0.1.log -rw-r--r-- 1 root wheel 52M Jun 16 14:37 2024-06-16_14h36m05s/logs/pspp-2.0.1.log -rw-r--r-- 1 root wheel 326G Jun 20 01:11 2024-06-19_01h09m43s/logs/pspp-2.0.1.log -rw-r--r-- 3 root wheel 209G Jun 20 12:59 2024-06-20_01h25m51s/logs/pspp-2.0.1.log -rw-r--r-- 1 root wheel 52M Jun 16 14:37 latest-done/logs/pspp-2.0.1.log -rw-r--r-- 3 root wheel 209G Jun 20 12:59 latest/logs/pspp-2.0.1.log ... these logs only stopp around 300Gig because the computer crashed or I killed the build. The logs are 100's of lines a second of the form: doc/pspp-figures/<blah>.png-<number>: error while writing to output stream as-in: doc/pspp-figures/ctables34.png-31392991: error while writing to output stream doc/pspp-figures/descriptives.png-57977716: error while writing to output stream doc/pspp-figures/temporary.png-39555797: error while writing to output stream doc/pspp-figures/descriptives.png-57977717: error while writing to output stream doc/pspp-figures/independent-samples-t.png-26730070: error while writing to output stream doc/pspp-figures/descriptives.png-57977718: error while writing to output stream d sampled one of the log files. 20737125 of that form. Created attachment 251900 [details]
patch to add broken for 14 and 15
I have since tested 13.3. It compiles. But on 14.0, 14.1 and 15, it fills disk. A commit in branch main references this bug: URL: https://cgit.FreeBSD.org/ports/commit/?id=291cba0140ece129cbb14f335676a862fa538452 commit 291cba0140ece129cbb14f335676a862fa538452 Author: Wen Heping <wen@FreeBSD.org> AuthorDate: 2024-07-06 23:40:11 +0000 Commit: Wen Heping <wen@FreeBSD.org> CommitDate: 2024-07-06 23:41:59 +0000 math/pspp: Mark BROKEN on FreeBSD 14 and 15 PR: 279742 Reported by: dgilbert@eicat.ca math/pspp/Makefile | 3 +++ 1 file changed, 3 insertions(+) (In reply to dgilbert from comment #10) I don't know how your settings of 13.3 differs from mine, but I am not able to build pspp on freshly upgraded build machine to 13.3-p4. I was able to build pspp last time on 13.2 with poudriere-devel-3.4.99.20231211. It took me a lot of time to figure it out the problem with completely frozen remote machine (without remote console) is caused by pspp build. It generates a millions of files on a ZFS filesystem in one directory. In my case, it was around 8 millions before I killed the build. All of them were like this one with ever incrementing numbered suffix. math/pspp/work/pspp-2.0.1/doc/pspp-figures/chisquare.png-999995 I think that system hang is also caused by ZFS with too many files in one directory, not just out of swap. Out of swap space will usually kill some processes and system become responsive again (in my experience), but this time the system dost not recover even after the build is killed. So please, can pspp be marked as BROKEN on all supported FreeBSD versions? Or are there any way to fix the build? I don't know why pspp creates these files. I don't use pspp personally, I just need to build it and install it to another machine for webdevelopers. (In reply to Miroslav Lachman from comment #12) I'm also seeing it broken on 13.4, so I think that is very shortly "broken on all supported" Created attachment 253783 [details]
Patch to the previous patch to say "broken for all"
I've added a patch (to the previous patch) that changes it to "all"
|
Created attachment 251458 [details] core.txt of crash. I clicked on 14.0-STABLE because 14.1-RELEASE was not yet a choice. I upgraded my poudriere box to 14.1, created a new jail for 14.1, and launched into a "-a" build pretty much immediately after returning from BSDCan. The build machine is a Threadripper 1900X with 128G of RAM and 140TB of disk in RAID-Z2. It has stably built poudriere almost constantly since I upgraded it to it's current state --- about 3 years or so. After the first poudriere hang, I instrumented things like temperatures. None of these spiked, but the hang happened again and again. After awhile, it was clear that pspp compiling was the trigger. Note that pspp would have compiled under 14.0 less than a week before (ie: just before BSDCan). I had to get debugging in to my kernel and learn how to cause it to debug. That took a couple tries --- all-the-while repeatedly crashing while pspp was building. Top was up on the window I keep open ... and this was the last top on display. last pid: 31372; load averages: 21.72, 32.46, 41.6670 up 0+04:34:59 20:36:48 220 processes: 12 running, 192 sleeping, 2 zombie, 14 waiting CPU: 21.7% user, 0.0% nice, 40.4% system, 0.0% interrupt, 37.8% idle Mem: 32M Active, 264K Inact, 124G Wired, 604M Free ARC: 16G Total, 230M MFU, 334M MRU, 22M Anon, 15G Header, 191M Other 107M Compressed, 460M Uncompressed, 4.28:1 Ratio Swap: 256G Total, 98G Used, 158G Free, 38% Inuse, 2868K In, 3612K Out PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND 61759 root 2 166 i10 60M 2616K vofflo 3 40:43 75.20% pspp-output 6367 root 2 166 i10 88M 2604K vofflo 15 36:06 73.63% pspp-output 15409 root 2 166 i10 92M 2608K vofflo 2 33:53 72.64% pspp-output 81893 root 2 166 i10 86M 2600K CPU12 12 34:05 72.04% pspp-output 78622 root 2 166 i10 57M 2588K CPU11 11 28:42 69.19% pspp-output 25531 root 2 166 i10 95M 2616K CPU5 5 27:00 68.84% pspp-output 81789 root 2 166 i10 42M 2584K CPU6 6 23:16 65.11% pspp-output 87988 root 2 166 i10 102M 2596K CPU7 7 20:57 64.28% pspp-output 11364 root 2 166 i10 57M 2612K CPU10 10 19:50 64.14% pspp-output 23538 root 2 166 i10 66M 2604K CPU11 11 21:09 63.94% pspp-output 61379 root 2 166 i10 93M 2624K tmpfs 4 21:10 63.46% pspp-output 85836 root 2 166 i10 74M 2608K CPU14 14 19:19 62.69% pspp-output 58400 root 2 166 i10 76M 2440K RUN 5 13:26 56.27% pspp-output 58294 root 2 166 i10 72M 2444K CPU1 1 14:44 56.15% pspp-output 70050 root 2 166 i10 48M 2440K RUN 1 12:46 56.10% pspp-output 2561 root 1 20 0 303M 1728K select 12 1:09 0.40% smbd 2502 postgres 1 20 0 173M 1012K select 4 0:13 0.16% postgres 65067 root 1 20 0 17M 1452K CPU9 9 0:21 0.14% top 2577 root 1 20 0 17M 1216K select 9 0:40 0.13% tmux 72517 root 6 166 i10 2310M 452K uwait 1 9:40 0.07% ghc-9.6.4 8903 root 45 166 i10 34G 4716K uwait 0 12:30 0.06% java 2503 postgres 1 20 0 31M 684K select 7 0:08 0.05% postgres 37351 root 1 20 0 22M 328K select 9 0:03 0.05% sshd 2190 root 1 20 0 14M 172K select 6 0:00 0.03% syslogd 72294 root 11 166 i10 345M 1664K kqread 4 0:02 0.01% node 2294 root 1 20 0 280M 228K select 11 0:00 0.01% httpd 1192 root 1 20 0 18M 340K select 0 0:27 0.01% mountd 1162 ntpd 1 20 0 23M 520K select 12 0:01 0.01% ntpd 95259 root 1 20 0 12M 328K ttyin 4 0:03 0.01% cu 1749 uwsgi 1 20 0 57M 412K kqread 12 0:01 0.00% uwsgi-3.8 36420 root 1 20 0 19M 544K select 5 0:01 0.00% minicom 1307 root 1 20 0 164M 460K kqread 8 0:00 0.00% php-fpm 1253 root 128 68 0 12M 2316K rpcsvc 11 0:13 0.00% nfsd 91926 root 2 166 i10 74M 2908K pfault 15 123:07 0.00% pspp-output 72530 root 11 166 i10 7498M 836K pfault 5 99:32 0.00% node 46100 root 18 166 i10 261G 932K uwait 4 18:33 0.00% dotnet 73028 root 1 166 i10 165M 4096B WAIT 11 3:56 0.00% <pkg-static> 2955 root 1 166 i10 15M 4096B wait 13 3:24 0.00% <sh> 93083 root 1 166 i10 195M 4096B WAIT 13 3:14 0.00% <pkg-static> 22537 root 6 166 i10 298M 17M uwait 10 1:22 0.00% ld.lld 2588 root 1 166 i10 22M 224K select 6 1:02 0.00% sh 24257 root 1 166 i10 145M 4096B WAIT 12 0:42 0.00% <pkg-static> 1301 www 1 20 0 27M 4096B WAIT 5 0:32 0.00% <nginx> 90301 root 14 166 i10 261G 260K uwait 4 0:24 0.00% dotnet It's worth noting here that the virtual terminal switch (alt - F<n>) works after this happens, but no other input is recognized (can't hit return in a window and shells going through the machine to others don't continue their output). When it happened this time, I dropped to KDB and dumped. core.txt attached. NOTE: this is repeatable. I have been through the cycle 6 times so far.