Summary: | 14-RELEASE/14-STABLE crash with heavy disk IO on AMD Asus x670e motherboard and Intel i225 (igc) breakage NIC non-functioning | ||||||
---|---|---|---|---|---|---|---|
Product: | Base System | Reporter: | Cameron <cam> | ||||
Component: | kern | Assignee: | freebsd-bugs (Nobody) <bugs> | ||||
Status: | New --- | ||||||
Severity: | Affects Only Me | CC: | vasyl.samoilov | ||||
Priority: | --- | Keywords: | crash | ||||
Version: | 14.0-STABLE | ||||||
Hardware: | amd64 | ||||||
OS: | Any | ||||||
Attachments: |
|
Description
Cameron
2024-03-13 15:22:36 UTC
Created attachment 249134 [details]
14-STABLE verbose dmesg
Finally tried this with a swap partition on an NVMe drive to sidestep the SATA controller entirely... Still I get no crash dump. I commented the old swap partition out, added the new one, rebooted (I could have just done swapoff/swapon, but I was being paranoid/thorough). I even ran "dumpdev" on the new one to be sure, swapinfo confirmed the new NVMe swap partition was only swap device... and yet nothing in /var/crash. savecore -vC /dev/devpart showed there was no dump. I'm guessing the system locks up so thoroughly it's unable to even do a kernel dump. Usually when this happens, the system reboots itself (including the 2 times I tried this) if that detail helps. I am experiencing the same behavior. 14-STABLE, root on ZFS. Can easily be replicated with heavy I/O, in my case it's syncthing and transmission-bt. Every time I download anything with transmission that will saturate disk I/O system will lock up and reboot itself. I even connected a monitor but no messages being printed out before lockup and reboot, so I don't see a kernel dump even being called upon. FreeBSD 14.0-STABLE #18 50f771371: Sat Apr 13 20:11:34 EEST 2024 note: In my case it's CPU: Intel(R) N100 (806.40-MHz K8-class CPU), zfs mirror with two ada0: <WDC WD40EFRX-68N32N0 82.00A82> ACS-3 ATA SATA 3.x device (sata), FreeBSD 12 was running with no problem. (In reply to Vasyl S. from comment #3) Maybe see if you can reproduce it by installing/running net-p2p/monero-cli I run it with these flags: monerod_enable="YES" monerod_flags="--max-concurrency=12" Maybe adjust the concurrency according to your CPU's specs? But if you reduce it, it's possible you may not be able to trigger the crash (I should probably experiment with this). I haven't been able to reproduce this with any other software yet... Not sure I want to use transmission on this box... but syncthing might work for me. I assume you mean net/syncthing ? Are you a syncing lot of data with it? I guess I want to try and at least partially replicate your workload to try to try and get the same results. (BTW, I'd never heard of syncthing before, but it sounds _really_ useful!) I have this bug reproducing itself on at least two servers consistently. Right now I'm had consistent reproduction of bug during make buildworld -j1 or -j2, given that I have 4 non-multithreaded cores (intel n100 and pentium j5005). I was able to rebuild world with -j4 and downgrade to 13-stable, problem is gone. 12-stable was working on same hardware without any problems for years. The only common element seems to be 14-stable, zfs (zmirror) and really chaotic i/o with significant amount of writes to busy zfs. transmission during torrent download is writing a lot of (relatively large) block, but in non-sequental manner. syncthing I am using for syncing a lot of git repositories, which are periodically re-scanned which generates a lot of i/o (mostly read, but it makes hdd devices busy). building world in a single thread also generates a significant amount of writes. (In reply to Vasyl S. from comment #6) I have syncthing setup and I have 3 git repos in the folder. It's pretty slow. Did you use any settings to tune it? Or is this what I should expect and is plenty good to try and reproduce the issue using syncthing? (other than the performance, which is likely related to small files, syncthing seems pretty nice to have so far). Have you tried 14.0-RELEASE? The issue occurs for me whether it's 14.0-RELEASE or 14.1-STABLE (well, currently 14.1-PRERELEASE). (In reply to Cameron from comment #7) So far I haven't been able to reproduce the crash on FreeBSD 14.1-PRERELEASE using Syncthing with a shared volume that has a number of large git repos. But the transfer speed is slower than I'd expect (usually 200-450 Kb) with everything on the same LAN and all SSD storage. I did get a warning about kqueue with a large number of files. Maybe that could be the problem? I'm also using ZFS but without RAID. Maybe it's easier to trigger with RAID1 due to twice the IO traffic..? Or maybe this is a similar but otherwise different issue to mine? Tried running monerod for the first time in a while... And my system no longer crashes! This could be resolved by one or more of the following changes: 1. Upgraded to 14.1-RELEASE. I tried 14-STABLE maybe within a few months of 14.1-RELEASE and still had the problem. 2. Started using "zpool trim"... But I have another FreeBSD that had 14.0-RELEASE where I didn't run trim and had no problems. 3. I'm on a beta BIOS for this motherboard that's more recent than current latest official release. I notice after monerod has run for a while, I start getting tons of these messages in dmesg: Jun 5 02:19:11 hostname kernel: sonewconn: pcb 0xfffff802963b9540 (0.0.0.0:18080 (proto 6)): Listen queue overflow: 193 already in queue awaiting acceptance (1 occurrences), euid 781, rgid 781, jail 0 Jun 5 02:25:11 hostname kernel: sonewconn: pcb 0xfffff802963b9540 Increasing kern.ipc.soacceptqueue doesn't seem to help at all. I wonder if IO is so slow that monerod can't keep up with the connections? The first few times I ran "zpool trim", it only took a few minutes... But over time, it has progressively gotten worse, now taking 21+ minutes. Suggesting there's still some IO issue. Perhaps the same issue I've had in the past when running monerod, but now it no longer causes my box to completely lockup. I can now run monerod constantly without locking up my box though, which is a nice improvement! In /var/log/monerod.log, I see a lot of traces: 2024-06-10 15:46:31.253 [P2P6] INFO stacktrace src/common/stack_trace.cpp:134 Exception: boost::wrapexcept<boost::bad_weak_ptr> 2024-06-10 15:46:31.253 [P2P6] INFO stacktrace src/common/stack_trace.cpp:135 Unwound call stack: 2024-06-10 15:46:31.385 [P2P6] INFO stacktrace src/common/stack_trace.cpp:163 1 0x9ab808 __cxa_throw + 0xc8 2024-06-10 15:46:31.510 [P2P6] INFO stacktrace src/common/stack_trace.cpp:159 2 0x50b05f 2024-06-10 15:46:31.633 [P2P6] INFO stacktrace src/common/stack_trace.cpp:159 3 0x7e1f4a 2024-06-10 15:46:31.757 [P2P6] INFO stacktrace src/common/stack_trace.cpp:159 4 0x7dc205 2024-06-10 15:46:31.879 [P2P6] INFO stacktrace src/common/stack_trace.cpp:159 5 0x788439 2024-06-10 15:46:32.001 [P2P6] INFO stacktrace src/common/stack_trace.cpp:159 6 0x78886c 2024-06-10 15:46:32.122 [P2P6] INFO stacktrace src/common/stack_trace.cpp:159 7 0x7c05e2 2024-06-10 15:46:32.244 [P2P6] INFO stacktrace src/common/stack_trace.cpp:159 8 0x7b2e5b 2024-06-10 15:46:32.365 [P2P6] INFO stacktrace src/common/stack_trace.cpp:159 9 0x7bc49d 2024-06-10 15:46:32.486 [P2P6] INFO stacktrace src/common/stack_trace.cpp:159 a 0x4d9b88 2024-06-10 15:46:32.607 [P2P6] INFO stacktrace src/common/stack_trace.cpp:159 b 0x491100 2024-06-10 15:46:32.728 [P2P6] INFO stacktrace src/common/stack_trace.cpp:159 c 0x48eddd 2024-06-10 15:46:32.849 [P2P6] INFO stacktrace src/common/stack_trace.cpp:159 d 0x48c562 2024-06-10 15:46:32.970 [P2P6] INFO stacktrace src/common/stack_trace.cpp:159 e 0x7e39a5 2024-06-10 15:46:33.091 [P2P6] INFO stacktrace src/common/stack_trace.cpp:159 f 0x7fd24f 2024-06-10 15:46:33.212 [P2P6] INFO stacktrace src/common/stack_trace.cpp:159 10 0x7fd118 2024-06-10 15:46:33.333 [P2P6] INFO stacktrace src/common/stack_trace.cpp:159 11 0x4fb1b2 2024-06-10 15:46:33.453 [P2P6] INFO stacktrace src/common/stack_trace.cpp:159 12 0x4f03c4 2024-06-10 15:46:33.575 [P2P6] INFO stacktrace src/common/stack_trace.cpp:159 13 0x4efe94 2024-06-10 15:46:33.695 [P2P6] INFO stacktrace src/common/stack_trace.cpp:159 14 0x4efbcc 2024-06-10 15:46:33.816 [P2P6] INFO stacktrace src/common/stack_trace.cpp:159 15 0x7deaa2 2024-06-10 15:46:33.937 [P2P6] INFO stacktrace src/common/stack_trace.cpp:159 16 0x82bec79bd 2024-06-10 15:46:34.058 [P2P6] INFO stacktrace src/common/stack_trace.cpp:159 17 0x8324bcb05 I see similar traces on my other box where monerod has never given me problems, but the traces become more far more common on the box that does give me problems once the sonewconn errors start appearing. The sonewconn errors have never appeared on the other working box. It seems monerod is mostly or entirely unable to continue syncing the block chain with constant stacktraces once it gets to this point unless I completely reboot the system. Completely stopping and starting monerod doesn't help. Looking at sockstat -c the last time I was in this state, I only had a bit over 200 connections. |