Bug 277671

Summary:

14-RELEASE/14-STABLE crash with heavy disk IO on AMD Asus x670e motherboard and Intel i225 (igc) breakage NIC non-functioning

Product:

Base System

Reporter:

Cameron <cam>

Component:

kern

Assignee:

freebsd-bugs (Nobody) <bugs>

Status:

New ---

Severity:

Affects Only Me

CC:

vasyl.samoilov

Priority:

---

Keywords:

crash

Version:

14.0-STABLE

Hardware:

amd64

OS:

Any

Attachments:

Description	Flags
14-STABLE verbose dmesg	none

Description Cameron 2024-03-13 15:22:36 UTC

Using a Samsung 1TB Samsung 850 Pro (ZFS), I'm able to pretty reliably crash my box when running 'monerod' from 'net-p2p/monero-cli'

Interestingly, I haven't been able to reproduce this (yet) with other IO loads including bonnie/bonnie++

This is on an Asus Crosshair x670 Extreme motherboard. This didn't start happening until I upgraded to BIOS 1709 and later. So why not downgrade the BIOS? Because it's impossible to anything lower after upgrading to 1709.

But this seems to be more than a simple BIOS bug. I believe it's likely related to the AGESA version.

This bug seems it could be tangentially related:
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=272507

With the upgrade to 1709, keyboard/mouse input stopped working (could've been an issue with USB not functioning properly after the update) and network stopped working (an intel i225). The IO related crash also did not occur until this upgrade.

Since upgrading to newer versions of the BIOS, keyboard/mouse input started working again. The onboard i225 still doesn't (it's recognized by 'ifconfig igc0', network settings are accepted and applied, but there is no network connectivity). I've worked around the nic issue by installing an Intel x540 tx2.

If I downgrade back to 1709, keyboard/mouse input still works, so probably the BIOS downgrade doesn't downgrade the AGESA version (which is why I suspect this is AGESA related).

I suspect the BIOS (AGESA) upgrade has altered and/or introduced some platform level bugs that are either Asus specific, AMD x670 specific, or some intersection of the 2.

When the crash occurs, the screen immediately goes black (so no text to share) and /var/crash isn't populated. If requested, perhaps I could try mounting /var/crash from an NVME drive (assuming this is a SATA specific issue) and/or run swap from an NMVE drive to see if I can get a crash dump.

Comment 1 Cameron 2024-03-13 15:47:11 UTC

Created attachment 249134 [details]
14-STABLE verbose dmesg

Comment 2 Cameron 2024-04-19 04:32:18 UTC

Finally tried this with a swap partition on an NVMe drive to sidestep the SATA controller entirely... Still I get no crash dump. I commented the old swap partition out, added the new one, rebooted (I could have just done swapoff/swapon, but I was being paranoid/thorough). 

I even ran "dumpdev" on the new one to be sure, swapinfo confirmed the new NVMe swap partition was only swap device... and yet nothing in /var/crash.

savecore -vC /dev/devpart showed there was no dump.

I'm guessing the system locks up so thoroughly it's unable to even do a kernel dump. Usually when this happens, the system reboots itself (including the 2 times I tried this) if that detail helps.

Comment 3 Vasyl S. 2024-04-25 08:45:20 UTC

I am experiencing the same behavior. 14-STABLE, root on ZFS.
Can easily be replicated with heavy I/O, in my case it's syncthing and transmission-bt. Every time I download anything with transmission that will saturate disk I/O system will lock up and reboot itself. I even connected a monitor but no messages being printed out before lockup and reboot, so I don't see a kernel dump even being called upon.

FreeBSD 14.0-STABLE #18 50f771371: Sat Apr 13 20:11:34 EEST 2024

Comment 4 Vasyl S. 2024-04-25 08:47:21 UTC

note: In my case it's CPU: Intel(R) N100 (806.40-MHz K8-class CPU), zfs mirror with two ada0: <WDC WD40EFRX-68N32N0 82.00A82> ACS-3 ATA SATA 3.x device (sata), FreeBSD 12 was running with no problem.

Comment 5 Cameron 2024-04-25 19:34:24 UTC

(In reply to Vasyl S. from comment #3)

Maybe see if you can reproduce it by installing/running net-p2p/monero-cli

I run it with these flags:
monerod_enable="YES"
monerod_flags="--max-concurrency=12"

Maybe adjust the concurrency according to your CPU's specs? But if you reduce it, it's possible you may not be able to trigger the crash (I should probably experiment with this).

I haven't been able to reproduce this with any other software yet... Not sure I want to use transmission on this box... but syncthing might work for me. I assume you mean net/syncthing ?

Are you a syncing lot of data with it? I guess I want to try and at least partially replicate your workload to try to try and get the same results.

(BTW, I'd never heard of syncthing before, but it sounds _really_ useful!)

Comment 6 Vasyl S. 2024-04-26 19:14:22 UTC

I have this bug reproducing itself on at least two servers consistently. Right now I'm had consistent reproduction of bug during make buildworld -j1  or -j2, given that I have 4 non-multithreaded cores (intel n100 and pentium j5005). I was able to rebuild world  with -j4 and downgrade to 13-stable, problem is gone. 12-stable was working on same hardware without any problems for years.

The only common element seems to be 14-stable, zfs (zmirror) and really chaotic i/o with significant amount of writes to busy zfs.

transmission during torrent download is writing a lot of (relatively large) block, but in non-sequental manner. syncthing I am using for syncing a lot of git repositories, which are periodically  re-scanned which generates a lot of i/o (mostly read, but it makes hdd devices busy). building world in a single thread also generates a significant amount of writes.

Comment 7 Cameron 2024-05-01 05:14:10 UTC

(In reply to Vasyl S. from comment #6)

I have syncthing setup and I have 3 git repos in the folder. It's pretty slow. Did you use any settings to tune it? Or is this what I should expect and is plenty good to try and reproduce the issue using syncthing?

(other than the performance, which is likely related to small files, syncthing seems pretty nice to have so far).

Have you tried 14.0-RELEASE? The issue occurs for me whether it's 14.0-RELEASE or 14.1-STABLE (well, currently 14.1-PRERELEASE).

Comment 8 Cameron 2024-05-01 19:15:43 UTC

(In reply to Cameron from comment #7)
So far I haven't been able to reproduce the crash on FreeBSD 14.1-PRERELEASE using Syncthing with a shared volume that has a number of large git repos. But the transfer speed is slower than I'd expect (usually 200-450 Kb) with everything on the same LAN and all SSD storage.

I did get a warning about kqueue with a large number of files. Maybe that could be the problem?

I'm also using ZFS but without RAID. Maybe it's easier to trigger with RAID1 due to twice the IO traffic..?

Or maybe this is a similar but otherwise different issue to mine?

Comment 9 Cameron 2024-06-10 16:08:45 UTC

Tried running monerod for the first time in a while... And my system no longer crashes!

This could be resolved by one or more of the following changes:

1. Upgraded to 14.1-RELEASE. I tried 14-STABLE maybe within a few months of 14.1-RELEASE and still had the problem.

2. Started using "zpool trim"... But I have another FreeBSD that had 14.0-RELEASE where I didn't run trim and had no problems.

3. I'm on a beta BIOS for this motherboard that's more recent than current latest official release.

I notice after monerod has run for a while, I start getting tons of these messages in dmesg:
Jun  5 02:19:11 hostname kernel: sonewconn: pcb 0xfffff802963b9540 (0.0.0.0:18080 (proto 6)): Listen queue overflow: 193 already in queue awaiting acceptance (1 occurrences), euid 781, rgid 781, jail 0
Jun  5 02:25:11 hostname kernel: sonewconn: pcb 0xfffff802963b9540 

Increasing kern.ipc.soacceptqueue doesn't seem to help at all. I wonder if IO is so slow that monerod can't keep up with the connections?

The first few times I ran "zpool trim", it only took a few minutes... But over time, it has progressively gotten worse, now taking 21+ minutes. Suggesting there's still some IO issue. Perhaps the same issue I've had in the past when running monerod, but now it no longer causes my box to completely lockup.

I can now run monerod constantly without locking up my box though, which is a nice improvement!

In /var/log/monerod.log, I see a lot of traces:
2024-06-10 15:46:31.253 [P2P6]  INFO    stacktrace      src/common/stack_trace.cpp:134  Exception: boost::wrapexcept<boost::bad_weak_ptr>
2024-06-10 15:46:31.253 [P2P6]  INFO    stacktrace      src/common/stack_trace.cpp:135  Unwound call stack:
2024-06-10 15:46:31.385 [P2P6]  INFO    stacktrace      src/common/stack_trace.cpp:163       1                  0x9ab808 __cxa_throw + 0xc8
2024-06-10 15:46:31.510 [P2P6]  INFO    stacktrace      src/common/stack_trace.cpp:159       2                  0x50b05f
2024-06-10 15:46:31.633 [P2P6]  INFO    stacktrace      src/common/stack_trace.cpp:159       3                  0x7e1f4a
2024-06-10 15:46:31.757 [P2P6]  INFO    stacktrace      src/common/stack_trace.cpp:159       4                  0x7dc205
2024-06-10 15:46:31.879 [P2P6]  INFO    stacktrace      src/common/stack_trace.cpp:159       5                  0x788439
2024-06-10 15:46:32.001 [P2P6]  INFO    stacktrace      src/common/stack_trace.cpp:159       6                  0x78886c
2024-06-10 15:46:32.122 [P2P6]  INFO    stacktrace      src/common/stack_trace.cpp:159       7                  0x7c05e2
2024-06-10 15:46:32.244 [P2P6]  INFO    stacktrace      src/common/stack_trace.cpp:159       8                  0x7b2e5b
2024-06-10 15:46:32.365 [P2P6]  INFO    stacktrace      src/common/stack_trace.cpp:159       9                  0x7bc49d
2024-06-10 15:46:32.486 [P2P6]  INFO    stacktrace      src/common/stack_trace.cpp:159       a                  0x4d9b88
2024-06-10 15:46:32.607 [P2P6]  INFO    stacktrace      src/common/stack_trace.cpp:159       b                  0x491100
2024-06-10 15:46:32.728 [P2P6]  INFO    stacktrace      src/common/stack_trace.cpp:159       c                  0x48eddd
2024-06-10 15:46:32.849 [P2P6]  INFO    stacktrace      src/common/stack_trace.cpp:159       d                  0x48c562
2024-06-10 15:46:32.970 [P2P6]  INFO    stacktrace      src/common/stack_trace.cpp:159       e                  0x7e39a5
2024-06-10 15:46:33.091 [P2P6]  INFO    stacktrace      src/common/stack_trace.cpp:159       f                  0x7fd24f
2024-06-10 15:46:33.212 [P2P6]  INFO    stacktrace      src/common/stack_trace.cpp:159      10                  0x7fd118
2024-06-10 15:46:33.333 [P2P6]  INFO    stacktrace      src/common/stack_trace.cpp:159      11                  0x4fb1b2
2024-06-10 15:46:33.453 [P2P6]  INFO    stacktrace      src/common/stack_trace.cpp:159      12                  0x4f03c4
2024-06-10 15:46:33.575 [P2P6]  INFO    stacktrace      src/common/stack_trace.cpp:159      13                  0x4efe94
2024-06-10 15:46:33.695 [P2P6]  INFO    stacktrace      src/common/stack_trace.cpp:159      14                  0x4efbcc
2024-06-10 15:46:33.816 [P2P6]  INFO    stacktrace      src/common/stack_trace.cpp:159      15                  0x7deaa2
2024-06-10 15:46:33.937 [P2P6]  INFO    stacktrace      src/common/stack_trace.cpp:159      16                  0x82bec79bd
2024-06-10 15:46:34.058 [P2P6]  INFO    stacktrace      src/common/stack_trace.cpp:159      17                  0x8324bcb05


I see similar traces on my other box where monerod has never given me problems, but the traces become more far more common on the box that does give me problems once the sonewconn errors start appearing. The sonewconn errors have never appeared on the other working box.

It seems monerod is mostly or entirely unable to continue syncing the block chain with constant stacktraces once it gets to this point unless I completely reboot the system. Completely stopping and starting monerod doesn't help.

Looking at sockstat -c the last time I was in this state, I only had a bit over 200 connections.