270975 – system hangs with heavy io and regular syncing

Bug 270975 - system hangs with heavy io and regular syncing

Summary: system hangs with heavy io and regular syncing

Status:	Open

Alias:	None

Product:	Base System
Classification:	Unclassified
Component:	kern (show other bugs)
Version:	12.4-RELEASE
Hardware:	i386 Any

Importance:	--- Affects Some People
Assignee:	freebsd-bugs (Nobody)

URL:
Keywords:	performance

Depends on:
Blocks:

Reported:	2023-04-21 08:53 UTC by nkoch
Modified:	2023-04-24 06:26 UTC (History)
CC List:	2 users (show)

See Also:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description nkoch 2023-04-21 08:53:31 UTC

We are running an embedded system with processes partly at rtprio doing io and often syncing. We have there a 12.1-p13 kernel with some modifications, including special device drivers, e.g. an sram based disk device.
As we rarely found the system to be completely unresponsive (no login, ssh possible) I added a utility that runs at very high rtprio and monitors the other processes. If it sees 100% cpu it throttles those processes using SIGSTOP/SIGCONT. That helped me to see that there was a thread in unkillable sleep in syscall sync using up most of the cpu, like it was busy waiting for something.

After that I did some testing with unmodified kernels (withoud my device drivers) and simple test scripts that do write+sync+random sleep at normal
and realtime priority.

So far I've tested FreeBSD12.1-release, FreeBSD12.1-p13, FreeBSD12.4-release.
I've managed to have all of them bein unresponsive after one or more hours.

For FreeBSD12.4, I've had one console running a shell with rtprio. After "killall sync &" and
"killall sh & " the system was hanging. I could only switch vtys but could not login.
In an other test I've got the hang by calling "procstat kstack 1" in the rtprio shell.

One detail: kern.dirdelay, kern.metadelay and kern.filedelay are all set to 1.

Comment 1 Daniel Ebdrup Jensen freebsd_committer

2023-04-21 14:23:28 UTC

It sounds to me like console interactivity is being completely overloaded as there are simply no spare cycles - as such, I'm not sure this is a bug, as FreeBSD is doing what it's been told to do.

I once manged to do the exact same on a production system, and solved it by using top -q as root to identify the process and kill it - but it took 3 days of waiting on the login prompt to finish, and ssh never managed to work.

The kern.sched MIB in sysctl(8) has some control over interactivity and threshold scores that you can try tweaking to improve this if you can't bear to shut down the system to fix the issue.

As an alternative, you can use cpuset(8) and/or rctl(8) to limit which cores your CPUs are running on and/or how much cputime they get, which should free up cycles for the system to be a little interactive.

If you want help with something, a bug tracker isn't the best place for it - stop by #FreeBSD on Libera, it's possible that there's someone that can help you rootcause the issue.
For reference, there's documentation on how to write a good bug report once you've figured out the exact issue: https://docs.freebsd.org/en/articles/problem-reports/

Comment 2 nkoch 2023-04-24 06:21:42 UTC

I would like to repeat: I've found that some thread remained in unkillable sleep in syscall sync(). For me, that does not look like a performance problem. And all of the io prcesses do some random sleep after calling sync.

I did the same test with the same number of io processes without rptrio. So far, after 2 days it does not show any unexpected behaviour.

Comment 3 nkoch 2023-04-24 06:26:00 UTC

And, may be I've forgot to mention. We first saw the problem in our embedded  environment without heavy io, just a lot of syncing at rtprio. It needed a lot more time, weeks at least, to hang the system. With more io I just happens earlier.