Bug 277717

Summary:	kernel using 100% CPU in arc_prune in 13.3
Product:	Base System	Reporter:	Maxim Usatov <maxim.usatov>
Component:	kern	Assignee:	Olivier Certner <olce>
Status:	Closed FIXED
Severity:	Affects Many People	CC:	chris, emaste, fbsdbugs4, frank, grahamperrin, mfburdett, nihilesthic, olce, paolo.tealdi, pmc, steelem, vvd, zarychtam
Priority:	---	Flags:	grahamperrin: needs_errata?
Version:	13.3-RELEASE
Hardware:	amd64
OS:	Any
See Also:	https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=275594

Description Maxim Usatov 2024-03-15 10:28:39 UTC

This is similar to bug #275063 and #274698 - kernel stuck with 100% in arc_prune, in FreeBSD 13.3-RELEASE releng/13.3-n257428-80d2b634ddf0. Can reproduce every time by doing rsync from internal ZFS NVMe storage to an external UFS SATA drive attached via USB.

Comment 1 Marek Zarychta 2024-03-15 10:37:07 UTC

See also bug 275594

Comment 2 Paolo Tealdi 2024-03-15 13:28:04 UTC

I confirm the bug also on two of on my FreeBSD 13.3 server (vmware). Resolved upgrading them to 14.0-p5 release. They are production server and busy one (nagios server and web server)

Comment 3 Trev 2024-03-15 14:22:25 UTC

Bug also present on FreeBSD 13-STABLE (stable/13-8b84d2da9: Fri Mar  8 15:06:13 AEDT 2024).

Comment 4 Christos Chatzaras 2024-03-16 09:11:35 UTC

The problem discussed in the link might be relevant:

https://forums.freebsd.org/threads/rsync-bad-file-descriptor.92733/

Does anyone know if an errata notice is anticipated?

Comment 5 Peter Much 2024-03-16 18:56:35 UTC

(In reply to Marek Zarychta from comment #1)
I reported on Feb 6 that problem in bug 275594 is also present in 13.3-BETA1, and on Feb 23 that the patches by Seigo Tanimura do solve my issue.

Comment 6 Paolo Tealdi 2024-03-18 14:22:01 UTC

I installed the patch provided by Seigo Tanimura on https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=275594 and the bug that affected another of my server  with 13.3-RELEASE seems to be disappeared. Copying in out from network and doing a 'pkg upgrade -f' (which installed +350 packages): the server is working fine, without arc_prune kernel process in top -HPS

Comment 7 nihilesthic 2024-03-27 15:49:27 UTC

Hello!

Any plans to backport the fix to 13.3?

Thank you!

Comment 8 Trev 2024-04-04 11:16:04 UTC

FreeBSD 13.3-STABLE #4 stable/13-b5e7969b2: Fri Mar 29 20:50:35 AEDT 2024
- The system now locks up more or less every second day during the backup process using ZFS send to send a ZFS snapshot to an external UFS hard disk attached via USB. It is only every second day (more or less) that arc_prune is consuming 100% of one CPU core and the two things always occur together (ie arc_prune running amok and the system freezing).

Mar 31 06:00:00 shadow kernel: umass0: <Seagate GoFlex Desk, class 0/0, rev 2.10/1.00, addr 4> on usbus3
Mar 31 06:00:00 shadow kernel: da0 at umass-sim0 bus 0 scbus3 target 0 lun 0
Mar 31 06:00:00 shadow kernel: da0: <Seagate GoFlex Desk 0D19> Fixed Direct Access SPC-3 SCSI device
Mar 31 06:00:00 shadow kernel: da0: Serial Number NA0MBZV8
Mar 31 06:00:00 shadow kernel: da0: 40.000MB/s transfers
Mar 31 06:00:00 shadow kernel: da0: 2861588MB (732566645 4096 byte sectors)
Mar 31 06:00:00 shadow kernel: da0: quirks=0x2<NO_6_BYTE>
Mar 31 06:00:30 shadow root[42531]: Start local backup to da0p1
Mar 31 06:34:38 shadow kernel: pid 62306 (seamonkey), jid 0, uid 1001, was killed: a thread waited too long to allocate a page
Mar 31 06:37:17 shadow kernel: pid 26160 (smbd), jid 0, uid 0, was killed: a thread waited too long to allocate a page
Mar 31 06:38:30 shadow kernel: pid 26033 (named), jid 0, uid 53, was killed: a thread waited too long to allocate a page
Mar 31 06:38:54 shadow kernel: pid 26288 (Xorg), jid 0, uid 0, was killed: a thread waited too long to allocate a page
Mar 31 07:14:12 shadow kernel: pid 26214 (milter-greylist), jid 0, uid 26, was killed: a thread waited too long to allocate a page
Mar 31 07:14:12 shadow kernel: pid 26268 (tcsh), jid 0, uid 1001, was killed: a thread waited too long to allocate a page
Mar 31 07:14:12 shadow kernel: pid 26167 (milter-relay), jid 0, uid 26, was killed: a thread waited too long to allocate a page
Mar 31 07:14:12 shadow kernel: pid 26110 (milter-regex), jid 0, uid 26, was killed: a thread waited too long to allocate a page
Mar 31 07:14:12 shadow kernel: pid 39982 (sendmail), jid 0, uid 0, was killed: a thread waited too long to allocate a page
Mar 31 07:14:12 shadow kernel: pid 40057 (httpd), jid 0, uid 80, was killed: a thread waited too long to allocate a page
Mar 31 07:14:12 shadow kernel: pid 40000 (httpd), jid 0, uid 80, was killed: a thread waited too long to allocate a page
Mar 31 07:14:12 shadow kernel: pid 40054 (httpd), jid 0, uid 80, was killed: a thread waited too long to allocate a page
Mar 31 07:14:12 shadow kernel: pid 42492 (sendmail), jid 0, uid 25, was killed: a thread waited too long to allocate a page
Mar 31 07:14:12 shadow kernel: pid 40056 (httpd), jid 0, uid 80, was killed: a thread waited too long to allocate a page
Mar 31 07:14:12 shadow kernel: pid 40001 (httpd), jid 0, uid 80, was killed: a thread waited too long to allocate a page
Mar 31 07:14:12 shadow kernel: pid 39999 (httpd), jid 0, uid 80, was killed: a thread waited too long to allocate a page
Mar 31 07:14:12 shadow kernel: pid 39998 (httpd), jid 0, uid 80, was killed: a thread waited too long to allocate a page
Mar 31 07:14:12 shadow kernel: pid 43308 (zfs), jid 0, uid 0, was killed: a thread waited too long to allocate a page
Mar 31 07:14:12 shadow kernel: pid 39996 (sendmail), jid 0, uid 25, was killed: a thread waited too long to allocate a page
Mar 31 07:14:12 shadow kernel: pid 26092 (ntpd), jid 0, uid 0, was killed: a thread waited too long to allocate a page

after which the only way to regain access to the server is to turn the power off :(

Comment 9 Vladimir Druzenko freebsd_committer

2024-04-13 01:58:10 UTC

(In reply to Trev from comment #8)
Look like fix was committed to stable/13 several hours ago only: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=275594#c115

Comment 10 Maxim Usatov 2024-04-26 16:26:33 UTC

I confirm all works fine now after applying the latest patch.

Comment 11 Vladimir Druzenko freebsd_committer

2024-04-26 17:10:15 UTC

(In reply to Maxim Usatov from comment #10)
You can close it as Fixed.

Comment 12 Olivier Certner freebsd_committer

2024-04-27 06:11:32 UTC

Was fixed as part of working on bug 275594, where most of the reports for the incarnation of the problem on 13 actually went.  Sorry for not having referenced this PR as well in the commit message and the EN.

Comment 13 Olivier Certner freebsd_committer

2024-04-27 15:07:56 UTC

The original report (not for 13.3, but for main) is bug 274698 and received a fix, then backported to stable/14.  The fixes done for bug 275594 for stable/13 and then releng/13.3 (bug 278375) are essentially backports of it. For a full chronology of commits for this fix, see bug 274698, comment 10.