Summary: | kernel using 100% CPU in arc_prune in 13.3 | ||
---|---|---|---|
Product: | Base System | Reporter: | Maxim Usatov <maxim.usatov> |
Component: | kern | Assignee: | Olivier Certner <olce> |
Status: | Closed FIXED | ||
Severity: | Affects Many People | CC: | chris, emaste, fbsdbugs4, frank, grahamperrin, mfburdett, nihilesthic, olce, paolo.tealdi, pmc, steelem, vvd, zarychtam |
Priority: | --- | Flags: | grahamperrin:
needs_errata?
|
Version: | 13.3-RELEASE | ||
Hardware: | amd64 | ||
OS: | Any | ||
See Also: | https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=275594 |
Description
Maxim Usatov
2024-03-15 10:28:39 UTC
See also bug 275594 I confirm the bug also on two of on my FreeBSD 13.3 server (vmware). Resolved upgrading them to 14.0-p5 release. They are production server and busy one (nagios server and web server) Bug also present on FreeBSD 13-STABLE (stable/13-8b84d2da9: Fri Mar 8 15:06:13 AEDT 2024). The problem discussed in the link might be relevant: https://forums.freebsd.org/threads/rsync-bad-file-descriptor.92733/ Does anyone know if an errata notice is anticipated? (In reply to Marek Zarychta from comment #1) I reported on Feb 6 that problem in bug 275594 is also present in 13.3-BETA1, and on Feb 23 that the patches by Seigo Tanimura do solve my issue. I installed the patch provided by Seigo Tanimura on https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=275594 and the bug that affected another of my server with 13.3-RELEASE seems to be disappeared. Copying in out from network and doing a 'pkg upgrade -f' (which installed +350 packages): the server is working fine, without arc_prune kernel process in top -HPS Hello! Any plans to backport the fix to 13.3? Thank you! FreeBSD 13.3-STABLE #4 stable/13-b5e7969b2: Fri Mar 29 20:50:35 AEDT 2024 - The system now locks up more or less every second day during the backup process using ZFS send to send a ZFS snapshot to an external UFS hard disk attached via USB. It is only every second day (more or less) that arc_prune is consuming 100% of one CPU core and the two things always occur together (ie arc_prune running amok and the system freezing). Mar 31 06:00:00 shadow kernel: umass0: <Seagate GoFlex Desk, class 0/0, rev 2.10/1.00, addr 4> on usbus3 Mar 31 06:00:00 shadow kernel: da0 at umass-sim0 bus 0 scbus3 target 0 lun 0 Mar 31 06:00:00 shadow kernel: da0: <Seagate GoFlex Desk 0D19> Fixed Direct Access SPC-3 SCSI device Mar 31 06:00:00 shadow kernel: da0: Serial Number NA0MBZV8 Mar 31 06:00:00 shadow kernel: da0: 40.000MB/s transfers Mar 31 06:00:00 shadow kernel: da0: 2861588MB (732566645 4096 byte sectors) Mar 31 06:00:00 shadow kernel: da0: quirks=0x2<NO_6_BYTE> Mar 31 06:00:30 shadow root[42531]: Start local backup to da0p1 Mar 31 06:34:38 shadow kernel: pid 62306 (seamonkey), jid 0, uid 1001, was killed: a thread waited too long to allocate a page Mar 31 06:37:17 shadow kernel: pid 26160 (smbd), jid 0, uid 0, was killed: a thread waited too long to allocate a page Mar 31 06:38:30 shadow kernel: pid 26033 (named), jid 0, uid 53, was killed: a thread waited too long to allocate a page Mar 31 06:38:54 shadow kernel: pid 26288 (Xorg), jid 0, uid 0, was killed: a thread waited too long to allocate a page Mar 31 07:14:12 shadow kernel: pid 26214 (milter-greylist), jid 0, uid 26, was killed: a thread waited too long to allocate a page Mar 31 07:14:12 shadow kernel: pid 26268 (tcsh), jid 0, uid 1001, was killed: a thread waited too long to allocate a page Mar 31 07:14:12 shadow kernel: pid 26167 (milter-relay), jid 0, uid 26, was killed: a thread waited too long to allocate a page Mar 31 07:14:12 shadow kernel: pid 26110 (milter-regex), jid 0, uid 26, was killed: a thread waited too long to allocate a page Mar 31 07:14:12 shadow kernel: pid 39982 (sendmail), jid 0, uid 0, was killed: a thread waited too long to allocate a page Mar 31 07:14:12 shadow kernel: pid 40057 (httpd), jid 0, uid 80, was killed: a thread waited too long to allocate a page Mar 31 07:14:12 shadow kernel: pid 40000 (httpd), jid 0, uid 80, was killed: a thread waited too long to allocate a page Mar 31 07:14:12 shadow kernel: pid 40054 (httpd), jid 0, uid 80, was killed: a thread waited too long to allocate a page Mar 31 07:14:12 shadow kernel: pid 42492 (sendmail), jid 0, uid 25, was killed: a thread waited too long to allocate a page Mar 31 07:14:12 shadow kernel: pid 40056 (httpd), jid 0, uid 80, was killed: a thread waited too long to allocate a page Mar 31 07:14:12 shadow kernel: pid 40001 (httpd), jid 0, uid 80, was killed: a thread waited too long to allocate a page Mar 31 07:14:12 shadow kernel: pid 39999 (httpd), jid 0, uid 80, was killed: a thread waited too long to allocate a page Mar 31 07:14:12 shadow kernel: pid 39998 (httpd), jid 0, uid 80, was killed: a thread waited too long to allocate a page Mar 31 07:14:12 shadow kernel: pid 43308 (zfs), jid 0, uid 0, was killed: a thread waited too long to allocate a page Mar 31 07:14:12 shadow kernel: pid 39996 (sendmail), jid 0, uid 25, was killed: a thread waited too long to allocate a page Mar 31 07:14:12 shadow kernel: pid 26092 (ntpd), jid 0, uid 0, was killed: a thread waited too long to allocate a page after which the only way to regain access to the server is to turn the power off :( (In reply to Trev from comment #8) Look like fix was committed to stable/13 several hours ago only: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=275594#c115 I confirm all works fine now after applying the latest patch. (In reply to Maxim Usatov from comment #10) You can close it as Fixed. Was fixed as part of working on bug 275594, where most of the reports for the incarnation of the problem on 13 actually went. Sorry for not having referenced this PR as well in the commit message and the EN. The original report (not for 13.3, but for main) is bug 274698 and received a fix, then backported to stable/14. The fixes done for bug 275594 for stable/13 and then releng/13.3 (bug 278375) are essentially backports of it. For a full chronology of commits for this fix, see bug 274698, comment 10. |