Bug 224080 - ZFS: "arc_dnlc_evicts_thr" + "arc_reclaim_thread" - high CPU load during poudriere build...
Summary: ZFS: "arc_dnlc_evicts_thr" + "arc_reclaim_thread" - high CPU load during poud...
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: CURRENT
Hardware: amd64 Any
: --- Affects Only Me
Assignee: freebsd-fs mailing list
URL:
Keywords: regression
Depends on:
Blocks:
 
Reported: 2017-12-04 13:29 UTC by Nils Beyer
Modified: 2018-01-02 04:37 UTC (History)
7 users (show)

See Also:


Attachments
Proposed fix (3.74 KB, patch)
2017-12-13 00:31 UTC, Jeff Roberson
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Nils Beyer 2017-12-04 13:29:47 UTC
On 12-CURRENT r326527

Running a poudriere build results in high CPU load (up to 100%) for the ZFS kernel threads "arc_dnlc_evicts_thr" and "arc_reclaim_thread":
------------------------------------------------------------------------------------------------
last pid:  8708;  load averages:  5.53,  1.49,  0.55  up 0+00:01:05    14:09:07                                                                                                     
787 processes: 18 running, 728 sleeping, 2 zombie, 39 waiting                                                                                                                       

Mem: 379M Active, 72M Inact, 765M Wired, 14G Free
ARC: 250M Total, 22M MFU, 178M MRU, 314K Anon, 6575K Header, 43M Other
     100M Compressed, 292M Uncompressed, 2.91:1 Ratio
Swap: 32G Total, 32G Free

  PID USERNAME      PRI NICE   SIZE    RES STATE   C   TIME    WCPU COMMAND
   17 root           -8    -     0K   192K CPU6    6   0:18  70.65% zfskern{arc_dnlc_evicts_thr}
 1328 root           52    0 14300K  5280K piperd  0   0:09  22.36% sh
   11 root          155 ki31     0K   128K RUN     5   0:21  16.46% idle{idle: cpu5}
   11 root          155 ki31     0K   128K RUN     7   0:28  16.16% idle{idle: cpu7}
   11 root          155 ki31     0K   128K RUN     0   0:29  15.48% idle{idle: cpu0}
   11 root          155 ki31     0K   128K RUN     1   0:28  15.38% idle{idle: cpu1}
   17 root           -8    -     0K   192K arc_re  1   0:09  14.79% zfskern{arc_reclaim_thread}
   11 root          155 ki31     0K   128K RUN     6   0:28  14.45% idle{idle: cpu6}
   11 root          155 ki31     0K   128K RUN     4   0:27  14.36% idle{idle: cpu4}
   11 root          155 ki31     0K   128K RUN     2   0:28  14.06% idle{idle: cpu2}
   11 root          155 ki31     0K   128K RUN     3   0:28  13.77% idle{idle: cpu3}
 1513 root           52    0 13024K  3968K nanslp  4   0:00   0.49% sh
   15 root          -16    -     0K    48K -       3   0:00   0.10% cam{doneq1}
    0 root          -16    -     0K  9024K swapin  4   0:28   0.00% kernel{swapper}
    0 root            8    -     0K  9024K -       3   0:00   0.00% kernel{thread taskq}
   12 root          -96    -     0K   624K WAIT    5   0:00   0.00% intr{irq281: hdac1}
   12 root          -60    -     0K   624K WAIT    0   0:00   0.00% intr{swi4: clock (0)}
   15 root          -16    -     0K    48K -       6   0:00   0.00% cam{scanner}
------------------------------------------------------------------------------------------------

I have 16GB of RAM and ARC is not limited - so there should be no memory pressure.

That behaviour does not exist on 11-STABLE (r326088). There, both mentioned threads only reaches about 1-2% at maximum (similar CPU).

Any ideas how to fix that?
Comment 1 Nils Beyer 2017-12-04 13:59:19 UTC
Addendum, even a simple "svnlite up" in "/usr/src" or "/usr/ports" let "arc_reclaim_thread" grow up to 70% load...
Comment 2 Nils Beyer 2017-12-04 14:09:35 UTC
On my colleague's notebook running 12-CURRENT r325310 that behaviour isn't as strong; during "svnlite up" his "arc_reclaim_thread" process only rises up to 6-8% load - still higher than on 11-STABLE, though...
Comment 3 Nils Beyer 2017-12-04 20:19:48 UTC
Ok, I've tried several revisions for good and bad results:
--------------------------------------------------------------------
326398 bad
326390 bad
326362 bad
326347 bad
326346 good
326343 good
326333 good
326286 good
--------------------------------------------------------------------

So, starting with commit r326347 - "Eliminate kmem_arena and kmem_object in preparation for further NUMA commits." - my system here freaks out...
Comment 4 Mark Linimon freebsd_committer freebsd_triage 2017-12-04 21:37:54 UTC
Cc: committer of r326346.
Comment 5 Nikolai Lifanov freebsd_committer 2017-12-05 14:45:19 UTC
I can also reproduce this issue. In addition, my arc size hovers around arc_min now. I expect >90G uncompressed and I get <3G. This leads to terrible Poudriere performance.
Comment 6 Nils Beyer 2017-12-05 14:51:35 UTC
(In reply to Nikolai Lifanov from comment #5)

I can confirm the low ARC usage as well (now at 1GB usage for unlimited ARC permission with 16GB RAM). Seems that every millisecond parts of ARC are being thrown away or something...
Comment 7 Cy Schubert freebsd_committer 2017-12-07 02:59:59 UTC
I've noticed this before r326346. Probably a week before r326346. I haven't been able to point to any commit yet.
Comment 8 Cy Schubert freebsd_committer 2017-12-07 03:19:12 UTC
I will try to revert r326347 and see if it makes a difference.
(In reply to Cy Schubert from comment #7)

I think I got my dates wrong. It may be r326347. I will revert and try.
Comment 9 Cy Schubert freebsd_committer 2017-12-07 05:14:32 UTC
I can confirm that reverting r326347 resolves the issue. My zpools no longer thrash. Performance, due to greatly reduced I/O, has vastly improved.

So far so good:

 last pid: 64954;  load averages:  1.87,  1.86,  2.05    up 0+00:58:45  21:12:59
71 processes:  3 running, 68 sleeping
CPU:  0.0% user, 64.0% nice, 33.3% system,  0.2% interrupt,  2.6% idle
Mem: 1285M Active, 1367M Inact, 4489M Wired, 650M Buf, 800M Free
ARC: 1997M Total, 356M MFU, 1313M MRU, 8244K Anon, 23M Header, 298M Other
     1302M Compressed, 3372M Uncompressed, 2.59:1 Ratio
Swap: 20G Total, 20G Free
   
  PID USERNAME       THR PRI NICE   SIZE    RES STATE   C   TIME    WCPU COMMAN
64687 root             1  52   10 10604K  1996K select  1   0:00   1.63% make
60776 root             1  52   10 14460K  6092K select  1   0:00   0.41% make
33699 root             1  20    0 14312K  5308K select  0   0:00   0.22% screen
60761 root             1  46   10 10984K  2520K select  1   0:00   0.14% make
61698 root             1  20    0 13920K  4124K CPU1    1   0:00   0.11% top
33888 root             1  30   10 10196K  1500K select  0   0:00   0.11% make
64686 root             1  52   10 11692K  3180K wait    0   0:00   0.11% sh
33914 root             1  30   10 10996K  2520K select  0   0:00   0.10% make
 2712 root             1  20    0 14352K  5624K select  1   0:00   0.02% devd
 3166 uucp             1  20    0 12288K  3856K nanslp  1   0:00   0.01% upsmon
 3127 root             1  20    0 10860K  2456K select  0   0:00   0.01% powerd
 3124 root             2  20    0 19264K 19396K select  1   0:00   0.01% ntpd
 2795 root             1  20    0 11400K  2616K select  0   0:00   0.01% syslog
 3358 root             1  20    0 15748K  6640K select  1   0:00   0.00% sendma


And a different machine:

last pid: 81498;  load averages:  1.86,  1.88,  2.04    up 0+00:59:45  21:13:59
62 processes:  2 running, 60 sleeping
CPU:  0.4% user, 13.9% nice, 19.4% system,  0.6% interrupt, 65.7% idle
Mem: 1291M Active, 1381M Inact, 4490M Wired, 650M Buf, 778M Free
ARC: 2031M Total, 349M MFU, 1331M MRU, 8254K Anon, 24M Header, 319M Other
     1313M Compressed, 3410M Uncompressed, 2.60:1 Ratio
Swap: 20G Total, 20G Free
   
  PID USERNAME       THR PRI NICE   SIZE    RES STATE   C   TIME    WCPU COMMAN
81192 root             1  83   10 19976K 11788K CPU1    1   0:00   5.95% make
33699 root             1  20    0 14312K  5308K select  1   0:01   0.37% screen
33888 root             1  30   10 10196K  1500K select  0   0:00   0.29% make
81169 root             1  52   10 14320K  5936K select  0   0:00   0.22% make
33914 root             1  30   10 11016K  2524K select  0   0:00   0.16% make
61698 root             1  20    0 13920K  4124K CPU0    0   0:00   0.10% top
 2712 root             1  20    0 14352K  5624K select  0   0:00   0.02% devd
 3127 root             1  20    0 10860K  2456K select  1   0:00   0.01% powerd
 3166 uucp             1  20    0 12288K  3856K nanslp  1   0:00   0.01% upsmon
 3124 root             2  20    0 19264K 19396K select  0   0:00   0.01% ntpd
 2795 root             1  20    0 11400K  2616K select  0   0:00   0.01% syslog
 3358 root             1  20    0 15748K  6640K select  0   0:00   0.00% sendma
25899 root             1  20    0 12876K  3684K select  0   0:00   0.00% klogin
33698 root             1  20    0 12392K  3648K pause   1   0:00   0.00% screen
Comment 10 Cy Schubert freebsd_committer 2017-12-07 05:25:29 UTC
During a buildworld -DNO_CLEAN (second machine):

                       Total     MFU     MRU    Anon     Hdr   L2Hdr   Other
     ZFS ARC           2001M    464M   1300M   8695K     30M      0K    198M

                                rate    hits  misses   total hits total misses
     arcstats                  : 71%     660     258       274342       697629
     arcstats.demand_data      : 67%     218     107        96070        81229
     arcstats.demand_metadata  : 25%      44     129        43412       569950
     arcstats.prefetch_data    :  0%       0      22        10019        27947
     arcstats.prefetch_metadata:100%     398       0       124841        18503
     zfetchstats               :  4%      99    2013        28345      3844648
     arcstats.l2               :  0%       0       0            0            0
     vdev_cache_stats          : 58%       7       5         9786        19927



I'm getting pretty healty hit rates where as with r326347 the hit rates were in the low single digits.
Comment 11 Cy Schubert freebsd_committer 2017-12-07 05:50:53 UTC
last pid: 23981;  load averages:  1.85,  1.80,  1.61    up 0+00:48:45  21:50:20
133 processes: 4 running, 129 sleeping
CPU:  4.8% user, 45.2% nice, 27.7% system,  0.2% interrupt, 22.1% idle
Mem: 1509M Active, 1341M Inact, 37M Laundry, 3593M Wired, 178M Buf, 1281M Free
ARC: 1324M Total, 20M MFU, 653M MRU, 134M Anon, 19M Header, 498M Other
     154M Compressed, 582M Uncompressed, 3.78:1 Ratio
Swap: 12G Total, 12G Free

  PID USERNAME       THR PRI NICE   SIZE    RES STATE   C   TIME    WCPU COMMAN
 2658 cy               6  23    0 92956K 45940K select  3   1:50  10.70% mate-t
19465 root             1  25    0 56596K 47756K zio->i  2   0:09  10.37% rsync
 2503 root             1  24    0 67200K 43964K select  2   1:47   9.91% Xorg
20179 root             1  24    0 56192K 47576K db->db  2   0:07   9.54% rsync
23948 root             1  52   10 10920K  2036K select  1   0:00   3.42% make
19772 root             1  20    0 11040K  2656K select  1   0:00   0.81% moused
23213 root             1   4   10 12992K  4312K select  0   0:00   0.21% make
12248 root             1   4   10 18088K  9680K select  3   0:01   0.17% make
23492 root             1  52   10 10560K  1804K select  1   0:00   0.14% make
23860 cy               1  20    0 13940K  4152K CPU2    2   0:00   0.11% top
 2591 cy               1  20    0 16804K  6684K select  2   0:02   0.09% xscree
23947 root             1  52   10 11688K  3176K wait    0   0:00   0.09% sh
 2560 cy               1  20    0 21684K  9336K select  2   0:02   0.09% mwm
 3205 root             1  30   10 11084K  2392K select  3   0:00   0.08% make
72684 root             1  30   10 10224K  1388K select  0   0:01   0.08% make

MFU should be much higher.
Comment 12 Cy Schubert freebsd_committer 2017-12-07 06:03:24 UTC
r326346 also needs to be backed out. Will try that too.
Comment 13 Cy Schubert freebsd_committer 2017-12-07 06:12:15 UTC
Looking at the workloads of my four systems, I'm not entirely convinced reverting r326346 will make as much difference as removing r326347 did. Though ARC is more in line with as it was before r326347, MFU fluctuates widely, from 55M to 500M. Better than single digit MB before. I will see if reverting r326346 makes a difference too.

last pid: 50534;  load averages:  1.32,  0.94,  0.96    up 0+01:05:30  22:07:05
123 processes: 2 running, 121 sleeping
CPU:  1.6% user,  4.2% nice,  5.0% system,  0.6% interrupt, 88.7% idle
Mem: 1018M Active, 1310M Inact, 3670M Wired, 178M Buf, 1763M Free
ARC: 1327M Total, 177M MFU, 651M MRU, 15M Anon, 24M Header, 460M Other
     283M Compressed, 943M Uncompressed, 3.33:1 Ratio
Swap: 12G Total, 12G Free

  PID USERNAME       THR PRI NICE   SIZE    RES STATE   C   TIME    WCPU COMMAN
 2503 root             1  24    0 58704K 35612K select  1   2:16   7.25% Xorg
 2658 cy               5  22    0 93644K 46620K CPU0    0   2:16   6.99% mate-t
50487 root             1   4   10 10476K  1836K select  2   0:00   1.53% make
49701 root             1  52   10 17436K  9024K select  1   0:00   0.16% make
 2591 cy               1  20    0 16804K  6684K select  2   0:02   0.08% xscree
50489 root             1  52   10 11668K  3172K wait    3   0:00   0.08% sh
 2560 cy               1  20    0 21684K  9336K select  0   0:03   0.07% mwm
23860 cy               1  20    0 14168K  4388K CPU3    3   0:01   0.07% top
50486 root             1  52   10 11688K  3176K wait    2   0:00   0.07% sh
25264 root             1  30   10 10224K  1388K select  1   0:00   0.07% make
48259 root             1  30   10 10956K  2336K select  2   0:00   0.06% make
46730 root             1  30   10 11036K  2344K select  1   0:00   0.06% make
 3041 cy               1  20    0 12708K  3704K select  0   0:01   0.05% krlogi
25290 root             1  30   10 11128K  2416K select  2   0:00   0.05% make
49044 root             1  20    0 11040K  2656K select  0   0:00   0.04% moused
Comment 14 Cy Schubert freebsd_committer 2017-12-07 07:06:17 UTC
So far so good (machine 1):

last pid: 47502;  load averages:  1.60,  1.92,  1.69    up 0+00:34:24  22:57:38
51 processes:  2 running, 49 sleeping
CPU:  0.6% user,  0.0% nice,  9.1% system,  0.2% interrupt, 90.2% idle
Mem: 295M Active, 1858M Inact, 4233M Wired, 645M Buf, 1555M Free
ARC: 1844M Total, 516M MFU, 980M MRU, 5136K Anon, 17M Header, 325M Other
     1121M Compressed, 2912M Uncompressed, 2.60:1 Ratio
Swap: 20G Total, 20G Free
   
  PID USERNAME       THR PRI NICE   SIZE    RES STATE   C   TIME    WCPU COMMAN
47500 root             1  30    0 12268K  3628K CPU1    1   0:04  11.72% cpdup
18042 root             1  20    0 14132K  4340K CPU0    0   0:01   0.08% top
 3127 root             2  20    0 19264K 19396K select  1   0:00   0.07% ntpd
 3570 root             1  20    0 14312K  5184K select  0   0:06   0.01% screen
 3130 root             1  20    0 10860K  2456K select  0   0:00   0.01% powerd
 3123 root             9  52    0 29220K  5564K uwait   1   0:00   0.01% nscd
99226 root             1  20    0 12876K  3684K select  1   0:00   0.00% klogin
 2911 root             1  20    0 12064K  3192K select  0   0:03   0.00% ypserv
 2795 root             1  20    0 11348K  2564K select  0   0:00   0.00% syslog
 2920 root             1  20    0 13448K 13536K select  1   0:00   0.00% amd
 2712 root             1  20    0 14352K  5624K select  1   0:00   0.00% devd
 3932 named            5  52    0 56212K 38848K sigwai  1   0:00   0.00% named
 3169 uucp             1  20    0 12288K  3860K connec  1   0:00   0.00% upsmon
 3359 root             1  20    0 15748K  6640K select  1   0:00   0.00% sendma
Comment 15 Cy Schubert freebsd_committer 2017-12-07 07:14:38 UTC
Machine 2, running buildworld -DNO_CLEAN and two rsyncs -- much better:

last pid:  4958;  load averages:  3.53,  1.93,  1.21    up 0+00:38:03  23:13:38
124 processes: 6 running, 118 sleeping
CPU:  1.9% user, 83.1% nice, 12.5% system,  0.0% interrupt,  2.6% idle
Mem: 1199M Active, 840M Inact, 44M Laundry, 5251M Wired, 81M Buf, 427M Free
ARC: 3059M Total, 639M MFU, 1797M MRU, 10M Anon, 52M Header, 561M Other
     1778M Compressed, 5101M Uncompressed, 2.87:1 Ratio
Swap: 12G Total, 12G Free

  PID USERNAME       THR PRI NICE   SIZE    RES STATE   C   TIME    WCPU COMMAN
 4847 root             1  90   10    98M 68872K RUN     3   0:03  91.98% cc
 4949 root             1  85   10 92972K 61428K RUN     0   0:01  62.55% cc
 3818 root             1  26    0   242M   234M select  2   0:02  17.95% rsync
 4945 root             1  52   10 10568K  1952K select  2   0:00   8.38% make
 4321 root             1  25    0   269M   261M select  1   0:01   7.72% rsync
 2639 cy               5  21    0 93388K 45296K CPU3    3   0:51   3.28% mate-t
 2581 root             1  22    0 69376K 46348K select  0   0:55   2.47% Xorg
 2855 cy              55  22    0  7866M   468M select  3   0:45   0.62% firefo
 4948 root             1  52   10 70348K 38172K wait    0   0:00   0.40% cc
 4825 root             1   4   10 11052K  2460K select  0   0:00   0.27% make
 3824 root             1  20    0 11004K  2504K sbwait  2   0:00   0.15% rsh
 3459 root             1  20    0 11040K  2656K select  0   0:00   0.14% moused
 4941 root             1  52   10 11732K  3196K wait    2   0:00   0.12% sh
 4944 root             1  52   10 11696K  3184K wait    3   0:00   0.12% sh
 3367 root             1  52   10 28328K 20252K select  1   0:01   0.11% make
Comment 16 Cy Schubert freebsd_committer 2017-12-07 08:39:15 UTC
I've done some testing with r326347 reverted and with r326346 + r326347 reverted. My test included I/O intensive operations including buildworld/buildkernel -DNO_CLEAN and parallel rsync. Removing r326346 + r326347 showed no significant improvement over removal of only r326347 (below). r326347 definitely caused the issue. Note MFU. Performance is significantly better due to improved ARC caching.

last pid:  2156;  load averages:  2.56,  2.32,  1.66    up 0+00:25:52  00:32:57
89 processes:  4 running, 85 sleeping
CPU:  1.3% user, 75.8% nice, 20.5% system,  0.8% interrupt,  1.7% idle
Mem: 390M Active, 602M Inact, 5631M Wired, 63M Buf, 1317M Free
ARC: 3585M Total, 1106M MFU, 2029M MRU, 14M Anon, 60M Header, 376M Other
     2734M Compressed, 6084M Uncompressed, 2.23:1 Ratio
Swap: 20G Total, 20G Free

  PID USERNAME       THR PRI NICE   SIZE    RES STATE   C   TIME    WCPU COMMAN
63633 root             1  52    0 36720K 27164K RUN     1   2:28   7.38% pkg
 2136 root             1   4   10 10548K  1932K select  0   0:00   6.32% make
 1909 root             1   4   10 11808K  3144K select  1   0:00   0.36% make
 2096 root             1  24    0 13804K  3952K CPU1    1   0:00   0.11% top
 2135 root             1  52   10 11700K  3124K wait    1   0:00   0.11% sh
 3598 root             1  20    0 14312K  5216K select  0   0:02   0.10% screen
98994 root             1   4   10 28516K 20320K select  1   0:01   0.10% make
 1828 root             1  52   10 10404K  1608K select  1   0:00   0.07% make
61839 root             1  30   10 10196K  1324K select  1   0:00   0.05% make
 1906 root             1   4   10 10360K  1468K select  1   0:00   0.05% make
98907 root             1  30   10 10980K  2340K select  1   0:00   0.04% make
99013 root             1  30   10 23784K 15508K select  1   0:00   0.04% make
 2712 root             1  20    0 14352K  5624K select  1   0:00   0.02% devd
 3137 root             1  20    0 10860K  2456K select  0   0:00   0.01% powerd
Comment 17 Nikolai Lifanov freebsd_committer 2017-12-07 15:30:57 UTC
I can definitely confirm that running stock r326658 with r326347 reverted fixes the ARC under-use for me.
Comment 18 Mark Johnston freebsd_committer 2017-12-07 18:54:22 UTC
I think this diff will address the problem: https://reviews.freebsd.org/D13412
Comment 19 Nikolai Lifanov freebsd_committer 2017-12-07 19:38:31 UTC
(In reply to Mark Johnston from comment #18)

I confirmed that the fix in D13412 resolves the problem for me.
Comment 20 commit-hook freebsd_committer 2017-12-07 19:38:37 UTC
A commit references this bug:

Author: markj
Date: Thu Dec  7 19:38:09 UTC 2017
New revision: 326664
URL: https://svnweb.freebsd.org/changeset/base/326664

Log:
  Fix the UMA reclaim worker after r326347.

  atomic_set_*() sets a bit in the target memory location, so
  atomic_set_int(&uma_reclaim_needed, 0) does not do what it looks like
  it does.

  PR:		224080
  Reviewed by:	jeff, kib
  Differential Revision:	https://reviews.freebsd.org/D13412

Changes:
  head/sys/vm/uma_core.c
Comment 21 Nils Beyer 2017-12-07 21:22:21 UTC
(In reply to commit-hook from comment #20)

thanks for the commit. Unfortunately, the issue still is not solved for me. How to repeat:

1) freshly reboot FreeBSD 12-CURRENT r326668M
2) login as "root"
3) start "top -s 1 -P -H -S"
4) switch to another VTY
5) login as "root"
6) start "svnlite up /usr/src"
7) switch back to VTY where "top" is running

"arc_reclaim_thread" raises up to 70% CPU load...
Comment 22 Nils Beyer 2017-12-07 21:23:18 UTC
(In reply to Nils Beyer from comment #21)

and total ARC usage stays under 1GB though "vfs.zfs.arc_max" is set to 16GB...
Comment 23 Jeff Roberson freebsd_committer 2017-12-07 21:27:47 UTC
Can you tell me what the following two sysctls hold:

vm.kmem_map_free
vm.kmem_map_size

and then run this for a few seconds:

dtrace -n 'fbt::uma_reclaim_wakeup:entry { @[stack()] = count (); }'
Comment 24 Nils Beyer 2017-12-07 21:40:20 UTC
(In reply to Jeff Roberson from comment #23)

--------------------------------------------------------------------------------
vm.kmem_map_free: 15830609920
vm.kmem_map_size: 801755136

#dtrace -n 'fbt::uma_reclaim_wakeup:entry { @[stack()] = count (); }'
dtrace: description 'fbt::uma_reclaim_wakeup:entry ' matched 1 probe
^C (after svnlite up finishes)
--------------------------------------------------------------------------------

I have to add that I'm not using any debug options in my kernel config...
Comment 25 Mark Johnston freebsd_committer 2017-12-07 21:55:06 UTC
I see another problem in arc_available_memory(): we're using vmem_size(kernel_arena, VMEM_FREE) to determine whether to reap memory. It used to be vmem_size(kmem_arena), and kmem_arena was initialized with vm_kmem_size bytes. But kernel_arena is grown on demand, so it's now bogus to use the arena's size to determine whether to reclaim memory.
Comment 26 Jeff Roberson freebsd_committer 2017-12-07 22:55:58 UTC
We can use the new uma limit facility.  I can produce a diff in a couple of hours.  This is essentially the same thing that kmem_map_free does however:

static int
sysctl_kmem_map_free(SYSCTL_HANDLER_ARGS)
{
        u_long size, limit;

        /* The sysctl is unsigned, implement as a saturation value. */
        size = uma_size();
        limit = uma_limit();
        if (size > limit)
                size = 0;
        else
                size = limit - size;
        return (sysctl_handle_long(oidp, &size, 0, req));
}

We should make a uma accessor to compute this.
Comment 27 Cy Schubert freebsd_committer 2017-12-08 14:53:09 UTC
Confirmed. The latest patch partially fixes the problem.

Mem: 186M Active, 60M Inact, 2123M Wired, 103M Buf, 2572M Free
ARC: 1012M Total, 368M MFU, 487M MRU, 11M Anon, 19M Header, 128M Other


On my sandbox machine (5 GB RAM) (lets call it machine 3 now).

bob<1># sysctl vm.kmem_map_free
vm.kmem_map_free: 1089998848
bob<1># sysctl vm.kmem_map_size
vm.kmem_map_size: 520613888
bob<1># sysctl hw.physmem 
hw.physmem: 5346369536
bob<1># 

bob# dtrace -n 'fbt::uma_reclaim_wakeup:entry { @[stack()] = count (); }'
dtrace: description 'fbt::uma_reclaim_wakeup:entry ' matched 1 probe

There is no output.

I'll get a chance to more fully test on on another machine (machine 2 before) tonight. I've set up a test partition to boot from to allow for quick testing.
Comment 28 Cy Schubert freebsd_committer 2017-12-09 01:44:00 UTC
Continuing to test on my sandbox (other machines are still busy on other builds):

ARC: 429M Total, 9067K MFU, 262M MRU, 2618K Anon, 4246K Header, 151M Other
     ^^^^        ^^^^^

buildworld/buildkernel -DNO_CLEAN and tar of /usr/src in progress.

ARC: 445M Total, 3880K MFU, 281M MRU, 2288K Anon, 3056K Header, 156M Other
     ^^^^        ^^^^^

I unset the arbitrary limits kmem_size_max andn kmem_size limits. They are now:

bob<2># sysctl vm.kmem_size    
vm.kmem_size: 5181820928
bob<2># sysctl vm.kmem_size_max
vm.kmem_size_max: 1319413950874
bob<2>#
Comment 29 Cy Schubert freebsd_committer 2017-12-09 18:27:52 UTC
(In reply to Jeff Roberson from comment #26)
I've applied a patch similar to the one discussed here. There is some improvement but more testing and comparison with r326347 removed is needed.

On my two 8 GB machines (with zfs arc tuning removed from loader.conf) I get:

2 core + 2 threads:
CPU:  0.9% user, 95.9% nice,  3.0% system,  0.0% interrupt,  0.2% idle
Mem: 1500M Active, 318M Inact, 33M Laundry, 1762M Wired, 154M Buf, 4148M Free
ARC: 840M Total, 141M MFU, 545M MRU, 16M Anon, 12M Header, 127M Other
     288M Compressed, 893M Uncompressed, 3.10:1 Ratio

slippy# sysctl vm.kmem_size
vm.kmem_size: 8138756096
slippy# sysctl vm.kmem_size_max
vm.kmem_size_max: 1319413950874
slippy# sysctl hw.physmem
hw.physmem: 8365903872
slippy# 
slippy$ sysctl vfs.zfs.arc_max  
vfs.zfs.arc_max: 7065014272
slippy$ 

2 core:
CPU:  0.0% user, 94.9% nice,  4.1% system,  0.6% interrupt,  0.4% idle
Mem: 752M Active, 269M Inact, 1705M Wired, 111M Buf, 5215M Free
ARC: 862M Total, 218M MFU, 456M MRU, 15M Anon, 15M Header, 156M Other
     343M Compressed, 1079M Uncompressed, 3.15:1 Ratio


cwsys# sysctl vm.kmem_size
vm.kmem_size: 8325849088
cwsys# sysctl vm.kmem_size_max
vm.kmem_size_max: 1319413950874
cwsys# sysctl hw.physmem
hw.physmem: 8566394880
cwsys# 
cwsys$ sysctl vfs.zfs.arc_max  
vfs.zfs.arc_max: 7252107264
cwsys$ 

It appears arc is being limited to ~ 840 - 860 MB on the 8 GB machines.

The 5 GB machine (2 core):
CPU:  0.0% user, 95.1% nice,  4.7% system,  0.2% interrupt,  0.0% idle
Mem: 590M Active, 147M Inact, 1126M Wired, 101M Buf, 3077M Free
ARC: 488M Total, 50M MFU, 334M MRU, 6630K Anon, 5713K Header, 91M Other
     107M Compressed, 431M Uncompressed, 4.02:1 Ratio

bob# sysctl vm.kmem_size
vm.kmem_size: 5180735488
bob# sysctl vm.kmem_size_max
vm.kmem_size_max: 1319413950874
bob# sysctl hw.physmem
hw.physmem: 5345284096
bob# 
bob# sysctl vfs.zfs.arc_max  
vfs.zfs.arc_max: 4106993664
bob# 


We might want to consider reverting r326347 until we can fix this.
Comment 30 Cy Schubert freebsd_committer 2017-12-09 21:58:53 UTC
With r326347 removed and kern_size, kern_size_max and arc_max tunables removed from loader.conf:

Laptop w 8 GB with two processes each of tar of /usr/src and /usr/obj and buildworld/buildkernel -DNO_CLEAN and firefox:

CPU:  2.7% user, 95.0% nice,  2.1% system,  0.1% interrupt,  0.0% idle
Mem: 617M Active, 67M Inact, 204M Laundry, 6505M Wired, 144M Buf, 369M Free
ARC: 4879M Total, 1830M MFU, 2630M MRU, 21M Anon, 54M Header, 343M Other
     3983M Compressed, 10G Uncompressed, 2.69:1 Ratio
Swap: 12G Total, 382M Used, 12G Free, 3% Inuse, 128K In

slippy$ sysctl vm.kmem_size    
vm.kmem_size: 8138752000
slippy$ sysctl vm.kmem_size_max
vm.kmem_size_max: 1319413950874
slippy$ sysctl hw.physmem
hw.physmem: 8365899776
slippy$ sysctl vfs.zfs.arc_max 
vfs.zfs.arc_max: 7065010176
slippy$ 


Server machine in basement with 8 GB also running two processes each of tar of /usr/src and /usr/obj and buildworld/buildkernel -DNO_CLEAN, no firefox:

CPU:  0.0% user, 97.0% nice,  2.6% system,  0.4% interrupt,  0.0% idle
Mem: 257M Active, 246M Inact, 36K Laundry, 6690M Wired, 110M Buf, 746M Free
ARC: 5153M Total, 1262M MFU, 3472M MRU, 20M Anon, 53M Header, 346M Other
     4364M Compressed, 11G Uncompressed, 2.51:1 Ratio
Swap: 20G Total, 95M Used, 20G Free

cwsys$ sysctl vm.kmem_size    
vm.kmem_size: 8325844992
cwsys$ sysctl vm.kmem_size_max
vm.kmem_size_max: 1319413950874
cwsys$ sysctl hw.physmem
hw.physmem: 8566390784
cwsys$ sysctl vfs.zfs.arc_max  
vfs.zfs.arc_max: 7252103168
cwsys$ 

This is also with additional patch discussed in comment #26 removed.
Comment 31 Jeff Roberson freebsd_committer 2017-12-13 00:31:22 UTC
Created attachment 188780 [details]
Proposed fix

Please test this fix for arclowmem et all
Comment 32 Cy Schubert freebsd_committer 2017-12-13 02:50:45 UTC
Currently running on four amd64 machines. Initial tests indicate this PR is fixed. Thanks Jeff.

I can test on an i386 machine this weekend.
Comment 33 Cy Schubert freebsd_committer 2017-12-13 03:06:25 UTC
On i386 this probably needs a cast. 

/opt/src/svn-current/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c:6115:2
1: error: ordered comparison between pointer and integer ('uint64_t' (aka 'unsig
ned long long') and 'long (*)(void)') [-Werror]
        available_memory = MIN(available_memory, uma_avail);
                           ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/opt/src/svn-current/sys/sys/param.h:305:23: note: expanded from macro 'MIN'
#define MIN(a,b) (((a)<(b))?(a):(b))
                   ~~~^~~~
Comment 34 Cy Schubert freebsd_committer 2017-12-13 03:40:04 UTC
arc.c line 6115 s/b,

available_memory = MIN(available_memory, uma_avail());
                                                  ^^

That'll fix i386 build.
Comment 35 Nils Beyer 2017-12-13 08:33:13 UTC
(In reply to Jeff Roberson from comment #31)

thank you.

Confirmed - with this patch, "arc_reclaim_thread" stays quiet and ARC usage raises accordingly. Disk usage has become normal and CPU power goes into user processes.

"arclowmem" tests anyone?
Comment 36 Jeff Roberson freebsd_committer 2017-12-13 21:43:38 UTC
I can comfirm that uma_avail) should've been uma_avail()) on all architectures.  pho has reported that this patch gives him identical performance to code pre-dating the change.
Comment 37 Mark Millard 2017-12-14 04:49:41 UTC
There is also the rpi2 and UFS tied list submittals against
-r326347 causing booting to hang, starting with:

https://lists.freebsd.org/pipermail/freebsd-hackers/2017-December/051939.html

and ending with:

https://lists.freebsd.org/pipermail/freebsd-hackers/2017-December/051947.html

There are tied to boot hangs with the kernel
thread (100001) involving:

. . .
uma_zcreate() at uma_zcreate+0x10c
         pc = 0xc052b024  lr = 0xc050cc90 (ffs_mount+0x80)
         sp = 0xd42a5858  fp = 0xd42a5980
         r4 = 0xd7844000  r5 = 0x00000000
         r6 = 0x00000003  r7 = 0xc09a99c0
         r8 = 0x00000000  r9 = 0xd429f000
        r10 = 0xd828c400
ffs_mount() at ffs_mount+0x80
         pc = 0xc050cc90  lr = 0xc032e1d4 (vfs_donmount+0xeec)
         sp = 0xd42a5988  fp = 0xd42a5b38
         r4 = 0xffffffff  r5 = 0xd74ec120
         r6 = 0xd429f000  r7 = 0x00000000
         r8 = 0x00000000  r9 = 0xc425acd0
        r10 = 0xd828c400
. . .

and the uma thread involving:

. . .
uma_reclaim_locked() at uma_reclaim_locked+0x200
         pc = 0xc052d9f8  lr = 0xc052de70 (uma_reclaim_worker+0x4c)
         sp = 0xd87dadf0  fp = 0xd87dae20
         r4 = 0xc09a9c68  r5 = 0xc4242d80
         r6 = 0xc06df86e  r7 = 0x00000000
         r8 = 0x00000100  r9 = 0xc09b3d08
        r10 = 0xc09a9c7c
uma_reclaim_worker() at uma_reclaim_worker+0x4c
         pc = 0xc052de70  lr = 0xc0230f10 (fork_exit+0xa0)
         sp = 0xd87dae28  fp = 0xd87dae40
         r4 = 0xd7ad8740  r5 = 0xd70ce000
         r6 = 0xc052de24  r7 = 0x00000000
         r8 = 0xd87dae48  r9 = 0xd7ada3a0
        r10 = 0xc0824e8c
. . .

(extracted from the last of those messages).

bisect activity was used to isolate -r326347
as the version the problem started in.
Comment 38 Cy Schubert freebsd_committer 2017-12-14 04:59:37 UTC
(In reply to Mark Millard from comment #37)
It probably makes sense to track this as a separate incident. It's easier to find this through keywords. What do you think?
Comment 39 Mark Millard 2017-12-14 05:14:01 UTC
(In reply to Cy Schubert from comment #38)

I can extract from the prior list messages and make a submittal
if desired.

There is also a complaint for pine64 from Emmanuel Vadot:

https://lists.freebsd.org/pipermail/svn-src-head/2017-December/106954.html

reporting that as of -r326347 :

 The board just hang when plugin usb device

That would seem to match my rpi2 reports: I have
a USB SSD on a powered hub for the root file
system on the rpi2 that I was using.

(Recently the mmcsd0 mechanism to hold in the
sdcard is broken but the eject mechanism
still works. I hold in the sdcard during
the early boot. But I've used a USB SSD
for such small boards for a long time.)
Comment 40 Mark Millard 2017-12-14 05:46:14 UTC
(In reply to Cy Schubert from comment #38)

I've submitted bugzilla 224330 for Emmanuel Vadot
and my list messages tied to USB on pine64
(Emmanual) and rpi2 (me).

Mostly I referenced the list messages instead of
copying over the information.

I do not have access to the Keywords field there.
Nor to See Also here in bugzilla 224080.
Comment 41 Mark Millard 2017-12-14 06:08:39 UTC
Another -r326347 and zfs combination list report
is part of the material in the pair:

https://lists.freebsd.org/pipermail/freebsd-current/2017-November/067613.html
https://lists.freebsd.org/pipermail/freebsd-current/2017-November/067618.html

These folks might be able to test for if their
problems have been fixed when things are ready
for such.
Comment 42 Jeff Roberson freebsd_committer 2017-12-14 21:32:55 UTC
The first report says r326347 was the last known good kernel. There may be other unrelated bugs.
Comment 43 Mark Millard 2017-12-14 22:08:47 UTC
(In reply to Jeff Roberson from comment #42)

You may well be correct about unrelated bugs. But. . .

Looks like I Parse the 067613.html report differently
than you do:

A) First it says:

    Most recent CURRENT, as with r326363, crashes on booting the kernel

This is different than it later reports about -r326347 as I
parse the statements.

B) Then it says 

    The last known half-ways good kernel is 12.0-CURRENT #157 r326347

Note the "half-ways" part. It then continues with what I
interpreted as a description of "half-ways" in a booted
-r326347 context (so, not for -r326363).

    When performing buildworld/buildkernel/relaese/package
    or poudriere on a ZFS filesystem, I can now bring down
    the system almost in a reproducible way.

I could be wrong but I do not see that as a report of the
"the last known good kernel".
Comment 44 Jeff Roberson freebsd_committer 2017-12-14 22:17:12 UTC
I have sent the patch and a short description to the person who made the report.  I agree I may have misparsed.
Comment 45 commit-hook freebsd_committer 2018-01-02 04:37:04 UTC
A commit references this bug:

Author: jeff
Date: Tue Jan  2 04:35:57 UTC 2018
New revision: 327485
URL: https://svnweb.freebsd.org/changeset/base/327485

Log:
  Fix arc after r326347 broke various memory limit queries.  Use UMA features
  rather than kmem arena size to determine available memory.

  Initialize the UMA limit to LONG_MAX to avoid spurious wakeups on boot before
  the real limit is set.

  PR:		224330 (partial), 224080
  Reviewed by:	markj, avg
  Sponsored by:	Netflix / Dell EMC Isilon
  Differential Revision:	https://reviews.freebsd.org/D13494

Changes:
  head/sys/cddl/compat/opensolaris/sys/kmem.h
  head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c
  head/sys/vm/uma.h
  head/sys/vm/uma_core.c
  head/sys/vm/uma_int.h