Bug 187594

Summary: [zfs] [patch] ZFS ARC behavior problem and fix
Product: Base System Reporter: Karl Denninger <karl>
Component: kernAssignee: Steven Hartland <smh>
Status: Closed FIXED    
Severity: Affects Some People CC: 214748mv, FreeBSD, Mark.Martinec+freebsd, Mark.Martinec, adrian, allanjude, avg, bdrewery, chris, crest, cy, dch, dpetrov67, emaste, feld, fk, fullermd, gibbs, grahamperrin, ian, jhay, jhb, junchoon, karl, ktk, mikej, mmoll, o.hushchenkov, ota, pi, rainer, rynunes, seanc, seanc, sef, sigsys, smh, swills, tablosazi.farahan, thierry, thomasrcurry, vanheugten, vhaisman, vsasjason, vsjcfm, will, xistence
Priority: Normal    
Version: 10.0-STABLE   
Hardware: Any   
OS: Any   
Attachments:
Description Flags
file.diff
none
smime.p7s
none
smime.p7s
none
smime.p7s
none
smime.p7s
none
smime.p7s
none
smime.p7s
none
smime.p7s
none
smime.p7s
none
smime.p7s
none
smime.p7s
none
refactor arc reclaim logic
none
refactor arc reclaim logic
none
arc reclaim refactor (against head)
none
From comment #10
none
ARC refactor patch less percentage reservation
none
ARC refactor patch less percentage reservation for after r269846
none
arc reclaim refactor (against head)
none
arc reclaim refactor (against head)
none
arc reclaim refactor (against head)
none
arc reclaim refactor (against stable/10)
none
arc reclaim refactor (against stable/10)
none
arc reclam refactor (against head)
none
arc reclaim refactor (against stable/10)
none
arc reclam refactor (against head)
none
arc reclaim refactor (against stable/10)
none
arc reclaim refactor (against stable/10)
none
arc reclam refactor (against head)
none
arc reclaim refactor (against stable/10)
none
ARC evict dtrace script
none
ARC reap dtrace script
none
ARC reclaim dtrace script
none
VM lowmem probe
none
ARC reclaim refactor (against stable/10)
none
ARC resize dtrace script
none
ARC reclaim refactor (against head)
none
Patch to correct ZFS "freeze" ram contention problem (apply after Steve's latest 10-Stable patch above)
none
See comment -- fix potential divide-by-zero panic
none
Replace previous patch -- picked up wrong copy (extra parenthesis)
none
Argh - didn't include all of Steve's other changes -- yet another try :-)
none
Candidate Fix (supersedes previous)
none
Replaces previous; results of further work with Steve
none
Current working patch against 10-Stable-BETA2
none
dtrace script for the above
none
ARC reclaim refactor + uma clear down (against stable/10)
none
ARC reclaim refactor + uma clear down (against stable/10)
none
UMA summary script
none
ARC reclaim refactor + uma clear down (against stable/10)
none
ARC Refactor / UMA Cleardown / DMU_TX dynamic against 10.1-STABLE
none
ARC Refactor / UMA Cleardown / DMU_TX dynamic against 10.2-BETA1
none
ARC Refactor / UMA Cleardown / DMU_TX dynamic against 10.2-r285717 and later
none
ARC Refactor / UMA Cleardown / DMU_TX dynamic against head after r263620
none
ARC Refactor / UMA Cleardown / DMU_TX dynamic against head after r286570
none
ARC Refactor / UMA Cleardown / DMU_TX dynamic against head after r286776
none
ARC Refactor / UMA Cleardown / DMU_TX dynamic against stable/10 after r288599
none
ARC Refactor / UMA Cleardown / DMU_TX dynamic with time-based rate limiting against stable/10 after r288599
none
Patch for 10.2-STABLE r289078 including pagedaemon wakeup code
none
Update of patch related to other bug report (see comment for details)
none
First cut against 11.0-STABLE
none
Dtrace script to track code execution in above patch against 11.0-STABLE
none
Second cut against 11.0-STABLE
none
Dtrace script to go with Second Cut on 11-STABLE
none
Cleanup of second patch - no functional changes
none
Patch against r324056 (11.1-STABLE) w/Phabricator D7538 improvements none

Description Karl Denninger 2014-03-14 20:50:00 UTC
ZFS can be convinced to engage in pathological behavior due to a bad
low-memory test in arc.c

The offending file is at
/usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c; it allegedly
checks for 25% free memory, and if it is less asks for the cache to shrink.

(snippet from arc.c around line 2494 of arc.c in 10-STABLE; path
/usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs)

#else /* !sun */
if (kmem_used() > (kmem_size() * 3) / 4)
return (1);
#endif /* sun */

Unfortunately these two functions do not return what the authors thought
they did.  It's clear what they're trying to do from the Solaris-specific
code up above this test.

The result is that the cache only shrinks when vm_paging_needed() tests
true, but by that time the system is in serious memory trouble and by
triggering only there it actually drives the system further into paging,
because the pager will not recall pages from the swap until they are next
executed.  This leads the ARC to try to fill in all the available RAM even
though pages have been pushed off onto swap.  Not good.

Fix: The following context diff corrects the problem.  If NEWRECLAIM is 
	defined (by default turned on once the patch is applied) we declare 
	and export a new tunable:

	vfs.zfs.arc_freepage_percent_target
	
	The default on zfs load is set to to 25 percent, as was the intent
	of the original software.  You may tune this in real time with
	sysctl to suit your workload and machine's installed RAM; unlike
	setting "arc_max" which can only be done at boot the target
	RAM consumption percentage is adaptive.

	Instead of the above code we then test for wired, active, inactive
	and cache pages, comparing free space against the total.  If the
	free space is less in percentage terms than the sum of all five we
	declare memory constrained, otherwise we declare that it is not.

	We retain the paging check but none of the others if this option is
	declared.

	A debugging flag called "NEWRECLAIM_DEBUG" is present in the code, 
	if changed from "undef" to "define" at compile time it will cause 
	printing of status changes (constrained .vs. not) along with any 
	picked up changes in the target in real-time.  This should not 
	be used in production.


How-To-Repeat: 	Set up a cache-heavy workload on large (~terabyte sized or bigger)
	ZFS filesystems and note that free RAM drops to the point that
	starvation occurs, while "wired" memory pins at the maximum ARC
	cache size, even though you have other demands for RAM that should
	cause the ARC memory congestion control algorithm to evict some of
	the cache as demand rises.  It does not.
Comment 1 Mark Linimon freebsd_committer freebsd_triage 2014-03-15 21:21:18 UTC
Responsible Changed
From-To: freebsd-bugs->freebsd-fs

Over to maintainer(s).
Comment 2 Adam McDougall 2014-03-16 03:42:00 UTC
This is generally working well for me so far, been running it over a day
on my desktop at home with only 4G ram and I have not needlessly
swapped.  Generally have 1GB or more free ram now although I also
decreased vfs.zfs.arc_freepage_percent_target to 15 because my ARC total
was pretty low.  At the moment I have 406M ARC and 1070M free while
Thunderbird and over a dozen Chromium tabs open.  Thanks for working on
a patch!
Comment 3 Andriy Gapon freebsd_committer freebsd_triage 2014-03-18 15:15:05 UTC
Karl Denninger <karl@fs.denninger.net> wrote:
> ZFS can be convinced to engage in pathological behavior due to a bad
> low-memory test in arc.c
> 
> The offending file is at
> /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c; it allegedly
> checks for 25% free memory, and if it is less asks for the cache to shrink.
> 
> (snippet from arc.c around line 2494 of arc.c in 10-STABLE; path
> /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs)
> 
> #else /* !sun */
> if (kmem_used() > (kmem_size() * 3) / 4)
> return (1);
> #endif /* sun */
> 
> Unfortunately these two functions do not return what the authors thought
> they did. It's clear what they're trying to do from the Solaris-specific
> code up above this test.

No, these functions do return what the authors think they do.
The check is for KVA usage (kernel virtual address space), not for physical memory.

> The result is that the cache only shrinks when vm_paging_needed() tests
> true, but by that time the system is in serious memory trouble and by

No, it is not.
The description and numbers here are a little bit outdated but they should give
an idea of how paging works in general:
https://wiki.freebsd.org/AvgPageoutAlgorithm

> triggering only there it actually drives the system further into paging,

How does ARC eviction drives the system further into paging?

> because the pager will not recall pages from the swap until they are next
> executed. This leads the ARC to try to fill in all the available RAM even
> though pages have been pushed off onto swap. Not good.

Unused physical memory is a waste.  It is true that ARC tries to use as much of
memory as it is allowed.  The same applies to the page cache (Active, Inactive).
Memory management is a dynamic system and there are a few competing agents.

It is hard to correctly tune that system using a large hummer such as your
patch.  I believe that with your patch ARC will get shrunk to its minimum size
in due time.  Active + Inactive will grow to use the memory that you are denying
to ARC driving Free below a threshold, which will reduce ARC.  Repeated enough
times this will drive ARC to its minimum.

Also, there are a few technical problems with the patch:
- you don't need to use sysctl interface in kernel, the values you need are
available directly, just take a look at e.g. implementation of vm_paging_needed()
- similarly, querying vfs.zfs.arc_freepage_percent_target value via
kernel_sysctlbyname is just bogus; you can use percent_target directly
- you don't need to sum various page counters to get a total count, there is
v_page_count

Lastly, can you try to test reverting your patch and instead setting
vm.lowmem_period=0 ?

-- 
Andriy Gapon
Comment 4 karl 2014-03-19 14:18:40 UTC
On 3/18/2014 12:19 PM, Karl Denninger wrote:
>
> On 3/18/2014 10:20 AM, Andriy Gapon wrote:
>> The following reply was made to PR kern/187594; it has been noted by 
>> GNATS.
>>
>> From: Andriy Gapon <avg@FreeBSD.org>
>> To: bug-followup@FreeBSD.org, karl@fs.denninger.net
>> Cc:
>> Subject: Re: kern/187594: [zfs] [patch] ZFS ARC behavior problem and fix
>> Date: Tue, 18 Mar 2014 17:15:05 +0200
>>
>>   Karl Denninger <karl@fs.denninger.net> wrote:
>>   > ZFS can be convinced to engage in pathological behavior due to a bad
>>   > low-memory test in arc.c
>>   >
>>   > The offending file is at
>>   > /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c; it 
>> allegedly
>>   > checks for 25% free memory, and if it is less asks for the cache 
>> to shrink.
>>   >
>>   > (snippet from arc.c around line 2494 of arc.c in 10-STABLE; path
>>   > /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs)
>>   >
>>   > #else /* !sun */
>>   > if (kmem_used() > (kmem_size() * 3) / 4)
>>   > return (1);
>>   > #endif /* sun */
>>   >
>>   > Unfortunately these two functions do not return what the authors 
>> thought
>>   > they did. It's clear what they're trying to do from the 
>> Solaris-specific
>>   > code up above this test.
>>     No, these functions do return what the authors think they do.
>>   The check is for KVA usage (kernel virtual address space), not for 
>> physical memory.
> I understand, but that's nonsensical in the context of the Solaris 
> code.  "lotsfree" is *not* a declaration of free kvm space, it's a 
> declaration of when the system has "lots" of free *physical* memory.
>
> Further it makes no sense at all to allow the ARC cache to force 
> things into virtual (e.g. swap-space backed) memory.  But that's the 
> behavior that has been observed, and it fits with the code as 
> originally written.
>
>>     > The result is that the cache only shrinks when 
>> vm_paging_needed() tests
>>   > true, but by that time the system is in serious memory trouble 
>> and by
>>     No, it is not.
>>   The description and numbers here are a little bit outdated but they 
>> should give
>>   an idea of how paging works in general:
>>   https://wiki.freebsd.org/AvgPageoutAlgorithm
>>     > triggering only there it actually drives the system further 
>> into paging,
>>     How does ARC eviction drives the system further into paging?
> 1. System gets low on physical memory but the ARC cache is looking at 
> available kvm (of which there is plenty.)  The ARC cache continues to 
> expand.
>
> 2. vm_paging_needed() returns true and the system begins to page off 
> to the swap.  At the same time the ARC cache is pared down because 
> arc_reclaim_needed has returned "1".
>
> 3. As the ARC cache shrinks and paging occurs vm_paging_needed() 
> returns false.  Paging out ceases but inactive pages remain on the 
> swap.  They are not recalled until and unless they are scheduled to 
> execute.  Arc_reclaim_needed again returns "0".
>
> 4. The hold-down timer expires in the ARC cache code 
> ("arc_grow_retry", declared as 60 seconds) and the ARC cache begins to 
> expand again.
>
> Go back to #2 until the system's performance starts to deteriorate 
> badly enough due to the paging that you notice it, which occurs when 
> something that is actually consuming CPU time has to be called in from 
> swap.
>
> This is consistent with what I and others have observed on both 9.2 
> and 10.0; the ARC will expand until it hits the maximum configured 
> even at the expense of forcing pages onto the swap.  In this specific 
> machine's case left to defaults it will grab nearly all physical 
> memory (over 20GB of 24) and wire it down.
>
> Limiting arc_max to 16GB sorta fixes it.  I say "sorta" because it 
> turns out that 16GB is still too much for the workload; it prevents 
> the pathological behavior where system "stalls" happen but only in the 
> extreme.  It turns out with the patch in my ARC cache stabilizes at 
> about 13.5GB during the busiest part of the day, growing to about 16 
> off-hours.
>
> One of the problems with just limiting it in /boot/loader.conf is that 
> you have to guess and the system doesn't reasonably adapt to changing 
> memory loads.  The code is clearly intended to do that but it doesn't 
> end up working that way in practice.
>>     > because the pager will not recall pages from the swap until 
>> they are next
>>   > executed. This leads the ARC to try to fill in all the available 
>> RAM even
>>   > though pages have been pushed off onto swap. Not good.
>>     Unused physical memory is a waste.  It is true that ARC tries to 
>> use as much of
>>   memory as it is allowed.  The same applies to the page cache 
>> (Active, Inactive).
>>   Memory management is a dynamic system and there are a few competing 
>> agents.
> That's true.  However, what the stock code does is force working set 
> out of memory and into the swap.  The ideal situation is one in which 
> there is no free memory because cache has sized itself to consume 
> everything *not* necessary for the working set of the processes that 
> are running.  Unfortunately we cannot determine this presciently 
> because a new process may come along and we do not necessarily know 
> for how long a process that is blocked on an event will remain blocked 
> (e.g. something waiting on network I/O, etc.)
>
> However, it is my contention that you do not want to evict a process 
> that is scheduled to run (or is going to be) in favor of disk cache 
> because you're defeating yourself by doing so.  The point of the disk 
> cache is to avoid going to the physical disk for I/O, but if you page 
> something you have ditched a physical I/O for data in favor of having 
> to go to physical disk *twice* -- first to write the paged-out data to 
> swap, and then to retrieve it when it is to be executed.  This also 
> appears to be consistent with what is present for Solaris machines.
>
> From the Sun code:
>
> #ifdef sun
>         /*
>          * take 'desfree' extra pages, so we reclaim sooner, rather 
> than later
>          */
>         extra = desfree;
>
>         /*
>          * check that we're out of range of the pageout scanner. It 
> starts to
>          * schedule paging if freemem is less than lotsfree and needfree.
>          * lotsfree is the high-water mark for pageout, and needfree 
> is the
>          * number of needed free pages.  We add extra pages here to 
> make sure
>          * the scanner doesn't start up while we're freeing memory.
>          */
>         if (freemem < lotsfree + needfree + extra)
>                 return (1);
>
>         /*
>          * check to make sure that swapfs has enough space so that anon
>          * reservations can still succeed. anon_resvmem() checks that the
>          * availrmem is greater than swapfs_minfree, and the number of 
> reserved
>          * swap pages.  We also add a bit of extra here just to prevent
>          * circumstances from getting really dire.
>          */
>         if (availrmem < swapfs_minfree + swapfs_reserve + extra)
>                 return (1);
>
> "freemem" is not virtual memory, it's actual memory.  "Lotsfree" is 
> the point where the system considers free RAM to be "ample"; 
> "needfree" is the "desperation" point and "extra" is the margin 
> (presumably for image activation.)
>
> The base code on FreeBSD doesn't look at physical memory at all; it 
> looks at kvm space instead.
>
>>   It is hard to correctly tune that system using a large hummer such 
>> as your
>>   patch.  I believe that with your patch ARC will get shrunk to its 
>> minimum size
>>   in due time.  Active + Inactive will grow to use the memory that 
>> you are denying
>>   to ARC driving Free below a threshold, which will reduce ARC. 
>> Repeated enough
>>   times this will drive ARC to its minimum.
> I disagree both in design theory and based on the empirical evidence 
> of actual operation.
>
> First, I don't (ever) want to give memory to the ARC cache that 
> otherwise would go to "active", because any time I do that I'm going 
> to force two page events, which is double the amount of I/O I would 
> take on a cache *miss*, and even with the ARC at minimum I get a 
> reasonable hit percentage.  If I therefore prefer ARC over "active" 
> pages I am going to take *at least* a 200% penalty on physical I/O and 
> if I get an 80% hit ratio with the ARC at a minimum the penalty is 
> closer to 800%!
>
> For inactive pages it's a bit more complicated as those may not be 
> reactivated.  However, I am trusting FreeBSD's VM subsystem to demote 
> those that are unlikely to be reactivated to the cache bucket and then 
> to "free", where they are able to be re-used. This is consistent with 
> what I actually see on a running system -- the "inact" bucket is 
> typically fairly large (often on a busy machine close to that of 
> "active") but pages demoted to "cache" don't stay there long - they 
> either get re-promoted back up or they are freed and go on the free list.
>
> The only time I see "inact" get out of control is when there's a 
> kernel memory leak somewhere (such as what I ran into the other day 
> with the in-kernel NAT subsystem on 10-STABLE.)  But that's a bug and 
> if it happens you're going to get bit anyway.
>
> For example right now on one of my very busy systems with 24GB of 
> installed RAM and many terabytes of storage across three ZFS pools I'm 
> seeing 17GB wired of which 13.5 is ARC cache.  That's the adaptive 
> figure it currently is running at, with a maximum of 22.3 and a 
> minimum of 2.79 (8:1 ratio.)  The remainder is wired down for other 
> reasons (there's a fairly large Postgres server running on that box, 
> among other things, and it has a big shared buffer declaration -- 
> that's most of the difference.)  Cache hit efficiency is currently 97.8%.
>
> Active is 2.26G right now, and inactive is 2.09G.  Both are stable. 
> Overnight inactive will drop to about 1.1GB while active will not 
> change all that much since most of it postgres and the middleware that 
> talks to it along with apache, which leaves most of its processes 
> present even when they go idle.  Peak load times are about right now 
> (mid-day), and again when the system is running backups nightly.
>
> Cache is 7448, in other words, insignificant.  Free memory is 2.6G.
>
> The tunable is set to 10%, which is almost exactly what free memory 
> is.  I find that when the system gets under 1G free transient image 
> activation can drive it into paging and performance starts to suffer 
> for my particular workload.
>
>>     Also, there are a few technical problems with the patch:
>>   - you don't need to use sysctl interface in kernel, the values you 
>> need are
>>   available directly, just take a look at e.g. implementation of 
>> vm_paging_needed()
> That's easily fixed.  I will look at it.
>>   - similarly, querying vfs.zfs.arc_freepage_percent_target value via
>>   kernel_sysctlbyname is just bogus; you can use percent_target directly
> I did not know if during setup of the OID the value was copied (and 
> thus you had to reference it later on) or the entry simply took the 
> pointer and stashed that.  Easily corrected.
>>   - you don't need to sum various page counters to get a total count, 
>> there is
>>   v_page_count
> Fair enough as well.
>>   Lastly, can you try to test reverting your patch and instead setting
>>   vm.lowmem_period=0 ?
> Yes.  By default it's 10; I have not tampered with that default.
>
> Let me do a bit of work and I'll post back with a revised patch. 
> Perhaps a tunable for percentage free + a free reserve that is a 
> "floor"?  The problem with that is where to put the defaults.  One 
> option would be to grab total size at init time and compute something 
> similar to what "lotsfree" is for Solaris, allowing that to be tuned 
> with the percentage if desired.  I selected 25% because that's what 
> the original test was expressing and it should be reasonable for 
> modest RAM configurations.  It's clearly too high for moderately large 
> (or huge) memory machines unless they have a lot of RAM -hungry 
> processes running on them.
>
> The percentage test, however, is an easy knob to twist that is 
> unlikely to severely harm you if you dial it too far in either 
> direction; anyone setting it to zero obviously knows what they're 
> getting into, and if you crank it too high all you end up doing is 
> limiting the ARC to the minimum value.
>


Responsive to the criticisms and in an attempt to better-track what the 
VM system does, I offer this update to the patch.  The following changes 
have been made:

1. There are now two tunables:
vfs.zfs.arc_freepages -- the number of free pages below which we declare 
low memory and ask for ARC paring.
vfs.zfs.arc_freepage_percent -- the additional free RAM to reserve in 
percent of total, if any (added to freepages)

2. vfs.zfs.arc_freepages, if zero (as is the default at boot), defaults 
to "vm.stats.vm.v_free_target" less 20%.  This allows the system to get 
into the page-stealing paradigm before the ARC cache is invaded.  While 
I do not run into a situation of unbridled inact page growth here the 
criticism that the original patch could allow this appears to be 
well-founded.  Setting the low memory alert here should prevent this, as 
the system will now allow the ARC to grow to the point that 
page-stealing takes place.

3. The previous option to reserve either a hard amount of RAM or a 
percentage of RAM remains.

4. The defaults should auto-tune for any particular RAM configuration to 
reasonable values that prevent stalls, yet if you have circumstances 
that argue for reserving more memory you may do so.

Updated patch follows:

*** arc.c.original	Thu Mar 13 09:18:48 2014
--- arc.c	Wed Mar 19 07:44:01 2014
***************
*** 18,23 ****
--- 18,99 ----
    *
    * CDDL HEADER END
    */
+
+ /* Karl Denninger (karl@denninger.net), 3/18/2014, FreeBSD-specific
+  *
+  * If "NEWRECLAIM" is defined, change the "low memory" warning that causes
+  * the ARC cache to be pared down.  The reason for the change is that the
+  * apparent attempted algorithm is to start evicting ARC cache when free
+  * pages fall below 25% of installed RAM.  This maps reasonably well to how
+  * Solaris is documented to behave; when "lotsfree" is invaded ZFS is told
+  * to pare down.
+  *
+  * The problem is that on FreeBSD machines the system doesn't appear to be
+  * getting what the authors of the original code thought they were looking at
+  * with its test -- or at least not what Solaris did -- and as a result that
+  * test never triggers.  That leaves the only reclaim trigger as the "paging
+  * needed" status flag, and by the time * that trips the system is already
+  * in low-memory trouble.  This can lead to severe pathological behavior
+  * under the following scenario:
+  * - The system starts to page and ARC is evicted.
+  * - The system stops paging as ARC's eviction drops wired RAM a bit.
+  * - ARC starts increasing its allocation again, and wired memory grows.
+  * - A new image is activated, and the system once again attempts to page.
+  * - ARC starts to be evicted again.
+  * - Back to #2
+  *
+  * Note that ZFS's ARC default (unless you override it in /boot/loader.conf)
+  * is to allow the ARC cache to grab nearly all of free RAM, provided nobody
+  * else needs it.  That would be ok if we evicted cache when required.
+  *
+  * Unfortunately the system can get into a state where it never
+  * manages to page anything of materiality back in, as if there is active
+  * I/O the ARC will start grabbing space once again as soon as the memory
+  * contention state drops.  For this reason the "paging is occurring" flag
+  * should be the **last resort** condition for ARC eviction; you want to
+  * (as Solaris does) start when there is material free RAM left BUT the
+  * vm system thinks it needs to be active to steal pages back in the attempt
+  * to never get into the condition where you're potentially paging off
+  * executables in favor of leaving disk cache allocated.
+  *
+  * To fix this we change how we look at low memory, declaring two new
+  * runtime tunables.
+  *
+  * The new sysctls are:
+  * vfs.zfs.arc_freepages (free pages required to call RAM "sufficient")
+  * vfs.zfs.arc_freepage_percent (additional reservation percentage, default 0)
+  *
+  * vfs.zfs.arc_freepages is initialized from vm.stats.vm.v_free_target,
+  * less 20% if we find that it is zero.  Note that vm.stats.vm.v_free_target
+  * is not initialized at boot -- the system has to be running first, so we
+  * cannot initialize this in arc_init.  So we check during runtime; this
+  * also allows the user to return to defaults by setting it to zero.
+  *
+  * This should insure that we allow the VM system to steal pages first,
+  * but pare the cache before we suspend processes attempting to get more
+  * memory, thereby avoiding "stalls."  You can set this higher if you wish,
+  * or force a specific percentage reservation as well, but doing so may
+  * cause the cache to pare back while the VM system remains willing to
+  * allow "inactive" pages to accumulate.  The challenge is that image
+  * activation can force things into the page space on a repeated basis
+  * if you allow this level to be too small (the above pathological
+  * behavior); the defaults should avoid that behavior but the sysctls
+  * are exposed should your workload require adjustment.
+  *
+  * If we're using this check for low memory we are replacing the previous
+  * ones, including the oddball "random" reclaim that appears to fire far
+  * more often than it should.  We still trigger if the system pages.
+  *
+  * If you turn on NEWRECLAIM_DEBUG then the kernel will print on the console
+  * status messages when the reclaim status trips on and off, along with the
+  * page count aggregate that triggered it (and the free space) for each
+  * event.
+  */
+
+ #define	NEWRECLAIM
+ #undef	NEWRECLAIM_DEBUG
+
+
   /*
    * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
    * Copyright (c) 2013 by Delphix. All rights reserved.
***************
*** 139,144 ****
--- 215,226 ----
   
   #include <vm/vm_pageout.h>
   
+ #ifdef	NEWRECLAIM
+ #ifdef	__FreeBSD__
+ #include <sys/sysctl.h>
+ #endif
+ #endif	/* NEWRECLAIM */
+
   #ifdef illumos
   #ifndef _KERNEL
   /* set with ZFS_DEBUG=watch, to enable watchpoints on frozen buffers */
***************
*** 203,218 ****
--- 285,320 ----
   int zfs_arc_shrink_shift = 0;
   int zfs_arc_p_min_shift = 0;
   int zfs_disable_dup_eviction = 0;
+ #ifdef	NEWRECLAIM
+ #ifdef  __FreeBSD__
+ static	int freepages = 0;	/* This much memory is considered critical */
+ static	int percent_target = 0;	/* Additionally reserve "X" percent free RAM */
+ #endif	/* __FreeBSD__ */
+ #endif	/* NEWRECLAIM */
   
   TUNABLE_QUAD("vfs.zfs.arc_max", &zfs_arc_max);
   TUNABLE_QUAD("vfs.zfs.arc_min", &zfs_arc_min);
   TUNABLE_QUAD("vfs.zfs.arc_meta_limit", &zfs_arc_meta_limit);
+ #ifdef	NEWRECLAIM
+ #ifdef  __FreeBSD__
+ TUNABLE_INT("vfs.zfs.arc_freepages", &freepages);
+ TUNABLE_INT("vfs.zfs.arc_freepage_percent", &percent_target);
+ #endif	/* __FreeBSD__ */
+ #endif	/* NEWRECLAIM */
+
   SYSCTL_DECL(_vfs_zfs);
   SYSCTL_UQUAD(_vfs_zfs, OID_AUTO, arc_max, CTLFLAG_RDTUN, &zfs_arc_max, 0,
       "Maximum ARC size");
   SYSCTL_UQUAD(_vfs_zfs, OID_AUTO, arc_min, CTLFLAG_RDTUN, &zfs_arc_min, 0,
       "Minimum ARC size");
   
+ #ifdef	NEWRECLAIM
+ #ifdef  __FreeBSD__
+ SYSCTL_INT(_vfs_zfs, OID_AUTO, arc_freepages, CTLFLAG_RWTUN, &freepages, 0, "ARC Free RAM Pages Required");
+ SYSCTL_INT(_vfs_zfs, OID_AUTO, arc_freepage_percent, CTLFLAG_RWTUN, &percent_target, 0, "ARC Free RAM Target percentage");
+ #endif	/* __FreeBSD__ */
+ #endif	/* NEWRECLAIM */
+
   /*
    * Note that buffers can be in one of 6 states:
    *	ARC_anon	- anonymous (discussed below)
***************
*** 2438,2443 ****
--- 2540,2557 ----
   {
   
   #ifdef _KERNEL
+ #ifdef	NEWRECLAIM
+ #ifdef  __FreeBSD__
+ 	u_int	vmfree = 0;
+ 	u_int	vmtotal = 0;
+ 	size_t	vmsize;
+ #ifdef	NEWRECLAIM_DEBUG
+ 	static	int	xval = -1;
+ 	static	int	oldpercent = 0;
+ 	static	int	oldfreepages = 0;
+ #endif	/* NEWRECLAIM_DEBUG */
+ #endif	/* __FreeBSD__ */
+ #endif	/* NEWRECLAIM */
   
   	if (needfree)
   		return (1);
***************
*** 2476,2481 ****
--- 2590,2596 ----
   		return (1);
   
   #if defined(__i386)
+
   	/*
   	 * If we're on an i386 platform, it's possible that we'll exhaust the
   	 * kernel heap space before we ever run out of available physical
***************
*** 2492,2502 ****
   		return (1);
   #endif
   #else	/* !sun */
   	if (kmem_used() > (kmem_size() * 3) / 4)
   		return (1);
   #endif	/* sun */
   
- #else
   	if (spa_get_random(100) == 0)
   		return (1);
   #endif
--- 2607,2680 ----
   		return (1);
   #endif
   #else	/* !sun */
+
+ #ifdef	NEWRECLAIM
+ #ifdef  __FreeBSD__
+ /*
+  * Implement the new tunable free RAM algorithm.  We check the free pages
+  * against the minimum specified target and the percentage that should be
+  * free.  If we're low we ask for ARC cache shrinkage.  If this is defined
+  * on a FreeBSD system the older checks are not performed.
+  *
+  * Check first to see if we need to init freepages, then test.
+  */
+ 	if (!freepages) {		/* If zero then (re)init */
+ 		vmsize = sizeof(vmtotal);
+ 		kernel_sysctlbyname(curthread, "vm.stats.vm.v_free_target", &vmtotal, &vmsize, NULL, 0, NULL, 0);
+ 		freepages = vmtotal - (vmtotal / 5);
+ #ifdef	NEWRECLAIM_DEBUG
+ 		printf("ZFS ARC: Default vfs.zfs.arc_freepages to [%u] [%u less 20%%]\n", freepages, vmtotal);
+ #endif	/* NEWRECLAIM_DEBUG */
+ 	}
+
+ 	vmsize = sizeof(vmtotal);
+         kernel_sysctlbyname(curthread, "vm.stats.vm.v_page_count", &vmtotal, &vmsize, NULL, 0, NULL, 0);
+ 	vmsize = sizeof(vmfree);
+         kernel_sysctlbyname(curthread, "vm.stats.vm.v_free_count", &vmfree, &vmsize, NULL, 0, NULL, 0);
+ #ifdef	NEWRECLAIM_DEBUG
+ 	if (percent_target != oldpercent) {
+ 		printf("ZFS ARC: Reservation percent change to [%d], [%d] pages, [%d] free\n", percent_target, vmtotal, vmfree);
+ 		oldpercent = percent_target;
+ 	}
+ 	if (freepages != oldfreepages) {
+ 		printf("ZFS ARC: Low RAM page change to [%d], [%d] pages, [%d] free\n", freepages, vmtotal, vmfree);
+ 		oldfreepages = freepages;
+ 	}
+ #endif	/* NEWRECLAIM_DEBUG */
+ 	if (!vmtotal) {
+ 		vmtotal = 1;	/* Protect against divide by zero */
+ 				/* (should be impossible, but...) */
+ 	}
+ /*
+  * Now figure out how much free RAM we require to call the ARC cache status
+  * "ok".  Add the percentage specified of the total to the base requirement.
+  */
+
+ 	if (vmfree < freepages + ((vmtotal / 100) * percent_target)) {
+ #ifdef	NEWRECLAIM_DEBUG
+ 		if (xval != 1) {
+ 			printf("ZFS ARC: RECLAIM total %u, free %u, free pct (%u), reserved (%u), target pct (%u)\n", vmtotal, vmfree, ((vmfree * 100) / vmtotal), freepages, percent_target);
+ 			xval = 1;
+ 		}
+ #endif	/* NEWRECLAIM_DEBUG */
+ 		return(1);
+ 	} else {
+ #ifdef	NEWRECLAIM_DEBUG
+ 		if (xval != 0) {
+ 			printf("ZFS ARC: NORMAL total %u, free %u, free pct (%u), reserved (%u), target pct (%u)\n", vmtotal, vmfree, ((vmfree * 100) / vmtotal), freepages, percent_target);
+ 			xval = 0;
+ 		}
+ #endif	/* NEWRECLAIM_DEBUG */
+ 		return(0);
+ 	}
+
+ #endif	/* __FreeBSD__ */
+ #endif	/* NEWRECLAIM */
+
   	if (kmem_used() > (kmem_size() * 3) / 4)
   		return (1);
   #endif	/* sun */
   
   	if (spa_get_random(100) == 0)
   		return (1);
   #endif


-- 
-- Karl
karl@denninger.net

Comment 5 karl 2014-03-19 18:03:30 UTC
The 20% invasion of the first-level paging regime looks too aggressive 
under very heavy load.  I have changed my system here to 10% and obtain 
a better response profile.

At 20% the system will still occasionally page recently-used executable 
code to disk before cache is released which is undesirable.  10% looks 
better but may STILL be too aggressive (in other words, 5% might be 
"just right")

Being able to tune this in real time is a BIG help!

Adjusted patch follows (only a couple of lines have changed)

*** arc.c.original	Thu Mar 13 09:18:48 2014
--- arc.c	Wed Mar 19 13:01:48 2014
***************
*** 18,23 ****
--- 18,99 ----
    *
    * CDDL HEADER END
    */
+
+ /* Karl Denninger (karl@denninger.net), 3/18/2014, FreeBSD-specific
+  *
+  * If "NEWRECLAIM" is defined, change the "low memory" warning that causes
+  * the ARC cache to be pared down.  The reason for the change is that the
+  * apparent attempted algorithm is to start evicting ARC cache when free
+  * pages fall below 25% of installed RAM.  This maps reasonably well to how
+  * Solaris is documented to behave; when "lotsfree" is invaded ZFS is told
+  * to pare down.
+  *
+  * The problem is that on FreeBSD machines the system doesn't appear to be
+  * getting what the authors of the original code thought they were looking at
+  * with its test -- or at least not what Solaris did -- and as a result that
+  * test never triggers.  That leaves the only reclaim trigger as the "paging
+  * needed" status flag, and by the time * that trips the system is already
+  * in low-memory trouble.  This can lead to severe pathological behavior
+  * under the following scenario:
+  * - The system starts to page and ARC is evicted.
+  * - The system stops paging as ARC's eviction drops wired RAM a bit.
+  * - ARC starts increasing its allocation again, and wired memory grows.
+  * - A new image is activated, and the system once again attempts to page.
+  * - ARC starts to be evicted again.
+  * - Back to #2
+  *
+  * Note that ZFS's ARC default (unless you override it in /boot/loader.conf)
+  * is to allow the ARC cache to grab nearly all of free RAM, provided nobody
+  * else needs it.  That would be ok if we evicted cache when required.
+  *
+  * Unfortunately the system can get into a state where it never
+  * manages to page anything of materiality back in, as if there is active
+  * I/O the ARC will start grabbing space once again as soon as the memory
+  * contention state drops.  For this reason the "paging is occurring" flag
+  * should be the **last resort** condition for ARC eviction; you want to
+  * (as Solaris does) start when there is material free RAM left BUT the
+  * vm system thinks it needs to be active to steal pages back in the attempt
+  * to never get into the condition where you're potentially paging off
+  * executables in favor of leaving disk cache allocated.
+  *
+  * To fix this we change how we look at low memory, declaring two new
+  * runtime tunables.
+  *
+  * The new sysctls are:
+  * vfs.zfs.arc_freepages (free pages required to call RAM "sufficient")
+  * vfs.zfs.arc_freepage_percent (additional reservation percentage, default 0)
+  *
+  * vfs.zfs.arc_freepages is initialized from vm.stats.vm.v_free_target,
+  * less 10% if we find that it is zero.  Note that vm.stats.vm.v_free_target
+  * is not initialized at boot -- the system has to be running first, so we
+  * cannot initialize this in arc_init.  So we check during runtime; this
+  * also allows the user to return to defaults by setting it to zero.
+  *
+  * This should insure that we allow the VM system to steal pages first,
+  * but pare the cache before we suspend processes attempting to get more
+  * memory, thereby avoiding "stalls."  You can set this higher if you wish,
+  * or force a specific percentage reservation as well, but doing so may
+  * cause the cache to pare back while the VM system remains willing to
+  * allow "inactive" pages to accumulate.  The challenge is that image
+  * activation can force things into the page space on a repeated basis
+  * if you allow this level to be too small (the above pathological
+  * behavior); the defaults should avoid that behavior but the sysctls
+  * are exposed should your workload require adjustment.
+  *
+  * If we're using this check for low memory we are replacing the previous
+  * ones, including the oddball "random" reclaim that appears to fire far
+  * more often than it should.  We still trigger if the system pages.
+  *
+  * If you turn on NEWRECLAIM_DEBUG then the kernel will print on the console
+  * status messages when the reclaim status trips on and off, along with the
+  * page count aggregate that triggered it (and the free space) for each
+  * event.
+  */
+
+ #define	NEWRECLAIM
+ #undef	NEWRECLAIM_DEBUG
+
+
   /*
    * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
    * Copyright (c) 2013 by Delphix. All rights reserved.
***************
*** 139,144 ****
--- 215,226 ----
   
   #include <vm/vm_pageout.h>
   
+ #ifdef	NEWRECLAIM
+ #ifdef	__FreeBSD__
+ #include <sys/sysctl.h>
+ #endif
+ #endif	/* NEWRECLAIM */
+
   #ifdef illumos
   #ifndef _KERNEL
   /* set with ZFS_DEBUG=watch, to enable watchpoints on frozen buffers */
***************
*** 203,218 ****
--- 285,320 ----
   int zfs_arc_shrink_shift = 0;
   int zfs_arc_p_min_shift = 0;
   int zfs_disable_dup_eviction = 0;
+ #ifdef	NEWRECLAIM
+ #ifdef  __FreeBSD__
+ static	int freepages = 0;	/* This much memory is considered critical */
+ static	int percent_target = 0;	/* Additionally reserve "X" percent free RAM */
+ #endif	/* __FreeBSD__ */
+ #endif	/* NEWRECLAIM */
   
   TUNABLE_QUAD("vfs.zfs.arc_max", &zfs_arc_max);
   TUNABLE_QUAD("vfs.zfs.arc_min", &zfs_arc_min);
   TUNABLE_QUAD("vfs.zfs.arc_meta_limit", &zfs_arc_meta_limit);
+ #ifdef	NEWRECLAIM
+ #ifdef  __FreeBSD__
+ TUNABLE_INT("vfs.zfs.arc_freepages", &freepages);
+ TUNABLE_INT("vfs.zfs.arc_freepage_percent", &percent_target);
+ #endif	/* __FreeBSD__ */
+ #endif	/* NEWRECLAIM */
+
   SYSCTL_DECL(_vfs_zfs);
   SYSCTL_UQUAD(_vfs_zfs, OID_AUTO, arc_max, CTLFLAG_RDTUN, &zfs_arc_max, 0,
       "Maximum ARC size");
   SYSCTL_UQUAD(_vfs_zfs, OID_AUTO, arc_min, CTLFLAG_RDTUN, &zfs_arc_min, 0,
       "Minimum ARC size");
   
+ #ifdef	NEWRECLAIM
+ #ifdef  __FreeBSD__
+ SYSCTL_INT(_vfs_zfs, OID_AUTO, arc_freepages, CTLFLAG_RWTUN, &freepages, 0, "ARC Free RAM Pages Required");
+ SYSCTL_INT(_vfs_zfs, OID_AUTO, arc_freepage_percent, CTLFLAG_RWTUN, &percent_target, 0, "ARC Free RAM Target percentage");
+ #endif	/* __FreeBSD__ */
+ #endif	/* NEWRECLAIM */
+
   /*
    * Note that buffers can be in one of 6 states:
    *	ARC_anon	- anonymous (discussed below)
***************
*** 2438,2443 ****
--- 2540,2557 ----
   {
   
   #ifdef _KERNEL
+ #ifdef	NEWRECLAIM
+ #ifdef  __FreeBSD__
+ 	u_int	vmfree = 0;
+ 	u_int	vmtotal = 0;
+ 	size_t	vmsize;
+ #ifdef	NEWRECLAIM_DEBUG
+ 	static	int	xval = -1;
+ 	static	int	oldpercent = 0;
+ 	static	int	oldfreepages = 0;
+ #endif	/* NEWRECLAIM_DEBUG */
+ #endif	/* __FreeBSD__ */
+ #endif	/* NEWRECLAIM */
   
   	if (needfree)
   		return (1);
***************
*** 2476,2481 ****
--- 2590,2596 ----
   		return (1);
   
   #if defined(__i386)
+
   	/*
   	 * If we're on an i386 platform, it's possible that we'll exhaust the
   	 * kernel heap space before we ever run out of available physical
***************
*** 2492,2502 ****
   		return (1);
   #endif
   #else	/* !sun */
   	if (kmem_used() > (kmem_size() * 3) / 4)
   		return (1);
   #endif	/* sun */
   
- #else
   	if (spa_get_random(100) == 0)
   		return (1);
   #endif
--- 2607,2680 ----
   		return (1);
   #endif
   #else	/* !sun */
+
+ #ifdef	NEWRECLAIM
+ #ifdef  __FreeBSD__
+ /*
+  * Implement the new tunable free RAM algorithm.  We check the free pages
+  * against the minimum specified target and the percentage that should be
+  * free.  If we're low we ask for ARC cache shrinkage.  If this is defined
+  * on a FreeBSD system the older checks are not performed.
+  *
+  * Check first to see if we need to init freepages, then test.
+  */
+ 	if (!freepages) {		/* If zero then (re)init */
+ 		vmsize = sizeof(vmtotal);
+ 		kernel_sysctlbyname(curthread, "vm.stats.vm.v_free_target", &vmtotal, &vmsize, NULL, 0, NULL, 0);
+ 		freepages = vmtotal - (vmtotal / 10);
+ #ifdef	NEWRECLAIM_DEBUG
+ 		printf("ZFS ARC: Default vfs.zfs.arc_freepages to [%u] [%u less 10%%]\n", freepages, vmtotal);
+ #endif	/* NEWRECLAIM_DEBUG */
+ 	}
+
+ 	vmsize = sizeof(vmtotal);
+         kernel_sysctlbyname(curthread, "vm.stats.vm.v_page_count", &vmtotal, &vmsize, NULL, 0, NULL, 0);
+ 	vmsize = sizeof(vmfree);
+         kernel_sysctlbyname(curthread, "vm.stats.vm.v_free_count", &vmfree, &vmsize, NULL, 0, NULL, 0);
+ #ifdef	NEWRECLAIM_DEBUG
+ 	if (percent_target != oldpercent) {
+ 		printf("ZFS ARC: Reservation percent change to [%d], [%d] pages, [%d] free\n", percent_target, vmtotal, vmfree);
+ 		oldpercent = percent_target;
+ 	}
+ 	if (freepages != oldfreepages) {
+ 		printf("ZFS ARC: Low RAM page change to [%d], [%d] pages, [%d] free\n", freepages, vmtotal, vmfree);
+ 		oldfreepages = freepages;
+ 	}
+ #endif	/* NEWRECLAIM_DEBUG */
+ 	if (!vmtotal) {
+ 		vmtotal = 1;	/* Protect against divide by zero */
+ 				/* (should be impossible, but...) */
+ 	}
+ /*
+  * Now figure out how much free RAM we require to call the ARC cache status
+  * "ok".  Add the percentage specified of the total to the base requirement.
+  */
+
+ 	if (vmfree < freepages + ((vmtotal / 100) * percent_target)) {
+ #ifdef	NEWRECLAIM_DEBUG
+ 		if (xval != 1) {
+ 			printf("ZFS ARC: RECLAIM total %u, free %u, free pct (%u), reserved (%u), target pct (%u)\n", vmtotal, vmfree, ((vmfree * 100) / vmtotal), freepages, percent_target);
+ 			xval = 1;
+ 		}
+ #endif	/* NEWRECLAIM_DEBUG */
+ 		return(1);
+ 	} else {
+ #ifdef	NEWRECLAIM_DEBUG
+ 		if (xval != 0) {
+ 			printf("ZFS ARC: NORMAL total %u, free %u, free pct (%u), reserved (%u), target pct (%u)\n", vmtotal, vmfree, ((vmfree * 100) / vmtotal), freepages, percent_target);
+ 			xval = 0;
+ 		}
+ #endif	/* NEWRECLAIM_DEBUG */
+ 		return(0);
+ 	}
+
+ #endif	/* __FreeBSD__ */
+ #endif	/* NEWRECLAIM */
+
   	if (kmem_used() > (kmem_size() * 3) / 4)
   		return (1);
   #endif	/* sun */
   
   	if (spa_get_random(100) == 0)
   		return (1);
   #endif


-- 
-- Karl
karl@denninger.net

Comment 6 karl 2014-03-20 12:26:39 UTC
I am increasingly-convinced with increasing runtime now on both 
synthetic and real loads in production that the proper default value for 
vfs.zfs.arc_freepages is vm.stats.vm.v_free_target less "just a bit."  
Five percent appears to be ok for most workloads with RAM configurations 
ranging from 4GB to the 24GB area (configurations that I can easily test 
under both synthetic and real environments.)

Larger invasions of the free target increasingly risk provocation of the 
behavior that prompted me to get involved in working this part of the 
code in the first place, including short-term (~5-10 second) "stalls" 
during which the system appears to be locked up, but is not.

It appears that the key to avoiding that behavior is to not allow the 
ARC to continue to take RAM when a material invasion of that target 
space has occurred.

-- 
-- Karl
karl@denninger.net

Comment 7 Andriy Gapon freebsd_committer freebsd_triage 2014-03-20 14:56:18 UTC
I think that you are gradually approaching a correct solution to the problem,
but from quite a different angle comparing to how I would approach the problem.

In fact, I think that it was this commit
http://svnweb.freebsd.org/changeset/base/254304 that broke a balance between the
page cache and ZFS ARC.

On technical side, I see that you are still using kernel_sysctlbyname in your
patches.  As I've said before, this is not needed and in certain sense incorrect.

-- 
Andriy Gapon
Comment 8 karl 2014-03-20 17:00:54 UTC
Responsive to avg's comment and with another overnight and daytime load 
of testing on multiple machines with varying memory configs from 4-24GB 
of RAM here is another version of the patch.

The differences are:

1. No longer use kernel_sysctlbyname, include the VM header file and get 
the values directly (less overhead.)  Remove the variables no longer needed.

2. Set the default free RAM level for ARC shrinkage to v_free_target 
less 3% as I was able to provoke a stall once with it set to a 5% 
reservation, was able to provoke it with the parameter set to 10% with a 
lot of work and was able to do so "on demand" with it set to 20%.  With 
a 5% invasion initiating a scrub with very heavy I/O and image load 
(hundreds of web and database processes) provoked a ~10 second system 
stall.  With it set to 3% I have not been able to reproduce the stall 
yet the inactive page count remains stable even under extremely heavy 
load, indicating that page-stealing remains effective when required.  
Note that for my workload even with this level set above v_free_target, 
which would imply no page stealing by the VM system before ARC expansion 
is halted, I do not get unbridled inactive page growth.

As before vfs.zfs.zrc_freepages and vfs.zfs.arc_freepage_percent remain 
as accessible knobs if you wish to twist them for some reason to 
compensate for an unusual load profile or machine configuration.

*** arc.c.original	Thu Mar 13 09:18:48 2014
--- arc.c	Thu Mar 20 11:51:48 2014
***************
*** 18,23 ****
--- 18,94 ----
    *
    * CDDL HEADER END
    */
+
+ /* Karl Denninger (karl@denninger.net), 3/20/2014, FreeBSD-specific
+  *
+  * If "NEWRECLAIM" is defined, change the "low memory" warning that causes
+  * the ARC cache to be pared down.  The reason for the change is that the
+  * apparent attempted algorithm is to start evicting ARC cache when free
+  * pages fall below 25% of installed RAM.  This maps reasonably well to how
+  * Solaris is documented to behave; when "lotsfree" is invaded ZFS is told
+  * to pare down.
+  *
+  * The problem is that on FreeBSD machines the system doesn't appear to be
+  * getting what the authors of the original code thought they were looking at
+  * with its test -- or at least not what Solaris did -- and as a result that
+  * test never triggers.  That leaves the only reclaim trigger as the "paging
+  * needed" status flag, and by the time * that trips the system is already
+  * in low-memory trouble.  This can lead to severe pathological behavior
+  * under the following scenario:
+  * - The system starts to page and ARC is evicted.
+  * - The system stops paging as ARC's eviction drops wired RAM a bit.
+  * - ARC starts increasing its allocation again, and wired memory grows.
+  * - A new image is activated, and the system once again attempts to page.
+  * - ARC starts to be evicted again.
+  * - Back to #2
+  *
+  * Note that ZFS's ARC default (unless you override it in /boot/loader.conf)
+  * is to allow the ARC cache to grab nearly all of free RAM, provided nobody
+  * else needs it.  That would be ok if we evicted cache when required.
+  *
+  * Unfortunately the system can get into a state where it never
+  * manages to page anything of materiality back in, as if there is active
+  * I/O the ARC will start grabbing space once again as soon as the memory
+  * contention state drops.  For this reason the "paging is occurring" flag
+  * should be the **last resort** condition for ARC eviction; you want to
+  * (as Solaris does) start when there is material free RAM left BUT the
+  * vm system thinks it needs to be active to steal pages back in the attempt
+  * to never get into the condition where you're potentially paging off
+  * executables in favor of leaving disk cache allocated.
+  *
+  * To fix this we change how we look at low memory, declaring two new
+  * runtime tunables.
+  *
+  * The new sysctls are:
+  * vfs.zfs.arc_freepages (free pages required to call RAM "sufficient")
+  * vfs.zfs.arc_freepage_percent (additional reservation percentage, default 0)
+  *
+  * vfs.zfs.arc_freepages is initialized from vm.v_free_target, less 3%.
+  * This should insure that we allow the VM system to steal pages first,
+  * but pare the cache before we suspend processes attempting to get more
+  * memory, thereby avoiding "stalls."  You can set this higher if you wish,
+  * or force a specific percentage reservation as well, but doing so may
+  * cause the cache to pare back while the VM system remains willing to
+  * allow "inactive" pages to accumulate.  The challenge is that image
+  * activation can force things into the page space on a repeated basis
+  * if you allow this level to be too small (the above pathological
+  * behavior); the defaults should avoid that behavior but the sysctls
+  * are exposed should your workload require adjustment.
+  *
+  * If we're using this check for low memory we are replacing the previous
+  * ones, including the oddball "random" reclaim that appears to fire far
+  * more often than it should.  We still trigger if the system pages.
+  *
+  * If you turn on NEWRECLAIM_DEBUG then the kernel will print on the console
+  * status messages when the reclaim status trips on and off, along with the
+  * page count aggregate that triggered it (and the free space) for each
+  * event.
+  */
+
+ #define	NEWRECLAIM
+ #undef	NEWRECLAIM_DEBUG
+
+
   /*
    * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
    * Copyright (c) 2013 by Delphix. All rights reserved.
***************
*** 139,144 ****
--- 210,222 ----
   
   #include <vm/vm_pageout.h>
   
+ #ifdef	NEWRECLAIM
+ #ifdef	__FreeBSD__
+ #include <sys/sysctl.h>
+ #include <sys/vmmeter.h>
+ #endif
+ #endif	/* NEWRECLAIM */
+
   #ifdef illumos
   #ifndef _KERNEL
   /* set with ZFS_DEBUG=watch, to enable watchpoints on frozen buffers */
***************
*** 203,218 ****
--- 281,316 ----
   int zfs_arc_shrink_shift = 0;
   int zfs_arc_p_min_shift = 0;
   int zfs_disable_dup_eviction = 0;
+ #ifdef	NEWRECLAIM
+ #ifdef  __FreeBSD__
+ static	int freepages = 0;	/* This much memory is considered critical */
+ static	int percent_target = 0;	/* Additionally reserve "X" percent free RAM */
+ #endif	/* __FreeBSD__ */
+ #endif	/* NEWRECLAIM */
   
   TUNABLE_QUAD("vfs.zfs.arc_max", &zfs_arc_max);
   TUNABLE_QUAD("vfs.zfs.arc_min", &zfs_arc_min);
   TUNABLE_QUAD("vfs.zfs.arc_meta_limit", &zfs_arc_meta_limit);
+ #ifdef	NEWRECLAIM
+ #ifdef  __FreeBSD__
+ TUNABLE_INT("vfs.zfs.arc_freepages", &freepages);
+ TUNABLE_INT("vfs.zfs.arc_freepage_percent", &percent_target);
+ #endif	/* __FreeBSD__ */
+ #endif	/* NEWRECLAIM */
+
   SYSCTL_DECL(_vfs_zfs);
   SYSCTL_UQUAD(_vfs_zfs, OID_AUTO, arc_max, CTLFLAG_RDTUN, &zfs_arc_max, 0,
       "Maximum ARC size");
   SYSCTL_UQUAD(_vfs_zfs, OID_AUTO, arc_min, CTLFLAG_RDTUN, &zfs_arc_min, 0,
       "Minimum ARC size");
   
+ #ifdef	NEWRECLAIM
+ #ifdef  __FreeBSD__
+ SYSCTL_INT(_vfs_zfs, OID_AUTO, arc_freepages, CTLFLAG_RWTUN, &freepages, 0, "ARC Free RAM Pages Required");
+ SYSCTL_INT(_vfs_zfs, OID_AUTO, arc_freepage_percent, CTLFLAG_RWTUN, &percent_target, 0, "ARC Free RAM Target percentage");
+ #endif	/* __FreeBSD__ */
+ #endif	/* NEWRECLAIM */
+
   /*
    * Note that buffers can be in one of 6 states:
    *	ARC_anon	- anonymous (discussed below)
***************
*** 2438,2443 ****
--- 2536,2546 ----
   {
   
   #ifdef _KERNEL
+ #ifdef	NEWRECLAIM_DEBUG
+ 	static	int	xval = -1;
+ 	static	int	oldpercent = 0;
+ 	static	int	oldfreepages = 0;
+ #endif	/* NEWRECLAIM_DEBUG */
   
   	if (needfree)
   		return (1);
***************
*** 2476,2481 ****
--- 2579,2585 ----
   		return (1);
   
   #if defined(__i386)
+
   	/*
   	 * If we're on an i386 platform, it's possible that we'll exhaust the
   	 * kernel heap space before we ever run out of available physical
***************
*** 2492,2502 ****
   		return (1);
   #endif
   #else	/* !sun */
   	if (kmem_used() > (kmem_size() * 3) / 4)
   		return (1);
   #endif	/* sun */
   
- #else
   	if (spa_get_random(100) == 0)
   		return (1);
   #endif
--- 2596,2658 ----
   		return (1);
   #endif
   #else	/* !sun */
+
+ #ifdef	NEWRECLAIM
+ #ifdef  __FreeBSD__
+ /*
+  * Implement the new tunable free RAM algorithm.  We check the free pages
+  * against the minimum specified target and the percentage that should be
+  * free.  If we're low we ask for ARC cache shrinkage.  If this is defined
+  * on a FreeBSD system the older checks are not performed.
+  *
+  * Check first to see if we need to init freepages, then test.
+  */
+ 	if (!freepages) {		/* If zero then (re)init */
+ 		freepages = cnt.v_free_target - (cnt.v_free_target / 33);
+ #ifdef	NEWRECLAIM_DEBUG
+ 		printf("ZFS ARC: Default vfs.zfs.arc_freepages to [%u] [%u less 3%%]\n", freepages, cnt.v_free_target);
+ #endif	/* NEWRECLAIM_DEBUG */
+ 	}
+ #ifdef	NEWRECLAIM_DEBUG
+ 	if (percent_target != oldpercent) {
+ 		printf("ZFS ARC: Reservation percent change to [%d], [%d] pages, [%d] free\n", percent_target, cnt.v_page_count, cnt.v_free_count);
+ 		oldpercent = percent_target;
+ 	}
+ 	if (freepages != oldfreepages) {
+ 		printf("ZFS ARC: Low RAM page change to [%d], [%d] pages, [%d] free\n", freepages, cnt.v_page_count, cnt.v_free_count);
+ 		oldfreepages = freepages;
+ 	}
+ #endif	/* NEWRECLAIM_DEBUG */
+ /*
+  * Now figure out how much free RAM we require to call the ARC cache status
+  * "ok".  Add the percentage specified of the total to the base requirement.
+  */
+
+ 	if (cnt.v_free_count < freepages + ((cnt.v_page_count / 100) * percent_target)) {
+ #ifdef	NEWRECLAIM_DEBUG
+ 		if (xval != 1) {
+ 			printf("ZFS ARC: RECLAIM total %u, free %u, free pct (%u), reserved (%u), target pct (%u)\n", cnt.v_page_count, cnt.v_free_count, ((cnt.v_free_count * 100) / cnt.v_page_count), freepages, percent_target);
+ 			xval = 1;
+ 		}
+ #endif	/* NEWRECLAIM_DEBUG */
+ 		return(1);
+ 	} else {
+ #ifdef	NEWRECLAIM_DEBUG
+ 		if (xval != 0) {
+ 			printf("ZFS ARC: NORMAL total %u, free %u, free pct (%u), reserved (%u), target pct (%u)\n", cnt.v_page_count, cnt.v_free_count, ((cnt.v_free_count * 100) / cnt.v_page_count), freepages, percent_target);
+ 			xval = 0;
+ 		}
+ #endif	/* NEWRECLAIM_DEBUG */
+ 		return(0);
+ 	}
+
+ #endif	/* __FreeBSD__ */
+ #endif	/* NEWRECLAIM */
+
   	if (kmem_used() > (kmem_size() * 3) / 4)
   		return (1);
   #endif	/* sun */
   
   	if (spa_get_random(100) == 0)
   		return (1);
   #endif


-- 
-- Karl
karl@denninger.net

Comment 9 karl 2014-03-24 11:41:16 UTC
Update:

1. Patch is still good against latest arc.c change (associated with new 
flags on the pool).
2. Change default low memory warning for the arc to cnt.v_page_count; no 
margin.  This appears to provide the best performance and does not cause 
problems with inact pages or other misbehavior on my test systems.
3. Expose the return flag (arc_shrink_needed) so if you care to watch it 
for some reason, you can.

*** arc.c.original	Sun Mar 23 14:56:01 2014
--- arc.c	Sun Mar 23 15:12:15 2014
***************
*** 18,23 ****
--- 18,95 ----
    *
    * CDDL HEADER END
    */
+
+ /* Karl Denninger (karl@denninger.net), 3/20/2014, FreeBSD-specific
+  *
+  * If "NEWRECLAIM" is defined, change the "low memory" warning that causes
+  * the ARC cache to be pared down.  The reason for the change is that the
+  * apparent attempted algorithm is to start evicting ARC cache when free
+  * pages fall below 25% of installed RAM.  This maps reasonably well to how
+  * Solaris is documented to behave; when "lotsfree" is invaded ZFS is told
+  * to pare down.
+  *
+  * The problem is that on FreeBSD machines the system doesn't appear to be
+  * getting what the authors of the original code thought they were looking at
+  * with its test -- or at least not what Solaris did -- and as a result that
+  * test never triggers.  That leaves the only reclaim trigger as the "paging
+  * needed" status flag, and by the time * that trips the system is already
+  * in low-memory trouble.  This can lead to severe pathological behavior
+  * under the following scenario:
+  * - The system starts to page and ARC is evicted.
+  * - The system stops paging as ARC's eviction drops wired RAM a bit.
+  * - ARC starts increasing its allocation again, and wired memory grows.
+  * - A new image is activated, and the system once again attempts to page.
+  * - ARC starts to be evicted again.
+  * - Back to #2
+  *
+  * Note that ZFS's ARC default (unless you override it in /boot/loader.conf)
+  * is to allow the ARC cache to grab nearly all of free RAM, provided nobody
+  * else needs it.  That would be ok if we evicted cache when required.
+  *
+  * Unfortunately the system can get into a state where it never
+  * manages to page anything of materiality back in, as if there is active
+  * I/O the ARC will start grabbing space once again as soon as the memory
+  * contention state drops.  For this reason the "paging is occurring" flag
+  * should be the **last resort** condition for ARC eviction; you want to
+  * (as Solaris does) start when there is material free RAM left BUT the
+  * vm system thinks it needs to be active to steal pages back in the attempt
+  * to never get into the condition where you're potentially paging off
+  * executables in favor of leaving disk cache allocated.
+  *
+  * To fix this we change how we look at low memory, declaring two new
+  * runtime tunables and one status.
+  *
+  * The new sysctls are:
+  * vfs.zfs.arc_freepages (free pages required to call RAM "sufficient")
+  * vfs.zfs.arc_freepage_percent (additional reservation percentage, default 0)
+  * vfs.zfs.arc_shrink_needed (shows "1" if we're asking for shrinking the ARC)
+  *
+  * vfs.zfs.arc_freepages is initialized from vm.v_free_target.
+  * This should insure that we allow the VM system to steal pages,
+  * but pare the cache before we suspend processes attempting to get more
+  * memory, thereby avoiding "stalls."  You can set this higher if you wish,
+  * or force a specific percentage reservation as well, but doing so may
+  * cause the cache to pare back while the VM system remains willing to
+  * allow "inactive" pages to accumulate.  The challenge is that image
+  * activation can force things into the page space on a repeated basis
+  * if you allow this level to be too small (the above pathological
+  * behavior); the defaults should avoid that behavior but the sysctls
+  * are exposed should your workload require adjustment.
+  *
+  * If we're using this check for low memory we are replacing the previous
+  * ones, including the oddball "random" reclaim that appears to fire far
+  * more often than it should.  We still trigger if the system pages.
+  *
+  * If you turn on NEWRECLAIM_DEBUG then the kernel will print on the console
+  * status messages when the reclaim status trips on and off, along with the
+  * page count aggregate that triggered it (and the free space) for each
+  * event.
+  */
+
+ #define	NEWRECLAIM
+ #undef	NEWRECLAIM_DEBUG
+
+
   /*
    * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
    * Copyright (c) 2013 by Delphix. All rights reserved.
***************
*** 139,144 ****
--- 211,223 ----
   
   #include <vm/vm_pageout.h>
   
+ #ifdef	NEWRECLAIM
+ #ifdef	__FreeBSD__
+ #include <sys/sysctl.h>
+ #include <sys/vmmeter.h>
+ #endif
+ #endif	/* NEWRECLAIM */
+
   #ifdef illumos
   #ifndef _KERNEL
   /* set with ZFS_DEBUG=watch, to enable watchpoints on frozen buffers */
***************
*** 203,218 ****
--- 282,320 ----
   int zfs_arc_shrink_shift = 0;
   int zfs_arc_p_min_shift = 0;
   int zfs_disable_dup_eviction = 0;
+ #ifdef	NEWRECLAIM
+ #ifdef  __FreeBSD__
+ static	int freepages = 0;	/* This much memory is considered critical */
+ static	int percent_target = 0;	/* Additionally reserve "X" percent free RAM */
+ static	int shrink_needed = 0;	/* Shrinkage of ARC cache needed?	*/
+ #endif	/* __FreeBSD__ */
+ #endif	/* NEWRECLAIM */
   
   TUNABLE_QUAD("vfs.zfs.arc_max", &zfs_arc_max);
   TUNABLE_QUAD("vfs.zfs.arc_min", &zfs_arc_min);
   TUNABLE_QUAD("vfs.zfs.arc_meta_limit", &zfs_arc_meta_limit);
+ #ifdef	NEWRECLAIM
+ #ifdef  __FreeBSD__
+ TUNABLE_INT("vfs.zfs.arc_freepages", &freepages);
+ TUNABLE_INT("vfs.zfs.arc_freepage_percent", &percent_target);
+ TUNABLE_INT("vfs.zfs.arc_shrink_needed", &shrink_needed);
+ #endif	/* __FreeBSD__ */
+ #endif	/* NEWRECLAIM */
+
   SYSCTL_DECL(_vfs_zfs);
   SYSCTL_UQUAD(_vfs_zfs, OID_AUTO, arc_max, CTLFLAG_RDTUN, &zfs_arc_max, 0,
       "Maximum ARC size");
   SYSCTL_UQUAD(_vfs_zfs, OID_AUTO, arc_min, CTLFLAG_RDTUN, &zfs_arc_min, 0,
       "Minimum ARC size");
   
+ #ifdef	NEWRECLAIM
+ #ifdef  __FreeBSD__
+ SYSCTL_INT(_vfs_zfs, OID_AUTO, arc_freepages, CTLFLAG_RWTUN, &freepages, 0, "ARC Free RAM Pages Required");
+ SYSCTL_INT(_vfs_zfs, OID_AUTO, arc_freepage_percent, CTLFLAG_RWTUN, &percent_target, 0, "ARC Free RAM Target percentage");
+ SYSCTL_INT(_vfs_zfs, OID_AUTO, arc_shrink_needed, CTLFLAG_RD, &shrink_needed, 0, "ARC Memory Constrained (0 = no, 1 = yes)");
+ #endif	/* __FreeBSD__ */
+ #endif	/* NEWRECLAIM */
+
   /*
    * Note that buffers can be in one of 6 states:
    *	ARC_anon	- anonymous (discussed below)
***************
*** 2438,2443 ****
--- 2540,2550 ----
   {
   
   #ifdef _KERNEL
+ #ifdef	NEWRECLAIM_DEBUG
+ 	static	int	xval = -1;
+ 	static	int	oldpercent = 0;
+ 	static	int	oldfreepages = 0;
+ #endif	/* NEWRECLAIM_DEBUG */
   
   	if (needfree)
   		return (1);
***************
*** 2476,2481 ****
--- 2583,2589 ----
   		return (1);
   
   #if defined(__i386)
+
   	/*
   	 * If we're on an i386 platform, it's possible that we'll exhaust the
   	 * kernel heap space before we ever run out of available physical
***************
*** 2492,2502 ****
   		return (1);
   #endif
   #else	/* !sun */
   	if (kmem_used() > (kmem_size() * 3) / 4)
   		return (1);
   #endif	/* sun */
   
- #else
   	if (spa_get_random(100) == 0)
   		return (1);
   #endif
--- 2600,2664 ----
   		return (1);
   #endif
   #else	/* !sun */
+
+ #ifdef	NEWRECLAIM
+ #ifdef  __FreeBSD__
+ /*
+  * Implement the new tunable free RAM algorithm.  We check the free pages
+  * against the minimum specified target and the percentage that should be
+  * free.  If we're low we ask for ARC cache shrinkage.  If this is defined
+  * on a FreeBSD system the older checks are not performed.
+  *
+  * Check first to see if we need to init freepages, then test.
+  */
+ 	if (!freepages) {		/* If zero then (re)init */
+ 		freepages = cnt.v_free_target;
+ #ifdef	NEWRECLAIM_DEBUG
+ 		printf("ZFS ARC: Default vfs.zfs.arc_freepages to [%u]\n", freepages);
+ #endif	/* NEWRECLAIM_DEBUG */
+ 	}
+ #ifdef	NEWRECLAIM_DEBUG
+ 	if (percent_target != oldpercent) {
+ 		printf("ZFS ARC: Reservation percent change to [%d], [%d] pages, [%d] free\n", percent_target, cnt.v_page_count, cnt.v_free_count);
+ 		oldpercent = percent_target;
+ 	}
+ 	if (freepages != oldfreepages) {
+ 		printf("ZFS ARC: Low RAM page change to [%d], [%d] pages, [%d] free\n", freepages, cnt.v_page_count, cnt.v_free_count);
+ 		oldfreepages = freepages;
+ 	}
+ #endif	/* NEWRECLAIM_DEBUG */
+ /*
+  * Now figure out how much free RAM we require to call the ARC cache status
+  * "ok".  Add the percentage specified of the total to the base requirement.
+  */
+
+ 	if (cnt.v_free_count < (freepages + ((cnt.v_page_count / 100) * percent_target))) {
+ #ifdef	NEWRECLAIM_DEBUG
+ 		if (xval != 1) {
+ 			printf("ZFS ARC: RECLAIM total %u, free %u, free pct (%u), reserved (%u), target pct (%u)\n", cnt.v_page_count, cnt.v_free_count, ((cnt.v_free_count * 100) / cnt.v_page_count), freepages, percent_target);
+ 			xval = 1;
+ 		}
+ #endif	/* NEWRECLAIM_DEBUG */
+ 		shrink_needed = 1;
+ 		return(1);
+ 	} else {
+ #ifdef	NEWRECLAIM_DEBUG
+ 		if (xval != 0) {
+ 			printf("ZFS ARC: NORMAL total %u, free %u, free pct (%u), reserved (%u), target pct (%u)\n", cnt.v_page_count, cnt.v_free_count, ((cnt.v_free_count * 100) / cnt.v_page_count), freepages, percent_target);
+ 			xval = 0;
+ 		}
+ #endif	/* NEWRECLAIM_DEBUG */
+ 		shrink_needed = 0;
+ 		return(0);
+ 	}
+
+ #endif	/* __FreeBSD__ */
+ #endif	/* NEWRECLAIM */
+
   	if (kmem_used() > (kmem_size() * 3) / 4)
   		return (1);
   #endif	/* sun */
   
   	if (spa_get_random(100) == 0)
   		return (1);
   #endif

-- 
-- Karl
karl@denninger.net

Comment 10 karl 2014-03-26 12:20:25 UTC
Updated to handle the change in <sys/vmmeter.h> that was recently 
committed to HEAD and slightly tweak the default reservation to be equal 
to the VM system's "wakeup" level.

This appears, after lots of use in multiple environments, to be the 
ideal default setting.  The knobs remain if you wish to twist then, and 
I have also exposed the return flag for shrinking being needed should 
you want to monitor it for some reason.

This change to arc.c has made a tremendous (and positive) difference in 
system behavior and others that are running it have made similar comments.

For those having problems with the PR system mangling these patches you 
can get the below patch via direct fetch at 
http://www.denninger.net/FreeBSD-Patches/arc-patch

*** arc.c.original	Sun Mar 23 14:56:01 2014
--- arc.c	Tue Mar 25 09:24:14 2014
***************
*** 18,23 ****
--- 18,95 ----
    *
    * CDDL HEADER END
    */
+
+ /* Karl Denninger (karl@denninger.net), 3/25/2014, FreeBSD-specific
+  *
+  * If "NEWRECLAIM" is defined, change the "low memory" warning that causes
+  * the ARC cache to be pared down.  The reason for the change is that the
+  * apparent attempted algorithm is to start evicting ARC cache when free
+  * pages fall below 25% of installed RAM.  This maps reasonably well to how
+  * Solaris is documented to behave; when "lotsfree" is invaded ZFS is told
+  * to pare down.
+  *
+  * The problem is that on FreeBSD machines the system doesn't appear to be
+  * getting what the authors of the original code thought they were looking at
+  * with its test -- or at least not what Solaris did -- and as a result that
+  * test never triggers.  That leaves the only reclaim trigger as the "paging
+  * needed" status flag, and by the time * that trips the system is already
+  * in low-memory trouble.  This can lead to severe pathological behavior
+  * under the following scenario:
+  * - The system starts to page and ARC is evicted.
+  * - The system stops paging as ARC's eviction drops wired RAM a bit.
+  * - ARC starts increasing its allocation again, and wired memory grows.
+  * - A new image is activated, and the system once again attempts to page.
+  * - ARC starts to be evicted again.
+  * - Back to #2
+  *
+  * Note that ZFS's ARC default (unless you override it in /boot/loader.conf)
+  * is to allow the ARC cache to grab nearly all of free RAM, provided nobody
+  * else needs it.  That would be ok if we evicted cache when required.
+  *
+  * Unfortunately the system can get into a state where it never
+  * manages to page anything of materiality back in, as if there is active
+  * I/O the ARC will start grabbing space once again as soon as the memory
+  * contention state drops.  For this reason the "paging is occurring" flag
+  * should be the **last resort** condition for ARC eviction; you want to
+  * (as Solaris does) start when there is material free RAM left BUT the
+  * vm system thinks it needs to be active to steal pages back in the attempt
+  * to never get into the condition where you're potentially paging off
+  * executables in favor of leaving disk cache allocated.
+  *
+  * To fix this we change how we look at low memory, declaring two new
+  * runtime tunables and one status.
+  *
+  * The new sysctls are:
+  * vfs.zfs.arc_freepages (free pages required to call RAM "sufficient")
+  * vfs.zfs.arc_freepage_percent (additional reservation percentage, default 0)
+  * vfs.zfs.arc_shrink_needed (shows "1" if we're asking for shrinking the ARC)
+  *
+  * vfs.zfs.arc_freepages is initialized from vm.v_free_target.
+  * This should insure that we allow the VM system to steal pages,
+  * but pare the cache before we suspend processes attempting to get more
+  * memory, thereby avoiding "stalls."  You can set this higher if you wish,
+  * or force a specific percentage reservation as well, but doing so may
+  * cause the cache to pare back while the VM system remains willing to
+  * allow "inactive" pages to accumulate.  The challenge is that image
+  * activation can force things into the page space on a repeated basis
+  * if you allow this level to be too small (the above pathological
+  * behavior); the defaults should avoid that behavior but the sysctls
+  * are exposed should your workload require adjustment.
+  *
+  * If we're using this check for low memory we are replacing the previous
+  * ones, including the oddball "random" reclaim that appears to fire far
+  * more often than it should.  We still trigger if the system pages.
+  *
+  * If you turn on NEWRECLAIM_DEBUG then the kernel will print on the console
+  * status messages when the reclaim status trips on and off, along with the
+  * page count aggregate that triggered it (and the free space) for each
+  * event.
+  */
+
+ #define	NEWRECLAIM
+ #undef	NEWRECLAIM_DEBUG
+
+
   /*
    * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
    * Copyright (c) 2013 by Delphix. All rights reserved.
***************
*** 139,144 ****
--- 211,230 ----
   
   #include <vm/vm_pageout.h>
   
+ #ifdef	NEWRECLAIM
+ #ifdef	__FreeBSD__
+ #include <sys/sysctl.h>
+ #include <sys/vmmeter.h>
+ /*
+  * Struct cnt. was renamed in -head (11-current) at rev 110016; check for it
+  */
+ #if __FreeBSD_version < 1100016
+ #define	vm_cnt	cnt
+ #endif	/* __FreeBSD_version */
+
+ #endif	/* __FreeBSD__ */
+ #endif	/* NEWRECLAIM */
+
   #ifdef illumos
   #ifndef _KERNEL
   /* set with ZFS_DEBUG=watch, to enable watchpoints on frozen buffers */
***************
*** 203,218 ****
--- 289,327 ----
   int zfs_arc_shrink_shift = 0;
   int zfs_arc_p_min_shift = 0;
   int zfs_disable_dup_eviction = 0;
+ #ifdef	NEWRECLAIM
+ #ifdef  __FreeBSD__
+ static	int freepages = 0;	/* This much memory is considered critical */
+ static	int percent_target = 0;	/* Additionally reserve "X" percent free RAM */
+ static	int shrink_needed = 0;	/* Shrinkage of ARC cache needed?	*/
+ #endif	/* __FreeBSD__ */
+ #endif	/* NEWRECLAIM */
   
   TUNABLE_QUAD("vfs.zfs.arc_max", &zfs_arc_max);
   TUNABLE_QUAD("vfs.zfs.arc_min", &zfs_arc_min);
   TUNABLE_QUAD("vfs.zfs.arc_meta_limit", &zfs_arc_meta_limit);
+ #ifdef	NEWRECLAIM
+ #ifdef  __FreeBSD__
+ TUNABLE_INT("vfs.zfs.arc_freepages", &freepages);
+ TUNABLE_INT("vfs.zfs.arc_freepage_percent", &percent_target);
+ TUNABLE_INT("vfs.zfs.arc_shrink_needed", &shrink_needed);
+ #endif	/* __FreeBSD__ */
+ #endif	/* NEWRECLAIM */
+
   SYSCTL_DECL(_vfs_zfs);
   SYSCTL_UQUAD(_vfs_zfs, OID_AUTO, arc_max, CTLFLAG_RDTUN, &zfs_arc_max, 0,
       "Maximum ARC size");
   SYSCTL_UQUAD(_vfs_zfs, OID_AUTO, arc_min, CTLFLAG_RDTUN, &zfs_arc_min, 0,
       "Minimum ARC size");
   
+ #ifdef	NEWRECLAIM
+ #ifdef  __FreeBSD__
+ SYSCTL_INT(_vfs_zfs, OID_AUTO, arc_freepages, CTLFLAG_RWTUN, &freepages, 0, "ARC Free RAM Pages Required");
+ SYSCTL_INT(_vfs_zfs, OID_AUTO, arc_freepage_percent, CTLFLAG_RWTUN, &percent_target, 0, "ARC Free RAM Target percentage");
+ SYSCTL_INT(_vfs_zfs, OID_AUTO, arc_shrink_needed, CTLFLAG_RD, &shrink_needed, 0, "ARC Memory Constrained (0 = no, 1 = yes)");
+ #endif	/* __FreeBSD__ */
+ #endif	/* NEWRECLAIM */
+
   /*
    * Note that buffers can be in one of 6 states:
    *	ARC_anon	- anonymous (discussed below)
***************
*** 2438,2443 ****
--- 2547,2557 ----
   {
   
   #ifdef _KERNEL
+ #ifdef	NEWRECLAIM_DEBUG
+ 	static	int	xval = -1;
+ 	static	int	oldpercent = 0;
+ 	static	int	oldfreepages = 0;
+ #endif	/* NEWRECLAIM_DEBUG */
   
   	if (needfree)
   		return (1);
***************
*** 2476,2481 ****
--- 2590,2596 ----
   		return (1);
   
   #if defined(__i386)
+
   	/*
   	 * If we're on an i386 platform, it's possible that we'll exhaust the
   	 * kernel heap space before we ever run out of available physical
***************
*** 2492,2502 ****
   		return (1);
   #endif
   #else	/* !sun */
   	if (kmem_used() > (kmem_size() * 3) / 4)
   		return (1);
   #endif	/* sun */
   
- #else
   	if (spa_get_random(100) == 0)
   		return (1);
   #endif
--- 2607,2671 ----
   		return (1);
   #endif
   #else	/* !sun */
+
+ #ifdef	NEWRECLAIM
+ #ifdef  __FreeBSD__
+ /*
+  * Implement the new tunable free RAM algorithm.  We check the free pages
+  * against the minimum specified target and the percentage that should be
+  * free.  If we're low we ask for ARC cache shrinkage.  If this is defined
+  * on a FreeBSD system the older checks are not performed.
+  *
+  * Check first to see if we need to init freepages, then test.
+  */
+ 	if (!freepages) {		/* If zero then (re)init */
+ 		freepages = vm_cnt.v_free_target;
+ #ifdef	NEWRECLAIM_DEBUG
+ 		printf("ZFS ARC: Default vfs.zfs.arc_freepages to [%u]\n", freepages);
+ #endif	/* NEWRECLAIM_DEBUG */
+ 	}
+ #ifdef	NEWRECLAIM_DEBUG
+ 	if (percent_target != oldpercent) {
+ 		printf("ZFS ARC: Reservation percent change to [%d], [%d] pages, [%d] free\n", percent_target, vm_cnt.v_page_count, vm_cnt.v_free_count);
+ 		oldpercent = percent_target;
+ 	}
+ 	if (freepages != oldfreepages) {
+ 		printf("ZFS ARC: Low RAM page change to [%d], [%d] pages, [%d] free\n", freepages, vm_cnt.v_page_count, vm_cnt.v_free_count);
+ 		oldfreepages = freepages;
+ 	}
+ #endif	/* NEWRECLAIM_DEBUG */
+ /*
+  * Now figure out how much free RAM we require to call the ARC cache status
+  * "ok".  Add the percentage specified of the total to the base requirement.
+  */
+
+ 	if (vm_cnt.v_free_count < (freepages + ((vm_cnt.v_page_count / 100) * percent_target))) {
+ #ifdef	NEWRECLAIM_DEBUG
+ 		if (xval != 1) {
+ 			printf("ZFS ARC: RECLAIM total %u, free %u, free pct (%u), reserved (%u), target pct (%u)\n", vm_cnt.v_page_count, vm_cnt.v_free_count, ((vm_cnt.v_free_count * 100) / vm_cnt.v_page_count), freepages, percent_target);
+ 			xval = 1;
+ 		}
+ #endif	/* NEWRECLAIM_DEBUG */
+ 		shrink_needed = 1;
+ 		return(1);
+ 	} else {
+ #ifdef	NEWRECLAIM_DEBUG
+ 		if (xval != 0) {
+ 			printf("ZFS ARC: NORMAL total %u, free %u, free pct (%u), reserved (%u), target pct (%u)\n", vm_cnt.v_page_count, vm_cnt.v_free_count, ((vm_cnt.v_free_count * 100) / vm_cnt.v_page_count), freepages, percent_target);
+ 			xval = 0;
+ 		}
+ #endif	/* NEWRECLAIM_DEBUG */
+ 		shrink_needed = 0;
+ 		return(0);
+ 	}
+
+ #endif	/* __FreeBSD__ */
+ #endif	/* NEWRECLAIM */
+
   	if (kmem_used() > (kmem_size() * 3) / 4)
   		return (1);
   #endif	/* sun */
   
   	if (spa_get_random(100) == 0)
   		return (1);
   #endif

-- 
-- Karl
karl@denninger.net

Comment 11 Devin Teske freebsd_committer freebsd_triage 2014-03-27 21:55:58 UTC
Hi,

I can't seem to find the code where you mention in your
previous post:

`...and slightly tweak the default reservation to be equal 
to the VM system's "wakeup" level.'

Comparing Mar 26th's patch to Mar 24th's patch yields no
such change. Did you post the latest patch?
-- 
Devin

_____________
The information contained in this message is proprietary and/or confidential. If you are not the intended recipient, please: (i) delete the message and all copies; (ii) do not disclose, distribute or use the message in any manner; and (iii) notify the sender immediately. In addition, please be aware that any message addressed to our domain is subject to archiving and review by persons other than the intended recipient. Thank you.
Comment 12 karl 2014-03-28 01:32:17 UTC
The last change was the "cnt" structure rename; the previous rev was the 
one where freepages was set to cnt.v_free_target (the margin was removed 
in the rev sent on 24-March in the morning) -- there was no logic change 
made in the 26th March followup .vs. the previous one from 24 March.

The latest that I and others are running is what is on the PR (and the 
link, which is identical -- a couple people had said they had trouble 
with the pr followup inclusions being problematic when copied down to be 
applied.)

Apologies for the confusion.

-- 
-- Karl
karl@denninger.net

Comment 13 karl 2014-04-03 17:57:50 UTC
After more than a week of operation without any changes on a very busy 
production server this is what the status looks like at this particular 
moment in time (caught it being pretty quiet at the moment... slow day):

[karl@NewFS ~]$ uptime
11:56AM  up 10 days, 20:37, 1 user, load averages: 0.80, 0.59, 0.58

[karl@NewFS ~]$ uname -v
FreeBSD 10.0-STABLE #22 r263665:263671M: Sun Mar 23 15:00:48 CDT 2014
  karl@NewFS.denninger.net:/usr/obj/usr/src/sys/KSD-SMP


     1 users    Load  0.50  0.57  0.58                  Apr  3 11:52

Mem:KB    REAL            VIRTUAL                       VN PAGER   SWAP PAGER
         Tot   Share      Tot    Share    Free           in   out     in   out
Act 4503936   32680  9319616    54908  701712  count
All  17598k   42312 10162228   293268          pages
Proc:                                                            Interrupts
   r   p   d   s   w   Csw  Trp  Sys  Int  Sof  Flt        ioflt  2635 total
   2         245   3  9936 4302  12k  990  442 2878   1161 cow      11 uart0 4
                                                      1460 zfod     53 uhci0 16
  0.6%Sys   0.1%Intr  1.5%User  0.0%Nice 97.8%Idle         ozfod       pcm0 17
|    |    |    |    |    |    |    |    |    |           %ozfod       ehci0 uhci
>                                                         daefr       uhci1 21

                                            dtbuf     1779 prcfr   532 uhci3 ehci
Namei     Name-cache   Dir-cache    485888 desvn     3862 totfr    44 twa0 30
    Calls    hits   %    hits   %    145761 numvn          react   989 cpu0:timer
    18611   18549 100                121467 frevn          pdwak    69 mps0 256
                                                       909 pdpgs    24 em0:rx 0
Disks   da0   da1   da2   da3   da4   da5   da6           intrn    32 em0:tx 0
KB/t  10.30 10.39  0.00  0.00 22.61 24.69 24.39  19017980 wire        em0:link
tps      21    21     0     0    10    16    16   2197580 act     118 em1:rx 0
MB/s   0.22  0.22  0.00  0.00  0.22  0.39  0.39   2544544 inact   107 em1:tx 0
%busy    19    19     0     0     0     1     1      3276 cache       em1:link
                                                    698064 free        ahci0:ch0
                                                           buf      32 cpu1:timer
                                                                    24 cpu10:time
                                                                    50 cpu6:timer
                                                                    26 cpu12:time
                                                                    37 cpu7:timer
                                                                    45 cpu14:time
                                                                    41 cpu4:timer
                                                                    35 cpu15:time
                                                                    25 cpu5:timer
                                                                    45 cpu9:timer
                                                                    45 cpu2:timer
                                                                   102 cpu11:time
                                                                    63 cpu3:timer
                                                                    41 cpu13:time
                                                                    45 cpu8:timer

[karl@NewFS ~]$ zpool list
NAME     SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
media   2.72T  2.12T   616G    77%  1.00x  ONLINE  -
zroot    234G  18.8G   215G     8%  1.36x  ONLINE  -
zstore  3.63T  2.50T  1.13T    68%  1.00x  ONLINE  -

[karl@NewFS ~]$ zfs-stats -A

------------------------------------------------------------------------
ZFS Subsystem Report                            Thu Apr  3 11:53:42 2014
------------------------------------------------------------------------

ARC Summary: (HEALTHY)
         Memory Throttle Count:                  0

ARC Misc:
         Deleted:                                27.84m
         Recycle Misses:                         1.12m
         Mutex Misses:                           2.65k
         Evict Skips:                            39.26m

ARC Size:                               59.13%  13.20   GiB
         Target Size: (Adaptive)         59.14%  13.20   GiB
         Min Size (Hard Limit):          12.50%  2.79    GiB
         Max Size (High Water):          8:1     22.33   GiB

ARC Size Breakdown:
         Recently Used Cache Size:       81.41%  10.75   GiB
         Frequently Used Cache Size:     18.59%  2.46    GiB

ARC Hash Breakdown:
         Elements Max:                           2.69m
         Elements Current:               63.22%  1.70m
         Collisions:                             95.13m
         Chain Max:                              24
         Chains:                                 413.62k

------------------------------------------------------------------------

[karl@NewFS ~]$ zfs-stats -E

------------------------------------------------------------------------
ZFS Subsystem Report                            Thu Apr  3 11:53:59 2014
------------------------------------------------------------------------

ARC Efficiency:                                 1.28b
         Cache Hit Ratio:                98.37%  1.26b
         Cache Miss Ratio:               1.63%   20.80m
         Actual Hit Ratio:               60.07%  766.91m

         Data Demand Efficiency:         99.15%  435.02m
         Data Prefetch Efficiency:       20.45%  17.49m

         CACHE HITS BY CACHE LIST:
           Anonymously Used:             38.72%  486.24m
           Most Recently Used:           3.74%   46.94m
           Most Frequently Used:         57.33%  719.97m
           Most Recently Used Ghost:     0.06%   792.68k
           Most Frequently Used Ghost:   0.16%   1.97m

         CACHE HITS BY DATA TYPE:
           Demand Data:                  34.34%  431.32m
           Prefetch Data:                0.28%   3.58m
           Demand Metadata:              23.72%  297.92m
           Prefetch Metadata:            41.65%  523.09m

         CACHE MISSES BY DATA TYPE:
           Demand Data:                  17.75%  3.69m
           Prefetch Data:                66.88%  13.91m
           Demand Metadata:              5.78%   1.20m
           Prefetch Metadata:            9.60%   2.00m

------------------------------------------------------------------------

Grinnin' big, in short.

I have no reason to make further changes to the code or defaults.

-- 
-- Karl
karl@denninger.net

Comment 14 karl 2014-04-14 16:40:49 UTC
Follow-up:

21 days at this point of uninterrupted uptime, inact pages are stable as 
is the free list, wired and free are appropriate and varies with load as 
expected, ZERO swapping and performance is and has remained excellent, 
all on a very-heavily used fairly-beefy (~24GB RAM, dual Xeon CPUs) 
production system under 10-STABLE.

-- 
-- Karl
karl@denninger.net

Comment 15 Devin Teske freebsd_committer freebsd_triage 2014-04-14 18:58:09 UTC
Been running this on stable/8 for a week now on 3 separate machines.
All appears stable and under heavy load, we can certainly see the new
reclaim firing early and appropriately when-needed (no longer do we
have programs getting swapped out).

Interestingly, in our testing we've found that we can force the old
reclaim (code state prior to applying Karl's patch) to fire by sapping the
few remaining pages from unallocated memory. I do this by exploiting
a little known bug in the bourne-shell to leak memory (command below).

	sh -c 'f(){ while :;do local b;done;};f'

Watching "top" in the un-patched state, we can see Wired memory grow
from ARC usage but not drop. I then run the above command and "top"
shows an "sh" process with a fast-growing "SIZE", quickly eating up about
100MB per second. When "top" shows the Free memory drop to mere KB
(single pages), we see the original (again, unpatched) reclaim algorithm
fire
and the Wired memory finally starts to drop.

After applying this patch, we no longer have to play the game of "eat
all my remaining memory to force the original reclaim event to free up
pages", but rather the ARC waxes and wanes with normal applicate usage.

However, I must say that on stable/8 the problem of applications going to
sleep is not nearly as bad as I have experienced it in 9 or 10.

We are happy to report that the patch seems to be a win for stable/8 as
well because in our case, we do like to have a bit of free memory and the
old reclaim was not providing that. It's nice to not have to resort to
tricks to
get the ARC to pare down.
-- 
Cheers,
Devin

_____________
The information contained in this message is proprietary and/or confidential. If you are not the intended recipient, please: (i) delete the message and all copies; (ii) do not disclose, distribute or use the message in any manner; and (iii) notify the sender immediately. In addition, please be aware that any message addressed to our domain is subject to archiving and review by persons other than the intended recipient. Thank you.
Comment 16 karl 2014-05-15 16:05:34 UTC
I have now been running the latest delta as posted 26 March -- it is 
coming up on two months now, has been stable here and I've seen several 
positive reports and no negative ones on impact for others. Performance 
continues to be "as expected."

Is there an expectation on this being merged forward and/or MFC'd?

-- 
-- Karl
karl@denninger.net

Comment 17 Adrian Chadd freebsd_committer freebsd_triage 2014-06-20 16:56:49 UTC
I'll swing alan cox into it and see what he thinks.

Thanks!


-a
Comment 18 fullermd 2014-06-20 20:01:20 UTC
I've also been running the patch (from http://...) for 1-2 weeks now on a couple stable/10 boxes and one -CURRENT, ranging from 4 to 16 gig of RAM.  No problems noted, swapping definitely less aggressive so I've not yet waited for anything to swap back in.  Doesn't seem to starve the ARC either.
Comment 19 Adrian Chadd freebsd_committer freebsd_triage 2014-06-20 20:55:21 UTC
From alc:

I gave it a cursory look.  The patch appears to use
"vm_cnt.v_free_target" incorrectly.  If you look at sys/vmmeter,
specifically,

/*
 * Return TRUE if we have not reached our free page target during
 * free page recovery operations.
 */

static __inline
int
vm_page_count_target(void)
{
    return (vm_cnt.v_free_target > (vm_cnt.v_free_count +
          vm_cnt.v_cache_count));
}

/*
 * Return the number of pages we need to free-up or cache
 * A positive number indicates that we do not have enough free pages.
 */

static __inline
int
vm_paging_target(void)
{
    return (vm_cnt.v_free_target - (vm_cnt.v_free_count +
          vm_cnt.v_cache_count));
}

you see that "vm_cnt.v_free_target" should be compared to
"(vm_cnt.v_free_count + vm_cnt.v_cache_count)" not "vm_cnt.v_free_count".
Comment 20 karl 2014-06-20 21:18:34 UTC
No, because memory in "cache" is subject to being either reallocated or freed.  When I was developing this patch that was my first impression as well and how I originally coded it, and it turned out to be wrong.

The issue here is that you have two parts of the system contending for RAM -- the VM system generally, and the ARC cache.  If the ARC cache frees space before the VM system activates and starts pruning then you wind up with the ARC pinned at the minimum after some period of time, because it releases "early." 

The original ZFS code releases ARC only when the VM system goes into "desperation" mode.  That's too late and results in pathological behavior including long freezes where nothing appears to happen at all.  What appears to actually be happening is that the ARC is essentially dumped while paging is occurring, and the system reacts very badly to that.

The test as it sits now activates the ARC pare-down at the point the VM system wakes up.  The two go into and out of contention at roughly the same time resulting in a balanced result -- the ARC stabilizes at a value allowing some cached pages to remain, but cached pages do not grow without boundary nor does the system get into a page starvation situation and get into the "freeze" condition trying to free huge chunks of ARC at once.

If you have a need to bias the ARC pare-down more-aggressively you can through the tunables, but the existing code is where after much experimentation across multiple workloads and RAM sizes was found to result in both a stable ARC and stable cache page population over long periods of time (weeks of uptime across varying loads.)

As currently implemented this has now been running untouched for several months on an extremely busy web, database (Postgresql) and internal Samba server without incident.
Comment 21 Adam McDougall 2014-07-10 02:30:49 UTC
I have been using this patch for a while on two desktops (4G and 16G) and my swapping activity has stopped with no visible negative impact.  Positive impact for me is no longer having to wait for applications to swap back in when I need to use them.  I tried r265945 and vm.lowmem_period=0 and neither were as effective as the patch in this bug report.  I plan to roll this patch into my next build for general deployment to my servers and start experimenting with removing my vfs.zfs.arc_max setting which I am currently using on some.  I hope it can gain enough consensus to at least commit to -current.  Thanks.
Comment 22 Michael Moll freebsd_committer freebsd_triage 2014-07-26 01:49:30 UTC
any news on this? especially regarding 10.1?
Comment 23 karl 2014-08-20 15:28:57 UTC
Status -- still grinning, still doing what it's supposed to on my fairly-heavily loaded production machine.....  This machine has a large Postgres database running on it and has 19Gb of RAM wired at the present time.

Are there any further comments from the reviewers on integration into the tree for general distribution?

[karl@NewFS ~]$ uname -v
FreeBSD 10.0-STABLE #0 r265164M: Wed Apr 30 16:55:42 CDT 2014     karl@NewFS.denninger.net:/usr/obj/usr/src/sys/KSD-SMP 

[karl@NewFS ~]$ uptime
10:24AM  up 111 days, 16:56, 1 user, load averages: 0.43, 0.55, 0.51

[karl@NewFS ~]$ zfs-stats -AE

------------------------------------------------------------------------
ZFS Subsystem Report                            Wed Aug 20 10:24:49 2014
------------------------------------------------------------------------

ARC Summary: (HEALTHY)
        Memory Throttle Count:                  0

ARC Misc:
        Deleted:                                146.88m
        Recycle Misses:                         6.59m
        Mutex Misses:                           14.19k
        Evict Skips:                            176.63m

ARC Size:                               65.19%  14.56   GiB
        Target Size: (Adaptive)         65.19%  14.56   GiB
        Min Size (Hard Limit):          12.50%  2.79    GiB
        Max Size (High Water):          8:1     22.33   GiB

ARC Size Breakdown:
        Recently Used Cache Size:       78.62%  11.44   GiB
        Frequently Used Cache Size:     21.38%  3.11    GiB

ARC Hash Breakdown:
        Elements Max:                           1.61m
        Elements Current:               61.38%  986.56k
        Collisions:                             653.96m
        Chain Max:                              27
        Chains:                                 243.95k


------------------------------------------------------------------------

ARC Efficiency:                                 6.56b
        Cache Hit Ratio:                98.75%  6.48b
        Cache Miss Ratio:               1.25%   81.95m
        Actual Hit Ratio:               82.14%  5.39b

        Data Demand Efficiency:         99.56%  3.83b
        Data Prefetch Efficiency:       23.05%  53.24m

        CACHE HITS BY CACHE LIST:
          Anonymously Used:             16.46%  1.07b
          Most Recently Used:           4.48%   289.87m
          Most Frequently Used:         78.70%  5.10b
          Most Recently Used Ghost:     0.07%   4.50m
          Most Frequently Used Ghost:   0.29%   18.94m

        CACHE HITS BY DATA TYPE:
          Demand Data:                  58.89%  3.81b
          Prefetch Data:                0.19%   12.27m
          Demand Metadata:              17.85%  1.16b
          Prefetch Metadata:            23.07%  1.49b

        CACHE MISSES BY DATA TYPE:
          Demand Data:                  20.38%  16.70m
          Prefetch Data:                49.99%  40.97m
          Demand Metadata:              16.74%  13.72m
          Prefetch Metadata:            12.89%  10.56m

------------------------------------------------------------------------
Comment 24 Mark.Martinec 2014-08-20 15:31:00 UTC
Please don't miss the 10.1 slush! The fix is still not in 10-STABLE
and needs to be manually applied for each new 10 installation.
Comment 25 fullermd 2014-08-23 11:37:55 UTC
Having run it for a few months on a number of boxes now, my general impression is that it seems like it goes a little _too_ far (with default options anyway; I haven't tried any tuning) toward making the ARC give up its lunch money to anybody who looks threateningly at it.  It feels like it should be a bit more aggressive, and historically was and did fine.

However, it's still _much_ nicer than the unpatched case, where the rest of the system starves and hides out in the swap space.  So from here, while perhaps imperfect and in need of some tuning work, it's still a significant improvement on the prior state, so landing it sounds just fine to me.
Comment 26 Steven Hartland freebsd_committer freebsd_triage 2014-08-23 14:16:38 UTC
I've actually been looking at this patch today in relation to my investigation of:
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=191510

I would appreciate it if people could test the attached patch, which was created against stable/10

It should achieve the same as Karl's patch as well as:
* More closely matching original Solaris logic
* Provide better control of the reclaim trigger (absolute not percentage based, which becomes a problem in larger memory machines)
* Uses direct kernel values instead of interfacing via sysctl's.
* Should fix the issue identified in #191510 as well.

Basic design is it will trigger ARC reclaim when free pages drops below vfs.zfs.arc_free_target, which by default is 3 x that of the VM's target free pages as exposed by vm.v_free_target (matching Solaris).
Comment 27 Steven Hartland freebsd_committer freebsd_triage 2014-08-23 14:18:50 UTC
Created attachment 146178 [details]
refactor arc reclaim logic
Comment 28 Steven Hartland freebsd_committer freebsd_triage 2014-08-24 09:05:40 UTC
Created attachment 146203 [details]
refactor arc reclaim logic
Comment 29 karl 2014-08-24 18:12:53 UTC
Steve;

I don't think you're going to like what your patch does under load.

If I'm reading it correctly you are basically paring ARC down when vm_free_target * 3 is invaded -- that is: (zfs_arc_free_target = kmem_free_target_size() * 3;)  

If I'm reading this correctly at three times the free target size you will pare ARC back.

When I allowed there to be a margin between ARC paring and the vm system's target free page space what ultimately happened under load was that the ARC cache was pinned to the minimum value because the VM system never woke up to reclaim any of the reserved but not-in-active-use space.

I believe that the reason for this is the basic design of the unified VM system in FreeBSD; pages are allowed to "float" on the premise that they may be re-used, which is fine in the general case and in fact improves performance but when you have ARC cache that can only be allocated out of actual free memory that "floating" space winds up pinning the ARC to the floor as ARC is evicted in favor of pages that are not in active use by other parts of the VM system.

I went through several iterations of the patch including both a ratio-based and fixed reservation over v_free_target with all of those attempts winding up with corner cases that led to bad outcomes; after much work the last update I posted defaults to this:  (freepages = vm_cnt.v_free_target;)

That results in ARC pare-back and VM page-cleaning wakeup happening at the same time.  You can tune the defaults away from that, but I have found that doing so in the general case leads to undesired outcomes, either with the ARC being pinned to zero or the "hangs" (really long pauses) returning.  I left the tunables in for those cases where a workload other than what I had encountered made them wise, although from the comments over the last three months I suspect they could be removed since those knobs appear to be far more-likely to hurt than help.

It would appear that what I settled on is the wrong decision by the original code comments, but that code assumes VM behavior that is different than what FreeBSD actually does.  There may be a valid debate to be had on altering FreeBSD's VM behavior via page coloration or somesuch so that ARC has "preference" over things like inactive or (especially) cached pages -- but without that sort of change I believe what you've proposed is going to run into a problem with system behavior under load similar to what I saw when I attempted to pare back ARC before VM page-cleaning wakes up.

From the production system here and my various test systems, this is what produces a stable operating environment:

vm.v_free_target: 130321
vm.stats.vm.v_free_target: 130321
[karl@NewFS ~]$ sysctl -a|grep arc_freepage
vfs.zfs.arc_freepages: 130321
vfs.zfs.arc_freepage_percent: 0

In other words both ARC paring and VM wake up at the same time, no additional reservation beyond free_target is requested.

If I tune vfs.zfs.arc_freepages up from that value manually by any material amount (e.g. to 3x value that I think you're setting it to) I will end up pinning the ARC cache to the floor on my servers within a relatively short period of time.
Comment 30 Steven Hartland freebsd_committer freebsd_triage 2014-08-24 19:02:35 UTC
That's not what we've experienced here on a box running a nginx reverse proxy which is under high memory load from ZFS ARC we see:-

------------------------------------------------------------------------
ZFS Subsystem Report                            Sun Aug 24 18:49:13 2014
------------------------------------------------------------------------

ARC Summary: (HEALTHY)
        Memory Throttle Count:                  0

ARC Misc:
        Deleted:                                73.95m
        Recycle Misses:                         4.08m
        Mutex Misses:                           30.34k
        Evict Skips:                            4.98m

ARC Size:                               93.63%  174.17  GiB
        Target Size: (Adaptive)         93.63%  174.17  GiB
        Min Size (Hard Limit):          12.50%  23.25   GiB
        Max Size (High Water):          8:1     186.01  GiB

ARC Size Breakdown:
        Recently Used Cache Size:       76.28%  132.86  GiB
        Frequently Used Cache Size:     23.72%  41.31   GiB

ARC Hash Breakdown:
        Elements Max:                           5.44m
        Elements Current:               100.00% 5.44m
        Collisions:                             9.36m
        Chain Max:                              8
        Chains:                                 533.83k

------------------------------------------------------------------------

If by default your patch results in freepages = vm_cnt.v_free_target you'll always be hitting if (vm_paging_needed()) {

Also the percentage method results in massive amounts of unused memory. On the box from above 40GB was never used, which is clearly a problem.

I'd be really be interest if you can test and confirm if you do indeed experience what you believe might happen with your workload.
Comment 31 karl 2014-08-24 20:38:21 UTC
Can I replicate your patch's behavior by tuning vfs.zfs.arc_freepages to vm_v_free_target * 3, using the patch I have in? (this link here: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=187594#c10)

I think so if I'm reading your patch correctly.

If so then I can do that test "hot" since my patch allows "online" tuning of that parameter and I will know in a few hours during the business day (and maybe quite quickly this evening, although our load is much lower at night and on weekends.)

The patch at the above link is not a percentage-based reservation; this is what I have now on the machine in question that has been running my patch in production:

[karl@NewFS ~]$ systat -vm

    1 users    Load  0.70  0.65  0.58                  Aug 24 15:25

Mem:KB    REAL            VIRTUAL                       VN PAGER   SWAP PAGER
        Tot   Share      Tot    Share    Free           in   out     in   out
Act 4252540   33548  8746752    69996  650840  count
All  17358k   51460  9688212   255976          pages
Proc:                                                            Interrupts
  r   p   d   s   w   Csw  Trp  Sys  Int  Sof  Flt        ioflt  6733 total
            237   3   22k 6650  14k 3866  441 2700   1197 cow      11 uart0 4
                                                     1286 zfod    154 uhci0 16
 0.9%Sys   0.3%Intr  2.0%User  0.0%Nice 96.7%Idle         ozfod       pcm0 17
|    |    |    |    |    |    |    |    |    |           %ozfod       ehci0 uhci
+>                                                        daefr       uhci1 21
                                           dtbuf     1889 prcfr   527 uhci3 ehci
Namei     Name-cache   Dir-cache    485888 desvn     6640 totfr   131 twa0 30
   Calls    hits   %    hits   %    143890 numvn          react  1003 cpu0:timer
   11904   11816  99                121463 frevn          pdwak   123 mps0 256
                                                      631 pdpgs  1387 em0:rx 0
Disks   da0   da1   da2   da3   da4   da5   da6           intrn  1353 em0:tx 0
KB/t  23.86 23.78  0.00  0.00 91.68 89.02 87.34  19287356 wire        em0:link
tps      51    50     0     0    25    37    38   1538432 act      84 em1:rx 0
MB/s   1.18  1.17  0.00  0.00  2.22  3.21  3.24   2984088 inact    97 em1:tx 0
%busy    37    35     0     0     2     2     3     23764 cache       em1:link
                                                   627204 free    226 cpu1:timer
                                                       96 buf      61 cpu8:timer
                                                                  212 cpu7:timer
                                                                   71 cpu11:time
                                                                  137 cpu3:timer
                                                                   48 cpu12:time
                                                                   94 cpu6:timer
                                                                   60 cpu15:time
                                                                  185 cpu5:timer
                                                                   74 cpu10:time
                                                                  220 cpu4:timer
                                                                  105 cpu13:time
                                                                  170 cpu2:timer
                                                                  105 cpu9:timer
                                                                   95 cpu14:time


That's not much in the way of free RAM up there...

This is what zfs-stats -AE shows:

[karl@NewFS ~]$ zfs-stats -AE

------------------------------------------------------------------------
ZFS Subsystem Report                            Sun Aug 24 15:31:24 2014
------------------------------------------------------------------------

ARC Summary: (HEALTHY)
        Memory Throttle Count:                  0

ARC Misc:
        Deleted:                                153.75m
        Recycle Misses:                         6.79m
        Mutex Misses:                           14.80k
        Evict Skips:                            182.26m

ARC Size:                               59.18%  13.21   GiB
        Target Size: (Adaptive)         59.18%  13.21   GiB
        Min Size (Hard Limit):          12.50%  2.79    GiB
        Max Size (High Water):          8:1     22.33   GiB

ARC Size Breakdown:
        Recently Used Cache Size:       87.08%  11.51   GiB
        Frequently Used Cache Size:     12.92%  1.71    GiB

ARC Hash Breakdown:
        Elements Max:                           1.61m
        Elements Current:               47.36%  761.34k
        Collisions:                             680.92m
        Chain Max:                              27
        Chains:                                 174.04k

------------------------------------------------------------------------

ARC Efficiency:                                 6.77b
        Cache Hit Ratio:                98.75%  6.69b
        Cache Miss Ratio:               1.25%   84.90m
        Actual Hit Ratio:               82.47%  5.59b

        Data Demand Efficiency:         99.56%  4.00b
        Data Prefetch Efficiency:       23.02%  55.39m

        CACHE HITS BY CACHE LIST:
          Anonymously Used:             16.12%  1.08b
          Most Recently Used:           4.50%   300.76m
          Most Frequently Used:         79.02%  5.28b
          Most Recently Used Ghost:     0.07%   4.66m
          Most Frequently Used Ghost:   0.29%   19.71m

        CACHE HITS BY DATA TYPE:
          Demand Data:                  59.49%  3.98b
          Prefetch Data:                0.19%   12.75m
          Demand Metadata:              17.78%  1.19b
          Prefetch Metadata:            22.53%  1.51b

        CACHE MISSES BY DATA TYPE:
          Demand Data:                  20.67%  17.55m
          Prefetch Data:                50.23%  42.64m
          Demand Metadata:              16.54%  14.04m
          Prefetch Metadata:            12.56%  10.66m

------------------------------------------------------------------------

As you can see there's plenty of contention for RAM in that the dynamic size of the ARC cache is about HALF of the maximum permitted.  However, RAM is not sitting around unused; the system is appropriately managing that memory between other VM uses and that of ARC cache.  It's doing great in terms of efficiency, given that it's got a near-99% hit ratio.

Point being that there would be effectively no gain to forcing more ARC usage, especially if it comes at the expense of (any) paging -- the system is doing zero of that and has nothing (effectively) paged out:

[karl@NewFS ~]$ pstat -s
Device          1K-blocks     Used    Avail Capacity
/dev/gpt/swap1.eli  67108864      692 67108172     0%

That system is a dual Xeon E5620 machine with 24GB of ECC RAM in it.  It has a Postgres database, an Apache web server and a bunch of middleware running on it and has been up for about three months now untouched.  It's pretty unloaded at the moment (as you can see) but gets a ton of traffic during the business day.
Comment 32 Steven Hartland freebsd_committer freebsd_triage 2014-08-25 10:32:32 UTC
I'll be honest I hadn't seen the inline patch. Can you attach the latest version so I can compare that to the original.
Comment 33 Steven Hartland freebsd_committer freebsd_triage 2014-08-25 10:35:12 UTC
Created attachment 146249 [details]
arc reclaim refactor (against head)
Comment 34 karl 2014-08-25 12:52:47 UTC
Created attachment 146251 [details]
From comment #10
Comment 35 karl 2014-08-25 12:53:25 UTC
(In reply to Steven Hartland from comment #32)
> I'll be honest I hadn't seen the inline patch. Can you attach the latest
> version so I can compare that to the original.

Done; it's in comment #10.
Comment 36 Steven Hartland freebsd_committer freebsd_triage 2014-08-25 13:06:05 UTC
What version is that against Karl as it fails every hunk against current head?
Comment 37 Steven Hartland freebsd_committer freebsd_triage 2014-08-25 13:28:15 UTC
Figured it out patch is in dos not unix format ;-)
Comment 38 karl 2014-08-25 14:09:40 UTC
(In reply to Steven Hartland from comment #37)
> Figured it out patch is in dos not unix format ;-)

Bugzilla did that; all I did was copy the patch in comment #10 and paste it in the "Add an attachment" box!  I suppose I should file a pr on the pr system ;-)
Comment 39 Steven Hartland freebsd_committer freebsd_triage 2014-08-25 18:02:59 UTC
The difference between the two approaches in the default config is:-
if (vm_cnt.v_free_count < vm_cnt.v_free_target) {
vs:
if (vm_cnt.v_free_count < 3 * vm_cnt.v_free_target) {

So mine is slightly more conservative with ARC scaling back when free page count drops below 3 x free page target.

Currently this is:
if (kmem_used() > (kmem_size() * 3) / 4)

Which means:-
1. ARC is limited to 3/4 of system memory
2. ARC reclaim can start much later as v_free_count != kmem_size() - kmem_used() but that's no where near the truth.
Comment 40 karl 2014-08-25 19:20:35 UTC
That's what I thought from reading your patch.

I can tell you what I expect to happen with my configuration if I do that here though (I will confirm later today, as it's easy to do as with my patch I can tune it in real time) -- my ARC is going to get pinned to the floor.

Here we go:

[root@NewFS /disk/karl]# sysctl -a|grep free_target
vm.v_free_target: 130321
vm.stats.vm.v_free_target: 130321

[root@NewFS /disk/karl]# sysctl -w vfs.zfs.arc_freepages=390963
vfs.zfs.arc_freepages: 130321 -> 390963

I'll let it run for a while and report back, but this much I know from my earlier testing and that of others who I shared the early versions with and got feedback from -- my original attempt in fixing this was exactly as you proposed, but with a smaller margin -- that is, to set the ARC release to free_target + a margin; original attempts were anywhere from 10% to 3% of RAM.  

The problem is that under the right sort of workloads ANY value for ARC paredown above free_target resulted in the ARC being pared back too early as the VM system over time consumed additional page space and the ARC was freed in favor of that other use.

From my understanding of how the unified VM system works both from documentation and examining the code this is to be expected; if you allow something else in memory to be evicted on free RAM pressure before the VM system wakes up you will eventually find yourself with none of that (or wherever you set your floor to) since the VM system never wakes up to reclaim its space.

As such this is only stable with very large RAM spaces and/or fixed VM loads; if your machine's working set is stable you will probably get away with what your patch does but for mixed workloads (and especially unpredictable ones where wildly-varying requirements for working set exist, or in a system with UFS filesystems as well where traditional buffer caching comes into play) you're going to find that ARC over time gets pinned to the lower limit.  That's "stable" in that it doesn't crash but not desirable as it essentially prevents the ARC from using more than minimum RAM any time there is VM contention.

The other way is worse and is the default "as shipped" without the patch -- the ARC cache does not pare down until the VM system gets critically low on space and starts paging (rather than simply reclaiming space from the cache and inact buckets.)  Once that happens performance goes in the toilet; that's where the "stall" behavior comes from, likely because as the ARC is pared down dramatically and violently you wind up with some pretty ugly interactions between paging activity (dramatically slowing system response in the first place) and a big spike in ARC cache misses that hit you at the same time.  While the 3/4 of system memory looks ok at first blush the problem is that for mixed large workloads, especially those that wire pages (e.g. where a DBMS is running) pathological behavior happens early and often.  The common refrain heard has always been "add more memory" and just as often "don't run mixed systems with UFS and ZFS filesystems in the same box" but as the patch I put forward shows that's simply neither necessary or (except if you truly go nuts with the RAM) sufficient. There's also a pretty-clean argument to be made that once you get into the mid (and certainly upper) 90% range on cache efficiency having more doesn't do anything for you.

Essentially as long as we have a unified VM system that does not provide a "color" for an ARC page (to specify, for example, that it is preferred to keep ARC rather than cache or inact) and you have an external, non-VM-system thread that is managing ARC, to avoid pathology you have to wake up all of contention-management systems for that VM space at the same time.

Rewriting the VM system to include ARC management directly and incorporate some sort of page coloration such as is done for cache and inact would be another approach and it was one I considered but it's one that marries the VM system to ZFS in ways that might make maintenance difficult down the road (especially considering that ZFS is an "import" to FreeBSD.)  As such I judged the best option to be insuring that both the ARC pare-down and VM cleanup wake up at the same time.

To achieve this ARC has to scale back when the VM system wakes up (that is, invasion of v_free_target occurs), not earlier (and DEFINITELY not later!)

(There is an advantage to the way I wrote the patch in that if a user wants your  patch's behavior instead of the way I wrote it, that is, 3 * v_free_target, they can have it with a run-time tunable that can be set dynamically..... but from my work and testing I don't think that most people with small to moderately-large RAM configurations are going to like the results of doing so.)
Comment 41 Steven Hartland freebsd_committer freebsd_triage 2014-08-25 21:18:24 UTC
I switched our test machine here to use zfs_arc_free_target = vm_cnt.v_free_target using the sysctl and saw no degradation in performance.

ARC reclaim kicked in a little later as expected but based on the existing testing that's happened I'm happy to use that as the default if you believe it changes the behaviour for the better for a wider set of work loads.

I don't believe there's a real need for the second sysctl (arc_freepage_percent) as that can be achieved with just the initial one really, and hence is just confusing.

The only other material difference is that your sysctl is in pages but the one I'm proposing in bytes. The reason I chose that was its avoids admins having to do the page conversion calculation if they do indeed want to tune the value, as well as it matching the other ZFS tunables which are specified in bytes too.

Does anyone prefer pages instead of bytes and if so why?
Comment 42 karl 2014-08-25 21:34:36 UTC
Getting rid of the percentage reservation (arc_freepage_percent) is fine; that is effectively an anachronism from earlier rounds of the code.

The reason I prefer pages over bytes is that if you do tune it off the default the easy way to get back to defaults is to grab vm.v_free_target out of sysctl.  In addition since the code tests against free pages rather than bytes why not use the same unit?  Of course there's the other argument on returning to the default, which is that as the patch is written if you set the level to zero it will pick up the default on its own. That means of forcing a re-set to the default should probably be retained.

Do you want me to re-work the patch to remove the second tunable and attach it or would you like to?
Comment 43 Steven Hartland freebsd_committer freebsd_triage 2014-08-26 01:42:36 UTC
No need to do that I wouldn't use the 0 = default code anyway, see how I made that work in my version.
Comment 44 karl 2014-08-26 02:39:57 UTC
If I'm reading your code correctly you don't allow the eviction threshold to be tuned during runtime.

There's no particular reason I can think of to do that; I know why you did it that way given the init on pagedaemon, but that's why I pick it up if the value is zeroed; that allows the ARC code to pick up the proper value after the rest of VM init is complete, and it also allows the sysadmin to both tell the system to go back to the default or adjust the ARC eviction threshold while the system is running.

Why lose that functionality?
Comment 45 karl 2014-08-26 02:48:41 UTC
Created attachment 146287 [details]
ARC refactor patch less percentage reservation

Checked against 10.0-Stable for build integrity; the only change is the removal of the percentage additional reservation (deprecated and not used by default); otherwise identical in operation to patch in comment #10
Comment 46 Tomoaki AOKI 2014-08-26 11:45:41 UTC
Created attachment 146300 [details]
ARC refactor patch less percentage reservation for after r269846

"ARC refactor patch less percentage reservation" failes at hunk 3 after stable/10 r269846.
Please see below svnweb diff.

https://svnweb.freebsd.org/base/stable/10/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c?r1=269732&r2=269846

Attached is corrected (minimalistic) patch. Applied and built OK at stable/10 r270646, amd64 and working now on my ThinkPad T420 with 8GB of RAM.
Comment 47 Steven Hartland freebsd_committer freebsd_triage 2014-08-27 12:14:28 UTC
(In reply to karl from comment #44)
> If I'm reading your code correctly you don't allow the eviction threshold to
> be tuned during runtime.
...
> 
> Why lose that functionality?

Its there sysctl vfs.zfs.arc_free_target
Comment 48 Steven Hartland freebsd_committer freebsd_triage 2014-08-27 12:20:46 UTC
Created attachment 146373 [details]
arc reclaim refactor (against head)

Switched arc_free_target to use kmem_free_target_size() instead of kmem_free_target_size() * 3 as the default.

I intend to commit this to head later today, with a 1 week MFC period so it can be merged to stable/10 before the freeze for 10.1; so if there are any issues please raise ASAP!
Comment 49 karl 2014-08-27 13:01:15 UTC
I'll spin up my test system later today with your iteration on it in place of what's been running and toss a synthetic load at it.
Comment 50 John Baldwin freebsd_committer freebsd_triage 2014-08-27 20:20:12 UTC
(In reply to Steven Hartland from comment #41)
> I switched our test machine here to use zfs_arc_free_target =
> vm_cnt.v_free_target using the sysctl and saw no degradation in performance.
> 
> ARC reclaim kicked in a little later as expected but based on the existing
> testing that's happened I'm happy to use that as the default if you believe
> it changes the behaviour for the better for a wider set of work loads.
> 
> I don't believe there's a real need for the second sysctl
> (arc_freepage_percent) as that can be achieved with just the initial one
> really, and hence is just confusing.
> 
> The only other material difference is that your sysctl is in pages but the
> one I'm proposing in bytes. The reason I chose that was its avoids admins
> having to do the page conversion calculation if they do indeed want to tune
> the value, as well as it matching the other ZFS tunables which are specified
> in bytes too.
> 
> Does anyone prefer pages instead of bytes and if so why?

All the other VM tunables and stats are in pages (v_*_count, v_*_target, etc.).  Please keep it in pages.
Comment 51 Steven Hartland freebsd_committer freebsd_triage 2014-08-27 23:15:07 UTC
Created attachment 146423 [details]
arc reclaim refactor (against head)

Switched to using pages instead of bytes for vfs.zfs.arc_free_target as requested by jbh
Comment 52 Steven Hartland freebsd_committer freebsd_triage 2014-08-27 23:18:01 UTC
Created attachment 146424 [details]
arc reclaim refactor (against head)

Flag all my old patches as obsolete
Comment 53 Steven Hartland freebsd_committer freebsd_triage 2014-08-28 10:35:28 UTC
(In reply to karl from comment #49)
> I'll spin up my test system later today with your iteration on it in place
> of what's been running and toss a synthetic load at it.

How did you get on Karl?
Comment 54 karl 2014-08-28 12:54:51 UTC
I have only one concern with behavior and that is that when the system came under RAM pressure ARC eviction occurred properly but when that pressure eased the release of space back to ARC was very slow. I believe that my original patch was similarly slow in recovery as RAM pressure eased but want to test that.

Expect an updated comment later today....
Comment 55 karl 2014-08-28 15:04:04 UTC
I've got one potential issue against the obsolesced set (the refactor against head will not apply against 10-Stable); one of my test scenarios just produced a material (~30 second) "hang" that I can't reproduce with my patch version on the same box.

The scenario is a buildworld/buildkernel/release memstick build; at the end the system allocates a ramdisk to hold the stick image; that allocation sticks a large and sudden RAM load on the VM system and forces ARC eviction.  The test system has 12Gb of RAM in it; the expected behavior is that ARC cache gets dumped from ~10Gb down to about 5-6Gb almost instantly when that command runs during the build.

I'm looking into it; I must be blind because I don't see why the second hunk is failing against arc.c on your patch when attempted against 10-Stable (checked out today) -- but patch says it does.
Comment 56 Tomoaki AOKI 2014-08-28 15:19:36 UTC
Created attachment 146456 [details]
arc reclaim refactor (against stable/10)

Attached is modified "arc reclaim refactor (against head)" patch for stable/10 for my test. Built and booted OK ATM. Please try if you need.

Modified points:
  *struct vm_cnt* in head is struct cnt* in stable/10. Rename them.
  *Some tunables are deleted in head. Readd them.
Comment 57 Tomoaki AOKI 2014-08-28 15:25:34 UTC
(In reply to Tomoaki AOKI from comment #56)
> Created attachment 146456 [details]
> arc reclaim refactor (against stable/10)
> 
> Attached is modified "arc reclaim refactor (against head)" patch for
> stable/10 for my test. Built and booted OK ATM. Please try if you need.
> 
> Modified points:
>   *struct vm_cnt* in head is struct cnt* in stable/10. Rename them.
>   *Some tunables are deleted in head. Readd them.

Forgot to mention.

https://svnweb.freebsd.org/base/head/sys/sys/vmmeter.h?r1=254304&r2=263620

https://svnweb.freebsd.org/base/head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c?r1=267985&r2=267992
Comment 58 karl 2014-08-28 15:28:33 UTC
Got it.... thanks; I had hand-fixed it on this end and am in the middle of rebuilding to re-run the test now.  It will take a while; the memstick test requires an hour or more to run from scratch to completion since I do it from a clean build state for consistency reasons.
Comment 59 Steven Hartland freebsd_committer freebsd_triage 2014-08-28 15:56:08 UTC
(In reply to Tomoaki AOKI from comment #56)
> Created attachment 146456 [details]
> arc reclaim refactor (against stable/10)
> 
> Attached is modified "arc reclaim refactor (against head)" patch for
> stable/10 for my test. Built and booted OK ATM. Please try if you need.

Yep that's correct :)
Comment 60 karl 2014-08-28 16:56:41 UTC
Looks pretty good with the most-recent against 10-STABLE and arc_free_target at the default value.  No misbehavior noted thus far on my test system.

It appears the refactor in terms of intent and operation is identical to the Comment 10 patch I had previously posted, and against my synthetic load tests along with "common actions" (e.g. world builds, etc) it behaves identically as well in my testing.
Comment 61 Steven Hartland freebsd_committer freebsd_triage 2014-08-28 17:10:07 UTC
(In reply to karl from comment #60)
> Looks pretty good with the most-recent against 10-STABLE and arc_free_target
> at the default value.  No misbehavior noted thus far on my test system.
> 
> It appears the refactor in terms of intent and operation is identical to the
> Comment 10 patch I had previously posted, and against my synthetic load
> tests along with "common actions" (e.g. world builds, etc) it behaves
> identically as well in my testing.

Thanks Karl your initial work on this and you time testing is much appreciated!
Comment 62 Steven Hartland freebsd_committer freebsd_triage 2014-08-28 20:07:05 UTC
Addressed by: http://svnweb.freebsd.org/changeset/base/270759
Comment 63 Steven Hartland freebsd_committer freebsd_triage 2014-08-29 19:37:28 UTC
Karl there's some concern amongst the dev's that we're not counting cache pages as free so could you apply the following to your test machine and retest to see if it has any adverse effects:-

Index: cddl/compat/opensolaris/kern/opensolaris_kmem.c
===================================================================
--- cddl/compat/opensolaris/kern/opensolaris_kmem.c	(revision 270824)
+++ cddl/compat/opensolaris/kern/opensolaris_kmem.c	(working copy)
@@ -152,7 +152,8 @@
 kmem_free_count(void)
 {
 
-	return (vm_cnt.v_free_count);
+	return (vm_cnt.v_free_count + vm_cnt.v_cache_count);
 }
 
 u_int
Comment 64 Steven Hartland freebsd_committer freebsd_triage 2014-08-29 20:08:10 UTC
I'm aware this has already been tried in the past but if you could retest now so we can eliminate any other issues with the stable/10 which have been fixed in the mean time that would be most appreciated.
Comment 65 karl 2014-08-29 20:23:04 UTC
I can give it a shot on my test system but I will note that for my production machine right now that would be a no-op as there are no cache pages in use.  Getting that on the production system with a high level of load over the holiday is rather unlikely due to it being Labor Day, so all I will have are my synthetics.  I do have some time over the long weekend to run that test and should be able to do so if you think it's useful, but I am wary of that approach being correct in the general case due to my previous experience.

My commentary on that discussion point and reasoning originally is here (http://lists.freebsd.org/pipermail/freebsd-fs/2014-March/019084.html); along that thread you'll see a vmstat output with the current paradigm showing that cache doesn't grow without boundary, as many suggested it would if I didn't include those pages as "free" and thus available for ARC to attempt to invade (effectively trying to force the VM system to evict them from the cache list by having it wake up first.)  Indeed, up at comment #31 is the patch on that production system that has been up for about four months and you can see only a few pages are in the cached state.  That is fairly typical.

What is the expected reason for concern about including them in the free count for ARC (when they're not in terms of what wakes up the VM system as a whole.)  If the expected area of concern is that cache pages will grow without boundary (or close to it) the evidence from both my production machine and others strongly suggests that's incorrect.

My work suggests strongly that the most-likely to be correct behavior for the majority of workloads is achieved when both the ARC paring routine and VM page-cleaning system wake up at the same time.  That occurs when vm_cnt.v_free_count is invaded.  Attempting to bias the outcome to force either the VM system or the ARC cache to do something first appears to increase the circumstances under which bad behavior occurs.
Comment 66 Steven Hartland freebsd_committer freebsd_triage 2014-08-29 20:29:56 UTC
Thanks Karl, I'm not a VM domain expert so have forwarded your feedback to the relevant concerned parties in the hope we can come to an educated conclusion.
Comment 67 John Baldwin freebsd_committer freebsd_triage 2014-08-29 21:14:12 UTC
(In reply to karl from comment #65)
> My work suggests strongly that the most-likely to be correct behavior for
> the majority of workloads is achieved when both the ARC paring routine and
> VM page-cleaning system wake up at the same time.  That occurs when
> vm_cnt.v_free_count is invaded.  Attempting to bias the outcome to force
> either the VM system or the ARC cache to do something first appears to
> increase the circumstances under which bad behavior occurs.

I think this last point is precisely why folks like alc@ and peter@ want to include the cache count.  All the code in pagedaemon uses 'free + cache' when making decisions.  To wit:

/*
 * Return TRUE if we are under our severe low-free-pages threshold
 *
 * This routine is typically used at the user<->system interface to determine
 * whether we need to block in order to avoid a low memory deadlock.
 */

static __inline 
int
vm_page_count_severe(void)
{
    return (vm_cnt.v_free_severe > (vm_cnt.v_free_count +
          vm_cnt.v_cache_count));
}

/*
 * Return TRUE if we are under our minimum low-free-pages threshold.
 *
 * This routine is typically used within the system to determine whether
 * we can execute potentially very expensive code in terms of memory.  It
 * is also used by the pageout daemon to calculate when to sleep, when
 * to wake waiters up, and when (after making a pass) to become more
 * desparate.
 */

static __inline 
int
vm_page_count_min(void)
{
    return (vm_cnt.v_free_min > (vm_cnt.v_free_count + vm_cnt.v_cache_count));
}

/*
 * Return TRUE if we have not reached our free page target during
 * free page recovery operations.
 */

static __inline 
int
vm_page_count_target(void)
{
    return (vm_cnt.v_free_target > (vm_cnt.v_free_count +
          vm_cnt.v_cache_count));
}

/*
 * Return the number of pages we need to free-up or cache
 * A positive number indicates that we do not have enough free pages.
 */

static __inline 
int
vm_paging_target(void)
{
    return (vm_cnt.v_free_target - (vm_cnt.v_free_count +
          vm_cnt.v_cache_count));
}

/*
 * Returns TRUE if the pagedaemon needs to be woken up.
 */

static __inline 
int
vm_paging_needed(void)
{
    return (vm_cnt.v_free_count + vm_cnt.v_cache_count <
        vm_pageout_wakeup_thresh);
}


These are the conditionals that pagedaemon uses to decide how many pages to recycle and when the VM system decides to awaken pagedaemon to recycle pages.  As you can clearly see, all of them use 'free + cache' to determine the count of free pages.  Thus, if they are going to wakeup at the same time as you wish, they should be using the same triggers.
Comment 68 karl 2014-08-29 21:17:20 UTC
I'll toss it on the test machine this weekend; they may well, given the code there, be correct....
Comment 69 karl 2014-08-29 23:05:38 UTC
I remember what it was.... with that as the test cached pages never got pared back and the arc was evicted in favor of it. That's arguably wrong since a page that is in the arc can be recovered if reused very rapidly.

I'll see if I can reproduce the bad behavior I previously saw.. I have the test machine running a soak with that proposed change now.
Comment 70 karl 2014-08-30 14:10:01 UTC
I was unable to provoke the former bad behavior on my test machine, which is running a current pull from 10-STABLE, with the eviction threshold set to free_count + cache_count.

This morning I took the opportunity of the long weekend to backport that to the production system and rebooted it -- will update if I see bad behavior or in a couple of days if not.
Comment 71 commit-hook freebsd_committer freebsd_triage 2014-08-30 21:44:37 UTC
A commit references this bug:

Author: smh
Date: Sat Aug 30 21:44:33 UTC 2014
New revision: 270861
URL: http://svnweb.freebsd.org/changeset/base/270861

Log:
  Ensure that ZFS ARC free memory checks include cached pages

  Also restore kmem_used() check for i386 as it has KVA limits that the raw
  page counts above don't consider

  PR:		187594
  Reviewed by:	peter
  X-MFC-With: r270759
  Review:	D700
  Sponsored by:	Multiplay

Changes:
  head/sys/cddl/compat/opensolaris/kern/opensolaris_kmem.c
  head/sys/cddl/compat/opensolaris/sys/kmem.h
  head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c
Comment 72 karl 2014-09-01 15:43:06 UTC
Update 9/1: The production system with the ARC cleanup trigger re-compiled to be free + cache pages continues to run well.

Perhaps some of others who have tested the original implementation could put this change in as well and make sure it doesn't do bad things to their machines?
Comment 73 Steven Hartland freebsd_committer freebsd_triage 2014-09-04 18:27:41 UTC
After further discussion we're like to ask users to test setting vfs.zfs.arc_free_target = (vm.stats.vm.v_free_min / 10) * 11
Comment 74 Steven Hartland freebsd_committer freebsd_triage 2014-09-04 18:42:10 UTC
Created attachment 146816 [details]
arc reclaim refactor (against stable/10)

Updated patch against stable/10 with all the proposed updates yet to hit the tree at time of writing.
Comment 75 Steven Hartland freebsd_committer freebsd_triage 2014-09-04 18:44:18 UTC
Created attachment 146817 [details]
arc reclam refactor (against head)

Updated patch against head with all the proposed updates yet to hit the tree at time of writing.
Comment 76 karl 2014-09-04 18:54:19 UTC
Steve;

That will result in the VM system going DEEPLY into contention before ARC is freed up on my production system here it will cut the free page count before ARC is freed by 2/3rds!

IMHO this will result in forced paging behavior and stalls.

I got one stall two nights ago during heavy SMB use (Samba; PCs running backups on an automatic basis) that caused the backup to error out -- and that was with simply including the cache pages in the free list for the comparison (effectively lowering the free memory threshold before the ARC backs off by the cached pages amount.)  As it happened overnight I don't know exactly how many pages were in the cache at the time, but it was enough to result in the bad behavior happening again.  It did not happen last night, but IMHO even that (relatively minor, in my workload) previous tweak is right on the edge of causing trouble.

What is the intended purpose behind driving the VM system that far into RAM contention cleanup before the ARC is pared back?
Comment 77 Steven Hartland freebsd_committer freebsd_triage 2014-09-04 19:27:10 UTC
(In reply to karl from comment #76)
> Steve;
> 
> That will result in the VM system going DEEPLY into contention before ARC is
> freed up on my production system here it will cut the free page count before
> ARC is freed by 2/3rds!

I know that's what you've experienced in the past but we would really like to confirm if this is still true given the combination of these + pagedaemon fixes already in stable/10 and above.

I've actually now uploaded the latest revision for both head and stable/10 and so would really appreciate it if you could confirm or deny your suspicions with a test run.

> IMHO this will result in forced paging behavior and stalls.

We're really looking to confirm this is still the case because there's some belief that this may no longer be the case.

> I got one stall two nights ago during heavy SMB use (Samba; PCs running
> backups on an automatic basis) that caused the backup to error out -- and
> that was with simply including the cache pages in the free list for the
> comparison (effectively lowering the free memory threshold before the ARC
> backs off by the cached pages amount.)  As it happened overnight I don't
> know exactly how many pages were in the cache at the time, but it was enough
> to result in the bad behavior happening again.  It did not happen last
> night, but IMHO even that (relatively minor, in my workload) previous tweak
> is right on the edge of causing trouble.
> 
> What is the intended purpose behind driving the VM system that far into RAM
> contention cleanup before the ARC is pared back?

There's a fine line and some complex interactions in play and we're trying to ensure we're not pegging back ARC excessively.

I created the following spreadsheet which depicts the values for various scenarios which might be of interest. As you can see it shows that even the most aggressive models proposed leave significant amounts of free RAM on larger machines.

https://docs.google.com/spreadsheets/d/1mSL2aH72Vca-phBm7DpACZ-UDeXV15jx4goU1N8Xv-Q/edit?usp=sharing

So I know its a pain but as one of the people who has an environment which seems to trigger this behaviour its really important to confirm or deny suspicions with hard results, as I'm sure you'll appreciate.
Comment 78 karl 2014-09-04 20:13:16 UTC
I understand but I have a problem with doing that on my production system -- where I can most-easily reproduce the problem.

FreeBSD 10.0-STABLE #1 r265164M: Fri Aug 29 23:16:06 CDT 2014     karl@NewFS.denninger.net:/usr/obj/usr/src/sys/KSD-SMP 

Your patch above will not apply against this rev and I am not willing to move the kernel and userland space up without knowing whether I'm going to get blown to bits as that is a production machine -- and there are reports open right now of serious potential issues with things like devd and lockups (and yes, that machine has USB resources on it.)

I can run my synthetics against the test system (which is updated to 10.1-PRERELEASE) but that is not dispositive since I can't produce the same load profile I have on the production machine with it.

When were the pagedaemon fixes committed and where can I find a description of them?  Specifically, how do they relate (time-wise) to r265164 which is what I'm running now?

Why do the developers believe that in a mixed and varying workload including database, SMB and web service that running free RAM down into the 200MB range, strongly implying that a burst of image activation will cause immediate page-out of active working set, is not going to lead to severe performance problems?

Note that this allocation would put the free RAM reserve with 24GB of RAM at approximately where it is now for a machine with ONE THIRD of that memory!  That configuration almost-certainly can't support that workload at all simply due to the requirement for wired pages (postgres buffer + working memory + apache working set, etc) but a 16GB or 24GB configuration can.

Having free RAM run down into the 200MB range on my production system (which is what your spreadsheet implies would be the expected behavior) is, from past experience, going to result in extraordinarily severe performance problems.

IMO ARC must be subservient to other demands on physical memory.  It's a disk *cache* and I fail to understand how allowing it, under *any* circumstance, to cause any part of the working set to be paged out is a correct decision.  Since pagedaemon does not manage ARC it cannot evict it on its own prior to initiating paging; it can only "see" how much RAM is available and in what bucket it lies.  Since ARC wires pages those are on the top of the priority queue to be left alone and yet in real terms that's wrong -- but it's unavoidable unless ARC management becomes part of pagedaemon with its own coloration (and that is a massive job with the potential for big trouble as it would effectively divorce huge parts of the imported code base from FreeBSD.)

I'm not grokking the complaint here -- the difference in free memory on a 1024GB machine is ~14GB left free, which is a 1.4% difference in overhead.  Someone running a 1Tb RAM machine has a damn good reason to stuff that much RAM in the box and yet we're still talking about an overhead of 1.4% -- and that assumes there's nothing very large that will get activated and demand that RAM.
Comment 79 Steven Hartland freebsd_committer freebsd_triage 2014-09-04 21:04:34 UTC
(In reply to karl from comment #78)
> I understand but I have a problem with doing that on my production system --
> where I can most-easily reproduce the problem.
> 
> FreeBSD 10.0-STABLE #1 r265164M: Fri Aug 29 23:16:06 CDT 2014    
> karl@NewFS.denninger.net:/usr/obj/usr/src/sys/KSD-SMP 
> 
> Your patch above will not apply against this rev and I am not willing to
> move the kernel and userland space up without knowing whether I'm going to
> get blown to bits as that is a production machine -- and there are reports
> open right now of serious potential issues with things like devd and lockups
> (and yes, that machine has USB resources on it.)

What issues do you see? Might be simple fix I'm guessing the first few hunks
of arc.c, due to the recent sysctl changes.

> I can run my synthetics against the test system (which is updated to
> 10.1-PRERELEASE) but that is not dispositive since I can't produce the same
> load profile I have on the production machine with it.
> 
> When were the pagedaemon fixes committed and where can I find a description
> of them?  Specifically, how do they relate (time-wise) to r265164 which is
> what I'm running now?

The ones I'm aware of, which may not be all are:-
http://svnweb.freebsd.org/base?view=revision&revision=265944
http://svnweb.freebsd.org/base?view=revision&revision=265945

These are after you core revision so unfortunately I think that unfortunately
invalidates the testing done so far :(

> Why do the developers believe that in a mixed and varying workload including
> database, SMB and web service that running free RAM down into the 200MB
> range, strongly implying that a burst of image activation will cause
> immediate page-out of active working set, is not going to lead to severe
> performance problems?

Its my understanding that having both ARC and pagedaemon trigger at the same
point means an combined effort to clear resources resulting in a good
balance.
Comment 80 karl 2014-09-04 21:33:22 UTC
I do not believe either of those pagedaemon commits will have a material impact on the system in this regard.

At its core the issue is that ARC is a cache, and yet it implemented as *wired* pages, which are sacrosanct.  Since there are two separate threads of processing that take care of this, one in the zfs code to handle the ARC pare-down and the other in pagedaemon that handles the various buckets of physical memory, and pagedaemon *can't* hit wired pages (if it could it would be disastrous as you'd be doing I/O to the swap twice instead of taking a cache miss which is one I/O!) this is really only about how prioritization lives between these two threads of processing.

The VM system is quiet when free RAM is above "free_target"; when that is invaded VM wakes up and starts looking for things it can either demote or free up.  That is the point where contention begins, and arguably where disk cache should stop growing.  If contention continues then ARC must lose this contention and the reason is simple -- a disk cache invalidation requires either zero (if the cached item is not requested again) or *one* I/O (if it is) but a page-out requires *at least* one and perhaps *two* -- one to put it on the swap file that cannot be avoided and a second one down the road to recover it if and when necessary.

If the argument at its core is that for some configurations the VM system shouldn't allow free RAM to get so large then v_free_target is too large.

A burst of memory contention will cause aggressive pare-down of the ARC, but this is IMHO correct behavior as you never want whatever is in ARC to result in I/O on the swap -- that is always a net lose.

The patch error is on Hunks #2 and 3, yes -- but from your comment it appears that hand-fixing that doesn't get the people who are objecting what they want.

My issue with updating the production system is that the environment there is stable now and given some reports of problems with -STABLE at present I'm loathe to potentially put an unstable environment on that box.  I am using a ZFS root so I can certainly clone it and, if necessary, revert, but the disruption from a crash on a production machine no matter how quickly I can roll it back remains substantial.

I can put this proposed change on my test box but again, from looking at the comments on those commits to pagedaemon I don't believe they're going to make any difference at all in terms of contention management.  The issue that led to me looking at this in the first place is that holding ARC until the pagedaemon has started to consider swapping is pretty easy to argue as "always wrong", and getting close enough to the line that an image activation or RAM request from an application shoves you over the line results in paging activity to the swap that should instead be an eviction of space held by the ARC.  As for an argument that this will result in some oscillation of the ARC space that's correct but it's exactly what *should* happen; when memory pressure is low it's perfectly fine to use all the free space for disk caching, but when it's not, no matter how transiently that may be the case, you *never, ever* want to page in an attempt to hold ARC in RAM as you're now trading one prospective physical I/O for two guaranteed ones!

It appears that the latest change puts that back in play, and that's exactly what the original patch was intended to, and did, stop.

I'll load the test machine this evening later on or tomorrow morning; I have commitments part of this weekend but given how far down into the hole that proposal shoves the eviction decision I will be surprised if I can't get it to misbehave quite easily -- it may be as simple as "cd /usr/src/release;make memstick" on a modest (e.g. 8-12Gb RAM) machine as that procedure allocates a nice fat memdisk during the process.
Comment 81 karl 2014-09-05 03:18:35 UTC
The "against stable/10" patch applies against a current checkout (sys directory claims to be 271151, checked out within the last hour) but does not build.
Comment 82 karl 2014-09-05 03:25:37 UTC
(In reply to karl from comment #81)
> The "against stable/10" patch applies against a current checkout (sys
> directory claims to be 271151, checked out within the last hour) but does
> not build.

Never mind -- looks like something didn't apply (odd); beginning testing now.
Comment 83 karl 2014-09-05 03:40:20 UTC
Uh, this is *definitely* wrong (against STABLE-10)

$ sysctl -a|grep arc_free_target
vfs.zfs.arc_free_target: 0

It appears you're grabbing that value before pageout_wakeup_thresh has initialized.

I'll set it to vm.pageout_wakeup_thresh, warm up the cache and see what I get.
Comment 84 karl 2014-09-05 04:05:18 UTC
The latest patch against STABLE blows up "make buildworld" here:

--- cddl/lib__L ---
/usr/src/cddl/lib/libzpool/../../../sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c:247:8: warning: implicit declaration of function 'sysctl_handle_int' is invalid in C99 [-Wimplicit-function-declaration]
        err = sysctl_handle_int(oidp, &val, 0, req);
              ^
/usr/src/cddl/lib/libzpool/../../../sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c:251:12: error: use of undeclared identifier 'minfree'
        if (val < minfree)
                  ^
/usr/src/cddl/lib/libzpool/../../../sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c:253:12: error: use of undeclared identifier 'cnt'
        if (val > cnt.v_page_count)
                  ^
--- lib__L ---
--- depend_subdir_libllvmx86codegen ---
--- X86GenInstrInfo.inc.h ---
tblgen -gen-instr-info  -I /usr/src/lib/clang/libllvmx86codegen/../../../contrib/llvm/include -I /usr/src/lib/clang/libllvmx86codegen/../../../contrib/llvm/lib/Target/X86  -d X86GenInstrInfo.inc.d -o X86GenInstrInfo.inc.h  /usr/src/lib/clang/libllvmx86codegen/../../../contrib/llvm/lib/Target/X86/X86.td
--- cddl/lib__L ---
1 warning and 2 errors generated.
*** [arc.So] Error code 1

root@dbms2:/usr/src # svn update .
At revision 271152.

Current as of this time (9/4 23:03)
Comment 85 Steven Hartland freebsd_committer freebsd_triage 2014-09-05 08:26:16 UTC
Investigating, I'm assuming the patch applied cleanly in the end?
Comment 86 Steven Hartland freebsd_committer freebsd_triage 2014-09-05 09:15:24 UTC
Created attachment 146851 [details]
arc reclaim refactor (against stable/10)

Updated patch against stable/10 to fix world build
Comment 87 Steven Hartland freebsd_committer freebsd_triage 2014-09-05 09:15:57 UTC
Created attachment 146852 [details]
arc reclam refactor (against head)

Updated patch against head to fix world build
Comment 88 Steven Hartland freebsd_committer freebsd_triage 2014-09-05 09:30:48 UTC
Created attachment 146854 [details]
arc reclaim refactor (against stable/10)

Add missing changes to pagedaemon which meant arc_free_target was initialised to 0 by stable/10 patch
Comment 89 Steven Hartland freebsd_committer freebsd_triage 2014-09-05 11:45:29 UTC
Created attachment 146859 [details]
arc reclaim refactor (against stable/10)

Added missing machine/vmparam.h include causing all platforms to take the small memory code paths
Comment 90 Steven Hartland freebsd_committer freebsd_triage 2014-09-05 11:46:32 UTC
Created attachment 146861 [details]
arc reclam refactor (against head)
Comment 91 Steven Hartland freebsd_committer freebsd_triage 2014-09-05 11:53:58 UTC
Sorry Karl there was a fairly major issue with the version you have in test. I've updated the patches to fix that.
Comment 92 karl 2014-09-05 12:47:46 UTC
BTW on the production system, which is running with the free + cache level set, I have now gotten a *second* stall during overnight backups from the PCs attached to that server (this generates very heavy SMB traffic over gigabit links.)  This is one of the paradigms that has historically generated problems -- and it's back with that single change.  The other changes are NOT in.  So for that system allowing invasion of v_free_target is empirically wrong.

I will see what I can get done today with the test machine and the new patch.
Comment 93 karl 2014-09-05 12:58:57 UTC
(In reply to karl from comment #92)
> BTW on the production system, which is running with the free + cache level
> set, I have now gotten a *second* stall during overnight backups from the
> PCs attached to that server (this generates very heavy SMB traffic over
> gigabit links.)  This is one of the paradigms that has historically
> generated problems -- and it's back with that single change.  The other
> changes are NOT in.  So for that system allowing invasion of v_free_target
> is empirically wrong.
> 
> I will see what I can get done today with the test machine and the new patch.

Incidentally I just reproduced the I/O freeze on the production machine.  That patch to allow invading v_free_target is being backed out this morning due to necessity -- whether I can get this newer version spun up and can take the risk of bad behavior there is an open question *but the question as to whether allowing cache pages to be included has been answered on my production machine with the rev it is running, and the answer is NO!*
Comment 94 Steven Hartland freebsd_committer freebsd_triage 2014-09-05 13:17:25 UTC
(In reply to karl from comment #93)
> (In reply to karl from comment #92)
> > BTW on the production system, which is running with the free + cache level
> > set, I have now gotten a *second* stall during overnight backups from the
> > PCs attached to that server (this generates very heavy SMB traffic over
> > gigabit links.)  This is one of the paradigms that has historically
> > generated problems -- and it's back with that single change.  The other
> > changes are NOT in.  So for that system allowing invasion of v_free_target
> > is empirically wrong.
> > 
> > I will see what I can get done today with the test machine and the new patch.
> 
> Incidentally I just reproduced the I/O freeze on the production machine. 
> That patch to allow invading v_free_target is being backed out this morning
> due to necessity -- whether I can get this newer version spun up and can
> take the risk of bad behavior there is an open question *but the question as
> to whether allowing cache pages to be included has been answered on my
> production machine with the rev it is running, and the answer is NO!*

Interesting what are you seeing for:
sysctl vm.stats.vm
Comment 95 karl 2014-09-05 13:20:28 UTC
[karl@NewFS ~]$ sysctl vm.stats.vm
vm.stats.vm.v_vm_faults: 123526970
vm.stats.vm.v_io_faults: 4387
vm.stats.vm.v_cow_faults: 46190876
vm.stats.vm.v_cow_optim: 45120
vm.stats.vm.v_zfod: 61801534
vm.stats.vm.v_ozfod: 7933
vm.stats.vm.v_swapin: 0
vm.stats.vm.v_swapout: 9
vm.stats.vm.v_swappgsin: 0
vm.stats.vm.v_swappgsout: 77
vm.stats.vm.v_vnodein: 4718
vm.stats.vm.v_vnodeout: 19311
vm.stats.vm.v_vnodepgsin: 29858
vm.stats.vm.v_vnodepgsout: 27672
vm.stats.vm.v_intrans: 1843
vm.stats.vm.v_reactivated: 4503
vm.stats.vm.v_pdwakeups: 3
vm.stats.vm.v_pdpages: 89932869
vm.stats.vm.v_tcached: 17313
vm.stats.vm.v_dfree: 0
vm.stats.vm.v_pfree: 71901800
vm.stats.vm.v_tfree: 219742493
vm.stats.vm.v_page_size: 4096
vm.stats.vm.v_page_count: 6115784
vm.stats.vm.v_free_reserved: 8013
vm.stats.vm.v_free_target: 130321
vm.stats.vm.v_free_min: 38590
vm.stats.vm.v_free_count: 1178178
vm.stats.vm.v_wire_count: 3905673
vm.stats.vm.v_active_count: 676200
vm.stats.vm.v_inactive_target: 195481
vm.stats.vm.v_inactive_count: 345569
vm.stats.vm.v_cache_count: 9829
vm.stats.vm.v_cache_min: 0
vm.stats.vm.v_cache_max: 0
vm.stats.vm.v_pageout_free_min: 34
vm.stats.vm.v_interrupt_free_min: 2
vm.stats.vm.v_forks: 209120
vm.stats.vm.v_vforks: 15356
vm.stats.vm.v_rforks: 0
vm.stats.vm.v_kthreads: 92
vm.stats.vm.v_forkpages: 9021798
vm.stats.vm.v_vforkpages: 19787317
vm.stats.vm.v_rforkpages: 0
vm.stats.vm.v_kthreadpages: 0

Note: This is on the machine that has the kernel on rev r265164
Comment 96 Steven Hartland freebsd_committer freebsd_triage 2014-09-05 14:05:44 UTC
(In reply to karl from comment #95)
> Note: This is on the machine that has the kernel on rev r265164

And thats the machine that saw the stall yes?

Can you also confirm the value for vfs.zfs.arc_free_target
Comment 97 karl 2014-09-05 14:23:47 UTC
arc_free_target is set to vm.v_free_target.

The ONLY change made to my original code was the one line you requested that added cache pages to the free page count; that effectively causes the code to invade vm.v_free_target by the amount of the cached page count before ARC is pared back.

With that in I have a ~50% fail rate running SMB backups to that machine over a Gig-E network due to I/O stalls.  The system otherwise appears to be ok in production use but that's a murderous problem in that the I/O stall causes *ALL* I/O on that pool to stall until it clears and thus everything that has open files on that pool and attempts I/O goes into a "D" blocked state.

With that one line change OUT I have a ZERO fail rate over SIX MONTHS of doing the exact same thing.

I am in the process of recompiling a current -STABLE kernel and world on the test machine, but whether I can reproduce the stalls on that system with this sort of reliability is unknown.  I will try, but the lack of success does not mean it won't happen in production, just that I can't successfully reproduce it on that test environment.
Comment 98 John Baldwin freebsd_committer freebsd_triage 2014-09-05 15:10:44 UTC
(In reply to karl from comment #80)
> The VM system is quiet when free RAM is above "free_target"; when that is
> invaded VM wakes up and starts looking for things it can either demote or
> free up.  That is the point where contention begins, and arguably where disk
> cache should stop growing.  If contention continues then ARC must lose this
> contention and the reason is simple -- a disk cache invalidation requires
> either zero (if the cached item is not requested again) or *one* I/O (if it
> is) but a page-out requires *at least* one and perhaps *two* -- one to put
> it on the swap file that cannot be avoided and a second one down the road to
> recover it if and when necessary.

This is incorrect.  The pagedaemon does not solely deal with anonymous-backed pages that are written to swap.  For non-ZFS, the disk cache lives in the VM cache (there is no separate disk cache), and so cached pages for read(2) and write(2) to non-ZFS go into the "inactive" page pool.  It is very common on machines using UFS that pagedaemon will process many pages either discarding clean file-backed pages (no I/Os just as you claimed for ARC) by moving them from inactive to cache, or cleaning them before moving them to cache (one I/O as you noted above).  In my experience, the overwhelming number of operations performed by pagedaemon against inactive pages are for file-backed pages and behave as you have described for ARC.

When ZFS is used, any ZFS files that are mmap'd are double-copied in order to make read/write coherent with mmap (ZFS read/write operations look for the VM copy of the page before going to ARC).  So, with mmap'd files (which is probably common in workloads like databases), ZFS files will have duplicate copies of pages in the "active / inactive / cache" pools that are separate from the ARC copies.  I believe that if pagedaemon decides to toss one of these copies, it will get written back into the ARC if it is dirty (so copied into the ARC page, not necessarily written to disk).

All that to say that the pagedaemon changes could be related to issues you are seeing, and that you should test those.  Certainly with UFS most of the pages pagedaemon deals with are file-backed pages (so "disk cache" in your words), not pages that get written to swap.  Even in a ZFS-only system that uses mmap() a non-trivial portion of the pages pagedaemon scans may in fact be file-backed ("disk cache") rather than swap-backed.
Comment 99 karl 2014-09-05 17:06:59 UTC
It's all fine and well to argue how it's supposed to work from reading the code but IMHO that's trumped by how it *does* work.

The fact is that on the rev I posted up above I can reproduce I/O hangs by running heavy SMB traffic to that box when cache pages are included in the "free" count.  With that change (which Steve asked for) reverted, the box rebooted and the ARC cache warmed up, but no other changes of any sort the same SMB job completes without errors and the "pauses" are once again gone as they have been for the last six months dating back to the early days of this PR.  

NTFS clients are utterly intolerant of long I/O delays and will declare such an I/O failed; the program in question doing the I/O in this case off the Windows machine is Acronis, a rather popular backup tool, and when it gets hit by a "pause" the result is that the backup job errors out.

As such perhaps those looking at this issue might contemplate using Samba with a Windows machine mounting the ZFS volume over a high-bandwidth (e.g. Gigabit or better) network connection as a test case.  It sure works out well for me in that regard. A directory copy via drag-n-drop of some very large files and directories during a "cd /usr/src/release;make clean;make memstick" on a moderately-sized RAM machine may be sufficient to cause trouble.  I can do so pretty much at-will on my production machine as noted above since that system is pretty busy, provided I allow vm.v_free_target to be invaded by the ARC.

As soon as I don't allow that to happen the stalls disappear.

The base behavior during these "pauses" during writes (if you have multiple pools and the one root is on is not affected -- otherwise you're screwed as you can't activate any image until it clears) is that one or more of the I/O channels associated with a pool has fairly heavy I/O activity on it but the others (that should, given the pool's organization) do not.  Since a write cannot return until all the pool devices that must be written to have their I/O completed the observed behavior (I/O "hangs" without any errors being reported) matches what you see with iostat.  The triggering event appears to be that paging (even in very small volume!) is initiated due to an attempt to hold ARC cache when the system is under material VM pressure, defined as allowing the ARC to remain sacrosanct beyond the invasion of vm.v_free_target.  It does not take much paging during these heavy write I/O conditions to cause it either; I have seen it with only a few hundred blocks on the swap, and the swap is NOT going to ZFS but rather is to a dedicated partition on the boot drive -- so the issue is quite-clearly not contention within ZFS between block I/O and swap activity.

In short this looks like a debate between engineering and science. It's certainly a good idea to go find out *why* what you think should happen doesn't and fix whatever that is (science), but in the meantime until you do the person crossing the bridge should be able to get to the other side without going swimming (engineering.)

People have been going swimming in this regard since at least FreeBSD-8 with ZFS filesystems and the usual refrain when it happens to someone is that their hardware is broken or they need more RAM.  Neither of those assertions has proved up and a lot of unnecessary RAM has likely been bought and disk drives condemned without cause chasing one's tail in this regard, never mind the indictments leveled against various and sundry high-performance disk adapters that appear to have been unfounded as well.  The difference in system performance between 99.2% cache hit rates and 99.7% with a doubling of system memory is close to nil but the dollar impact is not especially when a one-line code change could have avoided the expense.

Further the fix is not complicated and while there may be a side effect of leaving somewhat more RAM free than would otherwise potentially be the case it's a small side effect even at monstrous RAM configurations (~1.3% of the installed memory.) The alleged "bad side effects" that were claimed to be expected with the original change (specifically, inact pages growing without boundary slamming ARC to minimums) *do not happen* under any workload I or the others testing the original patch have experienced.

In any event given a sysctl tunable allows for a more aggressive posture to be tuned should you NOT experience the stalls when vm.v_free_target is invaded.

Indeed one line in /etc/rc.local will do that, should someone so choose.  The question at the bar is where the *default* should be.

If there is a claimed bogon somewhere else in the VM or I/O code that was and/or is the root cause of this behavior (which I certainly buy given the observed behavior) then being more aggressive with ARC is fine at the time that has been validated to be cleared.  I'd argue that should mean the community is able to find and peruse an explanation of what the bogon was, when it got into the code (and hopefully why it was thought correct at the time) and the interaction that led to the bad result is explained.

Until that verification is complete, however, I argue the correct path of action is to NOT allow the ARC to participate in invading vm.v_free_target. To withhold that actual and functional fix to the problem for the last six months is IMO awfully difficult to defend from an *engineering* perspective.

The empirical evidence supports disallowing ARC to result in invasion of vm.v_free at this time and running configured this way has kept the bridge-crossers using it, both myself and others, dry for the last six months.

Let's stop dunking people when we know how not to in the present tense.
Comment 100 Andriy Gapon freebsd_committer freebsd_triage 2014-09-05 17:16:55 UTC
Karl, please, just test those pagedaemon fixes.
Comment 101 karl 2014-09-05 18:42:28 UTC
I've already said I will test it, and am running that on my test systems now.  If I can get it to misbehave I will note it here -- but I will also note that the original "include cache" request *did not* lead to bad results on the test machine but *did* on my production system.  It took an overnight for me to find the actions (heavy SMB-based I/O) to cause it, but it occurred.

Given this I'm not loading that on my production machine when it's convenient for you since my expectation is that the test will fail.  I will load and test it when it's convenient for me and I can quickly revert should I need to on that user-facing machine.

However, that leads to the obvious question given the deadline situation that exists for 10.1, and which I believe should be addressed head-on.

Simply put why *not* commit the changes with the eviction threshold set to v_free_target?  What is the downside of doing so if it turns out that the system can be set more aggressively?  A one-line change can certainly be committed later if it turns out to be ok to do so after *rigorous* field testing by far more people than simply myself.  Further, that testing can be done *live* without having to recompile anything since the threshold is exposed as a tunable.

The claimed expected result from back in March of using v_free_target as the ARC eviction threshold (growth of inactive pages without boundary and a slammed ARC to the minimum) hasn't happened on anyone's machine that has run it.  It has happened, however, on some machines and some workloads if the eviction threshold is set *above* v_free_target.  I have six months of continuous uptime on the particular production machine with the eviction threshold set at v_free_target and NO bad behavior has occurred.  There are also others who are running the same code, set the same way, and none of them have reported unbridled inactive growth or a slammed ARC cache.  Finally, it doesn't make sense that it would happen given both pagedaemon and the ARC cache cleanup waking up at the same time (although it *might* lead to slower cleanup by the pagedaemon) and it *has* stopped the stalls.  The spreadsheet presented shows that the expected "cost" of that decision is 1.3% of RAM left unallocated on a *terabyte* RAM configuration.

The original code in the tree is flat-out broken as the kmem test is wrong.  The default "do nothing" decision on not committing the change with the eviction threshold set to v_free_target is to not have ARC evict on other-than-small RAM machines until the VM system goes into desperation paging mode.  That's clearly the wrong thing to do and yet that's the "do nothing" result.

What my original proposed heuristic does is allow ARC to consume any RAM available so long as v_free_target is not invaded.  When it is invaded the ARC is pared back until it's not.  This particular signpost is the same place that pagedaemon wakes up.  If v_free_target leaves too much RAM unallocated for certain configurations (e.g. as the spreadsheet claims) the correct resolution to that is to change how that value is computed.

What your proposed change does is attempt to use the ARC cache as a means of pressuring the VM system to become more aggressive in demoting pages from their respective queues.  I get it but what I suspect is going to happen is that when that threshold is approached if an image is activated or an existing process requests a chunk of memory you will be temporarily driven below v_free_min, which is effectively a desperation threshold.  I can say with complete certainty that my production system runs like *crap* when its free RAM is down to 150MB, and that which is where v_free_min multiplies out to (38590 on that box.)

Beyond my belief that it's a mistake to  use the ARC in this manner from a design perspective the empirical evidence to date says that's a bad idea too and not doing it, which was asserted to be expected to lead to bad outcomes has instead proved up over time as having no material downside and *is* effective in eliminating the stalls.
Comment 102 karl 2014-09-05 20:50:08 UTC
I can reproduce I/O stalls on the *production* machine within 5 minutes using the default parameters as proposed by simply running an Acronis backup once the cache has warmed up.

This is RELIABLY repeatable; I went two-for-two and that was enough for me; the machine has been reverted back to my previous load as it is utterly unusable with this code on it.

There is something else nasty going on with the way the ARC code is working with the current patch that goes beyond this as free memory is dramatically higher when the freezes happen (~6GB or so out of 24Gb, which is some 10x what it is with my version of the patch on the previous code rev!)

Note that even with arc_free_target manually tuned during runtime to v_free_target it STILL locks.  However, you are defining freemem as free_count + cache_count -- so that's not all that surprising since invasion is still taking place.  What bothers me even more though is that when I set it manually materially higher it still locked up and the wired page count did not move -- I expected it to.

Bluntly this proposed patch *dramatically* screws the pooch; the pattern of I/O and behavior is EXACTLY the same as described in comment #99 and it was utterly trivial to get the lockups to occur.
Comment 103 Steven Hartland freebsd_committer freebsd_triage 2014-09-06 01:02:32 UTC
(In reply to karl from comment #102)
> I can reproduce I/O stalls on the *production* machine within 5 minutes
> using the default parameters as proposed by simply running an Acronis backup
> once the cache has warmed up.
> 
> This is RELIABLY repeatable; I went two-for-two and that was enough for me;
> the machine has been reverted back to my previous load as it is utterly
> unusable with this code on it.
> 
> There is something else nasty going on with the way the ARC code is working
> with the current patch that goes beyond this as free memory is dramatically
> higher when the freezes happen (~6GB or so out of 24Gb, which is some 10x
> what it is with my version of the patch on the previous code rev!)
> 
> Note that even with arc_free_target manually tuned during runtime to
> v_free_target it STILL locks.  However, you are defining freemem as
> free_count + cache_count -- so that's not all that surprising since invasion
> is still taking place.  What bothers me even more though is that when I set
> it manually materially higher it still locked up and the wired page count
> did not move -- I expected it to.

Just to be clear your saying you saw the stalls when we're testing
free_count + cache_count < v_free_target but at the same time no paging
happened at all?

If so this is a very interesting fact as IMO it points the finger not at
the VM / pagedaemon but instead at the ARC reclaim process itself.
Comment 104 karl 2014-09-06 05:33:02 UTC
Yes, and maybe.

I'd agree except that the code changes that were put in didn't just impact arc_reclaim_needed, it also impacted arc_memory_throttle as "freemem" is used there now too -- and if I'm reading the intent of the code correctly it's wrong, although it should be wrong in a way that doesn't screw the system -- maybe. But in any event shouldn't that test be the same as in arc_reclaim_needed?

In addition there is a VM system change involved as well.

So which of the three is responsible for the bad behavior?

And just to make sure nobody mistakes the issue here -- it is bad enough that the machine becomes *completely unusable* as soon as any sort of heavy SMB-based write I/O (e.g. Samba) hits it, locking up the pool being written to for 30+ seconds at a time repeatedly and hanging any process that happens to attempt to I/O to it.
Comment 105 Steven Hartland freebsd_committer freebsd_triage 2014-09-06 16:35:59 UTC
(In reply to karl from comment #104)
> Yes, and maybe.
> 
> I'd agree except that the code changes that were put in didn't just impact
> arc_reclaim_needed, it also impacted arc_memory_throttle as "freemem" is
> used there now too -- and if I'm reading the intent of the code correctly
> it's wrong, although it should be wrong in a way that doesn't screw the
> system -- maybe. But in any event shouldn't that test be the same as in
> arc_reclaim_needed?

The changes to arc_memory_throttle to use freemem instead of
vm_cnt.v_free_count + vm_cnt.v_cache_count are difference eliminations
and have no functional effect as freemem = vm_cnt.v_free_count +
vm_cnt.v_cache_count.

> In addition there is a VM system change involved as well.

If your referring to the split of pagedaemon init then that's just
to allow for correct startup ordering, so will have no effect on this
behaviour.

> So which of the three is responsible for the bad behaviour?

Assuming I understood you correctly there's actually only one change in
play and that's the point at which ARC triggers reclaim.

> And just to make sure nobody mistakes the issue here -- it is bad
> enough that the machine becomes *completely unusable* as soon as
> any sort of heavy SMB-based write I/O (e.g. Samba) hits it, locking
> up the pool being written to for 30+ seconds at a time repeatedly
> and hanging any process that happens to attempt to I/O to it.

In order to narrow down what is causing this I've attached an
updated stable/10 patch and some dtrace scripts which will:
1. Allow you to exclude v_cache_count from the arc_reclaim_needed test using:
   sysctl vfs.zfs.arc_reclaim_cache_free=0
2. Use dtrace to investigate ARC evicts.
Comment 106 Steven Hartland freebsd_committer freebsd_triage 2014-09-06 16:38:13 UTC
Created attachment 146946 [details]
arc reclaim refactor (against stable/10)

Add some test DTRACE probes and sysctl
Comment 107 Steven Hartland freebsd_committer freebsd_triage 2014-09-06 16:38:59 UTC
Created attachment 146947 [details]
ARC evict dtrace script
Comment 108 Steven Hartland freebsd_committer freebsd_triage 2014-09-06 16:39:27 UTC
Created attachment 146948 [details]
ARC reap dtrace script
Comment 109 Steven Hartland freebsd_committer freebsd_triage 2014-09-06 16:39:53 UTC
Created attachment 146949 [details]
ARC reclaim dtrace script
Comment 110 Steven Hartland freebsd_committer freebsd_triage 2014-09-06 16:53:05 UTC
With the above changes / scripts I'd like to try and identify the nature of the behaviour that's causing the stalls.

Details on the scripts:
* arc-evict.d - will show you want and how much is being evicted from ARC.
* arc-reap.d - will show you how long arc_reap_now calls are taking and when arc_shrink is called.
* arc-reclaim.d - will show you what the cause for for arc_reclaim_needed returning true is.

Its important to test these with a full recent version of stable/10 I'm currently testing with r271116.

If you get any strange issues with dtrace then you may need to do a second build of kernel and world once an initial version of world has been installed.
Comment 111 karl 2014-09-06 17:13:31 UTC
(In reply to Steven Hartland from comment #105)
> (In reply to karl from comment #104)
> > Yes, and maybe.
> > 
> > I'd agree except that the code changes that were put in didn't just impact
> > arc_reclaim_needed, it also impacted arc_memory_throttle as "freemem" is
> > used there now too -- and if I'm reading the intent of the code correctly
> > it's wrong, although it should be wrong in a way that doesn't screw the
> > system -- maybe. But in any event shouldn't that test be the same as in
> > arc_reclaim_needed?
> 
> The changes to arc_memory_throttle to use freemem instead of
> vm_cnt.v_free_count + vm_cnt.v_cache_count are difference eliminations
> and have no functional effect as freemem = vm_cnt.v_free_count +
> vm_cnt.v_cache_count.

Ah, but the test is not the same; that is, the throttle can be invoked even though eviction is not taking place, and yet the comments in the code strongly imply that throttling should not happen UNLESS eviction is taking place.

Should not the test be the same?

> > In addition there is a VM system change involved as well.
> 
> If your referring to the split of pagedaemon init then that's just
> to allow for correct startup ordering, so will have no effect on this
> behaviour.

No, I'm referring to the alleged cleanups that led to this request for slamming the ARC's eviction threshold to where paging occurs instead of where the system declares that it is under (some) memory pressure.

> 
> > So which of the three is responsible for the bad behaviour?
> 
> Assuming I understood you correctly there's actually only one change in
> play and that's the point at which ARC triggers reclaim.

No.  There are three (known) changes relative to the code running on the production machine which does not exhibit the problem.

> In order to narrow down what is causing this I've attached an
> updated stable/10 patch and some dtrace scripts which will:
> 1. Allow you to exclude v_cache_count from the arc_reclaim_needed test using:
>    sysctl vfs.zfs.arc_reclaim_cache_free=0
> 2. Use dtrace to investigate ARC evicts.

My current path is to attempt to provoke the bad behavior on my test system.  I'm *not* pleased with the outcome of what I was asked to test yesterday given the assertions that it was "more correct" than my original, working and fully-functional patch.

Please understand -- I have working code now, and will continue to have working code.  These attempts to "improve" it which were requested in fact led to that system taking a significant and outrageously-invasive (from the user perspective) hit *in production.*

That's not going to happen again and I am certainly not going to provoke said bad behavior on said production, internet-facing machine so I can diddle with dtrace *on a live system serving users* while said users all go swimming instead of crossing the bridge.

I will continue efforts to reproduce the behavior on my test machine as time permits.
Comment 112 Steven Hartland freebsd_committer freebsd_triage 2014-09-06 19:47:52 UTC
(In reply to karl from comment #111)
> (In reply to Steven Hartland from comment #105)
> > (In reply to karl from comment #104)
> > > Yes, and maybe.
> > > 
> > > I'd agree except that the code changes that were put in didn't just impact
> > > arc_reclaim_needed, it also impacted arc_memory_throttle as "freemem" is
> > > used there now too -- and if I'm reading the intent of the code correctly
> > > it's wrong, although it should be wrong in a way that doesn't screw the
> > > system -- maybe. But in any event shouldn't that test be the same as in
> > > arc_reclaim_needed?
> > 
> > The changes to arc_memory_throttle to use freemem instead of
> > vm_cnt.v_free_count + vm_cnt.v_cache_count are difference eliminations
> > and have no functional effect as freemem = vm_cnt.v_free_count +
> > vm_cnt.v_cache_count.
> 
> Ah, but the test is not the same; that is, the throttle can be invoked even
> though eviction is not taking place, and yet the comments in the code
> strongly imply that throttling should not happen UNLESS eviction is taking
> place.

The value is identical it what it was so I don't follow what your trying
to say I'm afraid.

> Should not the test be the same?
> 
> > > In addition there is a VM system change involved as well.
> > 
> > If your referring to the split of pagedaemon init then that's just
> > to allow for correct startup ordering, so will have no effect on this
> > behaviour.
> 
> No, I'm referring to the alleged cleanups that led to this request for
> slamming the ARC's eviction threshold to where paging occurs instead of
> where the system declares that it is under (some) memory pressure.

That's nothing to do with VM, its an ARC change, which is why I provide
the ability to switch methods in the latest version, so you can test
with each option without reboots.


> > > So which of the three is responsible for the bad behaviour?
> > 
> > Assuming I understood you correctly there's actually only one change in
> > play and that's the point at which ARC triggers reclaim.
> 
> No.  There are three (known) changes relative to the code running on the
> production machine which does not exhibit the problem.

I don't know what changes your referring to then I'm afraid.

> > In order to narrow down what is causing this I've attached an
> > updated stable/10 patch and some dtrace scripts which will:
> > 1. Allow you to exclude v_cache_count from the arc_reclaim_needed test using:
> >    sysctl vfs.zfs.arc_reclaim_cache_free=0
> > 2. Use dtrace to investigate ARC evicts.
> 
> My current path is to attempt to provoke the bad behavior on my test system.
> I'm *not* pleased with the outcome of what I was asked to test yesterday
> given the assertions that it was "more correct" than my original, working
> and fully-functional patch.
> 
> Please understand -- I have working code now, and will continue to have
> working code.  These attempts to "improve" it which were requested in fact
> led to that system taking a significant and outrageously-invasive (from the
> user perspective) hit *in production.*
> 
> That's not going to happen again and I am certainly not going to provoke
> said bad behavior on said production, internet-facing machine so I can
> diddle with dtrace *on a live system serving users* while said users all go
> swimming instead of crossing the bridge.

Appreciated, but in order to get to the bottom of the issue I wanted to
provide you some tools that should help with that.

From what you've said the issue presents as a 30 second stall, so
if you cant create a test scenario the updates will now allow you
to switch between the bad and good values instantly hence able to
capture information from a single stall with no reboots required.
Comment 113 karl 2014-09-07 03:24:37 UTC
I went back and looked at the original unaltered (and older) arc.c file; you're correct, the throttle code is unchanged.

I'm going to continue trying to reproduce the problem on the test machine, but I'm not loading it again on the production system until I understand fully why the behavior changed when it should not have (specifically, when I cranked up the arc_free_target well beyond the VM's free target value the stalls did NOT immediately stop -- they should have.)  That will mean being able to reproduce the problem under test.

The issue presents itself as an I/O stall of sufficient duration to error out a SMB file operation (via Samba) on a Windows machine running Acronis to the ZFS volume, and some of the stalls are WELL beyond 30 seconds -- we're talking about a couple of minutes in many cases, and any I/O to a ZFS disk blocks during that time.  There may be multiple sequential stalls occurring; I've not been able to determine that with any sort of reliability.  No errors are logged to the console or elsewhere during that time or when the stall ends.  I'm not entirely certain whether it's local to the pool that the original I/O that blocked was directed at -- I don't think it is, but it's very hard to be certain as one wrong move and your shell session blocks when you're trying to look around while it's happening.  If systat is open it typically shows a pinned drive for percentage in use at 100%, but not all of them for a given pool, but whether that's a true state being reported or it's a function of systat being blocked is unknown.
Comment 114 karl 2014-09-07 03:39:48 UTC
One other note -- there's something else going on too with your patch, and it's not good.

    1 users    Load  0.00  0.00  0.00                  Sep  6 22:26

Mem:KB    REAL            VIRTUAL                       VN PAGER   SWAP PAGER
        Tot   Share      Tot    Share    Free           in   out     in   out
Act   56044   22988   862556    36132 5929704  count
All 2592072   25680   937452    87808          pages
Proc:                                                            Interrupts
  r   p   d   s   w   Csw  Trp  Sys  Int  Sof  Flt        ioflt  6858 total
             33      4572  17k  72k 1778       17k        cow         atkbd0 1
                                                    17018 zfod        uhci0 16
25.0%Sys   0.0%Intr  0.0%User  0.0%Nice 75.0%Idle         ozfod       ehci0 uhci
|    |    |    |    |    |    |    |    |    |           %ozfod       uhci1 21
=============                                             daefr       twa0 32
                                           dtbuf          prcfr       cpu0:timer
Namei     Name-cache   Dir-cache    281984 desvn          totfr       mps0 256
   Calls    hits   %    hits   %    120767 numvn          react   762 em0:rx 0
       8       8 100                 68849 frevn          pdwak  1016 em0:tx 0
                                                          pdpgs       em0:link
Disks   da0   da1   da2   da3   da4   da5   da6           intrn       cpu1:timer
KB/t   0.00  0.00  0.00  0.00  0.00  0.00  0.00   4847604 wire        cpu9:timer
tps       0     0     0     0     0     0     0      6276 act         cpu7:timer
MB/s   0.00  0.00  0.00  0.00  0.00  0.00  0.00   1416036 inact       cpu10:time
%busy     0     0     0     0     0     0     0           cache       cpu5:timer
                                                  5929788 free        cpu12:time
                                                          buf         cpu4:timer
                                                                      cpu15:time
                                                                 4826 cpu2:timer
                                                                      cpu13:time
                                                                      cpu6:timer
                                                                      cpu14:time
                                                                      cpu3:timer
                                                                  254 cpu11:time
                                                                      cpu8:timer


That's the test machine.  It's idle right now.  Note the huge amount of free memory.

$ zfs-stats -A

------------------------------------------------------------------------
ZFS Subsystem Report                            Sat Sep  6 22:27:29 2014
------------------------------------------------------------------------

ARC Summary: (HEALTHY)
        Memory Throttle Count:                  0

ARC Misc:
        Deleted:                                651.78k
        Recycle Misses:                         162.27k
        Mutex Misses:                           40
        Evict Skips:                            692.02k

ARC Size:                               29.78%  3.17    GiB
        Target Size: (Adaptive)         29.81%  3.17    GiB
        Min Size (Hard Limit):          12.50%  1.33    GiB
        Max Size (High Water):          8:1     10.63   GiB

ARC Size Breakdown:
        Recently Used Cache Size:       51.28%  1.63    GiB
        Frequently Used Cache Size:     48.72%  1.54    GiB

ARC Hash Breakdown:
        Elements Max:                           452.56k
        Elements Current:               84.98%  384.56k
        Collisions:                             339.78k
        Chain Max:                              5
        Chains:                                 33.10k

------------------------------------------------------------------------

That ain't right.  ARC was around 8Gb when I left this afternoon immediately after running a big file copy and a buildworld at the same time in an (unsuccessful) attempt to get it to lock up.  This sort of slam-down is exactly, in fact, what your patch was allegedly supposed to avoid.  Needless to say I don't think it will be possible to get it to lock up as long as that's the state it's going to remain in.

Parameters are at defaults (intentionally) since I was trying to provoke the problem this morning.

$ sysctl -a|grep free_target
vm.v_free_target: 65013
vm.stats.vm.v_free_target: 65013
vfs.zfs.arc_free_target: 21186

This is a 12Gb RAM machine, dual-CPU Xeon box (nominally 8-way SMP plus Hyperthreading.)

I will bring the tree up to the current level and re-apply the patch but at some point I need a stable reference.  Whatever it's doing now clearly isn't, and when it is behaving this way I would not expect to be able to lock it up no matter what I do, but that's very undesirable behavior out of the ARC in the opposite direction!
Comment 115 karl 2014-09-07 15:30:59 UTC
And with the test system under heavy I/O load -- note the free memory.

    5 users    Load  1.18  1.55  1.19                  Sep  7 10:28

Mem:KB    REAL            VIRTUAL                       VN PAGER   SWAP PAGER
        Tot   Share      Tot    Share    Free           in   out     in   out
Act  103208   17908  1699800    26648 2216268  count      
All 9107864   22860  1784888    88044          pages      
Proc:                                                            Interrupts
  r   p   d   s   w   Csw  Trp  Sys  Int  Sof  Flt        ioflt 20074 total
  1          52       75k 1476  47k  15k  240    2        cow         atkbd0 1
                                                        2 zfod     41 uhci0 16
 4.4%Sys   2.0%Intr  0.2%User  0.0%Nice 93.4%Idle         ozfod       ehci0 uhci
|    |    |    |    |    |    |    |    |    |           %ozfod       uhci1 21
==+                                                       daefr    33 twa0 32
                                           dtbuf        2 prcfr   308 cpu0:timer
Namei     Name-cache   Dir-cache    281984 desvn        6 totfr  2136 mps0 256
   Calls    hits   %    hits   %     16137 numvn          react  6471 em0:rx 0
       3       3 100                  9313 frevn          pdwak  6462 em0:tx 0
                                                       54 pdpgs       em0:link
Disks   da0   da1   da2   da3   da4   da5   da6           intrn   147 cpu1:timer
KB/t  93.14 92.89 93.20 92.86 10.46 10.37  0.00   9710040 wire   1057 cpu12:time
tps     539   543   541   543    16    17     0    131684 act     220 cpu3:timer
MB/s  49.05 49.24 49.24 49.24  0.17  0.17  0.00    141712 inact   104 cpu10:time
%busy    80    57    56    56    14    15     0     33880 cache   157 cpu7:timer
                                                  2182388 free   1044 cpu13:time
                                                          buf     318 cpu6:timer
                                                                   99 cpu9:timer
                                                                  131 cpu4:timer
                                                                  110 cpu15:time
                                                                  235 cpu5:timer
                                                                  127 cpu14:time
                                                                  164 cpu2:timer
                                                                  606 cpu11:time
                                                                  104 cpu8:timer

root@dbms2:/home/karl # zfs-stats -A

------------------------------------------------------------------------
ZFS Subsystem Report                            Sun Sep  7 10:29:37 2014
------------------------------------------------------------------------

ARC Summary: (HEALTHY)
        Memory Throttle Count:                  0

ARC Misc:
        Deleted:                                824.50k
        Recycle Misses:                         115.32k
        Mutex Misses:                           32
        Evict Skips:                            19.64m

ARC Size:                               73.94%  7.86    GiB
        Target Size: (Adaptive)         73.95%  7.86    GiB
        Min Size (Hard Limit):          12.50%  1.33    GiB
        Max Size (High Water):          8:1     10.63   GiB

ARC Size Breakdown:
        Recently Used Cache Size:       93.75%  7.37    GiB
        Frequently Used Cache Size:     6.25%   503.29  MiB

ARC Hash Breakdown:
        Elements Max:                           137.09k
        Elements Current:               98.80%  135.44k
        Collisions:                             54.28k
        Chain Max:                              3
        Chains:                                 4.27k

------------------------------------------------------------------------

Throttled down from max size materially -- but.... why, given the alleged intent of the current patch above and no memory pressure?  Something's not right here.

In addition the dtrace scripts do not compile against the current patch and source unchanged.
Comment 116 Steven Hartland freebsd_committer freebsd_triage 2014-09-07 15:40:37 UTC
(In reply to karl from comment #115)
> 
> In addition the dtrace scripts do not compile against the current patch and
> source unchanged.

Anyone in particular? I can see the arc-reclaim.d has to addition trace points
not in the stable/10 code, simply comment out the two lines:
/*
vm:::vm-lowmem_cache,
vm:::vm-lowmem_scan,
*/
Comment 117 Steven Hartland freebsd_committer freebsd_triage 2014-09-07 15:50:26 UTC
Created attachment 147014 [details]
VM lowmem probe
Comment 118 Steven Hartland freebsd_committer freebsd_triage 2014-09-07 16:14:18 UTC
(In reply to karl from comment #115)

> ARC Size:                               73.94%  7.86    GiB
>         Target Size: (Adaptive)         73.95%  7.86    GiB
>         Min Size (Hard Limit):          12.50%  1.33    GiB
>         Max Size (High Water):          8:1     10.63   GiB
...
> 
> ------------------------------------------------------------------------
> 
> Throttled down from max size materially -- but.... why, given the alleged
> intent of the current patch above and no memory pressure?  Something's not
> right here.

Its not 100% obvious, but throttling is an over time thing, so it will happen
if there "has" been pressure.

After that point it will slowly expand backup to max assuming no pressure.
Comment 119 karl 2014-09-07 16:44:43 UTC
But there never was pressure.  That's the point -- the system was booted with the latest svn update run, built and installed, then the backup test was run.  During the backup run the ARC target was pared down even though ~20% of installed RAM remained free.
Comment 120 Steven Hartland freebsd_committer freebsd_triage 2014-09-07 18:23:01 UTC
(In reply to karl from comment #119)
> But there never was pressure.  That's the point -- the system was booted
> with the latest svn update run, built and installed, then the backup test
> was run.  During the backup run the ARC target was pared down even though
> ~20% of installed RAM remained free.

While you might not think there was, the only way for ARC to action a
shrink is if it sees VM pressure. There are a number of triggers for
this but one way or another that's what happened.
Comment 121 karl 2014-09-07 18:30:47 UTC
Understood -- my point is that the "pressure" point that was experienced isn't where the changes that were put in allegedly placed it, and that at no time does it appear that the VM system actually thought it was under pressure (that is, free space never went below vm.free_target.)
Comment 122 Steven Hartland freebsd_committer freebsd_triage 2014-09-08 15:25:11 UTC
Created attachment 147068 [details]
ARC reclaim refactor (against stable/10)

Updated default zfs_arc_free_target to (vm_pageout_wakeup_thresh / 2) * 3;

Added more dtrace probes

Fixed missing include causing i386 path to still be take on stable version.
Comment 123 Steven Hartland freebsd_committer freebsd_triage 2014-09-08 15:27:29 UTC
Created attachment 147069 [details]
ARC resize dtrace script
Comment 124 Steven Hartland freebsd_committer freebsd_triage 2014-09-08 15:29:46 UTC
Created attachment 147070 [details]
ARC reclaim refactor (against head)

Updated default zfs_arc_free_target to (vm_pageout_wakeup_thresh / 2) * 3;

Added more dtrace probes
Comment 125 karl 2014-09-08 15:41:06 UTC
That setting for arc_free_target still produces degenerate I/O and RAM behavior on my test system.  Setting it to vm_free_target during runtime greatly lessens (but does NOT completely eliminate) said behavior.

The degenerate behavior exhibits as wildly disparate I/O rates (in TPS, MB/S and %busy) to the members of a raidz2 pool (which should never happen under write-heavy loads) that are coordinated with wildly-oscillating free RAM; as much as 1.5Gb on a 12G system.  What appears to be happening is that actual I/O is getting blocked by something (memory allocation?); if you get an image activation and/or RAM demand during one of those cycles I would expect a full-on freeze because if you cannot complete all the I/Os required on a raidz2 pool to complete the write safely you will block until you can.
Comment 126 karl 2014-09-08 16:03:14 UTC
Shutting off vfs.zfs.arc_reclaim_cache_free further reduces the oscillations - possibly to a large enough degree that the system would be stable under heavy production loads with random (e.g. user-created) image activation and RAM allocation on top of the synthetics.
Comment 127 karl 2014-09-08 17:33:58 UTC
OK, from the testing that Steve has requested (along with avg) here is the test environment and my results.

a. Configuration has 12Gb RAM and dual Xeon CPUs (8-way nominal SMP)

b. There are two pools on separate disk adapters, both running as JBOD with no on-board cache or similar indirections.  The boot pool is a 2-way mirror and the data pool on which the pressure is applied is a 4-device raidz2.  The filesystem on the second pool has lz4 compression enabled.  Swap file is on a dedicated disk partition (not on zfs)

c. A small program is used to dirty and lock up 4gb of wired RAM to simulate the buffering memory used by a moderately-sized Postgres database.

d. A second small program is run that opens a file on the raidz2 pool's filesystem with O_SYNC and then repeatedly lseek()s to a random location within a 20Gb file locus and writes 512 bytes of random data there.  This simulates a write-heavy database I/O load and heavily stresses the synchronous commit mechanism.

e. Finally, I use Acronis to create heavy sequential write I/O pressure over  Gig-E ethernet link by running a 200Gb backup to the data pool.

With defaults once the cache warms up I observe pathological behavior in the form of severe oscillations in both free RAM and I/O rates.  Specifically, free RAM oscillates between less than 200Mb and over 1.5Gb free, and at the same time I see oscillations of more than 10:1 between I/O spindles in both tps and MB/Sec, with one spindle typically pinned on busy% while the others oscillate between being approximately as busy (normal) and dropping to single-digit utilizations and double-digit tps (grossly abnormal.) Granularity of the systat display does not allow me to determine exactly which unlocking event comes first but it appears that when I/O rates are balanced free RAM is rapidly depleted and then rates diverge dramatically.  Some paging activity is noted, but not a gross amount (10-20Mb of total page space consumed.)  No errors are logged during any of the tests.

These oscillations can be reduced dramatically by increasing arc_free_target to vm_free_target.  They are reduced further by excluding cache pages from the free computation, however, that reduction is relatively modest.  The mitigating factor here is that under this test there are few cache pages in the first place; for a system with more of them in normal operation the impact would likely be much larger.

Erratta:
After extended testing and shutting down all involved processes I have roughly a gigabyte of unaccounted for wired memory I cannot trace that was not present at boot time; zfs-stats -A shows only ~40MB consumed (concurs with vmstat -m.)  I do not have a means of identifying the source of this wired allocation but it is somewhat-concerning as it may indicate a memory leak somewhere in the code.  All of the RAM allocated by my wired-ram pressure program was released when it exited.

Results from the requested testing:

1. Allowing vm_free_target to be invaded *to any degree* by the ARC causes pathological behavior within both the I/O and memory systems.  Specifically, it results in I/O and free RAM oscillations that, if they coincide with natural RAM pressure (e.g. image activation due to a user request, etc) can and do lead to stalls, and it causes paging activity that arguably should not, given the actual RAM and system pressure, take place.

2. Evicting ARC at vm_free_target and NOT including cache memory in the free computation, dramatically reduces both the I/O throughput oscillations on the pool(s) to which I/O is being done (from more than 10:1 deviations to ~2:1 or so maximum) *and* reduces free RAM oscillations as well (by a factor of about four.)  It does *not*, however, eliminate either condition entirely and the remaining oscillations are statistically significant.  Given the steady-state workload being imposed on the system this is unexpected behavior.

3. Evicting ARC *above* vm_free_target does not cause problems in my test system but *was* reported in earlier testing of my original patch to do so for some other people's workloads, and as such that default was removed (the original "additional percentage" reservation.)  Doing so, however, *did* provide further statistically-significant improvements in I/O and RAM stability.

4. These observations match what I previously observed in earlier versions of 10-STABLE and led to the patch in comment #10.

5. The pagedaemon commits to 10-STABLE between when I developed the original patch and today that avg believed would address the issue *do not* change in a detectable fashion the pathological behavior under test conditions.


Comments and recommendations:

The original Illumos code stays *well* clear of attempting to pressure the VM system for other configurations (e.g. Solaris, etc.)  Attempting to use the ARC as a means of pressuring the pagedaemon is easily shown to result in pathological behavior of both the memory and I/O subsystems and, for some workloads, will result in unacceptable performance degradation that manifests either in poor I/O performance at best or, under certain conditions that are difficult to reproduce in a synthetic test (but not hard for me to reproduce in production) outright system stalls.

Steve appears to have identified and fixed additional issues with code includes that resulted in an inappropriate computation for non-i386 systems in some cases.  These should, of course, be included in any commit/MFC.

Against the current patch the configuration that produces the best behavior in test without notable side effects come with arc_free_target = vm_free_target and arc_reclaim_cache_free = 0, neither of which is the current default.  

This is effectively the same heuristic for ARC behavior as found in my comment #10.

I therefore recommend arc_free_target = vm_free_target and arc_reclaim_cache_free = 0 as defaults as while this does not completely resolve the oscillation problem it minimizes it without resulting in undesirable side effects such as the unbridled growth of the inact page pool or slamming the ARC to minimums over time.

I further recommend attempting to identify a means of protecting the RAM allocation necessary to complete ZFS I/O so that it does not block under even extreme RAM pressure but I recognize that may not be possible given underlying VM system design.  Illumos appears to have recognized this risk to performance in the original code by how they intentionally remained well away from low-memory conditions.
Comment 128 karl 2014-09-13 05:25:58 UTC
Created attachment 147265 [details]
Patch to correct ZFS "freeze" ram contention problem (apply after Steve's latest 10-Stable patch above)

After much additional back and forth with Steve I believe I have identified the reason for #2 in my reported results and have developed a patch to address it.

What we are doing right now is essentially violence served upon the ZFS code in an attempt to force it to pressure the VM system to reclaim memory that can then be used for ARC.  

While setting arc_free_target = vm_free_target helps this situation, and in fact in my testing doubling both arc_free_target and vm_free_target helped *even more*, it did not result in a "complete" fix.

I now understand why.

In short the dirty memory that ZFS internally allocates for writes can be enough to drive the system into aggressive paging, even when it is trying to aggressively reclaim ARC.  The problem occurs when large I/O requirements "stack up" before the ARC can evict which forces free memory below the paging threshold.  This in turn causes the ZFS I/O system to block on paging (that is, it is blocked waiting for the VM system to give it memory and the VM system must page to do so), which is a pathological condition that should never exist. I can re-create this behavior "at will".

At present "as delivered" (with or without Steve's patch) the code will allow up to 10% of system RAM to be used for dirty I/Os or 4GB, which ever is less.  The problem is that on a ~12GB system that is 1.2GB of RAM, and that exceeds the difference between free_target and the paging threshold (by a *lot*!)  A similar problem exists for other configurations; at 16GB the difference between free_min and free_target (by default) is ~250MB but 10% of the installed memory is 1.6GB, or roughly SIX TIMES the margin available.  If you allow the ARC to invade beyond free_target then the problem gets worse for obvious reasons. This is why setting it to free_target helps, and in fact why when I *doubled* free_target it helped even more -- but did not completely eliminate the pathology, and I was still able to force paging to occur by simply loading up on big write I/Os.

I propose that I now have a fix instead of a workaround.

Note that for VERY large RAM configurations (~250GB+) the maximum dirty buffer allocation (4GB) is small enough that you will not invade the paging threshold.  In addition if you are not write I/O bound in the first place you won't run into the problem since the system never builds the write queue depth and thus you don't get bit by this at all.

But if you throw large write I/Os at the system at a rate that exceeds what your disks can actually perform at there is a decent probability of forcing the VM system into a desperation swap condition which destroys performance and produces the "hangs."  *Any* workload that produces large write I/O requests that exceed the disk performance available is potentially subject to this behavior to some degree, and the more large write I/O stress you put on the system the more-likely you are to run into it.  Workloads that mix sync I/Os on the same spindles as large async write I/Os are especially bad because the sync I/Os slow total throughput available dramatically.  In addition during a resilver, especially on a raidz(x) volume where I/O performance is materially impaired (it is often cut by anywhere between 50-60% during that operation) you're begging for trouble for the same reason.

Steve pointed me at a blog describing some of the changes in the I/O subsystem in openzfs along with how the code handles dirty pages internally and, with some code-perusing the missing piece fell into place.

The solution I have come up with is to add dynamic sizing of the dirty buffer pool to the code that checks for the need to evict ARC.  By dynamically limiting the outstanding dirty I/O RAM demand to the margin between the current free memory and the system's paging threshold even ruinously nasty I/O backlogs no longer cause the system to page and you no longer get ZFS I/O threads in a blocking state.

I can now run on my test system with an intentionally faulted and resilvering raidz2 pool, dramatically cutting the I/O capacity of the system, while hammering the system with I/O demand at well beyond 300% of
what it can deliver.  It remains stable over more than an hour of that sustained treatment; without this patch the SMB-based copies fail within a couple of minutes and the system becomes unresponsive for upwards of a minute at a time.

The patch is attached above; it applies against Steve's refactor for Stable-10 pulled as of Friday morning.

Please consider testing it if you have an interest in this issue.

Thanks.
Comment 129 karl 2014-09-13 15:16:33 UTC
The patch I added last night to this thread has now been validated against the older 10.0-STABLE code build I am running on my production system (with my original changes to arc.c.)

This eliminates the pagedaemon changes avg proposed as effacious as being effective since they are not in the revision of 10.0-STABLE in question (r265164M)

I am now attempting to roll forward the production machine to 10.1-PRE since it appears, even under severe stress, that the resolution is effective on both r265164 and r271435.
Comment 130 Steven Hartland freebsd_committer freebsd_triage 2014-09-13 16:10:02 UTC
Dynamically adjusting zfs_dirty_data_max could caus a panic in dmu_tx_delay due to divide by zero.
Comment 131 karl 2014-09-13 16:33:04 UTC
Created attachment 147274 [details]
See comment -- fix potential divide-by-zero panic

That's easily fixed (divide by zero risk) -- new patch that does so attached.

This should apply against stable/10 from zero (this patch incorporates Steve's patch from 2014-09-08 and does NOT require it be applied first.)
Comment 132 karl 2014-09-13 16:38:12 UTC
Created attachment 147275 [details]
Replace previous patch -- picked up wrong copy (extra parenthesis)

Comment above applies; picked up the wrong file when attaching.
Comment 133 karl 2014-09-13 16:41:50 UTC
Created attachment 147276 [details]
Argh - didn't include all of Steve's other changes -- yet another try :-)

Include all of Steve's previous other changes to Stable-10.  Sorry about that.
Comment 134 karl 2014-09-13 16:44:56 UTC
Hold on a second -- one more time (wrong copy AGAIN) :-)
Comment 135 karl 2014-09-13 17:55:05 UTC
Steve has pointed out a potential (very small, but possible) window where a divide-by-zero could occur.  Working on that -- careful with that axe if you apply the previous patch. :-)
Comment 136 karl 2014-09-13 20:34:10 UTC
Created attachment 147286 [details]
Candidate Fix (supersedes previous)

Move dirty_max resize into dmu_tx.c in the function dmu_tx_assign.  This places it in front of the thread mutex which should prevent the possibility (small, but present) of a divide-by-zero identified by Steve.

Verified to apply against r271435, builds cleanly and survived two "hell tests", including extreme I/O stress during a raidz2 resilver, that have killed the test machine repeatedly in the past. It is currently under a fairly extreme soak test and has exhibited no problems at all.

Note that this patch includes Steve's refactor against 10-STABLE; 

Comments from others are appreciated.
Comment 137 Tomoaki AOKI 2014-09-15 12:46:32 UTC
(In reply to karl from comment #136)

Working with no problem for me. Thanks. stable/10 r271481 amd64.

But one strange thing.
At the time previous Steven's patch was uploaded, I accidentally forgotten to apply it ('patch -R' for previously applied one, 'patch -C' for integriety check, and missed 'patch'), but observed no thrashing behavior with the kernel. (I noticed it by absense of vfs.zfs.arc_free_target sysctl.)
As I applied the patch and rebuilt kernel right after I noticed that, my only stress test without patch was rebuilding kernel.

At least, when I first applied Karl's patch (Mar.26, 2014), severe thrashing behavior was observed with vanilla GENERIC kernel while memory-intensive jobs, such as linking kernel, linking www/chromium etc., and Karl's patch helped.

I haven't determined which change (maybe in sys/vm) helped, but I haven't observed any negative influence with latest Karl's patch.

By the way, I think this bug 187594 should be Release Blocker (Show Stopper) issue for 10.1, but not currently listed in https://www.freebsd.org/releases/10.1R/todo.html. Shouldn't this listed there?
Comment 138 karl 2014-09-16 18:41:54 UTC
No problems noted with the latest on either test or production systems.  The production system was subjected to ridiculous levels of abuse yesterday with the latest patch in due to a burst of activity and it showed no distress.

As an aside Steve MFC'd a fix for a fairly-serious but unrelated problem that IS evident in recent 10.x that I was chasing down on the Illumos list; the symptom is that a resilver can either restart repeatedly or never complete.  The reason was a bogon in the code related to ZFS attempting to use TRIM on drives that don't support it -- this caused the resilver to be queued for a restart due to the "error" returned from the TRIM attempt.

Steve reports it's fixed as of r271683.  There was a patch committed some time ago to address this but MFCing it apparently slipped through the cracks.
Comment 139 John Hay freebsd_committer freebsd_triage 2014-09-16 19:40:00 UTC
I found this after I saw the email from Karl. I upgraded to 10-stable svn 271611 and applied the patch mention in comment 136. And what a difference. :-) My machine reacts when I type or click on another window. Previously I often had to wait for 30+ seconds.

For the record, the machine has 4G RAM and boot from zfs. It is my desktop but builds snapshots at night.
Comment 140 karl 2014-09-19 01:45:02 UTC
Created attachment 147459 [details]
Replaces previous; results of further work with Steve

From findings and further effort; credit to Steve as well as plenty of work has gone into this across the board.

1. The UMA allocator is, for "pedestrian" system configurations, *badly* broken both in grabbing ridiculously large chunks of RAM and not releasing them when it should.  This patch turns it off; MFCing that back on appears to have been a serious mistake as shutting it off stops a lot of the pathology that appeared to be coming from the ZFS buffer management code.  Those with *very* large RAM configurations where unannounced 2GB "grabs" of RAM and significant chunks of memory sitting around in the wired state for periods of time even though allegedly released do not severely impact performance may have a different view.

2. With the crazy RAM grab problems gone resizing of the write buffers is no longer necessary.  Remove that code.

3. Set arc_free_target to the paging threshold plus of the difference between the VM system's free target and the paging threshold.  There is apparently a belief in some quarters that it should be at the paging threshold.  On both a 12Gb and 4Gb configuration setting it there in my environment produces materially worse performance as well as swapping that appears to be unnecessary; the 12Gb configuration also produces fairly-severe free RAM oscillations with it set there that go away with where I have configured it.  Free free to try it both ways.

Should apply against 10-STABLE r271838 and similar, supersedes previous.  Differs materially from Steve's latest refactor only in turning off the uma allocator and where the default free_target is set.

Note that if you have actual hard disks (and not SSDs) you want to be either on 271683 or later, or, if not, turn off TRIM in your /boot/loader.conf with vfs.zfs.trim_enabled=0.  See comment #138.
Comment 141 Justin T. Gibbs freebsd_committer freebsd_triage 2014-09-20 06:39:02 UTC
Can you provide more information about the measurements you performed to conclude that UMA is interacting poorly with ZFS?  Disabling UMA significantly increases CPU utilization in ZFS.  On some workloads (SSDs and 8K record size) using UMA doubles ZFS's performance, so it would be better to fix UMA (or ZFS's use of UMA), then to just turn it off.
Comment 142 karl 2014-09-20 15:06:20 UTC
I'm working on that (fixing the way ZFS interacts with the UMA allocation system.)

I have taken Steve's refactor and continued work, as I can easily produce workloads here that cause either system stalls or (in a couple of cases I can produce on occasion) kernel panics due to outright RAM exhaustion with the parameters set where Steve and, apparently the rest of the core team wants them.

ZFS' code was originally designed to remain well beyond the paging threshold in RAM consumption.  If you look at the original ARC code and how it determines sizing strategy this is apparent both in the comments and the code itself.  Unfortunately this is a problem on FreeBSD because the VM system is designed to essentially sleep and do nothing until available system memory drops below vm.v.free_target.

So if you use ZFS on FreeBSD, and there was no change made to that strategy, over time inactive and cache pages would crowd out the ARC cache until it dropped to the minimum.  This is unacceptable for obvious reasons so on FreeBSD the the code is (ab)used to pressure the VM system by intentionally invading the vm.v.free_target threshold, causing pagedaemon to wake up and clean up the kitchen floor of all the old McDonalds' hamburger wrappers.  There is in fact apparently a belief in the core team that the ARC's eviction code should remain asleep until RAM reaches the vm.pageout_wakeup_thresh barrier, effectively using ARC to force the system to dance on the edge of the system's hard paging threshold.  This is, in my opinion, gross violence upon the original intent of the ZFS code because causing the system to swap in preference to keeping a (slightly) larger disk cache is IMHO crazy, but this is the where my recent work has focused at their request.  I've acceded to it for testing purposes (although I'll never run *my* systems configured there, and the patches I've got here don't have it set there) because if I can get the system to be stable with it set at that value it will run (much) better with what I believe is a more-sane choice for that parameter (vfs.zfs.arc_free_target), and that I (and everyone else) can tune easily.

UMA is much faster but it gets its speed in part by "cheating." That is, it doesn't actually return memory to the system when it's told to free memory right away; the overhead is deferred and taken "at convenient (or necessary) times" instead of for each allocation and release.  It also has the characteristic of grabbing quite-large chunks of RAM at impressive rates of speed for its internal use within its internal bucketing system.  The lazy release, while good for performance, can get you in a lot of trouble if you're dancing near the limits of RAM exhaustion and before that release happens an allocation is requested that cannot be filled with the existing free buckets in the UMA and thus a further actual RAM allocation is required.

As written ZFS currently calls the UMA reap routines in the arc maintenance thread.  This, in a heavy I/O load environment with memory near exhaustion (on purpose due to the above) leads to pathology.  If the system is forced to swap to fill a memory request for ZFS it is possible for the zio_ thread in question to block on that request until the pagedaemon can clear enough room for for allocation to be granted.

I have instrumented the ZFS allocation code sufficiently with dtrace to show that with heavy read/write I/O loads, especially on a degraded pool that is resilvering (thus cutting the maximum I/O performance materially) even though the ARC code *should* force eviction of sufficient cache to prevent materially invading the paging threshold with UMA on the fact that it does not reap until the ARC maintenance thread gets to it means I can produce *mid-triple-digit* free RAM page counts (that is, about two *megabytes*!) immediately following allocation requests -- and wicked, nasty stall conditions, some lasting for several minutes where anything requiring image activation (or recall from the swap) is non-responsive as all (or virtually all) of the zio_ threads are blocked waiting for the pagedaemon.  This is on a 12GB machine with a normal paging threshold of 21,000 pages.  The interesting thing about the pathology is that it presents as what appears to be a "resonance" sort of behavior; that is, it creates an oscillating condition in free memory that gets worse against a constant load profile until the system starts refusing service (the "stalls") and, if I continue to press my I/O demand into that condition I have managed to produce a couple of panics from outright RAM exhaustion.

The original patch I wrote dramatically reduced the risk because it pulled the ZFS ARC cache out of the paging threshold and to the vm_free_target point.  It did not completely eliminate it due to the UMA interaction problem, which took a *lot* of effort to instrument sufficiently to prove. Pulling ARC out of the paging threshold *and* limiting dirty buffer memory to the difference between current free RAM and the paging threshold on a dynamic basis completely eliminates the problem (because it prevents that invasion of the threshold from happening in real time, and thus the reap calls from the arc maintenance thread is sufficient) but that apparently is not considered an "acceptable" fix. Turning off UMA works as well by removing the oscillation dynamic entirely but has a performance hit associated with it.

My current effort is to build a heuristic inside in the ARC ram allocator that will detect the incipient onset of this condition.  I am currently able to get to the point where the code "dampens" the oscillation sufficiently to prevent outright RAM exhaustion, but not quickly enough to remain out of danger of a stall.  That's not good enough because once the degenerative behavior starts it gets progressively worse for a constant workload.  There's no guarantee I'll get where I want to go with this effort but I'm giving it my best shot as I'm aware that UMA is considered a "good thing" in terms of performance.

I'll post further code if/when I get something that survives my torture test with UMA on and doesn't throw dtrace probe data that makes me uncomfortable.  

At present I believe I can get there.
Comment 143 karl 2014-09-22 04:39:47 UTC
I believe I have a smoking gun on the UMA memory allocator misbehavior.

zio_buf_65536:           65536,         0,       656,     40554,  25254813,         0,         0,      41MB,   2.475GB,   2.515GB

That's a particularly pathological case that is easily reproduced.  All that is required is, on a degraded pool (to slow it's maximum I/O rate down) to do a read of all files on it to /dev/null (e.g. "find . -type f -exec cat {} >/dev/null \;") and then at the same time perform a copy of files within the pool (e.g. "pax -r -w -v somedirectory/* anotherdirectory/")

Since the copy is local it is all done with a fixed-sized block (e.g. there's no TCP congestion control that impacts write block size to splay the block requests all over the hash table.)  Specifically, all the I/Os are in the 64k block size in the UMA allocator.

Here's the problem -- at the moment I took that snapshot there were 656 of those buffers currently in use, but over 40,000, or 2.475 *gigabytes*, of RAM "free" in that segment of the pool.  In other words *rather than reassign one of the existing 40,000+ "free" UMA buffers the system went and grabbed another one, and it did it over 40,000 times!*

This pathology of handing back new memory instead of re-using the "free" buffers in that pool that the system already has continues until the ZFS low memory warnings trip, then the reap routine is called and magically that 3GB+ of RAM reappears all at once.  In the meantime the ARC cache has been evicted on bogus memory pressure.

This is broken, badly -- those free buffers should be reassigned without taking a new allocation.

The impact of the pathology is extraordinary under heavy load as it produces extremely large oscillations in the system RAM state which both ZFS and the VM system have no effective means to deal with.

I have code that I've stuck in the ZFS allocator that prevents it from blowing up at this point by detecting the low memory condition when it crosses the vm_free_target barrier and calling the reap routine to clear out the free allocations that are not being reused as they should be.  While that does result in a system that doesn't blow up or stall it doesn't solve the actual problem but rather works around it, as did the original patch I posted (and the one that dynamically varied write buffers to compensate for the extreme hysteresis in the memory state.)

The correct fix will require finding out why UMA is not returning a free buffer that it has in the cache instead of allocating a new one, but until the UMA code is repaired the approach in comment 140, which includes disabling UMA, appears to be the only sane option.
Comment 144 Andriy Gapon freebsd_committer freebsd_triage 2014-09-22 13:51:26 UTC
(In reply to karl from comment #143)

Just an assorted list of things to consider, perhaps it will lead to some ideas to try.

- were the new allocations all 64KB sized?  if not then they could not make use of zio_buf_65536 cached objects.

- cached items from UMA zones are really freed only when the paging threshold is reached and the pagedaemon makes a pageout pass.

- the pagedaemon first posts vm_lowmem and then calls uma_reclaim().
- the event handlers are invoked synchronously (and directly).
- additionally, arc_lowmem() handler will block until the arc reclaim thread makes a pass.

- the above means that there is no normal paging out nor uma reclaiming until the arc reclaim thread evicts some of the cache.
- which means that any memory that the arc reclaim thread needs while doing its job or memory that is needed by other threads (including zfs ones) digs deeper into the page reserve

- zfs memory throttling may occur because of that (see arc_memory_throttle())

- additionally, there is vm.lowmem_period knob. vm_lowmem and uma_reclaim() can not be called more often than that period allows

Personally, what I'd try first is make arc_lowmem() non-blocking. IMO, it should just wake up the arc reclaim thread and return.
Comment 145 Andriy Gapon freebsd_committer freebsd_triage 2014-09-22 13:52:36 UTC
Changing the status because what's been committed does not seem to be a final solution at all.
Comment 146 karl 2014-09-22 17:01:00 UTC
OK, I've traced it further using AVG's touchpoints and now fully understand exactly what is going on.  

Nobody is going to like this, but my previous mitigation strategy is entirely effective at preventing the stalls.  I will explain.

ZFS allocates dirty buffers for write data subject to two parameters; dirty_max and dirty_max_max.  The former is 10% of RAM by default, the latter is 4GB.  The lesser of the two is the nominal limit.

Read buffers are less of a problem because a read comes FROM the disk and there is no "backlog"; that is, the system memory is faster than the I/O and thus there is no artificial pressure.  Writes are the opposite; there is always pressure any time the write rate exceeds the physical I/O rate of the system.

Dirty_max limits the total "in flight" data that can be in process.

UMA makes this situation untenable in the general sense, because for most workloads write block size is highly variable.  But UMA silos data into buffer blocks of various sizes, and can only re-use a block of a given size.  That is, if I want a 64k block of RAM but there are two 32K ones and no 64K ones, I must allocate a new 64K block.

Now contemplate what happens in an environment where the system does not keep up with the native I/O rate for any material amount of time.  For each moment during which the system is behind it will allocate up to 10% of the total system memory to dirty buffers, and take them from UMA for each particular size that is being written.  Those buffers, after the write completes, are returned to the UMA pool but are not freed back to the system.  Only a call to kmem_reap() (or it's sub-calls that only act on some of the kegs) actually returns space.

So let us take the existing 12GB system that I am testing on with a SMB client that is doing a lot of write I/O.  Due to TCP windowing size control the network stack tries to optimize the bandwidth delay product so as to produce a smooth flow of information.  This means that the block size presented to the system for writes is highly variable.

That in turn splays UMA allocation requests all over the map as the I/O delays and thus the results of that computation (and thus the block size) is also variable.

The consequence of this is not a maximum of 1.2Gb of "unused" buffers in the UMA system at any given point in time (that is, dirty_max when fully drained) it is a potential *unlimited* multiplication of that 1.2Gb!  In actual practice it is easy to manage to accumulate 3, 4, or even 5Gb of "Free" UMA data in the system this way, evicting the ARC cache the entire time, because the block sizes requested in a given burst are not available in that size and thus must be allocated, even though a gigabyte or more is available in another size.

There is no fix for this given the architecture of how the UMA memory system works and the way ZFS uses it.

What I can do is mitigate the damage through a couple of strategies, both of which I know work from testing but both of which were discarded when presented.  I argue they should be reconsidered as a group and active if UMA is turned on, specifically:

1. Set the ARC eviction threshold to approximately vm_free_target (NOT above it, as that provokes bad behavior in that it preferences inact and cache pages over ARC.)

2. Dynamically tune the dirty_write_max to the difference between current actual free memory and the paging threshold.  When RAM is plentiful this allows the full 10% of RAM to be used for dirty buffers.  When it is not it halts the rapid expansion of free-and-non-reallocable UMA storage (because the block size requested is not an exact match) and materially reduces the pressure coming from the above.

3. When an ARC allocation is made and we detect we are under vm_free_target, call kmem_reap().  This should not happen *if* the arc maintenance thread gets there first (since it also evicts and makes calls to the reaping routines) but due to the fact that thread is asynchronous and the allocation is on-demand, it still is likely to be able to happen occasionally.

All three can be conditioned on the use of UMA; those who do not wish to use UMA need none of those protections.  If UMA is on, however, without them you're going to see extreme oscillations in free memory for any workload that has highly-variable write block sizes going to the ZFS volumes that exceed the system's I/O bandwidth.

What we have, basically, is a design limitation that was not taken into consideration and steps to mitigate the harm it does need to be applied.

I will work up, test and validate the previous effort plus #3 and put it through my torture test, posting the results.
Comment 147 Justin T. Gibbs freebsd_committer freebsd_triage 2014-09-22 20:07:18 UTC
The buffer sizes allocated via zio_buf_alloc() are generally related to file system record size, not the access size of the client.  When compression is on, or the file system is so fragmented as to trigger gang block allocation, additional, temporary buffers are required which will be smaller than the record size (typically 128k for data blocks, 16k for metadata).  Similar temporary buffer allocation will occur for the parity blocks for the RAIDZ transform and for I/O aggregation in vdev_queue.c.  Are you writing a lot of sub-128k files, or using compression?  In your RAIDZ, what is the typical I/O size going to individual disks?  I'm trying to better understand the spike in usage for the 64k bucket on your system and whether it is due to write inflation (these additional temporary buffers) or is really due to data held in the ARC.

Since the temporary allocations are released to UMA as soon as the I/O completes, the VM system should be able to reclaim the majority of them on demand.  However, if we really do want to avoid the cost of the VM seeing a page shortage, we'll need to
estimate the memory overhead for retiring dirty data (which could be 2x the size of the actual data and metadata) and reclaim within ZFS to vm_free_target + that amount before starting the I/O pipeline.  Alternatively, we could notice the encroachment on the limit in zio_buf_alloc(), and trigger reclamation there.  For these temporary buffers, waiting for the ARC to do the reclamation may be too late.

All the above assumes that temporary zio buffer allocation is part of the problem - a hypothesis I'll try to validate on my systems here.
Comment 148 karl 2014-09-22 20:25:53 UTC
Compression is off on this filesystem on the test machine (it typically is on and set to lz4 for filesystems on the production machines, however); the pool in question is a raidz2.

The blocksize is 128k (default) on that filesystem as well.

With this load (internal pax copy + a cat of all the files to /dev/null) the system is stable although the oscillation in free memory is pretty extreme in the face of static load (the size of dirty_data_max.)  The instability comes when you start getting a mix of block sizes in there and effectively amplify that free-and-not-reallocated memory from the dirty buffers times the number of disparate buffer sizes that get used.

With reads only (just the "cat") the problem doesn't manifest; the system gets to a steady state on RAM and regular arc replacement works just fine and there is no large group of UMA buffers that are free and unused compared against the allocated ones in the various buckets.
Comment 149 karl 2014-09-23 17:49:01 UTC
Created attachment 147607 [details]
Current working patch against 10-Stable-BETA2

This is my current working proposal; it has the following characteristics on top of Steve's original refactor:

1. Moves the ZFS UMA reap routines to their own function so they can be used from other than just in the ARC maintenance thread.

2. If UMA is enabled calls a UMA reclaim every 10 passes through the ARC maintenance thread.  Under normal (non-pressured) RAM conditions this occurs once every 10 seconds and keeps the UMA area clean of burst I/O buffer demand that is not being reused.

3. If UMA is enabled, we are under RAM pressure at the time of allocation, AND we evicted ARC to make room and yet there is still RAM pressure despite the eviction, presume UMA bloat is the cause and call the UMA reaping routines.  (This has a dtrace associated with it; code for that follows if you wish to track it.)  This test should never pass with a negative paging margin; that it does is hard evidence that the system allegedly returned memory that has a low probability of being able to be reused (because it's in the wrong UMA bucket.)

4. Sets the arc free target to halfway between the paging threshold and vm_free_target; this is a tunable if you wish to set it elsewhere (specifically, some argue it should be at the paging threshold.  I strongly disagree but feel free to experiment with that [vfs.zfs.arc_free_target] and determine which provides superior results under your workload -- it is dynamically tunable while the system is in operation.)  This should NOT be set above vm_free_target except in extraordinary (and very unlikely) circumstances as doing so preferences inactive and cache pages over ARC.

5. If UMA is turned on implements a dynamic dirty write buffer resize when under memory pressure.  Normally dirty write buffers are limited to a fixed 10% of system memory with a maximum of 4GB.  The dynamic resizing code adds a further test, restricting dirty write buffers to the difference between the paging threshold and current system free RAM, with a minimum of 256MB, should that be less than the previous tests would allow.  This results, in my testing, in a material improvement in I/O convergence against available I/O bandwidth when the system is under stress.

The above heuristic appears to provide an excellent balance between the interacting dynamics involved with UMA turned on, limiting the burst memory pressure from write I/Os to a reasonable value that, in turn, allows the UMA cleanup code to keep up when the system is under stress.  It also prevents huge amounts of free RAM that accumulated during an I/O stress event from sitting indefinitely in the UMA free list in preference to ARC cache space, and in my testing is 100% effective at preventing "stalls" and other related bad behavior, especially for interactive sessions, while having no perceptible impact on I/O performance.

Comments from others who try the code are appreciated.
Comment 150 karl 2014-09-23 17:51:27 UTC
Created attachment 147609 [details]
dtrace script for the above

This dtrace script contains two calls; the first fires if there is a UMA reap called from inside the allocator (indicting memory pressure that resulted in eviction and needs cleaning at the kernel level) and the second, commented out, prints for each pass through dmu_tx on a write buffer and shows the dynamic sizing. The latter is commented out as it produces a great deal of output under heavy I/O load and thus is only useful for spot checks if you suspect misbehavior.
Comment 151 Andriy Gapon freebsd_committer freebsd_triage 2014-09-26 16:07:01 UTC
I suggest that people who are affected by the clearly inefficient behavior of uma(9) disable usage of uma for zio allocation.
Meanwhile we should work on improving uma behavior as a separate issue.

In other words, I do not see a need for kludges in arc code only to compensate for uma's problem.

I think that we need to periodically trim cached buckets of uma zones.  But we should not throw away all of them as this would negatively impact performance and almost defeat uma benefits.  So there should be some stats collection to estimate working set size for each zone, so that we do not reduce the cache too much unless really necessary.  I am tempted to suggest that pagedaemon could be doing those jobs.
Comment 152 karl 2014-09-26 16:47:46 UTC
(In reply to Andriy Gapon from comment #151)
> I suggest that people who are affected by the clearly inefficient behavior
> of uma(9) disable usage of uma for zio allocation.
> Meanwhile we should work on improving uma behavior as a separate issue.
> 
> In other words, I do not see a need for kludges in arc code only to
> compensate for uma's problem.

Except that's it not really UMA's problem, per-se.  UMA is working as designed.

It is ZFS' abuse of UMA's design.

Specifically, UMA is designed for quick allocation where there may be constructed items that are expensive to put together (e.g. things that require a mutex or similar to set up) and which can be re-used.  Those may also have an associated data block of arbitrary but known (and therefore fixed) size.  In that regard it works extremely well.

ZFS works fine in the general ARC (read) cache sense within this paradigm.

It violates the intent on a gross basis when deferred write (that is, dirty data) buffering is involved, because a data data buffer could be of any size.  In addition it is entirely possible (and indeed at least somewhat likely) to have large bursts of dirty (asynchronous) write data outstanding in a given size in one instant, and then have that data outstanding in a different size in another.

ZFS imposes a dirty data maximum on a per-pool basis but it is sized predicated on system RAM.  This is inherently wrong-headed because the maximum productive use of such a cache is determined not by system memory but rather by I/O channel performance, which varies from system to system and also in many cases from pool to pool or even vdev to vdev (e.g. the system that contains a high-performance SSD pool, and a second of much lower-performance spinning rust.)  The I/O scheduler (by default) will not apply artificial delay until 60% of that cache is in use (which is ridiculously large for spinning rust in many cases, especially with older, slower drives in a raidz configuration) and will not hard-limit until it fills.  UMA multiplies the maximum potential dirty data RAM allocation by the number of different size blocks queued for writing, which makes the original bad decision much worse and renders the code's ability to determine whether memory really is low or whether a requested allocation will be returned from an existing but idle RAM allocation impossible.

> I think that we need to periodically trim cached buckets of uma zones.  

Yes, but I believe that misses the essence of the issue -- ZFS is abusing the original intent of how UMA is architected.  This is an annoyance (it limits ARC write cache size) during times of little RAM pressure and highly-invariant (size-wise) write buffer use.  It becomes a big problem and leads to pathology when the system is under memory stress and write buffer size is highly-variable and beyond the ability of the system's I/O channels to drain. Network I/O and degraded pools both add to the probability of trouble because the first tends to splay out write buffer size and the second materially harms end-to-end I/O channel performance.

> But
> we should not throw away all of them as this would negatively impact
> performance and almost defeat uma benefits.  So there should be some stats
> collection to estimate working set size for each zone, so that we do not
> reduce the cache too much unless really necessary.  I am tempted to suggest
> that pagedaemon could be doing those jobs.

I argue that's just a kludge to cover a design problem as well, and embedding it into the system (rather than restricting it to the ZFS code) is the wrong direction to take.

It appears to me that the correct option overall is to not abuse UMA in the first place.  Specifically, dirty write buffer allocation is one that needs examination on a more-systemic basis, arguably moving it to a facility that is not part of UMA but rather allocates wired RAM and manages it internally to ZFS itself.  At the same time the dirty data buffer sizing should be computed predicated on measured I/O subsystem performance on a per-pool basis rather than taking a blanket allocation based on RAM size.

By removing dirty write buffers from UMA the problem with UMA is resolved.  By having ZFS do its own allocation and management for dirty write buffers it can have one pool of space that it can size dynamically predicated on available system memory or optimum throughput (whichever is less), with a maximum as is now the case but across the entire system rather than per-pool along with a per-pool limit that is computed predicated on I/O subsystem performance for each pool.

But that is a pretty significant change to the code and one that certainly ought to be coordinated with Illumos, since the impact on UMA is not limited to FreeBSD; indeed, it applies across-the-board.

The rest of how ZFS uses UMA is both ok and of positive impact on performance. It is specifically the abuse of UMA to hold dirty write buffers that leads to the problem.

I agree that for many people the right choice is to run Steve's refactor with UMA turned off.  But there are many people who argue that the UMA performance improvement is worthwhile (I happen to be one of them.)

So for those who want to use UMA pending examination and refactoring of the way ZFS does write buffering the patch under consideration stabilizes the system when UMA is on and yet provides most (but not all) of the potential CPU saving that UMA brings to the table.  I have another turn of this code under test that allows tuning of the dirty data maximum and honoring it (when the admin wishes to) along with a sysctl to disable the dynamic sizing entirely if you find it limits write I/O performance inappropriately.  I have not been able to get the patched code to misbehave in that fashion nor has anyone reported it, but it certainly appears possible, particularly in the case of very-high-performance I/O subsystems, and as such the ability to turn that aspect of it off is something I have taken some time to add and am now testing.
Comment 153 noons 2014-09-26 20:22:44 UTC
After speaking to Karl (thanks again for your time by the way), I ended up upgrading freebsd10-release to 10-stable, disabled UMA and applied Steve's refactor patch. I am in the process of trying to get the replicated partner to exhibit the same zpool deadlocks as the primary so unfortunately I won't know if this patch will have truly worked for a few days. The reason I am so curious is out of our 6 pairs this one server deadlocked almost every night, while two others only have locked up once. They are all currently in the process of migrating a large amount of data to them over multiple rsync processes while send/receiving to spares over 10Gb.

One other interesting observation I have made is post stable/patch my iozone 'WRITE' results have doubled in speed. I verified UMA was in fact disabled so that shouldn't be skewing any of my results. I tested and retested thinking it was just a one off result, but my results were all fairly consistent. 

Thank you to everyone contributing to the resolution of this.
Comment 154 noons 2014-09-27 19:44:24 UTC
This patch has seemingly made things worse. Had one of my other systems zpool deadlock so I updated to 10-stable, disabled UMA, applied Steve's patch. The system was only up for an 1:23 before the zpool deadlocked. Oddly enough this didn't happen when I would have expected it to.. Still plenty of free memory and the system hadn't even started paging. Cant kill the rsync's zpool status, zfs, etc just lock the shell up. Only option is a hard reboot...


last pid:  1864;  load averages:  0.07,  0.04,  0.05    up 0+01:22:41  15:39:37
87 processes:  1 running, 86 sleeping
CPU:  0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% idle
Mem: 194M Active, 170M Inact, 34G Wired, 91G Free
ARC: 27G Total, 5750M MFU, 3091M MRU, 3991M Anon, 161M Header, 14G Other
Swap: 32G Total, 32G Free

  PID USERNAME    THR PRI NICE   SIZE    RES STATE   C   TIME    WCPU COMMAND
 1445 root          1  20    0 25068K 11796K dp->dp  7   0:21   0.00% rsync
 1444 root          1  20    0 16876K  5716K dp->dp  6   0:08   0.00% rsync
 1446 root          1  20    0 25068K 14004K dp->dp  5   0:05   0.00% rsync
Comment 155 Steven Hartland freebsd_committer freebsd_triage 2014-09-27 20:05:27 UTC
Given the free memory and process states, I'd say that's highly likely to be unrelated to the patch as you'll simply not be triggering any of the altered code paths.
Comment 156 Steven Hartland freebsd_committer freebsd_triage 2014-09-27 20:06:39 UTC
To debug that hang procstat -kk and obtain a kernel dump if you can.
Comment 157 Steven Hartland freebsd_committer freebsd_triage 2014-09-27 20:10:27 UTC
Created attachment 147733 [details]
ARC reclaim refactor + uma clear down (against stable/10)

This version is very much WIP it builds on ARC reclaim refactor (against stable/10) adding automatic UMA free space clear down.

The following new sysctls are available:
* vfs.zfs.arc_cache_reclaim_period
* vfs.zfs.arc_cache_reclaim_partial
* vfs.zfs.arc_cache_free_max
* vfs.zfs.arc_cache_free_period
* vfs.zfs.arc_cache_target

The ARC reclaim thread now monitors triggers a free size based cleanup of its UMA zones.

This cleanup has multiple strategies:

1. ARC_CACHE_RECLAIM_SIZE
Every arc_cache_free_period we that all zones have no more than arc_cache_free_max of free space. This prevents zones which spike from sitting with large amounts of ram allocated but unused (free).
2. ARC_CACHE_RECLAIM_NOW
If free memory is lower than arc_cache_target a full cleanup of all zones is triggered.
If zfs_arc_cache_partial is none zero then this strategy will return early (not process all zones) if free memory is more than arc_cache_target.
3. ARC_CACHE_RECLAIM_FORCE
The same as ARC_CACHE_RECLAIM_NOW but will ignore when the last reclaim ran.

ARC cache reclaim for ARC_CACHE_RECLAIM_NOW and ARC_CACHE_RECLAIM_SIZE strategies will only run at most once every arc_cache_reclaim_period.

Before allocating new buffers arc_get_data_buf checks if allocating this amount of memory would cross cache reclaim threshold and if so it runs a cache reclaim with ARC_CACHE_RECLAIM_NOW.
Comment 158 Steven Hartland freebsd_committer freebsd_triage 2014-09-27 23:50:14 UTC
Created attachment 147738 [details]
ARC reclaim refactor + uma clear down (against stable/10)

Fixes for ARC_CACHE_RECLAIM_FORCE strategy.
Comment 159 noons 2014-09-28 00:18:17 UTC
This was definitely a red-herring as the behavior just didn't match up to what I was seeing before (with the exception of the deadlock). I will keep an eye on it. 

(In reply to noons from comment #154)
> This patch has seemingly made things worse. Had one of my other systems
> zpool deadlock so I updated to 10-stable, disabled UMA, applied Steve's
> patch. The system was only up for an 1:23 before the zpool deadlocked. Oddly
> enough this didn't happen when I would have expected it to.. Still plenty of
> free memory and the system hadn't even started paging. Cant kill the rsync's
> zpool status, zfs, etc just lock the shell up. Only option is a hard
> reboot...
> 
> 
> last pid:  1864;  load averages:  0.07,  0.04,  0.05    up 0+01:22:41 
> 15:39:37
> 87 processes:  1 running, 86 sleeping
> CPU:  0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% idle
> Mem: 194M Active, 170M Inact, 34G Wired, 91G Free
> ARC: 27G Total, 5750M MFU, 3091M MRU, 3991M Anon, 161M Header, 14G Other
> Swap: 32G Total, 32G Free
> 
>   PID USERNAME    THR PRI NICE   SIZE    RES STATE   C   TIME    WCPU COMMAND
>  1445 root          1  20    0 25068K 11796K dp->dp  7   0:21   0.00% rsync
>  1444 root          1  20    0 16876K  5716K dp->dp  6   0:08   0.00% rsync
>  1446 root          1  20    0 25068K 14004K dp->dp  5   0:05   0.00% rsync
Comment 160 noons 2014-09-28 15:18:02 UTC
This can't be right. Running rsync's the Arc was at 90G free memory was around 8-10G stepped away for 20 min to get coffee and came back to this..

Mem: 58M Active, 396M Inact, 123G Wired, 1012K Cache, 1310M Free
ARC: 15G Total, 850M MFU, 6930M MRU, 19M Anon, 3619M Header, 4823M Other
Swap: 32G Total, 32G Free

Where did all my memory go along with my ARC?
Comment 161 Steven Hartland freebsd_committer freebsd_triage 2014-09-28 15:26:05 UTC
(In reply to noons from comment #160)
> This can't be right. Running rsync's the Arc was at 90G free memory was
> around 8-10G stepped away for 20 min to get coffee and came back to this..
> 
> Mem: 58M Active, 396M Inact, 123G Wired, 1012K Cache, 1310M Free
> ARC: 15G Total, 850M MFU, 6930M MRU, 19M Anon, 3619M Header, 4823M Other
> Swap: 32G Total, 32G Free
> 
> Where did all my memory go along with my ARC?

No that doesn't look right, I suspect your seeing it all in uma free.

What does vmstat -z show?
Which patch are you running?
Comment 162 noons 2014-09-28 15:31:41 UTC
(In reply to Steven Hartland from comment #161)
> (In reply to noons from comment #160)
> > This can't be right. Running rsync's the Arc was at 90G free memory was
> > around 8-10G stepped away for 20 min to get coffee and came back to this..
> > 
> > Mem: 58M Active, 396M Inact, 123G Wired, 1012K Cache, 1310M Free
> > ARC: 15G Total, 850M MFU, 6930M MRU, 19M Anon, 3619M Header, 4823M Other
> > Swap: 32G Total, 32G Free
> > 
> > Where did all my memory go along with my ARC?
> 
> No that doesn't look right, I suspect your seeing it all in uma free.
> 
> What does vmstat -z show?
> Which patch are you running?

10-stable, Arc Reclaim Refactor Fix and UMA disabled.

http://pastebin.com/PB1ZyVP6


Oddly enough something 'burst' my free memory just went back to 'normal' and my ARC is begging to grow again..

Mem: 71M Active, 400M Inact, 43G Wired, 976K Cache, 81G Free
ARC: 26G Total, 1890M MFU, 17G MRU, 54M Anon, 3703M Header, 3787M Other
Swap: 32G Total, 32G Free
Comment 163 Steven Hartland freebsd_committer freebsd_triage 2014-09-28 15:42:04 UTC
(In reply to noons from comment #162)
> 10-stable, Arc Reclaim Refactor Fix and UMA disabled.
> 
> http://pastebin.com/PB1ZyVP6
> 
> 
> Oddly enough something 'burst' my free memory just went back to 'normal' and
> my ARC is begging to grow again..
> 
> Mem: 71M Active, 400M Inact, 43G Wired, 976K Cache, 81G Free
> ARC: 26G Total, 1890M MFU, 17G MRU, 54M Anon, 3703M Header, 3787M Other
> Swap: 32G Total, 32G Free

If its already cleared up hard to tell but even with those stats you have significant amounts of memory tied up in uma free. 6.64GB of which was due to ZFS and 11.29GB from other sources.

oused: 16.962GB, ofree: 11.291GB, ototal: 28.254GB
zused: 6.224GB, zfree: 6.649GB, ztotal: 12.872GB
used: 23.186GB, free: 17.940GB, total: 41.126GB
Comment 164 Steven Hartland freebsd_committer freebsd_triage 2014-09-28 15:44:48 UTC
Created attachment 147754 [details]
UMA summary script

This script does some summary of UMA usage, where zused = ZFS usage, oused = oher usage.
Comment 165 noons 2014-09-28 16:19:34 UTC
(In reply to Steven Hartland from comment #163)
> (In reply to noons from comment #162)
> > 10-stable, Arc Reclaim Refactor Fix and UMA disabled.
> > 
> > http://pastebin.com/PB1ZyVP6
> > 
> > 
> > Oddly enough something 'burst' my free memory just went back to 'normal' and
> > my ARC is begging to grow again..
> > 
> > Mem: 71M Active, 400M Inact, 43G Wired, 976K Cache, 81G Free
> > ARC: 26G Total, 1890M MFU, 17G MRU, 54M Anon, 3703M Header, 3787M Other
> > Swap: 32G Total, 32G Free
> 
> If its already cleared up hard to tell but even with those stats you have
> significant amounts of memory tied up in uma free. 6.64GB of which was due
> to ZFS and 11.29GB from other sources.
> 
> oused: 16.962GB, ofree: 11.291GB, ototal: 28.254GB
> zused: 6.224GB, zfree: 6.649GB, ztotal: 12.872GB
> used: 23.186GB, free: 17.940GB, total: 41.126GB


Ok this was fairly easy to reproduce, I ended up starting new rsync jobs to an empty folder an the behavior happened almost immediately. I watched the ark drop to the min along with my free memory..

http://pastebin.com/yw6T30GK


I also attempted to run your UMA script, but doesn't appear to be returning anything.
Comment 166 Steven Hartland freebsd_committer freebsd_triage 2014-09-28 17:01:27 UTC
(In reply to noons from comment #165)
> 
> I also attempted to run your UMA script, but doesn't appear to be returning
> anything.

It takes the input from vmstat -z so you run:
vmstat -z | vmstat.awk
Comment 167 noons 2014-09-28 17:34:50 UTC
(In reply to Steven Hartland from comment #166)
> (In reply to noons from comment #165)
> > 
> > I also attempted to run your UMA script, but doesn't appear to be returning
> > anything.
> 
> It takes the input from vmstat -z so you run:
> vmstat -z | vmstat.awk

Ahh thanks! See my results below... I've been testing this behavior and it seems that the ARC begins evicting as soon as it grows above 100GB, then it rapidly evicts until it hits the minimum. Another interesting observation is as soon as I initiate a large amount of reads (rsync to another box for example) the free memory bursts instantly to around 90G, if I don't initiate any reads then the ARC remains at 15GB and free memory hovers around 1G. I am kind of wondering if I should just set my arc_max to 100G as my other 10-release unpatched boxes never go higher then 105G anyway. It still doesn't explain the behavior...

http://pastebin.com/vjSCVCJe
Comment 168 Steven Hartland freebsd_committer freebsd_triage 2014-09-28 18:15:55 UTC
(In reply to noons from comment #167)
> Ahh thanks! See my results below... I've been testing this behavior and it
> seems that the ARC begins evicting as soon as it grows above 100GB, then it
> rapidly evicts until it hits the minimum. Another interesting observation is
> as soon as I initiate a large amount of reads (rsync to another box for
> example) the free memory bursts instantly to around 90G, if I don't initiate
> any reads then the ARC remains at 15GB and free memory hovers around 1G. I
> am kind of wondering if I should just set my arc_max to 100G as my other
> 10-release unpatched boxes never go higher then 105G anyway. It still
> doesn't explain the behavior...
> 
> http://pastebin.com/vjSCVCJe

Wow you have 84GB in the free 65536 zone. I have a quick look but cant see
what that zone is used by, I'm guessing it could be malloc provider or
something else generic.

The general pattern I'm getting the impression that the UMA doesn't deal
very well with spiky loads. How long does it take to recover from this?

If you use turn on ZFS UMA support what do you see?

After that does how does ARC reclaim refactor + uma clear down (against
stable/10) patch instead of the baseline patch affect that behaviour?
Comment 169 Steven Hartland freebsd_committer freebsd_triage 2014-09-28 18:24:50 UTC
noons do you have any custom ZFS sysclt's active at all?
Comment 170 noons 2014-09-28 21:17:20 UTC
(In reply to Steven Hartland from comment #169)
> noons do you have any custom ZFS sysclt's active at all?

Only what was listed below was added to my loader.conf, no values manually set either. The memory usage/ARC stays like this until I call some sort of large read (right now I am migrating files to the server so for the most part it is doing just writes). As soon as I call an rsync copying files from this box vs to, free memory jumps almost instantly to normal levels and my ARC begins to rise again. 

vfs.root.mountfrom="zfs:system"
vfs.zfs.zio.use_uma="0"


Interestingly enough with 10-release once memory began to get into the danger zone ~1.5G (for whatever reason) that is when the system would start to page yet the arc would remain fairly stable (maybe drop 10G then rise back up again). All that is going on is an rsync to the local box and 10 min zfs sends, so I am not sure where this memory is going.

I will try enabling UMA and will post my results (takes a few hours for the memory to 'leak')
Comment 171 karl 2014-09-29 13:14:28 UTC
(In reply to noons from comment #167)
> (In reply to Steven Hartland from comment #166)
> > (In reply to noons from comment #165)
> > > 
> > > I also attempted to run your UMA script, but doesn't appear to be returning
> > > anything.
> > 
> > It takes the input from vmstat -z so you run:
> > vmstat -z | vmstat.awk
> 
> Ahh thanks! See my results below... I've been testing this behavior and it
> seems that the ARC begins evicting as soon as it grows above 100GB, then it
> rapidly evicts until it hits the minimum. Another interesting observation is
> as soon as I initiate a large amount of reads (rsync to another box for
> example) the free memory bursts instantly to around 90G, if I don't initiate
> any reads then the ARC remains at 15GB and free memory hovers around 1G. I
> am kind of wondering if I should just set my arc_max to 100G as my other
> 10-release unpatched boxes never go higher then 105G anyway. It still
> doesn't explain the behavior...
> 
> http://pastebin.com/vjSCVCJe

As Steve noted you have an utterly enormous amount of RAM out in UMA that is not being used.  This looks like yet another manifestation of UMA out-of-control.

UMA *cannot* handle "spiky" loads -- by definition it does not (ever) return released memory on its own; something has to come along and tell it clean up after itself.  If you have a situation where you request 10Gig of RAM in a given UMA size and then release it, that 10 Gig remains allocated but available for another request later.

The problem comes in when that spiky load doesn't come back any time soon, which leaves that RAM allocated indefinitely, or worse it comes back *in a different size request*, which results in multiplying the amount of RAM consumed.  The latter is really, really bad.

That 83Gb number looks awful but there's plenty of other bad news in that printout; you have over 10Gb out and unused between three other buckets, and only one of them looks reasonably-coherent against current requirements (the 16k block.)

You need to find out where that's coming from..... it does not appear to be ZFS related.
Comment 172 noons 2014-09-29 13:51:23 UTC
(In reply to karl from comment #171)
> (In reply to noons from comment #167)
> > (In reply to Steven Hartland from comment #166)
> > > (In reply to noons from comment #165)
> > > > 
> > > > I also attempted to run your UMA script, but doesn't appear to be returning
> > > > anything.
> > > 
> > > It takes the input from vmstat -z so you run:
> > > vmstat -z | vmstat.awk
> > 
> > Ahh thanks! See my results below... I've been testing this behavior and it
> > seems that the ARC begins evicting as soon as it grows above 100GB, then it
> > rapidly evicts until it hits the minimum. Another interesting observation is
> > as soon as I initiate a large amount of reads (rsync to another box for
> > example) the free memory bursts instantly to around 90G, if I don't initiate
> > any reads then the ARC remains at 15GB and free memory hovers around 1G. I
> > am kind of wondering if I should just set my arc_max to 100G as my other
> > 10-release unpatched boxes never go higher then 105G anyway. It still
> > doesn't explain the behavior...
> > 
> > http://pastebin.com/vjSCVCJe
> 
> As Steve noted you have an utterly enormous amount of RAM out in UMA that is
> not being used.  This looks like yet another manifestation of UMA
> out-of-control.
> 
> UMA *cannot* handle "spiky" loads -- by definition it does not (ever) return
> released memory on its own; something has to come along and tell it clean up
> after itself.  If you have a situation where you request 10Gig of RAM in a
> given UMA size and then release it, that 10 Gig remains allocated but
> available for another request later.
> 
> The problem comes in when that spiky load doesn't come back any time soon,
> which leaves that RAM allocated indefinitely, or worse it comes back *in a
> different size request*, which results in multiplying the amount of RAM
> consumed.  The latter is really, really bad.
> 
> That 83Gb number looks awful but there's plenty of other bad news in that
> printout; you have over 10Gb out and unused between three other buckets, and
> only one of them looks reasonably-coherent against current requirements (the
> 16k block.)
> 
> You need to find out where that's coming from..... it does not appear to be
> ZFS related.

I am not sure what else it could be then. Only thing running are 3 rsync jobs and send/recv (using zrep for snapshot management). Maybe the amount of files I am rsync'ing is causing the pain, maybe send/resv?
Comment 173 karl 2014-09-29 14:10:00 UTC
(In reply to noons from comment #172)

> I am not sure what else it could be then. Only thing running are 3 rsync
> jobs and send/recv (using zrep for snapshot management). Maybe the amount of
> files I am rsync'ing is causing the pain, maybe send/resv?

Send/recv should not be doing this; this ought to be coming from something that the kernel is up to, not user-space applications.

These moves are going over the network (not on the same machine) yes?  How saturated are your I/O channels during this -- are they being slammed (e.g. 'gstat' showing your spindles at 100% load and material backlogs in the I/O queues?)

That's the paradigm that provokes trouble on my machine with UMA on and ZFS, and that my latest tries to get around (Steve is going after the same problem with a different approach.)  I've not been able to provoke what you're seeing and since you've eliminated ZFS' use of UMA it's coming from somewhere else.  I'm quite suspicious of the networking code if you're doing these copies over the network....
Comment 174 noons 2014-09-29 14:56:55 UTC
(In reply to karl from comment #173)
> (In reply to noons from comment #172)
> 
> > I am not sure what else it could be then. Only thing running are 3 rsync
> > jobs and send/recv (using zrep for snapshot management). Maybe the amount of
> > files I am rsync'ing is causing the pain, maybe send/resv?
> 
> Send/recv should not be doing this; this ought to be coming from something
> that the kernel is up to, not user-space applications.
> 
> These moves are going over the network (not on the same machine) yes?  How
> saturated are your I/O channels during this -- are they being slammed (e.g.
> 'gstat' showing your spindles at 100% load and material backlogs in the I/O
> queues?)
> 
> That's the paradigm that provokes trouble on my machine with UMA on and ZFS,
> and that my latest tries to get around (Steve is going after the same
> problem with a different approach.)  I've not been able to provoke what
> you're seeing and since you've eliminated ZFS' use of UMA it's coming from
> somewhere else.  I'm quite suspicious of the networking code if you're doing
> these copies over the network....

Woke up to two servers completely deadlocked, one patched one on release. I thought for sure the behavior I was seeing was related to this bug as it seemed to only happen once the memory got saturated and the system was doing heavy paging, but in one instance on a patched server the system deadlocked before the ARC was even populated (1 hour of uptime). This patch definitely solved the paging issues, but causes the ARC to dump down to minimum most likely due to whatever my real issue actually is. Good news is the patch is probably working as expected.

I am keeping an eye on gstat and the average is around 20% and spikes to around 50-60 probably when the rsync jobs pick up considerable speed, but I will continue to monitor this. I have witnessed a few times where for around 1-2 min on 10-release some of the disks hit 100% while the others are 0. Trying to confirm if this is happening post stable-patch

The rsync itself is being done via a 1G direct connection MTU set to 9000 while send/recv is going over a 10G direct connection. I have 3 rsync jobs running to different subfolders due to the constant pausing of rsync, an issue we have due to many of the folders containing million+ of small files (8k-32k). 

procstats during deadlock:
http://pastebin.com/avixyHDA

Anyway if you guys agree this is unrelated to this bug I will have to raise a separate issue as I don't want to pollute this bug thread anymore then I already have.
Comment 175 Adrian Chadd freebsd_committer freebsd_triage 2014-09-29 16:52:22 UTC
.. can we look at addressing the UMA hoarding as a general thing, separate from ZFS?

I think avg is on the right track here - there are two separate problems here.


-a
Comment 176 Steven Hartland freebsd_committer freebsd_triage 2014-09-29 17:36:10 UTC
(In reply to Adrian Chadd from comment #175)
> .. can we look at addressing the UMA hoarding as a general thing, separate
> from ZFS?
> 
> I think avg is on the right track here - there are two separate problems
> here.

Yes this is something I've been thinking about and I noticed that UMA already does periodic accounting for hash sizing.

I don't believe there's going to be a one size fits all solution so I've toying with the idea of configuring a free threshold on each zone much the same way uma_zone_set_max(..) does for used.

That said given how ZFS works and how active it is with its allocations I don't believe that would be sufficient. It needs to be able to actively manage its memory use as in my + uma clear down patch.

Now ideally this functionality should be provided by the uma subsystem, but there's still going to need to be interaction there.
Comment 177 Steven Hartland freebsd_committer freebsd_triage 2014-09-29 17:37:28 UTC
(In reply to noons from comment #174)
> procstats during deadlock:
> http://pastebin.com/avixyHDA
> 
> Anyway if you guys agree this is unrelated to this bug I will have to raise
> a separate issue as I don't want to pollute this bug thread anymore then I
> already have.

Do you have any zvols in use? If so I may have a patch sitting around here which could fix this deadlock.
Comment 178 karl 2014-09-29 19:29:04 UTC
(In reply to Steven Hartland from comment #176)
> (In reply to Adrian Chadd from comment #175)
> > .. can we look at addressing the UMA hoarding as a general thing, separate
> > from ZFS?
> > 
> > I think avg is on the right track here - there are two separate problems
> > here.
> 
> Yes this is something I've been thinking about and I noticed that UMA
> already does periodic accounting for hash sizing.
> 
> I don't believe there's going to be a one size fits all solution so I've
> toying with the idea of configuring a free threshold on each zone much the
> same way uma_zone_set_max(..) does for used.
> 
> That said given how ZFS works and how active it is with its allocations I
> don't believe that would be sufficient. It needs to be able to actively
> manage its memory use as in my + uma clear down patch.
> 
> Now ideally this functionality should be provided by the uma subsystem, but
> there's still going to need to be interaction there.

If the problem with ZFS and UMA is limited to dirty write allocation (and I believe it is) then the fair question to ask is whether the gains from using UMA for that purpose, with the attendant hackery to keep the space spikes under control (a strategy that may well work; I believe my approach does in the general case but at a material overhead cost; your approach may also work but at a material overhead cost) outweighs the cost of not using UMA for dirty write buffers.

That is, does the latency + CPU consumption tip toward using it or not using it?  If toward not then there's a decent argument for refactoring the code to not use UMA for dirty write buffers at all.
Comment 179 Steven Hartland freebsd_committer freebsd_triage 2014-09-29 19:48:54 UTC
(In reply to karl from comment #178)
> If the problem with ZFS and UMA is limited to dirty write allocation (and I
> believe it is) then the fair question to ask is whether the gains from using
> UMA for that purpose, with the attendant hackery to keep the space spikes
> under control (a strategy that may well work; I believe my approach does in
> the general case but at a material overhead cost; your approach may also
> work but at a material overhead cost) outweighs the cost of not using UMA
> for dirty write buffers.

Actually I don't believe that's the case, there are more issues that stem from UMA not clearing up as readily as it could.

> That is, does the latency + CPU consumption tip toward using it or not using
> it?  If toward not then there's a decent argument for refactoring the code
> to not use UMA for dirty write buffers at all.

You can achieve that but disabling uma support.

If your suggesting dirty buffers should be separate from ARC then that's a totally different matter and could massively balloon the memory requirements as the whole design is around the two sharing memory space.
Comment 180 karl 2014-09-29 19:52:20 UTC
(In reply to Steven Hartland from comment #179)
> (In reply to karl from comment #178)
> > If the problem with ZFS and UMA is limited to dirty write allocation (and I
> > believe it is) then the fair question to ask is whether the gains from using
> > UMA for that purpose, with the attendant hackery to keep the space spikes
> > under control (a strategy that may well work; I believe my approach does in
> > the general case but at a material overhead cost; your approach may also
> > work but at a material overhead cost) outweighs the cost of not using UMA
> > for dirty write buffers.
> 
> Actually I don't believe that's the case, there are more issues that stem
> from UMA not clearing up as readily as it could.

In the case of other facilities, yes.  But in the case of ZFS it appears that the issues can only be triggered during writes.  I've yet to be able to reproduce the UMA bloat problems with any sort of read stress test no matter how nasty I get with it.

> > That is, does the latency + CPU consumption tip toward using it or not using
> > it?  If toward not then there's a decent argument for refactoring the code
> > to not use UMA for dirty write buffers at all.
> 
> You can achieve that but disabling uma support.
> 
> If your suggesting dirty buffers should be separate from ARC then that's a
> totally different matter and could massively balloon the memory requirements
> as the whole design is around the two sharing memory space.

If that is the case then it would appear to argue for either a very serious refactor of the UMA code or that ZFS is abusing the intent and design of the UMA code and thus it should not use it at all.
Comment 181 Steven Hartland freebsd_committer freebsd_triage 2014-09-29 20:01:53 UTC
(In reply to karl from comment #180)
> In the case of other facilities, yes.  But in the case of ZFS it appears
> that the issues can only be triggered during writes.  I've yet to be able to
> reproduce the UMA bloat problems with any sort of read stress test no matter
> how nasty I get with it.

You may be seeing that in your env but its totally possible for others to be seeing it on reads too. All you need is different apps using different read sizes and you'll get it on read too.

> > > That is, does the latency + CPU consumption tip toward using it or not using
> > > it?  If toward not then there's a decent argument for refactoring the code
> > > to not use UMA for dirty write buffers at all.
> > 
> > You can achieve that but disabling uma support.
> > 
> > If your suggesting dirty buffers should be separate from ARC then that's a
> > totally different matter and could massively balloon the memory requirements
> > as the whole design is around the two sharing memory space.
> 
> If that is the case then it would appear to argue for either a very serious
> refactor of the UMA code or that ZFS is abusing the intent and design of the
> UMA code and thus it should not use it at all.

I'm not sure how serious a refactor is required, but I would agree with Adrian that UMA does appear to need some work to ensure it can deal with spiky workloads that allocate large amounts in zones for short periods, then don't use it again for some time.

That said in the general case it shouldn't be too much of an issue if low memory trigger did UMA clear down as one of its first steps.

The problem with ZFS is that it looks at how much is used / free when free doesn't include UMA free when potentially it should.
Comment 182 karl 2014-09-29 20:31:56 UTC
(In reply to Steven Hartland from comment #181)
> (In reply to karl from comment #180)
> > In the case of other facilities, yes.  But in the case of ZFS it appears
> > that the issues can only be triggered during writes.  I've yet to be able to
> > reproduce the UMA bloat problems with any sort of read stress test no matter
> > how nasty I get with it.
> 
> You may be seeing that in your env but its totally possible for others to be
> seeing it on reads too. All you need is different apps using different read
> sizes and you'll get it on read too.

I've tried to generate that condition by opening up large files (many-gigabyte-sized) with random lseeks and read sizes and I can't make the condition manifest itself that way.

> > > > That is, does the latency + CPU consumption tip toward using it or not using
> > > > it?  If toward not then there's a decent argument for refactoring the code
> > > > to not use UMA for dirty write buffers at all.
> > > 
> > > You can achieve that but disabling uma support.
> > > 
> > > If your suggesting dirty buffers should be separate from ARC then that's a
> > > totally different matter and could massively balloon the memory requirements
> > > as the whole design is around the two sharing memory space.
> > 
> > If that is the case then it would appear to argue for either a very serious
> > refactor of the UMA code or that ZFS is abusing the intent and design of the
> > UMA code and thus it should not use it at all.
> 
> I'm not sure how serious a refactor is required, but I would agree with
> Adrian that UMA does appear to need some work to ensure it can deal with
> spiky workloads that allocate large amounts in zones for short periods, then
> don't use it again for some time.
> 
> That said in the general case it shouldn't be too much of an issue if low
> memory trigger did UMA clear down as one of its first steps.

Maybe.  My testing hasn't led to joy via that route, but I haven't completely given up.  Putting that check in the arc maintenance thread (which when not under heavy pressure runs once per second) proved to be insufficiently quick to catch the condition before it got out of hand.  Doing so in the allocator (my latest published patch) works but it is arguably wasteful to clear all of the buckets.

But is it in practical terms?  If there are (relatively) few available slots then the actual work of freeing them should be small.  That in turn implies that the overhead of doing that when unnecessary isn't very large, nor is the price paid significant for doing so as a percentage of the whole in terms of the work being done.

> 
> The problem with ZFS is that it looks at how much is used / free when free
> doesn't include UMA free when potentially it should.

I tried to include that and it led to even worse behavior, but that was because it wasn't fine-grained enough.
Comment 183 noons 2014-09-30 02:02:37 UTC
(In reply to karl from comment #173)
> (In reply to noons from comment #172)
> 
> > I am not sure what else it could be then. Only thing running are 3 rsync
> > jobs and send/recv (using zrep for snapshot management). Maybe the amount of
> > files I am rsync'ing is causing the pain, maybe send/resv?
> 
> Send/recv should not be doing this; this ought to be coming from something
> that the kernel is up to, not user-space applications.
> 
> These moves are going over the network (not on the same machine) yes?  How
> saturated are your I/O channels during this -- are they being slammed (e.g.
> 'gstat' showing your spindles at 100% load and material backlogs in the I/O
> queues?)
> 
> That's the paradigm that provokes trouble on my machine with UMA on and ZFS,
> and that my latest tries to get around (Steve is going after the same
> problem with a different approach.)  I've not been able to provoke what
> you're seeing and since you've eliminated ZFS' use of UMA it's coming from
> somewhere else.  I'm quite suspicious of the networking code if you're doing
> these copies over the network....

Grasping at straws I thought about what you said Karl and happened to look at netstat. Mbufs denied and jumbo clusters denied were all awfully high and growing, I disabled jumbo frames rebooted and so far I am not seeing the behavior I was yesterday. My ARC is at 100G memory around 4-6G as memory drops arc drops slightly down to 96G and raises back up again. 

Did I stumble on a bug with the networking stack? I am going to continue monitoring, but it has been healthy like this for hours where yesterday with jumbo frames I repeatedly hit this issue as soon as the arc filled up. There is no other explanation as no other changes were made and even after a reboot yesterday I was hitting the same issue.
Comment 184 karl 2014-09-30 02:15:35 UTC
(In reply to noons from comment #183)
> (In reply to karl from comment #173)
> > (In reply to noons from comment #172)
> > 
> > > I am not sure what else it could be then. Only thing running are 3 rsync
> > > jobs and send/recv (using zrep for snapshot management). Maybe the amount of
> > > files I am rsync'ing is causing the pain, maybe send/resv?
> > 
> > Send/recv should not be doing this; this ought to be coming from something
> > that the kernel is up to, not user-space applications.
> > 
> > These moves are going over the network (not on the same machine) yes?  How
> > saturated are your I/O channels during this -- are they being slammed (e.g.
> > 'gstat' showing your spindles at 100% load and material backlogs in the I/O
> > queues?)
> > 
> > That's the paradigm that provokes trouble on my machine with UMA on and ZFS,
> > and that my latest tries to get around (Steve is going after the same
> > problem with a different approach.)  I've not been able to provoke what
> > you're seeing and since you've eliminated ZFS' use of UMA it's coming from
> > somewhere else.  I'm quite suspicious of the networking code if you're doing
> > these copies over the network....
> 
> Grasping at straws I thought about what you said Karl and happened to look
> at netstat. Mbufs denied and jumbo clusters denied were all awfully high and
> growing, I disabled jumbo frames rebooted and so far I am not seeing the
> behavior I was yesterday. My ARC is at 100G memory around 4-6G as memory
> drops arc drops slightly down to 96G and raises back up again. 
> 
> Did I stumble on a bug with the networking stack? I am going to continue
> monitoring, but it has been healthy like this for hours where yesterday with
> jumbo frames I repeatedly hit this issue as soon as the arc filled up. There
> is no other explanation as no other changes were made and even after a
> reboot yesterday I was hitting the same issue.

Entirely possible.... see if the problem stays gone; if you're getting mbuf and jumbo clusters denied, yeah, there may well be a problem there, especially with the UMA you had out.  It's not proof but it looks like pretty good correlation -- especially if the problem is now gone.
Comment 185 Steven Hartland freebsd_committer freebsd_triage 2014-09-30 02:20:11 UTC
(In reply to noons from comment #183)
> Grasping at straws I thought about what you said Karl and happened to look
> at netstat. Mbufs denied and jumbo clusters denied were all awfully high and
> growing, I disabled jumbo frames rebooted and so far I am not seeing the
> behavior I was yesterday. My ARC is at 100G memory around 4-6G as memory
> drops arc drops slightly down to 96G and raises back up again. 

mbufs are allocated using a UMA zone, so monitor the differences between your vmstat -z in both cases to spot the difference.

> Did I stumble on a bug with the networking stack? I am going to continue
> monitoring, but it has been healthy like this for hours where yesterday with
> jumbo frames I repeatedly hit this issue as soon as the arc filled up. There
> is no other explanation as no other changes were made and even after a
> reboot yesterday I was hitting the same issue.

This could be a side effect of slowing down the incoming data rate due to smaller packet sizes, so also compare your peek throughputs on your interfaces using systat 1 -if.
Comment 186 Steven Hartland freebsd_committer freebsd_triage 2014-09-30 02:36:57 UTC
Created attachment 147815 [details]
ARC reclaim refactor + uma clear down (against stable/10)

Fix periodic firing too often, add comments, rename var.
Comment 187 Andriy Gapon freebsd_committer freebsd_triage 2014-09-30 06:35:59 UTC
Comment on attachment 147815 [details]
ARC reclaim refactor + uma clear down (against stable/10)

This looks horrible.
Comment 188 Steven Hartland freebsd_committer freebsd_triage 2014-09-30 08:30:15 UTC
(In reply to Andriy Gapon from comment #187)
> Comment on attachment 147815 [details]
> ARC reclaim refactor + uma clear down (against stable/10)
> 
> This looks horrible.

That's why its WIP for testing ideas.
Comment 189 commit-hook freebsd_committer freebsd_triage 2014-10-03 20:35:28 UTC
A commit references this bug:

Author: smh
Date: Fri Oct  3 20:34:56 UTC 2014
New revision: 272483
URL: https://svnweb.freebsd.org/changeset/base/272483

Log:
  Refactor ZFS ARC reclaim checks and limits

  Remove previously added kmem methods in favour of defines which
  allow diff minimisation between upstream code base.

  Rebalance ARC free target to be vm_pageout_wakeup_thresh by default
  which eliminates issue where ARC gets minimised instead of balancing
  with VM pageout. The restores the target point prior to r270759.

  Bring in missing upstream only changes which move unused code to
  further eliminate code differences.

  Add additional DTRACE probe to aid monitoring of ARC behaviour.

  Enable upstream i386 code paths on platforms which don't define
  UMA_MD_SMALL_ALLOC.

  Fix mixture of byte an page values in arc_memory_throttle i386 code
  path value assignment of available_memory.

  PR:		187594
  Review:		D702
  Reviewed by:	avg
  MFC after:	1 week
  X-MFC-With:	r270759 & r270861
  Sponsored by:	Multiplay

Changes:
  head/sys/cddl/compat/opensolaris/kern/opensolaris_kmem.c
  head/sys/cddl/compat/opensolaris/sys/kmem.h
  head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c
  head/sys/vm/vm_pageout.c
Comment 190 Will Andrews freebsd_committer freebsd_triage 2014-10-14 13:26:40 UTC
Have you tried turning off ZIL commits on your filesystems (zfs set sync=disabled) to see whether that plays a role in the write performance cliffs you've seen?

I know this property change would not be a fix in and of itself, but I do have a local change which may help on top of the ARC interaction changes mentioned.  And setting this property on your filesystems is the quickest and easiest way to determine whether that is the case.

Can you also comment on what degree your workload involves sync writes?
Comment 191 karl 2014-10-14 13:31:28 UTC
(In reply to Will Andrews from comment #190)
> Have you tried turning off ZIL commits on your filesystems (zfs set
> sync=disabled) to see whether that plays a role in the write performance
> cliffs you've seen?
> 
> I know this property change would not be a fix in and of itself, but I do
> have a local change which may help on top of the ARC interaction changes
> mentioned.  And setting this property on your filesystems is the quickest
> and easiest way to determine whether that is the case.
> 
> Can you also comment on what degree your workload involves sync writes?

I cannot check that at this moment as my test machine is under a stress test for a different project right now, but should be able to over the weekend.

For my workload (and one that easily provokes the problems) sync write I/O is a very large part of the workload on the machine and pools involved as the production systems most-impacted have a Postgres database running on them with very heavy write transaction activity.
Comment 192 mikej 2014-10-31 15:38:47 UTC
Are there updates to the patch as it does not apply cleanly to 10.1-PRERELEASE #0 r273834.

r273834 still suffers from using swap when under memory pressure.

If not, what is the latest version of 10 stable that should be used.

Thanks.
Comment 193 karl 2014-11-01 13:57:34 UTC
(In reply to mikej from comment #192)
> Are there updates to the patch as it does not apply cleanly to
> 10.1-PRERELEASE #0 r273834.
> 
> r273834 still suffers from using swap when under memory pressure.
> 
> If not, what is the latest version of 10 stable that should be used.
> 
> Thanks.

My test machine is otherwise occupied with a project (as am I) at the moment; I can attempt to refactor my last patch against 10.1-PRERELEASE some time early to mid next-week.

My current production kernel with my last patch is built from r272215.
Comment 194 mikej 2014-11-24 20:01:54 UTC
Have any of these patches made it into later version of 10-stable, do you still need to patch against r272215 or is there a revised patch available?

Thanks!
Comment 195 Steven Hartland freebsd_committer freebsd_triage 2014-11-25 02:02:26 UTC
(In reply to mikej from comment #194)
> Have any of these patches made it into later version of 10-stable, do you
> still need to patch against r272215 or is there a revised patch available?

Yes the core refactor is in stable/10 as of r272875 however there is still more work to do.
Comment 196 rainer 2014-11-25 08:32:00 UTC
What's the status of this patch regarding 10.1?

Is this in 10.1?
Or is there a patch for 10.1?
Comment 197 Steven Hartland freebsd_committer freebsd_triage 2014-11-25 13:25:09 UTC
(In reply to rainer from comment #196)
> What's the status of this patch regarding 10.1?
> 
> Is this in 10.1?
> Or is there a patch for 10.1?

It was unfortunately too late to make it into 10.1 but you can use the patch from stable/10 https://svnweb.freebsd.org/base?view=revision&revision=272875
Comment 198 karl 2015-02-10 19:57:48 UTC
I have refactored this patch for 10.1-STABLE as it exists today and have initiated testing.

Some of the proposed changes are in 10.1 but not all (particularly not dynamic dirty write sizing); the threshold for arc eviction by default is also set too low IMHO (which brings the pager into play when it should not be.)

When I have an updated patch that applies cleanly against the -STABLE codebase I'll post it.
Comment 199 karl 2015-02-10 21:44:49 UTC
Created attachment 152852 [details]
ARC Refactor / UMA Cleardown / DMU_TX dynamic against 10.1-STABLE

This has received LIMITED testing here but appears to be fully functional and provides identical behavior to the previous revision.  Heavy I/O and compile loads evict cache as memory pressure rises; you should not see swap activity.

Applies cleanly against r278524.

Dynamic dmu_tx write buffer sizing can be disabled by setting vfs.zfs.dynamic_write_buffer=0 and the free target can be adjusted with vfs.zfs.arc_free_target, both are available dynamically while the system is operating.  Write buffer dynamic sizing defaults on and arc_free_target defaults to the pager wakeup threshold plus half the difference between the free target and wakeup threshold.

Attachment 147609 [details] above works with this patch to monitor behavior should you desire to do so using dtrace.

Comments welcome.
Comment 200 karl 2015-02-23 22:35:19 UTC
(In reply to karl from comment #199)

The latest has now been running on a production system running FreeBSD 10.1-STABLE #8 r278579M for close to two weeks without incident.
Comment 201 Sean Chittenden freebsd_committer freebsd_triage 2015-03-05 20:00:53 UTC
Since you're nearly at a month of running this, how has it performed?  What are the next steps?  We've begun testing and are ramping up load, but haven't seen anything yet.
Comment 202 karl 2015-03-05 20:23:11 UTC
(In reply to Sean Chittenden from comment #201)

It is running extremely well; I have noted no anomalous behavior at all.

I do not believe core's view of this has changed, which means they may not be willing to MFC it but since I cannot run the base code without serious performance problems in my environment I will be maintaining it for (at minimum) the builds I run until it is either MFC'd or becomes unnecessary due to further evolution in the base code set.

There is one remaining known issue that can impact interactive response that is in the base code but not really addressed in this patch set, which is dynamic TXG queue management based not on memory size but on volume set performance.  The issue here is that it is quite possible for a very large amount of dirty data to be queued for write and clearance of that data to disk may require tens or even *hundreds* of seconds in some cases. Ironically this is most-likely to show up at times of relatively low RAM pressure from other processes but "quick-n-fast" dirtying of data (e.g. "make -j16 buildkernel"), especially with this patch in, as under high RAM pressure the dynamic dmu_tx resizing limits write backlog as part of its memory "breathing" management heuristic.

The problem with fixing that in a patch set is that I have not yet identified how to determine a volume set's write I/O performance (e.g. effective throughput available) on a dynamic basis or even if ZFS' code maintains that data internally at all.  This is something I would like to fix but it's not high priority as it comes up rarely in my environment.
Comment 203 Tomoaki AOKI 2015-03-07 04:22:30 UTC
(In reply to karl from comment #199)

Observed no incident running with r278535 through r279579, stable/10 amd64.
Notebook with 8GB of RAM.
Completely no swap observed within then. Thanks!

Note:
 Performed several updates of base, rebuilding www/chromium, editors/libreoffice
 (not at the same moment, though) while buildkernel for memory pressure.
 In some case, performing above while buildkernel of head in VirtualBox VM
 (1GB of memory allowed).
Comment 204 Steven Hartland freebsd_committer freebsd_triage 2015-04-03 15:44:07 UTC
The following is a interesting commit from mav which may well have an impact on this:
https://svnweb.freebsd.org/base?view=revision&revision=281026
Comment 205 karl 2015-04-03 15:50:23 UTC
(In reply to Steven Hartland from comment #204)

Mayyybeee.... although it looks like that is likely an additional potential issue, albeit one that with my code changes in is unlikely to manifest (due to the cleanup of UMA and dynamic dmu_tx resize that takes place under memory pressure with sufficient margin before things get out of control.)
Comment 206 Tom 2015-05-17 16:58:35 UTC
karl, thank you so much for your efforts on this issue. I stumbled across this PR a few days ago and after having poured through your detailed research and explanations it has shed light on issues I have been having for quite some time.

I'm currently running stable/10 r282909 and I believe the symptoms I'm seeing are related to the problem you describe. Would you agree that at this revision, without your patches, there will be problems with memory reclamation?

I have applied your patch and begun testing the IO patterns that led to my numerous kernel panics. So far so good, though I haven't been testing long.

I really hope the devs take a look at this patch and incorporate it at some point. Maybe they have in head but it hasn't been merged to stable? I don't know. What I do know is if you use FreeBSD as an iSCSI SAN for VM storage and also use it as a SMB backup target you're going to have a very, very bad time without this patch.
Comment 207 karl 2015-05-17 18:22:38 UTC
[[I'm currently running stable/10 r282909 and I believe the symptoms I'm seeing are related to the problem you describe. Would you agree that at this revision, without your patches, there will be problems with memory reclamation?]]
I am on 10.1-Stable r278579 with the patch in; I have not updated to later as I've had no real reason to as of this point.

The interactions between the VM and what it wants to do, the use of UMA and ZFS' ARC design assumptions all interact in rather interesting ways that "disagree" with each other.  This patch isn't an optimal answer as I've noted but it does the job and it appears to provide durable relief for those people who otherwise manage to trigger the bad behavior with their particular operating environment.

There is an obscure comment in the first quarter status report that MIGHT relate to this patch -- it's not entirely clear.

http://www.freebsd.org/news/status/report-2015-01-2015-03.html
Comment 208 rainer 2015-05-17 18:41:55 UTC
I also wondered about that sentence in the report.
That discussion must have taken place on the core-mailing list directly.
Because I didn't see it in this PR, nor -stable nor -fs.

I still have to upgrade my 10.0 system that showed this problem and that I haven't upgraded to 10.1, yet.
Comment 209 Tom 2015-05-17 18:57:16 UTC
It was that comment that sparked my interest and after some googling led me to this PR. I have a feeling it does indeed refer to this discussion.
Comment 210 Sean Chittenden freebsd_committer freebsd_triage 2015-05-17 23:52:46 UTC
I'm not authoritative, but I believe that comment was related to this issue.
Comment 211 Sean Chittenden freebsd_committer freebsd_triage 2015-05-28 16:40:34 UTC
Karl, have you run any tests with -STABLE recently after some of these fixes?  If so, how are they running?
Comment 212 karl 2015-05-28 17:06:07 UTC
(In reply to Sean Chittenden from comment #211)

I have been running this code on -STABLE since the last update on a continual basis in production and have recorded no anomalous behavior.  One of my busier production systems with this code on it currently has 106 days of uptime and has been completely stable.
Comment 213 Sean Chittenden freebsd_committer freebsd_triage 2015-05-28 17:18:53 UTC
Alright, good to know.  Have you run -STABLE without this patch to see if the fixes from mav@ helped or hindered?
Comment 214 karl 2015-05-28 17:23:41 UTC
(In reply to Sean Chittenden from comment #213)

No; I have looked at that patch but from reading it I have no reason to believe that it would impact the specific issues this patch attempts to address; I'm not running into the cases where that specific issue becomes a problem in my particular use profiles.
Comment 215 Tom 2015-06-09 10:37:47 UTC
To anyone else running with this patch, have you noticed a dramatic drop in arc hit ratios? Before, as measured with zfs-stats -E, my system seemed to hover around 95% for Cache Hit Ratio and 90% for Actual Hit Ratio. Now these seem quite a bit lower, 89% and 70% respectively right now. 

Running dtrace for a bit on arc_evict I notice that MFU entries get evicted just as often as MRU. This seems a bit a bit strange, though I'm no expert. I would think MFU entries are valuable and would be held as long as possible, trimming the tail of MRU instead.
Comment 216 karl 2015-06-09 12:24:51 UTC
Cache efficiency is a function of both size and workload; I have an extremely write-heavy workload on one of my production machines here (database service with a very high percentage of updates as a percentage of total I/Os) and run around 75% total.

On that system the MRU cache list is much larger (85%) than the MFU size (15%) but the MFU list gets most of the hits (77%).

ARC Summary: (HEALTHY)
        Memory Throttle Count:                  0

ARC Misc:
        Deleted:                                7.79m
        Recycle Misses:                         2.78m
        Mutex Misses:                           1.35k
        Evict Skips:                            38.83m

ARC Size:                               67.63%  15.10   GiB
        Target Size: (Adaptive)         67.77%  15.13   GiB
        Min Size (Hard Limit):          12.50%  2.79    GiB
        Max Size (High Water):          8:1     22.33   GiB

ARC Size Breakdown:
        Recently Used Cache Size:       84.63%  12.81   GiB
        Frequently Used Cache Size:     15.37%  2.33    GiB

ARC Hash Breakdown:
        Elements Max:                           1.41m
        Elements Current:               52.10%  736.74k
        Collisions:                             2.94m
        Chain Max:                              7
        Chains:                                 82.23k


------------------------------------------------------------------------

ARC Efficiency:                                 111.93m
        Cache Hit Ratio:                80.42%  90.02m
        Cache Miss Ratio:               19.58%  21.91m
        Actual Hit Ratio:               74.24%  83.10m

        Data Demand Efficiency:         83.86%  76.83m
        Data Prefetch Efficiency:       8.41%   2.78m

        CACHE HITS BY CACHE LIST:
          Anonymously Used:             6.48%   5.83m
          Most Recently Used:           14.93%  13.44m
          Most Frequently Used:         77.38%  69.66m
          Most Recently Used Ghost:     0.15%   132.44k
          Most Frequently Used Ghost:   1.06%   949.70k

        CACHE HITS BY DATA TYPE:
          Demand Data:                  71.58%  64.43m
          Prefetch Data:                0.26%   233.84k
          Demand Metadata:              20.49%  18.45m
          Prefetch Metadata:            7.67%   6.90m

        CACHE MISSES BY DATA TYPE:
          Demand Data:                  56.58%  12.40m
          Prefetch Data:                11.62%  2.55m
          Demand Metadata:              16.47%  3.61m
          Prefetch Metadata:            15.33%  3.36m
Comment 217 karl 2015-07-02 19:41:33 UTC
The latest refactor (2015-02-10) appears to apply cleanly against 10.2-PRE with only minor fuzzing (a 10 line offset in one hunk); I am testing it now for functionality.
Comment 218 karl 2015-07-15 17:23:24 UTC
Created attachment 158809 [details]
ARC Refactor / UMA Cleardown / DMU_TX dynamic against 10.2-BETA1

No difference in expected behavior, makes conforming change to storage class for i386 dynamic linker and fix up offsets.  Otherwise unchanged in regard to the previous iteration.
Comment 219 karl 2015-07-15 17:24:37 UTC
Patch that applies cleanly against 10.2-BETA1 added; this has seen limited but successful operation on my test machine, but not heavy production use as of yet as I have not yet rolled forward my production machines.
Comment 220 Tomoaki AOKI 2015-07-25 05:14:09 UTC
Created attachment 159207 [details]
ARC Refactor / UMA Cleardown / DMU_TX dynamic against 10.2-r285717 and later

Karl's latest patch no longer applies from r285717, stable/10.
By modifying 1 line (line 67), the patch applies cleanly, builds and runs OK.
Attached is the modified patch.


By the way, although the modified patch applies with some fuzz in head,
it doesn't build. Below is the error messages (-j1).
I haven't be able to determine which commit missed the declaration of cnt.


cc  -O2 -pipe -march=nehalem  -DFREEBSD_NAMECACHE -DBUILDING_ZFS -fno-strict-aliasing -Werror -D_KERNEL -DKLD_MODULE -nostdinc  -I/usr/src/sys/cddl/compat/opensolaris -I/usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs -I/usr/src/sys/cddl/contrib/opensolaris/uts/common/zmod -I/usr/src/sys/cddl/contrib/opensolaris/uts/common -I/usr/src/sys -I/usr/src/sys/cddl/contrib/opensolaris/common/zfs -I/usr/src/sys/cddl/contrib/opensolaris/common -DHAVE_KERNEL_OPTION_HEADERS -include /usr/obj/usr/src/sys/TEST12/opt_global.h -I. -I/usr/src/sys -fno-common -g -fno-omit-frame-pointer -mno-omit-leaf-frame-pointer -I/usr/obj/usr/src/sys/TEST12  -mcmodel=kernel -mno-red-zone -mno-mmx -mno-sse -msoft-float  -fno-asynchronous-unwind-tables -ffreestanding -fwrapv -fstack-protector -gdwarf-2 -Wall -Wredundant-decls -Wnested-externs -Wstrict-prototypes  -Wmissing-prototypes -Wpointer-arith -Winline -Wcast-qual  -Wundef -Wno-pointer-sign -D__printf__=__freebsd_kprintf__  -Wmissing-include-dirs -fdiagnostics-show-option  -Wno-unknown-pragmas  -Wno-error-tautological-compare -Wno-error-empty-body  -Wno-error-parentheses-equality -Wno-error-unused-function  -Wno-error-pointer-sign -Wno-missing-prototypes -Wno-undef -Wno-strict-prototypes -Wno-cast-qual -Wno-parentheses -Wno-redundant-decls -Wno-missing-braces -Wno-uninitialized -Wno-unused -Wno-inline -Wno-switch -Wno-pointer-arith  -mno-aes -mno-avx  -std=iso9899:1999 -include /usr/src/sys/cddl/compat/opensolaris/sys/debug_compat.h -c /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c -o arc.o
/usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c:226:53: error: use of undeclared identifier 'cnt'; did you mean 'int'?
        zfs_arc_free_target = vm_pageout_wakeup_thresh + ((cnt.v_free_target - vm_pageout_wakeup_thresh) / 2);
                                                           ^~~
                                                           int
/usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c:226:53: error: expected expression
/usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c:2995:29: error: use of undeclared identifier 'cnt'
                        if (zio_use_uma && (ptob(cnt.v_free_count) + size < ptob(cnt.v_free_target))) {
                                                 ^
/usr/src/sys/cddl/compat/opensolaris/sys/param.h:38:30: note: expanded from macro 'ptob'
#define ptob(x)         ((uint64_t)(x) << PAGE_SHIFT)
                                    ^
/usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c:2995:61: error: use of undeclared identifier 'cnt'
                        if (zio_use_uma && (ptob(cnt.v_free_count) + size < ptob(cnt.v_free_target))) {
                                                                                 ^
/usr/src/sys/cddl/compat/opensolaris/sys/param.h:38:30: note: expanded from macro 'ptob'
#define ptob(x)         ((uint64_t)(x) << PAGE_SHIFT)
                                    ^
/usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c:2996:48: error: use of undeclared identifier 'cnt'; did you mean 'int'?
                                DTRACE_PROBE3(arc__alloc_lowmem_reap, int, cnt.v_free_count, int, size, int, cnt.v_free_target);
                                                                           ^~~
                                                                           int
/usr/src/sys/sys/sdt.h:345:32: note: expanded from macro 'DTRACE_PROBE3'
        DTRACE_PROBE_IMPL_START(name, arg0, arg1, arg2, 0, 0)           \
                                      ^
/usr/src/sys/sys/sdt.h:326:27: note: expanded from macro 'DTRACE_PROBE_IMPL_START'
        SDT_PROBE(sdt, , , name, arg0, arg1, arg2, arg3, arg4);
                                 ^
/usr/src/sys/sys/sdt.h:166:19: note: expanded from macro 'SDT_PROBE'
                    (uintptr_t) arg0, (uintptr_t) arg1, (uintptr_t) arg2,       \
                                ^
/usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c:2996:48: error: expected expression
/usr/src/sys/sys/sdt.h:345:32: note: expanded from macro 'DTRACE_PROBE3'
        DTRACE_PROBE_IMPL_START(name, arg0, arg1, arg2, 0, 0)           \
                                      ^
/usr/src/sys/sys/sdt.h:326:27: note: expanded from macro 'DTRACE_PROBE_IMPL_START'
        SDT_PROBE(sdt, , , name, arg0, arg1, arg2, arg3, arg4);
                                 ^
/usr/src/sys/sys/sdt.h:166:19: note: expanded from macro 'SDT_PROBE'
                    (uintptr_t) arg0, (uintptr_t) arg1, (uintptr_t) arg2,       \
                                ^
/usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c:2996:82: error: use of undeclared identifier 'cnt'; did you mean 'int'?
                                DTRACE_PROBE3(arc__alloc_lowmem_reap, int, cnt.v_free_count, int, size, int, cnt.v_free_target);
                                                                                                             ^~~
                                                                                                             int
/usr/src/sys/sys/sdt.h:345:44: note: expanded from macro 'DTRACE_PROBE3'
        DTRACE_PROBE_IMPL_START(name, arg0, arg1, arg2, 0, 0)           \
                                                  ^
/usr/src/sys/sys/sdt.h:326:39: note: expanded from macro 'DTRACE_PROBE_IMPL_START'
        SDT_PROBE(sdt, , , name, arg0, arg1, arg2, arg3, arg4);
                                             ^
/usr/src/sys/sys/sdt.h:166:55: note: expanded from macro 'SDT_PROBE'
                    (uintptr_t) arg0, (uintptr_t) arg1, (uintptr_t) arg2,       \
                                                                    ^
/usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c:2996:82: error: expected expression
/usr/src/sys/sys/sdt.h:345:44: note: expanded from macro 'DTRACE_PROBE3'
        DTRACE_PROBE_IMPL_START(name, arg0, arg1, arg2, 0, 0)           \
                                                  ^
/usr/src/sys/sys/sdt.h:326:39: note: expanded from macro 'DTRACE_PROBE_IMPL_START'
        SDT_PROBE(sdt, , , name, arg0, arg1, arg2, arg3, arg4);
                                             ^
/usr/src/sys/sys/sdt.h:166:55: note: expanded from macro 'SDT_PROBE'
                    (uintptr_t) arg0, (uintptr_t) arg1, (uintptr_t) arg2,       \
                                                                    ^
8 errors generated.
*** [arc.o] Error code 1

make[4]: stopped in /usr/src/sys/modules/zfs
1 error

make[4]: stopped in /usr/src/sys/modules/zfs
*** [all_subdir_zfs] Error code 2

make[3]: stopped in /usr/src/sys/modules
1 error

make[3]: stopped in /usr/src/sys/modules
*** [modules-all] Error code 2

make[2]: stopped in /usr/obj/usr/src/sys/TEST12
1 error

make[2]: stopped in /usr/obj/usr/src/sys/TEST12
*** [buildkernel] Error code 2

make[1]: stopped in /usr/src
1 error

make[1]: stopped in /usr/src
*** [buildkernel] Error code 2

make: stopped in /usr/src
1 error

make: stopped in /usr/src
Comment 221 karl 2015-07-26 12:06:10 UTC
I'm looking at this now against 10.2-RC.... will update it sometime today.

I have not attempted to apply it against -HEAD.
Comment 222 Tomoaki AOKI 2015-07-29 15:21:07 UTC
I've found the cause of build failure in -head.

Global cnt is renamed to vm_cnt by r263620[1][2][3], at Sat Mar 22 10:26:09 2014 UTC.
(sys/sys/vmmeter.h and affected files)

By replacing all cnt.* to vm_cnt.* in my modified patch,

  1)Applicable with some fuzz.

  2)Built and boots fine in UFS only VM (VirtualBox).

  3)Not tested yet with ZFS environment.

Confirmed at r286002.


As of 3) above, I've not uploaded modified patch yet. (Will try next weekend, if possible.)
But it would be OK because the change in r263620 is, actually, renaming of global variable cnt.


[1]https://svnweb.freebsd.org/base?view=revision&revision=263620

[2]https://svnweb.freebsd.org/base/head/sys/sys/vmmeter.h?r1=263620&r2=263619&pathrev=263620

[3]https://svnweb.freebsd.org/base/head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c?r1=263620&r2=263619&pathrev=263620
Comment 223 Adrian Gschwend 2015-08-06 14:09:11 UTC
Any chance for an update of the patch for latest 10.1 stable?

Checkout I used:

/usr/src # svn info
Path: .
Working Copy Root Path: /usr/src
URL: http://svn0.us-east.freebsd.org/base/stable/10
Relative URL: ^/stable/10
Repository Root: http://svn0.us-east.freebsd.org/base
Repository UUID: ccf9f872-aa2e-dd11-9fc8-001c23d0bc1f
Revision: 286362
Node Kind: directory
Schedule: normal
Last Changed Author: kib
Last Changed Rev: 286362
Last Changed Date: 2015-08-06 10:51:15 +0200 (Thu, 06 Aug 2015)


# PATCH

/usr/src/sys # patch -i ../zfs-patch-10_1_stable.txt
Hmm...  Looks like a unified diff to me...
The text leading up to this was:
--------------------------
|Index: cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c
|===================================================================
|--- cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c   (revision 278524)
|+++ cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c   (working copy)
--------------------------
Patching file cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c using Plan A...
Hunk #1 succeeded at 190.
Hunk #2 succeeded at 221.
Hunk #3 succeeded at 242.
Hunk #4 failed at 2646.
Hunk #5 succeeded at 2709 (offset 10 lines).
Hunk #6 succeeded at 2728 (offset 10 lines).
Hunk #7 succeeded at 2938 (offset 10 lines).
1 out of 7 hunks failed--saving rejects to cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c.rej


the error:

 /usr/src/sys # cat cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c.rej
@@ -2635,6 +2646,28 @@
 extern kmem_cache_t    *zio_data_buf_cache[];
 extern kmem_cache_t    *range_seg_cache;

+static void __used
+reap_arc_caches()
+{
+       size_t          i;
+       kmem_cache_t            *prev_cache = NULL;
+       kmem_cache_t            *prev_data_cache = NULL;
+
+       for (i = 0; i < SPA_MAXBLOCKSIZE >> SPA_MINBLOCKSHIFT; i++) {
+               if (zio_buf_cache[i] != prev_cache) {
+                       prev_cache = zio_buf_cache[i];
+                       kmem_cache_reap_now(zio_buf_cache[i]);
+               }
+               if (zio_data_buf_cache[i] != prev_data_cache) {
+                       prev_data_cache = zio_data_buf_cache[i];
+                       kmem_cache_reap_now(zio_data_buf_cache[i]);
+               }
+       }
+       kmem_cache_reap_now(buf_cache);
+       kmem_cache_reap_now(hdr_cache);
+       kmem_cache_reap_now(range_seg_cache);
+}
+
 static void __noinline
 arc_kmem_reap_now(arc_reclaim_strategy_t strat)
 {

the rest was fine according to patch.

regards

Adrian
Comment 224 Adrian Gschwend 2015-08-06 14:32:29 UTC
Karl told me to try the latest patch and indeed the one from Tomoaki " ARC Refactor / UMA Cleardown / DMU_TX dynamic against 10.2-r285717 and later" seems to apply!

Will see if the weird problems I started having on a new system will disappear with it.
Comment 225 Tomoaki AOKI 2015-08-09 11:56:42 UTC
Created attachment 159688 [details]
ARC Refactor / UMA Cleardown / DMU_TX dynamic against head after r263620

Patch for head after r263620 (follow renaming of struct cnt).
Tested with r286002 in Root-on-ZFS VirtualBox VM (allowing 2 core and 2GB of RAM).
Tried buildworld/buildkernel with -j4, amount of swap looks rasonable. (300MB+)

 *In another VM allowing 2 core and 1GB of RAM, UFS2 installation, same revision,
  same build option, 2GB swap partition exhausts and require 512MB of swap file.

 *Swap partition of Root-on-ZFS VM is 2GB, same as UFS2 one.
  So patched Root-on-ZFS looks more aggressively frees cache (arc) than UFS2.

 *UFS2 installation shows severe thrashing behavior.

I tried to create Root-on-ZFS memstick to test with real hardware,
but no success. Maybe memstick problem (reused memstick).
So I decided to upload patch for head for someone can test.
The only differense with former one is only to rename cnt to vm_cnt.
stable/10 user must use former one.

Simple port, but hope this catches more FreeBSD developer's eyes.
Comment 226 Tomoaki AOKI 2015-08-09 12:02:28 UTC
(In reply to Adrian Gschwend from comment #224)

FYI, stable/10 r286399 with the patch, amd64, is running fine for me.
Comment 227 Tomoaki AOKI 2015-08-09 12:14:03 UTC
(In reply to Tomoaki AOKI from comment #225)

I guess 2GB of RAM is too small for ZFS environment, but giving more RAM
for VM is not acceptable on 8GB notebook. :-(
So I had to compare amount of swap against UFS2 one.

 *The highest memory pressure happenes when building clang, in toolchain phase,
  and real build phase.
Comment 228 Tomoaki AOKI 2015-08-14 15:36:37 UTC
Created attachment 159859 [details]
ARC Refactor / UMA Cleardown / DMU_TX dynamic against head after r286570

Almost just after I uploaded former patch for -head, massive updates are made to -head.

Unfortunately, former patch (patch.head.r263620) is not applicable after r286570 [1].

I created revised patch (now applicable on /usr/src) , but current state is...

 *Confirmed applicable to head r286718, which massive update seems to be ended.
  [With some fuzz, not fixed]

 *builds, installs and boots fine for environment below.
    *VirtualBox VM allowed 2GB of RAM and 2 cores of CPU.
    *ZFS-root by bsdinstall, single drive, 2GB of swap partition
     outside ZFS pool.
    */tmp in ZFS pool (default). not swap backed /tmp.

 *After installworld and reboot, tested with a couple of buildkernel,
  runs as usual, but with about 1MB of swap at ctfmerge after linking.

 *Not tested with l2arc, which SHALL need testing as of r286570
  commit message.

Yes, 2GB of memory is tooo small to run ZFS environment. But it's
the maximum I can give to VM for now. :-(

This is all what I can (at least for now). I should hand over
to someone familiar with VM, ZFS, and -head codes.

In general, FreeBSD src codes are first tested with patch(es) on
head, merged to head, and then MFC'ed.

In other hand, Karl's work is proceeded on stable/10 (or 10.x-Release
branch), as it's needed for his production system and shares to us.

So if we want it land to src, we need someone else to port his
patch(es) to head.

I would be able to make initial port to head like this, but
enough level of reviews and tests on wide variety of environments
would be strongly needed. It's clearly beyond my hand.


<Technical (a bit) details and concerns>

In r286570 changes, hdr_cache is splitted into hdr_full_cache and
hdr_l2only_cache. See line 1099-1100, 1112-1113, 1239-1243, 3026-3027
in [1] for detail.

In r286625 changes, arc_kmem_reap_now() no longer has any argument.
See line 3279 in [2] for detail. Although I cannot confirm enough,
It can indirectly prevent this patch from actually working. Isn't it?

Also in r286625 changes, condition in line 3339 to 3340 is changed
to use arc_available_memory() instead of arc_reclaim_needed().
There's more changes in same function. Would it affect? See [2].

And possibly many others I'm missing.


[1] https://svnweb.freebsd.org/base/head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c?r1=284513&r2=286570

[2] https://svnweb.freebsd.org/base/head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c?r1=286623&r2=286625
Comment 229 Tomoaki AOKI 2015-08-15 16:51:55 UTC
Created attachment 159905 [details]
ARC Refactor / UMA Cleardown / DMU_TX dynamic against head after r286776

Massive updates again. :-(
This would be why Karl doesn't want to develope on head.

Previous patch is no longer applicable with latest head.
New one should be applicable with r286763 or later.
Confirmed to be applicable to r286795, and by tracking
commit log, breaking commit was r286763.
See line 3529 to 3540, 3793 to 3800 in [1].

  *Beware! Line 3795 matches 7th line in line 3799 block of left side.
   "} else {" in line 3795 of right side shall match 8th line
   in line 3799 block of left side.
   Line numbers shown in leftmost is matched to right side.

Latest commit to arc.c is r286776 now, so latest patch is named
need_test-patch.head.r286776.


Porting level is as previous one. (initial level)

This time, I regenerated patch using svnlite diff, so no more fuzz,
if you apply to r286776 through r286795.
Instead, function name in @@ lines are lost.

This time, I obsoleted my old patches for head. If you need old ones
with some needs, try "Show obsolete". Patches for stable/10 aren't obsoleted.


[1] https://svnweb.freebsd.org/base/head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c?r1=286762&r2=286763
Comment 230 Fabian Keil 2015-09-09 13:47:54 UTC
I've been using Karl's patch for a couple of months now on
systems based on 11-CURRENT and clearly it's a big improvement.

Unfortunately I still noticed severe latency issues under memory
pressure on systems that frequently operate close to the ARC
minimum.

This patch seems to significantly improve the user experience:
https://www.fabiankeil.de/sourcecode/electrobsd/zfs-arc-tuning.diff
(Apply Karl's patch first)

For details see:
https://www.fabiankeil.de/gehacktes/electrobsd/zfs-arc-tuning/
Comment 231 karl 2015-09-09 14:07:25 UTC
I like that in general; that's yet another piece of the puzzle.....

Will look into incorporating it.
Comment 232 Tomoaki AOKI 2015-10-04 11:51:18 UTC
Created attachment 161691 [details]
ARC Refactor / UMA Cleardown / DMU_TX dynamic against stable/10 after r288599

Afrer massive MFCs from r288516 to r288599 by Alexander Motin (mav@),
Karl's most recent patch with my fix (for r285717) no longer applies.

Unfortunately, my ported patch for head applies, but doesn't build
because global struct cnt to vm_cnt change (I noted it in Comment 222)
isn't MFC'ed yet. (Maybe never MFC'ed because of KPI/KBI change.)

But fortunately, replacing all 'vm_cnt' in the patch to 'cnt' helps.

One more news. Fabian Keil's incremental patch (with replacing all
'vm_cnt' in the patch to 'cnt') is now applicable for stable/10.

Updated patch applied together with Fabian Keil's incremental patch is
running fine for me (stable/10 r288599, amd64), but not yet heavy-loaded.
(Compiling www/chromium with memory pressure by VirtualBox VM is on-going.)
Comment 233 Tomoaki AOKI 2015-10-04 12:00:55 UTC
Created attachment 161692 [details]
ARC Refactor / UMA Cleardown / DMU_TX dynamic with time-based rate limiting against stable/10 after r288599

Convenient patch for testers who want to test Fabian's patch.

Applies Karl's patch (modified for stable/10 after r288599)
and Fabian's patch (modified for stable/10) at once.

Won't upload modified Fabian's patch alone, as its prerequisite is
that Karl's patch is already applied.
Comment 234 Tomoaki AOKI 2015-10-04 12:50:26 UTC
(In reply to Tomoaki AOKI from comment #232)

Building www/chromium under memory pressure finished earlier than I estimated
with relatively small amount of swap.

As it finished while I didn't watching top outputs, I couldn't see when
the swap occurred and how much free memory ATM.

But the fact below would possibly indicate that the memory pressure I gave
was too large for phisical memory. (8GB)

  *The amount of swap was relatively small. (about 187MB)
  *Mate desktop, firefox, sylpheed, multiple Mate terminal, and some others
   was running.
  *VirtualBox VM used for memory pressure is allowed 2GB of memory.
  *ARC size (below 900MB) was smaller than the size of www/chromium/work
   (1.1GB), possibly mean no more shrink was possible while linking.
  *While du for calculating disk usage of www/chromium/work, ARC grew to 1.1GB.
Comment 235 karl 2015-10-09 15:08:15 UTC
Testing now on my test machine with the most-recent 10.2-STABLE ZFS changes in..... if it looks ok I will post here and roll forward onto one of my production machines this weekend during my regular maintenance window.
Comment 236 Fabian Keil 2015-10-09 16:11:26 UTC
FYI, I'm currently testing additional patches to limit the inactive memory
more aggressively.

This should fix two remaining issues I run into:

1) As Andriy predicted, some workloads can result in the ARC getting
   starved by inactive pages:

    last pid: 28429;  load averages:  0.48,  0.46,  0.41    up 0+03:39:07  17:24:59
    91 processes:  2 running, 88 sleeping, 1 waiting
    CPU:  1.4% user,  0.0% nice, 12.7% system,  0.2% interrupt, 85.7% idle
    Mem: 396M Active, 489M Inact, 986M Wired, 292K Cache, 5202K Buf, 43M Free
    ARC: 351M Total, 90M MFU, 44M MRU, 6839K Anon, 7810K Header, 203M Other, 350M Target
    Swap: 2048M Total, 99M Used, 1949M Free, 4% Inuse
    
      PID USERNAME    THR PRI NICE   SIZE    RES STATE   C   TIME    WCPU COMMAND
       11 root          2 155 ki31     0K    32K RUN     0 377:37 170.34% idle
    26625 fk           17  36    0   175M 24504K uwait   1   0:09   8.40% git
        0 root        468 -16    0     0K  7488K swapin  1   3:29   6.26% kernel
       22 root          1  20    -     0K    16K geli:w  1   4:16   5.06% g_eli[1] ada0s1d

    2015 Sep 21 17:24:58: Scan goals in the previous minute:
      Update active LRU/deactivate pages                               60
    2015 Sep 21 17:24:58: Seconds since last 'Move inactive to cache or free' pass: 1477
    2015 Sep 21 17:24:58: Seconds since last 'Launder dirty pages' pass: 9273

2) On systems with only 1 GB of RAM and no ZFS tuning, memory pressure sometimes
   causes the free pages to shrink to v_interrupt_free_min (2!) because
   the inactive pages aren't freed quickly enough. As a result, ZFS locks up
   and the system panics or merely becomes unresponsive (except for ICMP etc.).

   I initially suspected that this was a result of my rate-limiting
   patch, but disabling the rate limit in case of vm_page_count_min()
   didn't make a difference.

   This could be a completely unrelated issue and I have not yet tried
   to reproduce it with vanilla FreeBSD.

   As I have not been able to reproduce the problem on systems with 2 GB
   or more, I continue to use the patches in production.

Anyway, first results look promising and I intend to polish and publish
the patches in a couple of days.
Comment 237 karl 2015-10-10 16:05:51 UTC
I'm very interested in looking at that; I played with some code that looked at the inact list when it was larger than the target and woke the vm system up to run through its demotion/cleaning routine.  It did the desired thing more-or-less (and resulted in more ARC memory) but there's a potential price for that, in that the reason for inact pages is to avoid I/O in the event that a page that was previously in use is again referenced.

It was not clear to me from my benchmarking that the code addition was actually beneficial in this instance, particularly with moderate RAM configurations..... are you triggering only when the ARC target size is less than the max high-water size?
Comment 238 karl 2015-10-10 17:40:07 UTC
Here is what I am playing with at present....

If we come through the consider_reap routine, wake the pager.  But, since there's an event handler for "low memory" in the arc code that will get triggered if the pager wakes up, set a (local) flag so as to not run through there if we woke the pager up, while leaving it active if the VM system woke up on its own.  (Without that test the ARC get slammed immediately to the min value and remains there since the event handler will get fired immediately when the pager is awakened.)

This *should* leave inact pages alone (which is desirable) until and unless we have a reason to reap, but when we do it should also force the pager to wake up and go through the process for demoting inact pages that have aged sufficiently, even though the pager is otherwise happy with the state of the system.

My initial tests appear to say this works as expected, and it's low-impact.  I'm running a soak right now to see if performance bears up under some fairly extreme circumstances.
Comment 239 karl 2015-10-10 18:19:29 UTC
Meh.

This does keep inact under control but it also winds up sending some modest number of pages to swap, and the impact on interactive response, particularly if you have a process that has been idle but is still attached to a terminal, is unacceptable when you try to come back to it.
Comment 240 karl 2015-10-10 20:13:18 UTC
Another attempt in the same general vein being made.... this one looks more-promising.  I've got it on a production box now as it's the weekend and I can get away with it given that it passed on my test machine, so we'll see.

It's only a couple of line of code change compared against the latest up above, and if it's effective on those workloads where inact pages pile up then that would be the remaining complaint that has been leveled against this patch.
Comment 241 Fabian Keil 2015-10-11 13:14:33 UTC
I let the vm pageout daemon limit the inactive pages so the ARC
can continue to be oblivious of them. In my opinion the ARC has
too much knowledge about the vm internals already.

Here's a sneak preview:
https://www.fabiankeil.de/sourcecode/electrobsd/vm-limit-inactive-memory-more-aggressively.diff
(The code has not been properly tested yet!)

A screen shot of the inactive page limit in action:
https://www.fabiankeil.de/bilder/electrobsd/kernel-compilation-with-inactive-page-limit-enabled.png
Comment 242 karl 2015-10-12 12:37:01 UTC
There was an intentional decision taken quite a long time ago when it comes to the VM and inactive pages; in the absence of memory pressure there's no reason to demote (and ultimately evict) them because there is no "cost" in keeping them around, and there might be a benefit if they become active again.

I'm not sure changing that is necessarily a good idea although the debate probably ought to be entertained.

Note that the original Illumos/Sun code says the following (which only is operative on Solaris):

        /*
         * check that we're out of range of the pageout scanner.  It starts to
         * schedule paging if freemem is less than lotsfree and needfree.
         * lotsfree is the high-water mark for pageout, and needfree is the
         * number of needed free pages.  We add extra pages here to make sure
         * the scanner doesn't start up while we're freeing memory.
         */

This is important; the original intent is to remain clear of the pager with ARC's demands.  That was what drove the patch in the first place.  While the workloads I have accessible to me don't cause inact pages to cause trouble there are reports that for some other people they do.

That's easily addressed without changing how VM behaves because the VM system *sleeps* whenever it is not under memory pressure.  That is, it never demotes pages under that scenario because it doesn't think there's a reason for it.

So what I've been testing here for a little bit is a small change to that paradigm.  At a small margin above arc_free_target *wake the pager occasionally* which will cause it to age pages even though it knows that it's not in critical memory trouble.

There are two conditions on this test -- first, the init code has to have been run first, because if you wake the pager before the VM's swapper has been initialized the kernel will panic immediately and second, there's no reason to do it until the ARC cache is warmed up.  Finally, the ARC code has an event handler set that gets kicked whenever the pagedaemon normally wakes up (which is a statement that we're in memory trouble) and since we're triggering it otherwise under light RAM pressure I've added a bypass so we don't pare arc back in that instance.

There are two tunables added to control this behavior:

vfs.zfs.arc_wakeup_delay: 500
vfs.zfs.arc_wakeup_pager: 95970

arc_wakeup_delay is the minimum delay in milliseconds between wakes; because waking the pager is expensive in relative terms you don't want to "storm" it.  Further, there's no benefit to doing so since the entire (and only) purpose is to allow it to schedule page demotion, particularly from inact.

arc_wakeup_pager is the number of pages of free RAM below which we wake the pager up on the above timed schedule.  The default is 10/9ths of arc_free_target although you can change it at any time.

This has held inact pages to roughly the same as or less than the active count under both synthetic and operational loads; without it inact pages could rise as high as 2-2.5x active under some conditions here, although I never got the "runaway" scenario under the workloads that I have.

The price of this change is that it also comes with the *cost* of demoting some pages to swap; I believe there's a cogent argument to be made that this is *wrong* and may hurt interactive performance, perhaps badly, especially where the swap resides on spinning rust.  For this reason I'm not sure I like the patch's results, but for those who want to play with it, here it is.

(You can effectively disable the wakeup code by setting zfs.vfs.arc_wakeup_pager to vm.v_free_min should you wish to do so on a running system.)
Comment 243 karl 2015-10-12 12:40:33 UTC
Created attachment 161943 [details]
Patch for 10.2-STABLE r289078 including pagedaemon wakeup code

Explained in comment #242; excerpt:

Here is a small change to the former patch paradigm.  At a small margin above arc_free_target *wake the pager occasionally* which will cause it to age pages even though it knows that it's not in critical memory trouble.

There are two conditions on this test -- first, the init code has to have been run first, because if you wake the pager before the VM's swapper has been initialized the kernel will panic immediately and second, there's no reason to do it until the ARC cache is warmed up.  Finally, the ARC code has an event handler set that gets kicked whenever the pagedaemon normally wakes up (which is a statement that we're in memory trouble) and since we're triggering it otherwise under light RAM pressure I've added a bypass so we don't pare arc back in that instance.

There are two tunables added to control this behavior:

vfs.zfs.arc_wakeup_delay: 500
vfs.zfs.arc_wakeup_pager: 95970

arc_wakeup_delay is the minimum delay in milliseconds between wakes; because waking the pager is expensive in relative terms you don't want to "storm" it.  Further, there's no benefit to doing so since the entire (and only) purpose is to allow it to schedule page demotion, particularly from inact.

arc_wakeup_pager is the number of pages of free RAM below which we wake the pager up on the above timed schedule.  The default is 10/9ths of arc_free_target although you can change it at any time.

This has held inact pages to roughly the same as or less than the active count under both synthetic and operational loads; without it inact pages could rise as high as 2-2.5x active under some conditions here, although I never got the "runaway" scenario under the workloads that I have.

The price of this change is that it also comes with the *cost* of demoting some pages to swap; I believe there's a cogent argument to be made that this is *wrong* and may hurt interactive performance, perhaps badly, especially where the swap resides on spinning rust.  For this reason I'm not sure I like the patch's results, but for those who want to play with it, here it is.

(You can effectively disable the wakeup code by setting zfs.vfs.arc_wakeup_pager to vm.v_free_min should you wish to do so on a running system.)
Comment 244 Fabian Keil 2015-11-01 16:35:24 UTC
In comment #225 I mentioned ZFS lockups on patched systems with 1 GB of RAM.

Just for the record: I recently tried to reproduce them with unpatched
FreeBSD 10.2-RELEASE and with a 11.0-CURRENT snapshot based on r289846
(to rule out a regression in HEAD) but the systems remained stable for
days with the workload (reproducing ElectroBSD images in a loop) that
causes patched systems to panic within hours (sometimes minutes).

So far I'm pleased with the patch I posted in comment #241 but intend
to add auto-tuning in the near future.

I have not yet tried Karl's latest patch but like to point out that
some of the underlying assumptions about how the vm pager behaves seem
incorrect to me.

For example I'd expect calling pagedaemon_wakeup() without memory
pressure (from the pagers point of view) to be pretty close to a nop
as vm_pageout_worker() does its own checks before doing any heavy lifting:
http://fxr.watson.org/fxr/source/vm/vm_pageout.c#L1634

Also note that vm_pageout_worker() is already called at least once per
second anyway:

[fk@polizei-erziehung ~]$ sudo /usr/src/share/dtrace/monitor-page-scanner 
2015 Nov  1 17:20:45: Monitoring the page scanner. Minimum pass value to show 'boring' scans without memory pressure or inactive page surplus: 2 (Launder dirty pages). Press CTRL-C to abort.
2015 Nov  1 17:21:45: Scan goals in the previous minute:
  Update active LRU/deactivate pages                               60
2015 Nov  1 17:22:45: Scan goals in the previous minute:
  Update active LRU/deactivate pages                               60

I'm not claiming that increasing the frequency when there's no memory
pressure causes any harm (besides code complexity), but I'm not convinced
that it has the intended effect and needs to be triggered from ZFS
(as opposed to changing the pager defaults).
Comment 245 mikej 2015-11-20 23:28:49 UTC
Can someone please comment on which diff and it's location is currently being tested on both 10-STABLE and HEAD with SVN revision numbers.

Acknowledgements on what if any work has been already included in STABLE or HEAD would be enlightening - I don't grep/awk the SVN repository well :-(

Karl, Steve and the group thank you for you work and insight.

Kind regards,

--mikej
Comment 246 karl 2015-11-20 23:44:58 UTC
My last set of patches will apply against -STABLE r289078; I have not attempted it against -HEAD.
Comment 247 karl 2015-12-10 15:48:51 UTC
Created attachment 164051 [details]
Update of patch related to other bug report (see comment for details)

Identical to 161943 but adds defense against inversion of the dirty/dirty_max_internal relationship resulting in a potential sign extension or divide-by-zero.

In response to https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=205163; the stack backtrace and examination of variables shows inversion of that relationship but the claimed panic cause (integer divide by zero) is not evident in the kernel state, which appears to imply a crashed stack.

The code added should in theory never execute and inversion of those values should be impossible because at line ~1325 in dmu_tx.c once dirty_data_max_internal is (potentially) resized dmu_tx_try_assign executes and, if the dirty data level is at or near the (now possibly resized) dirty_data_max amount dmu_tx_wait is called to suspend allocation until that buffer drains.  That is, changes in this value are at least theoretically protected against the dirty value meeting or exceeding the resized amount and thus division by zero or an inversion should not be able to happen.

Nonetheless it cannot do harm to defend against this (other than the couple of additional instructions and time to execute them) and if it resolves the panic that implies there is a missing mutex somewhere in the base code that needs to be run down and found since it will likely cause trouble elsewhere as well.
Comment 248 karl 2016-08-29 17:24:13 UTC
Created attachment 174197 [details]
First cut against 11.0-STABLE

I have implemented a quick cut of the former patch against 11.0-STABLE @ r305004.

This implements only one piece of the 10.x patch heuristic -- the component that senses low memory before the the paging threshold is reached (which is where the base code initiates both ARC eviction *and* paging) and (1) wakes the pager and (2) reaps caches, both with a timer so as to prevent calling these routines (which can be expensive) too rapidly.  This cut does *not* bring over the arc_free_target modification from the base code that is in the 10.x patch.

11.x has materially-different internals in a few subtle but potentially important areas when it comes to ARC memory management and its interaction with the VM.  However, the stock code has been observed on my production systems to retain some (but not all) of the pathology in 10.x.

As such I am being conservative with pulling the patch across to 11.x rather than simply slam-bang porting over the older patch, as minimalism, if it's effective, is a win.

This has received minor testing here thus far and while it appears to be quite effective in mitigating some of the remaining bogons in the ARC code I cannot vouch for it at this time in a production environment.

Due to limited testing at this point it should be deployed with some care; I am very interested in commentary from others who have issues with memory management (particularly high page file consumption when under memory pressure and/or inappropriate ARC eviction) under memory pressure scenarios as to whether this helps or (hopefully not!) has undesirable side effects.
Comment 249 karl 2016-08-29 17:36:05 UTC
Created attachment 174198 [details]
Dtrace script to track code execution in above patch against 11.0-STABLE

This script, fed to dtrace, will print when the above patch both wakes the pager and executes UMA reaping should you desire to see how often the code in question actually executes
Comment 250 karl 2016-08-30 13:46:52 UTC
The 11.0 code is not effective in preventing RSS page eviction to the swap, but *is* effective at preventing ARC pare-back in an inappropriate fashion in my testing thus far.

I'm going to tweak vfs.zfs.arc_free_target and see if that's sufficient to stop the aggressive RSS eviction before continuing to bring over the other components of the original patch.  I suspect a relatively-minor change in that regard may be effective given what I'm seeing with dtrace.
Comment 251 karl 2016-08-30 20:23:02 UTC
Created attachment 174231 [details]
Second cut against 11.0-STABLE

Second cut; makes the following changes in addition to the above:

1. Sets arc_free_target's default to slightly above the paging threshold instead of equal to it.  (This remains tunable if you don't like the defaults.)

2. When the system invades vm.v_free_target (note: still above arc_free_target by a decent amount, especially on larger systems) we start *slowly* cleaning up UMA; one zone is chased down per 500ms "window" on a rotary basis.  (For future: I will be looking into the "vmstat" code to see how to determine the "in use" .vs. "free" count and targeting only those UMA buckets where a "material" [e.g. more than 10%?] of the allocated is free and unused in this circumstance.)

3. If we detect an invasion of arc_free_target then the previous algorithm (aggressively clearing down UMA) is employed.

The code is a bit rough on a style basis but should be ok execution wise.

As with the previous rev this has been tested modestly here without ill effect, but is not guaranteed ok -- use with care, especially on a production machine.  The Dtrace script has been updated to show whether the non-aggressive or aggressive reclaim fires, and the former appears to be very effective in limiting the need for the latter to be executed.

Evaluation continues.
Comment 252 karl 2016-08-30 20:24:09 UTC
Created attachment 174232 [details]
Dtrace script to go with Second Cut on 11-STABLE

Pairs with the Second Cut 174231
Comment 253 karl 2016-08-31 15:00:37 UTC
Created attachment 174254 [details]
Cleanup of second patch - no functional changes

Remove extraneous call pass-through; no functional change.
Comment 254 Tomoaki AOKI 2016-09-04 08:07:07 UTC
(In reply to karl from comment #253)

Glad to know you're working on this for stable/11. :-)
This is still a important work on stable/11 (and would be head,too).

There was only my quick conversion for slightly old patchset for those
environments. (Worked mostly OK but observed some strange swaps.)


One thing to note. Your new default for vfs.zfs.arc_free_target (even
of second cut) looks too small, while old default (for stable/10) works
mostly well for me.

  *Tested old default by manually setting vfs.zfs.arc_free_target
   in sysctl.conf.

Sorry, no DTRACE logs is kept for second cut itself.

Currently, I observed swap with old default when a few MB of tmpfs
(swap-backed) is used. This should be a known-to-be-OK behaviour.

My environment is not changed, except that base is upgraded from stable/10
to stable/11 and rebuilt all ports.
Comment 255 karl 2016-09-04 13:41:31 UTC
I am playing with the arc_free_target number here..... with it set where it is now I do get a fair amount of page-space allocation even though RAM is not exhausted, but it doesn't have a notable performance impact.  One of the tests I intend to run is to intentionally remove the page allocation from the system when under said load and see if it misbehaves; if not, then perhaps leaving it where it is in the second cut is ok, otherwise I will increase the default.

So far the second cut appears to be behaving well on my production machines; I have made a couple of additional cleanup passes on the code itself, but still am looking for the means to determine UMA occupancy for a slab.  Not having it isn't critical but it would make the clean-up under light memory pressure more efficient by avoiding calls to reap on zones that have few resources out-but-unused.  The exposed public functions documented in the man pages only return the current allocation which is insufficient.
Comment 256 Andriy Gapon freebsd_committer freebsd_triage 2017-03-12 09:56:39 UTC
Can this bug be closed already?
I'd suggest opening new, more focused bug reports for any remaining issues.
Comment 257 karl 2017-03-12 16:54:16 UTC
The current patch is still running on my 11-STABLE system, the problems still remain if I attempt to run a kernel "neat" without it.

While the scope has narrowed to a material degree due to improvements in the base code it's not by any means "fixed" in my opinion in the code "as-shipped" in the 11-STABLE branch.
Comment 258 Anton Saietskii 2017-05-08 21:23:44 UTC
Which patch shoud I try for releng/10.3?
Comment 259 karl 2017-05-09 14:29:19 UTC
(In reply to Anton Sayetsky from comment #258)

See what rev the code is (uname -a) and if the rev is after 288599 try this one: https://bugs.freebsd.org/bugzilla/attachment.cgi?id=164051

If it applies cleanly it should work -- but do check to see if it applies cleanly before committing it!
Comment 260 Anton Saietskii 2017-05-12 13:50:05 UTC
(In reply to karl from comment #259)

Thanks Karl - patch applied cleanly, buildkernel was fine too. Machine already running patched kernel and you can see zfs-stats output after 16 hours below:

root@cs0:~# zfs-stats -a

------------------------------------------------------------------------
ZFS Subsystem Report                            Fri May 12 16:47:25 2017
------------------------------------------------------------------------

System Information:

        Kernel Version:                         1003000 (osreldate)
        Hardware Platform:                      amd64
        Processor Architecture:                 amd64

        ZFS Storage pool Version:               5000
        ZFS Filesystem Version:                 5

FreeBSD 10.3-RELEASE-p19 #0 r318183M: Thu May 11 23:59:27 MSK 2017 root
16:47  up 16:34, 4 users, load averages: 0,62 0,56 0,53

------------------------------------------------------------------------

System Memory:

        0.01%   19.10   MiB Active,     0.27%   684.19  MiB Inact
        94.08%  234.63  GiB Wired,      0.00%   3.88    MiB Cache
        5.64%   14.07   GiB Free,       0.00%   0 Gap

        Real Installed:                         256.00  GiB
        Real Available:                 99.98%  255.95  GiB
        Real Managed:                   97.43%  249.39  GiB

        Logical Total:                          256.00  GiB
        Logical Used:                   94.24%  241.26  GiB
        Logical Free:                   5.76%   14.74   GiB

Kernel Memory:                                  1.33    GiB
        Data:                           99.04%  1.32    GiB
        Text:                           0.96%   13.08   MiB

Kernel Memory Map:                              249.39  GiB
        Size:                           92.55%  230.80  GiB
        Free:                           7.45%   18.58   GiB

------------------------------------------------------------------------

ARC Summary: (HEALTHY)
        Memory Throttle Count:                  0

ARC Misc:
        Deleted:                                1.44m
        Recycle Misses:                         0
        Mutex Misses:                           114
        Evict Skips:                            322

ARC Size:                               92.76%  230.41  GiB
        Target Size: (Adaptive)         92.76%  230.40  GiB
        Min Size (Hard Limit):          77.30%  192.00  GiB
        Max Size (High Water):          1:1     248.39  GiB

ARC Size Breakdown:
        Recently Used Cache Size:       45.72%  105.35  GiB
        Frequently Used Cache Size:     54.28%  125.06  GiB

ARC Hash Breakdown:
        Elements Max:                           2.37m
        Elements Current:               100.00% 2.37m
        Collisions:                             157.32k
        Chain Max:                              3
        Chains:                                 78.14k

------------------------------------------------------------------------

ARC Efficiency:                                 41.19m
        Cache Hit Ratio:                35.56%  14.65m
        Cache Miss Ratio:               64.44%  26.54m
        Actual Hit Ratio:               35.42%  14.59m

        Data Demand Efficiency:         80.54%  14.41m

        CACHE HITS BY CACHE LIST:
          Most Recently Used:           65.55%  9.60m
          Most Frequently Used:         34.05%  4.99m
          Most Recently Used Ghost:     0.32%   46.65k
          Most Frequently Used Ghost:   0.45%   66.17k

        CACHE HITS BY DATA TYPE:
          Demand Data:                  79.20%  11.60m
          Prefetch Data:                0.00%   0
          Demand Metadata:              19.17%  2.81m
          Prefetch Metadata:            1.63%   239.14k

        CACHE MISSES BY DATA TYPE:
          Demand Data:                  10.56%  2.80m
          Prefetch Data:                0.00%   0
          Demand Metadata:              89.21%  23.68m
          Prefetch Metadata:            0.23%   62.38k

------------------------------------------------------------------------

L2 ARC Summary: (HEALTHY)
        Passed Headroom:                        119.58k
        Tried Lock Failures:                    11.52k
        IO In Progress:                         0
        Low Memory Aborts:                      0
        Free on Write:                          226
        Writes While Full:                      8.66k
        R/W Clashes:                            0
        Bad Checksums:                          0
        IO Errors:                              0
        SPA Mismatch:                           142.53m

L2 ARC Size: (Adaptive)                         1.09    TiB
        Header Size:                    0.01%   111.71  MiB

L2 ARC Breakdown:                               26.54m
        Hit Ratio:                      0.70%   186.42k
        Miss Ratio:                     99.30%  26.35m
        Feeds:                                  67.06k

L2 ARC Buffer:
        Bytes Scanned:                          58.44   TiB
        Buffer Iterations:                      67.06k
        List Iterations:                        265.86k
        NULL List Iterations:                   2

L2 ARC Writes:
        Writes Sent:                    100.00% 31.97k

------------------------------------------------------------------------


------------------------------------------------------------------------

VDEV cache is disabled

------------------------------------------------------------------------

ZFS Tunables (sysctl):
        kern.maxusers                           16717
        vm.kmem_size                            267777855488
        vm.kmem_size_scale                      1
        vm.kmem_size_min                        0
        vm.kmem_size_max                        1319413950874
        vfs.zfs.trim.max_interval               1
        vfs.zfs.trim.timeout                    30
        vfs.zfs.trim.txg_delay                  32
        vfs.zfs.trim.enabled                    0
        vfs.zfs.vol.unmap_enabled               1
        vfs.zfs.vol.mode                        1
        vfs.zfs.version.zpl                     5
        vfs.zfs.version.spa                     5000
        vfs.zfs.version.acl                     1
        vfs.zfs.version.ioctl                   5
        vfs.zfs.debug                           0
        vfs.zfs.super_owner                     0
        vfs.zfs.sync_pass_rewrite               2
        vfs.zfs.sync_pass_dont_compress         5
        vfs.zfs.sync_pass_deferred_free         2
        vfs.zfs.zio.exclude_metadata            0
        vfs.zfs.zio.use_uma                     1
        vfs.zfs.cache_flush_disable             0
        vfs.zfs.zil_replay_disable              0
        vfs.zfs.min_auto_ashift                 12
        vfs.zfs.max_auto_ashift                 13
        vfs.zfs.vdev.trim_max_pending           10000
        vfs.zfs.vdev.bio_delete_disable         0
        vfs.zfs.vdev.bio_flush_disable          0
        vfs.zfs.vdev.write_gap_limit            4096
        vfs.zfs.vdev.read_gap_limit             32768
        vfs.zfs.vdev.aggregation_limit          131072
        vfs.zfs.vdev.trim_max_active            64
        vfs.zfs.vdev.trim_min_active            1
        vfs.zfs.vdev.scrub_max_active           2
        vfs.zfs.vdev.scrub_min_active           1
        vfs.zfs.vdev.async_write_max_active     10
        vfs.zfs.vdev.async_write_min_active     1
        vfs.zfs.vdev.async_read_max_active      3
        vfs.zfs.vdev.async_read_min_active      1
        vfs.zfs.vdev.sync_write_max_active      10
        vfs.zfs.vdev.sync_write_min_active      10
        vfs.zfs.vdev.sync_read_max_active       10
        vfs.zfs.vdev.sync_read_min_active       10
        vfs.zfs.vdev.max_active                 1000
        vfs.zfs.vdev.async_write_active_max_dirty_percent60
        vfs.zfs.vdev.async_write_active_min_dirty_percent30
        vfs.zfs.vdev.mirror.non_rotating_seek_inc1
        vfs.zfs.vdev.mirror.non_rotating_inc    0
        vfs.zfs.vdev.mirror.rotating_seek_offset1048576
        vfs.zfs.vdev.mirror.rotating_seek_inc   5
        vfs.zfs.vdev.mirror.rotating_inc        0
        vfs.zfs.vdev.trim_on_init               1
        vfs.zfs.vdev.cache.bshift               16
        vfs.zfs.vdev.cache.size                 0
        vfs.zfs.vdev.cache.max                  16384
        vfs.zfs.vdev.metaslabs_per_vdev         200
        vfs.zfs.txg.timeout                     10
        vfs.zfs.space_map_blksz                 4096
        vfs.zfs.spa_slop_shift                  5
        vfs.zfs.spa_asize_inflation             24
        vfs.zfs.deadman_enabled                 1
        vfs.zfs.deadman_checktime_ms            5000
        vfs.zfs.deadman_synctime_ms             1000000
        vfs.zfs.recover                         0
        vfs.zfs.spa_load_verify_data            1
        vfs.zfs.spa_load_verify_metadata        1
        vfs.zfs.spa_load_verify_maxinflight     10000
        vfs.zfs.check_hostid                    1
        vfs.zfs.mg_fragmentation_threshold      85
        vfs.zfs.mg_noalloc_threshold            0
        vfs.zfs.condense_pct                    200
        vfs.zfs.metaslab.bias_enabled           1
        vfs.zfs.metaslab.lba_weighting_enabled  1
        vfs.zfs.metaslab.fragmentation_factor_enabled1
        vfs.zfs.metaslab.preload_enabled        1
        vfs.zfs.metaslab.preload_limit          3
        vfs.zfs.metaslab.unload_delay           8
        vfs.zfs.metaslab.load_pct               50
        vfs.zfs.metaslab.min_alloc_size         33554432
        vfs.zfs.metaslab.df_free_pct            4
        vfs.zfs.metaslab.df_alloc_threshold     131072
        vfs.zfs.metaslab.debug_unload           0
        vfs.zfs.metaslab.debug_load             0
        vfs.zfs.metaslab.fragmentation_threshold70
        vfs.zfs.metaslab.gang_bang              16777217
        vfs.zfs.free_bpobj_enabled              1
        vfs.zfs.free_max_blocks                 -1
        vfs.zfs.no_scrub_prefetch               0
        vfs.zfs.no_scrub_io                     0
        vfs.zfs.resilver_min_time_ms            3000
        vfs.zfs.free_min_time_ms                1000
        vfs.zfs.scan_min_time_ms                1000
        vfs.zfs.scan_idle                       50
        vfs.zfs.scrub_delay                     4
        vfs.zfs.resilver_delay                  2
        vfs.zfs.top_maxinflight                 288
        vfs.zfs.zfetch.array_rd_sz              1048576
        vfs.zfs.zfetch.max_distance             8388608
        vfs.zfs.zfetch.min_sec_reap             2
        vfs.zfs.zfetch.max_streams              24
        vfs.zfs.prefetch_disable                1
        vfs.zfs.delay_scale                     500000
        vfs.zfs.delay_min_dirty_percent         60
        vfs.zfs.dirty_data_sync                 67108864
        vfs.zfs.dirty_data_max_percent          10
        vfs.zfs.dirty_data_max_max              4294967296
        vfs.zfs.dirty_data_max                  4294967296
        vfs.zfs.max_recordsize                  1048576
        vfs.zfs.mdcomp_disable                  0
        vfs.zfs.nopwrite_enabled                1
        vfs.zfs.dedup.prefetch                  1
        vfs.zfs.arc_reap_delay_min              200
        vfs.zfs.arc_cache_reapings_skipped      3271
        vfs.zfs.l2c_only_size                   0
        vfs.zfs.mfu_ghost_data_lsize            110572121600
        vfs.zfs.mfu_ghost_metadata_lsize        0
        vfs.zfs.mfu_ghost_size                  110572121600
        vfs.zfs.mfu_data_lsize                  135099494912
        vfs.zfs.mfu_metadata_lsize              520443392
        vfs.zfs.mfu_size                        135810017280
        vfs.zfs.mru_ghost_data_lsize            136816527872
        vfs.zfs.mru_ghost_metadata_lsize        0
        vfs.zfs.mru_ghost_size                  136816527872
        vfs.zfs.mru_data_lsize                  109763704320
        vfs.zfs.mru_metadata_lsize              464291840
        vfs.zfs.mru_size                        110581102080
        vfs.zfs.anon_data_lsize                 0
        vfs.zfs.anon_metadata_lsize             0
        vfs.zfs.anon_size                       5408768
        vfs.zfs.l2arc_norw                      1
        vfs.zfs.l2arc_feed_again                1
        vfs.zfs.l2arc_noprefetch                0
        vfs.zfs.l2arc_feed_min_ms               200
        vfs.zfs.l2arc_feed_secs                 1
        vfs.zfs.l2arc_headroom                  4
        vfs.zfs.l2arc_write_boost               174063616
        vfs.zfs.l2arc_write_max                 87031808
        vfs.zfs.arc_meta_limit                  66676028416
        vfs.zfs.arc_free_target                 922966
        vfs.zfs.arc_wakeup_delay                500
        vfs.zfs.arc_wakeup_pager                1025517
        vfs.zfs.dynamic_write_buffer            1
        vfs.zfs.arc_shrink_shift                7
        vfs.zfs.arc_average_blocksize           8192
        vfs.zfs.arc_min                         206158430208
        vfs.zfs.arc_max                         266704113664

------------------------------------------------------------------------

root@cs0:~#

I'll continue to look how it works.
Comment 261 Anton Saietskii 2017-05-12 14:00:13 UTC
BTWm I think it's better to change "Affects only me" field to "Affects some people" in Importance field.
Comment 262 karl 2017-05-12 14:09:29 UTC
(In reply to Anton Sayetsky from comment #261)

Where that field is set now isn't where I set it when I reported the problem originally, and I have not changed it.
Comment 263 Mark Felder freebsd_committer freebsd_triage 2017-05-12 15:37:55 UTC
Updated "importance" field.
Comment 264 Anton Saietskii 2017-05-12 22:26:02 UTC
(In reply to karl from comment #259)

Unfortunately, patch didn't fix my issue. It's somewhat different to that described in current PR, so I'll open a new one & describe there my problem in detail.
As a new round of testing disabled UMA in zio.
Comment 265 Tomoaki AOKI 2017-07-29 01:00:17 UTC
Not yet looked in enough, but this review looks related.

 D7538 Correct adaptation ZFS ARC memory pressure to FreeBSD [1]

How do you think of it, Karl?

[1] https://reviews.freebsd.org/D7538
Comment 266 karl 2017-07-30 15:47:00 UTC
(In reply to Tomoaki AOKI from comment #265)

At first blush this looks quite interesting in that included is a means of determining allocated-but-unused UMA memory, which can (at minimum) improve decision-making on whether you *really* are low on RAM or not.

I'm not actively working this at the present time but will stick a reminder up to look at it in more depth against 11.1 once I go down that road.

Thanks!
Comment 267 karl 2017-09-26 19:52:32 UTC
11.1 continues to inappropriately pare back ARC when it should not, forcing processes to swap inappropriately as well.

I am testing an updated patch; the 11.0 patch does not cleanly apply, even with fuzz due to some internal changes in arc.c.  I should have a new rev for 11.1-STABLE soon; it compiles at this point but I wish to run it for a week or so before publishing it so that it has at least a "one guy, one box, hit hard" soak test.
Comment 268 karl 2017-09-30 17:33:02 UTC
Created attachment 186818 [details]
Patch against r324056 (11.1-STABLE) w/Phabricator D7538 improvements

Incorporate D7538 UMA cleanup improvements.  By themselves there remains a problem with extremely large inact page counts pressuring the ARC and inappropriate eviction of RSS to page to leave ARC in place.

The changes to D7538 implement the same heuristic as previous iterations, specifically:

1. Always favor RSS in RAM over ARC, since to evict RSS you must take one (guaranteed) I/O (and perhaps two if it is recalled) whereas holding ARC in RAM is done on the *speculation* you will save a future I/O.  Thus RSS should always "win" over ARC if there is contention.  To implement this begin paring back ARC somewhat before wakeup_pager is reached in free RAM.

2. Wake up the pagedaemon when we are materially below free_target so scrubbing of inact pages occurs when mild memory pressure exists.  This prevents the ARC from being pared back when the consumer of large amounts of ram is inact pages which could otherwise be demoted.
Comment 269 Tomoaki AOKI 2017-10-01 00:01:00 UTC
(In reply to karl from comment #268)

Thanks for your work!
Confirmed 

  *applicable with some offsets on VM area (cleanly applicable to arc.c),
  *buildable /installable,
  *and boots fine

on head r324080 amd64. But not yet stress-tested.
Comment 270 Tomoaki AOKI 2017-10-02 12:02:39 UTC
(In reply to karl from comment #268)

Noticed non-activated #IFDEFs for REAP_ARC.

Should I activate it (e.g. via -DREAP_ARC on make command line) for some special cases?
Or #define REAP_ARC is needed within the patch like WAKE_PAGER case?
Comment 271 karl 2017-10-02 13:13:50 UTC
(In reply to Tomoaki AOKI from comment #270)

No; that portion of the original code was there to handle what the "D" patch does now, and it does so in a more-elegant fashion.  I will eventually remove that section of code entirely.
Comment 272 Tomoaki AOKI 2017-10-02 13:57:46 UTC
(In reply to karl from comment #271)

Understood. Thanks!

Currently trying with regular load basis (on stable/11).
This means that unless hit by massive updates or huge single port update,
the loads would be relatively low. :-(

Anyway, I'll report back if I found something strange.
(e.g., swap occurs on cases known to NOT swaps with D7538 or your previous patch.)
Comment 273 Tomoaki AOKI 2017-11-19 06:30:30 UTC
Just a heads-up for now.
r325851 on head (MFC after: 3 weeks) broke this, including plain D7538 patch. [1]

 *Removes "needfree" variable.

Current test status:
Observing strange swaps. Swaps while rebuilding www/firefox, which didn't cause swaps with pure D7538 or previous patch here.
What is strange is that sometimes doesn't swap within lookalike workloads.
I'm looking for threshould of sysctl values to stop swapping, but not reliably succeeded for now. :-(

[1] https://lists.freebsd.org/pipermail/svn-src-head/2017-November/106220.html
Comment 274 vali gholami 2017-12-17 07:12:00 UTC
MARKED AS SPAM
Comment 275 Cy Schubert freebsd_committer freebsd_triage 2017-12-20 01:12:14 UTC
ZFS Tuning (https://wiki.freebsd.org/ZFSTuningGuide -- some of which I had contributed) stated when it was written years ago setting vfs.zfs.arc_max, vm.kmem_size and vm.kmem_size_max for small memory machines. I'd been tuning my larger memory systems, albeit with higher values, avoiding the issues. Having removed the artificial limits, looking at the numbers (and I haven't studied this in depth), arc can shrink, but not easily. However the memory does remain wired. Without having a chance to look at any code I suspect there may be a couple of issues that might need to be addressed. (Juxtapositioned to this issue is the fact that UFS buffer cache doesn't readily shrink either.)

What I'm suggesting is that we might need to take a step back and look at both issues holistically.
Comment 276 Eitan Adler freebsd_committer freebsd_triage 2018-05-28 19:47:03 UTC
batch change:

For bugs that match the following
-  Status Is In progress 
AND
- Untouched since 2018-01-01.
AND
- Affects Base System OR Documentation

DO:

Reset to open status.


Note:
I did a quick pass but if you are getting this email it might be worthwhile to double check to see if this bug ought to be closed.
Comment 277 ian 2018-09-21 09:49:04 UTC
I have been seeing regular severe performance degradation due to 'WIRED' memory reaching 96G on a machine with 128G of physical ram, this is with arc_max set to 64G.

The load on this machine is very static, it runs half a dozen bhyve vms, they run very well in general but see about a 70% drop in performance when the wired memory hits 95G or so. pagedaemon cpu goes from under 1% to sitting between 10% and 70%. After trying many things today trying to get it back under control without a reboot (including reducing arc_max to 20G and removing l2arc from the zpool completely), I wrote a quick and dirty bit of code to allocate 64G of ram and scan through at least 1 read in every 4K page. This pretty much instantly released much of the 95G of wired memory and returned performance to normal (and pagedaemon back to under 1% cpu).

This might be a quick and dirty workaround for others experiencing this problem (I have parked this process in a 4am cron job for now). It does suggest though that the process of trimming back wired memory doesnt happen until way too late ?

This is on 11.2 stable build around 15th of August. This machine repeatably degrades in performance around the same time, ie 3-4 weeks of uptime. I suspect with the dodgy program to just randomly eat up a bunch of ram and exit (it takes about 20 seconds to do this) running daily this degradation will not occur.

Potentially on machines where there is more dynamic memory load this situation may not occur ?
Comment 278 Shane 2018-09-22 03:23:28 UTC
(In reply to ian from comment #277)

As well as arc being wired, there is also a max_wired that defaults to 30% ram, this limits other wired allocations, together the two can total more than physical ram. See bug #229764 for my report on that.

You should also account for what ram your bhyves use, bhyve ram plus ARC plus max_wired should combine to less than physical.

For the last two or three years I had been using an allocate memory script to push out arc fairly regularly, this is on my everyday desktop, I have not needed to since applying review D7538
Comment 279 Graham Perrin freebsd_committer freebsd_triage 2019-03-02 11:24:58 UTC
See also: 

- bug 229670 - Too many vnodes causes ZFS ARC to exceed limit vfs.zfs.arc_max (high ARC "Other" usage)
Comment 280 karl 2019-03-02 14:58:07 UTC
(In reply to Graham Perrin from comment #279)

Are you seeing this on FreeBSD-12?

I've recently moved the system that had the worst examples of ARC trouble (and related), which led to my writing this patch originally to 12 -- and have not yet rolled the patch forward.

SO FAR it appears that 12 is MUCH better behaved.  There are fairly-extensive changes in the ZFS code internally in 12, and I'm not yet ready to call this OBE, but the improvement is substantial.

Specifically, I could trivially trigger pathological behavior previously by simply starting a "buildworld" while that box was busy doing its other things (which include running a Postgres database) -- to the point that it would swap badly enough that interactive response was essentially destroyed.  Poudriere was flat out of the question.

I'm still seeing some evidence of leaving UMA in particular stale to some degree but it appears to be far better in 12 than it was previously.
Comment 281 Sean Chittenden freebsd_committer freebsd_triage 2019-06-18 17:44:21 UTC
Karl, can you test 12 again with the following?

https://svnweb.freebsd.org/base?view=revision&revision=348772

Given the age of this issue and other fixes in this area, I'm going to close this issue.  If there is a new patch, let's create a new issue for a new patch that can be applied against master and potentially be MFC'ed or pushed upstream to OpenZFS (fka ZoL).
Comment 282 karl 2019-06-18 17:50:53 UTC
(In reply to Sean Chittenden from comment #281)

I have a production box on FreeBSD 12.0-STABLE #2 r349024M at this point and will re-run my series of "torture tests" on it in the next couple of weeks -- am out of town right now.

So far, however, since I built that rev it has not shown evidence of misbehavior.
Comment 283 Anton Saietskii 2019-07-27 10:13:32 UTC
(In reply to Sean Chittenden from comment #281)

I disagree with this since problem is still not fixed in releng/11.2, which is supported branch currently.
Comment 284 Sean Chittenden freebsd_committer freebsd_triage 2019-07-29 17:45:26 UTC
(In reply to Anton Saietskii from comment #283)

Raise the issue with re@ and see if they can cherry-pick the fixes into a supported branch.  My understanding is such that MFC'ing the commits associated with this PR will not happen for 11-STABLE because the work is present in the 12-STABLE branch and because of the amount of drift between 1{1,2}-STABLE.