Bug 242427 - zfsshutdown -> zfs_fini -> kmem_cache_destroy(zio_*_buf) is very slow causing 10+ minutes long reboots
Summary: zfsshutdown -> zfs_fini -> kmem_cache_destroy(zio_*_buf) is very slow causing...
Status: In Progress
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 11.3-RELEASE
Hardware: amd64 Any
: --- Affects Some People
Assignee: Mark Johnston
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-12-04 14:29 UTC by Peter Eriksson
Modified: 2020-01-09 19:18 UTC (History)
5 users (show)

See Also:


Attachments
proposed diff (3.01 KB, patch)
2019-12-10 18:33 UTC, Mark Johnston
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Peter Eriksson 2019-12-04 14:29:16 UTC
I've noticed that on our file servers it can take a _very_ long time to reboot - where most of the time spent is during the "shutdown" phase - time spent after the "Uptime: " line is printed.

After a long debugging session I've pinpointed it to ZFS freeing the "zio" cache, which, after many levels of function calls, ends up it in pmap_remove() where some calls to it takes approximately 1 second. On a basically idle test server it can take up to 10-20 minutes for the server to "shut down" (or even more - the time seems to be relative to the server uptime). We've seen production machines that seem to be "hung" (atleast an hour or more) so we gave up and sent them a "hard reset" via IPMI.

Hardware:
Dell PowerEdge R730xd with dual Intel Xeon E5-2620v4 CPUs (32 "cpus") and 256GB of RAM. No swap.

Software:
FreeBSD 11.3-RELEASE-p5. ZFS on boot & data. ZFS ARC limited to 128GB. Approx 24000 ZFS filesystems (empty on this test server). Snapshots taken every hour.


An example of how long it can take (I'm only printing timing info for calls that take >=1s (the top 4) or >=2s (the rest):

kmem_unback: pmap_remove(kernel_pmap, 18446741877714755584, 18446741877714767872) took 1 seconds
kmem_free: kmem_unback(kmem_object, 18446741877714755584, 12288) took 1 seconds
page_free: kmem_free(kmem_arena, 18446741877714755584, 12288) took 1 seconds
keg_free_slab: keg->uk_freef(mem) {page_free} took 1 seconds

keg_drain: while-keg_free_slab-loop took 14 seconds [20021 loops, 14 slow calls]
zone_drain_wait: zone_foreach_keg(zone, &keg_drain) took 14 seconds
zone_dtor: zone_drain_wait(zone, M_WAITOK) took 14 seconds
zone_free_item(zone=UMA Zones): zone->uz_dtor() took 14 seconds
uma_zdestroy(zio_buf_12288) took 14 seconds
kmem_cache_destroy: uma_zdestroy(0xfffff803467c8ac0) [zio_buf_12288] took 14 seconds
kmem_cache_destroy(zio_buf_cache[20]) took 14 seconds

Called from kern_shutdown() -> EVENTHANDLER_INVOKE(shutdown_post_sync) -> zfsshutdown()


Ie, ~14 out of 20021 calls to keg_free_slab() takes 1 second instead of executing really quickly (in this case).
But some kmem_cache:s are much bigger causing delays for 300-600 seconds (or more).

(I use "time_second" for time measurements, should probably use something with better granularity for the top 4 calls :-)

I added a sysctl() kern.shutdown.verbose that I can set to a number to make it be more verbose (and added a lot of printf()s to get this info)... With it I now can see the number of filesystems being unmounted (since that too can take a little while - nothing close to the times above though).

- Peter
Comment 1 Mark Johnston freebsd_committer 2019-12-04 15:58:18 UTC
My naive guess is that we're either spending this time in the pmap_invalidate_all() call or in the pmap_delayed_invl_finished() call.  Can you determine whether this is the case?  Do you have PTI enabled (vm.pmap.pti)?
Comment 2 Konstantin Belousov freebsd_committer 2019-12-04 16:15:40 UTC
(In reply to Mark Johnston from comment #1)
This is 11.3, right ?  How any CPUs do you have ?

How do you measure that pmap_remove() takes 1sec ?

Please recompile with INVARIANTS at least and see if anything pops up.
Comment 3 Peter Eriksson 2019-12-04 17:00:37 UTC
(In reply to Mark Johnston from comment #1)

# sysctl vm.pmap
vm.pmap.pdpe.demotions: 3
vm.pmap.pde.promotions: 18422
vm.pmap.pde.p_failures: 1050
vm.pmap.pde.mappings: 11065
vm.pmap.pde.demotions: 1546
vm.pmap.allow_2m_x_ept: 1
vm.pmap.pcid_save_cnt: 2609830
vm.pmap.pti: 1
vm.pmap.invpcid_works: 1
vm.pmap.pcid_enabled: 1
vm.pmap.pg_ps_enabled: 1
vm.pmap.pat_works: 1

As for the questsions from Konstantin:

CPU: Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz (2100.04-MHz K8-class CPU)
FreeBSD/SMP: 2 package(s) x 8 core(s) x 2 hardware threads

hw.ncpu: 32

FreeBSD 11.3-RELEASE-p5


The measurements are very simple, one example from sys/vm/vm_kern.c

> extern int verbose_shutdown; /* sysctl kern.shutdown.verbose, defined elsewhere */
>
> void
> kmem_unback(vm_object_t object, vm_offset_t addr, vm_size_t size)
> {
>        vm_page_t m, next;
>        vm_offset_t end, offset;
>        time_t t0;
>        int dt;
>
>        KASSERT(object == kmem_object || object == kernel_object,
>            ("kmem_unback: only supports kernel objects."));
>
>        if (verbose_shutdown > 3)
>          t0 = time_second;
>
>        pmap_remove(kernel_pmap, addr, addr + size);
>
>        if (verbose_shutdown > 3) {
>          dt = time_second - t0;
>          if (dt > 0)
>            printf("kmem_unback: pmap_remove(kernel_pmap, %lu, %lu) took %d seconds\n",
>                   (unsigned long) addr, (unsigned long) addr+size, dt);
>          t0 = time_second;
>        }

The printed messages when dt is 1 (or more, but that hasn't happened yet) perfectly matches with the time spent (when measured "higher up" in the call chain and measured by a wall clock).

(The resultion of time_second is seconds so probably should change to something with a bit higher resolution if I'd really wanna know how long time pmap_remove() takes)
 
I haven't tested an INVARIANTS build yet, but will do that...
Comment 4 Peter Eriksson 2019-12-04 17:23:12 UTC
(In reply to Peter Eriksson from comment #3)

Before someone asks - yes, in theory since I'm using the "if (time_second-t0>0)" check, then it _could_ be just coincidence and that the clock steps 1s just between the two reads of time_second. However, it is consistent (all 4 levels of calls return 1 (and they read the time spent separately) - and I see it printed 14 times), and the loop in keg_drain reports 14 seconds - so I doubt that it would be the case. Not all the time atleast... :-)

> kmem_unback: pmap_remove(kernel_pmap, 18446741877714755584, 18446741877714767872) took 1 seconds
> kmem_free: kmem_unback(kmem_object, 18446741877714755584, 12288) took 1 seconds
> page_free: kmem_free(kmem_arena, 18446741877714755584, 12288) took 1 seconds
> keg_free_slab: keg->uk_freef(mem) {page_free} took 1 seconds
(The four lines above is repeated 14 times, with different addresses of course)

> keg_drain: while-keg_free_slab-loop took 14 seconds [20021 loops, 14 slow calls]
Comment 5 Konstantin Belousov freebsd_committer 2019-12-04 17:30:09 UTC
I am sure that you are detecting the edge of second and not the actual sleeping place.  It probably something that happens once in second (some callout) which allows the system to make the progress.

Use precise timing functions like getnanouptime() to find the slow calls.
Comment 6 Peter Eriksson 2019-12-05 15:31:02 UTC
(In reply to Konstantin Belousov from comment #5)

Yes, you are correct (and I was wrong). Sorry about that. 

I've rewritten the timing tests now to use nanouptime() instead, and now things look slightly different. Albeit still takes a long time...

Here's a sample from a server that has been up for some hours:

       keg_free_slab: keg->uk_freef(mem) {page_free} took 2077 µs
       keg_free_slab: keg->uk_freef(mem) {page_free} took 2123 µs
       keg_free_slab: keg->uk_freef(mem) {page_free} took 2107 µs
       keg_free_slab: keg->uk_freef(mem) {page_free} took 2190 µs
      keg_drain: while()-keg_free_slab-loop took 163 s (163765748 µs) [240297 loops, 4 slow, avg=681 µs, min=246 µs, max=7922 µs]
     zone_drain_wait(): zone_foreach_keg(zone, &keg_drain) took 163 s
    zone_dtor(): zone_drain_wait(zone, M_WAITOK) took 163 s
   zone_free_item(zone=UMA Zones): zone->uz_dtor() took 163 s
  uma_zdestroy(zio_buf_16384) took 163 s
 kmem_cache_destroy: uma_zdestroy(0xfffff80346601880) [zio_buf_16384] took 164 seconds
zio_fini: kmem_cache_destroy(zio_buf_cache[28]) took 164 seconds

keg_free_slab only prints the keg->uk_freef(mem) (calls page_free which calls kmem_free) call timings if it takes over 2000µs. Most of them are not shown, the majority seems to be around 1000-1500µs.


Anyway, I wonder if one couldn't just skip all this freeing of kernel memory at reboot/shutdown time. Perhaps it could conditionally be disabled via a sysctl setting? 

In order to test this I added such a sysctl and with the calls to kmem_cache_destroy() in zio_fini() disabled this machine reboots in under 1 minute instead of atleast 10 minutes (for a newly booted server). (The LSI SAS HBA shutdown and the ZFS unmounting takes the most of that time).


(I suppose the current code is needed for the case where you have dynamically loaded zfs and wishes to kldunload it (without rebooting) - or one will have a massive memory leak, but since we're rebooting almost immediately after this I don't really see why we have to spend 10-60 minutes freeing all this memory :-).

zio.fini() is in /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c
The rest are in /usr/src/sys/vm/uma_core.c & vm_kern.c


A slightly longer output (verbose-level 2 so skipping some timing measurements in keg_free_slab and below that might inflate the timing numbers) for reference from a freshly rebooted server:

Freeing ZFS ZIO caches:

keg_drain: while()-keg_free_slab-loop took 994 ms (993875 µs) [1450 loops, 0 slow, avg=685 µs, min=383 µs, max=1713 µs, 0 zero]
zone_drain_wait(): zone_foreach_keg(zone, &keg_drain) took 7319 µs
zone_dtor(): zone_drain_wait(zone, M_WAITOK) took 14 ms
uma_zdestroy(zio_buf_8192) took 19 ms

keg_drain: while()-keg_free_slab-loop took 15 s (15886587 µs) [23217 loops, 0 slow, avg=684 µs, min=279 µs, max=1745 µs, 0 zero]
zone_drain_wait(): zone_foreach_keg(zone, &keg_drain) took 15 s
zone_dtor(): zone_drain_wait(zone, M_WAITOK) took 15 s
uma_zdestroy(zio_buf_12288) took 15 s
kmem_cache_destroy: uma_zdestroy(0xfffff803465f6980) [zio_buf_12288] took 16 seconds
zio_fini: kmem_cache_destroy(zio_buf_cache[20]) took 16 seconds

keg_drain: while()-keg_free_slab-loop took 61 s (61698566 µs) [86029 loops, 11 slow, avg=717 µs, min=265 µs, max=2252 µs, 0 zero]
zone_drain_wait(): zone_foreach_keg(zone, &keg_drain) took 61 s
zone_dtor(): zone_drain_wait(zone, M_WAITOK) took 61 s
uma_zdestroy(zio_buf_16384) took 61 s
kmem_cache_destroy: uma_zdestroy(0xfffff803465f6880) [zio_buf_16384] took 62 seconds
zio_fini: kmem_cache_destroy(zio_buf_cache[28]) took 62 seconds

keg_drain: while()-keg_free_slab-loop took 87 s (86986351 µs) [123793 loops, 24 slow, avg=702 µs, min=297 µs, max=2398 µs, 0 zero]
zone_drain_wait(): zone_foreach_keg(zone, &keg_drain) took 87 s
zone_dtor(): zone_drain_wait(zone, M_WAITOK) took 87 s
uma_zdestroy(zio_buf_131072) took 87 s
kmem_cache_destroy: uma_zdestroy(0xfffff8034c904640) [zio_buf_131072] took 87 seconds
zio_fini: kmem_cache_destroy(zio_buf_cache[224]) took 87 seconds

keg_drain: while()-keg_free_slab-loop took 344 ms (5340963 µs) [7730 loops, 0 slow, avg=690 µs, min=371 µs, max=1739 µs, 0 zero]
zone_drain_wait(): zone_foreach_keg(zone, &keg_drain) took 357 ms
zone_dtor(): zone_drain_wait(zone, M_WAITOK) took 364 ms
uma_zdestroy(zio_data_buf_131072) took 370 ms
kmem_cache_destroy: uma_zdestroy(0xfffff8034c904600) [zio_data_buf_131072] took 6 seconds
zio_fini: kmem_cache_destroy(zio_data_buf_cache[224]) took 6 seconds
Comment 7 Peter Eriksson 2019-12-05 16:23:58 UTC
>      keg_free_slab: keg->uk_freef(mem) {page_free} took 2190 µs
>      keg_drain: while()-keg_free_slab-loop took 163 s (163765748 µs) [240297 loops, 4 slow, avg=681 µs, min=246 µs, max=7922 µs]

Just a FYI:

The "max" value should probably be subtracted with ~6000µs - calling printf() with a line of text seems to take about that time - so if I run the same test without printing anything in keg_free_slab() the max is around 2000µs instead.
Comment 8 Mark Johnston freebsd_committer 2019-12-08 00:22:21 UTC
(In reply to Peter Eriksson from comment #6)
Do your timings still indicate that most of the time is still spent in pmap_remove()?
Comment 9 Peter Eriksson 2019-12-08 23:46:45 UTC
(In reply to Mark Johnston from comment #8)

Yes and no... I don't see any really slow calls to pmap_remove() but they are slow enough that the whole delete-all-zfs-caches take 10min (for a newly rebooted test server with 24k filesystems), and more if I leave it up for a day or so.

Here's some timing output from a test run:

  zio_buf_512: 22221 slabs destroyed in 8651 µs [time per slab: min=0 µs, avg=0 µs, max=13 µs]
  zio_data_buf_512: 120 slabs destroyed in 81 µs [time per slab: min=0 µs, avg=0 µs, max=17 µs]
  zio_buf_1024: 6159 slabs destroyed in 2518 µs [time per slab: min=0 µs, avg=0 µs, max=14 µs]
  zio_data_buf_1024: 127 slabs destroyed in 46 µs [time per slab: min=0 µs, avg=0 µs, max=1 µs]
  zio_buf_1536: 12236 slabs destroyed in 4919 µs [time per slab: min=0 µs, avg=0 µs, max=2 µs]
  zio_data_buf_1536: 153 slabs destroyed in 55 µs [time per slab: min=0 µs, avg=0 µs, max=1 µs]
  zio_buf_2048: 12219 slabs destroyed in 4837 µs [time per slab: min=0 µs, avg=0 µs, max=14 µs]
  zio_data_buf_2048: 95 slabs destroyed in 38 µs [time per slab: min=0 µs, avg=0 µs, max=1 µs]
  zio_buf_2560: 54 slabs destroyed in 20 µs [time per slab: min=0 µs, avg=0 µs, max=0 µs]
  zio_data_buf_2560: 138 slabs destroyed in 53 µs [time per slab: min=0 µs, avg=0 µs, max=0 µs]
  zio_buf_3072: 39 slabs destroyed in 13 µs [time per slab: min=0 µs, avg=0 µs, max=0 µs]
  zio_data_buf_3072: 132 slabs destroyed in 49 µs [time per slab: min=0 µs, avg=0 µs, max=1 µs]
  zio_buf_3584: 18 slabs destroyed in 8305 ns [time per slab: min=0 µs, avg=0 µs, max=1 µs]
  zio_data_buf_3584: 77 slabs destroyed in 35 µs [time per slab: min=0 µs, avg=0 µs, max=1 µs]
  zio_buf_4096: 5459 slabs destroyed in 1615 µs [time per slab: min=0 µs, avg=0 µs, max=13 µs]
  zio_data_buf_4096: 872 slabs destroyed in 202 µs [time per slab: min=0 µs, avg=0 µs, max=0 µs]

  zio_buf_5120: 26 slabs destroyed in 29 ms [time per slab: min=555 µs, avg=1131 µs, max=1712 µs]
  zio_buf_5120: pmap_remove: n=26, tot=29348843 ns, min=552941 ns, avg=1128801 ns, max=1710594 ns

  zio_data_buf_5120: 96 slabs destroyed in 95 ms [time per slab: min=489 µs, avg=993 µs, max=1661 µs]
  zio_data_buf_5120: pmap_remove: n=96, tot=95210566 ns, min=488086 ns, avg=991776 ns, max=1658944 ns

  zio_buf_6144: 18 slabs destroyed in 16 ms [time per slab: min=546 µs, avg=935 µs, max=1352 µs]
  zio_buf_6144: pmap_remove: n=18, tot=16821582 ns, min=545705 ns, avg=934532 ns, max=1349915 ns
...
  zio_buf_12288: 23202 slabs destroyed in 15 s [time per slab: min=332 µs, avg=666 µs, max=2110 µs]
  zio_buf_12288: pmap_remove: n=23202, tot=15444385621 ns, min=331665 ns, avg=665648 ns, max=2108369 ns

  zio_data_buf_12288: 109 slabs destroyed in 78 ms [time per slab: min=483 µs, avg=715 µs, max=1593 µs]
  zio_data_buf_12288: pmap_remove: n=109, tot=77850972 ns, min=481789 ns, avg=714229 ns, max=1591361 ns

  zio_buf_14336: 35 slabs destroyed in 31 ms [time per slab: min=470 µs, avg=911 µs, max=1858 µs]
  zio_buf_14336: pmap_remove: n=35, tot=31853738 ns, min=468547 ns, avg=910106 ns, max=1856448 ns

  zio_data_buf_14336: 94 slabs destroyed in 67 ms [time per slab: min=432 µs, avg=722 µs, max=1570 µs]
  zio_data_buf_14336: pmap_remove: n=94, tot=67799962 ns, min=430979 ns, avg=721276 ns, max=1568713 ns

  zio_buf_16384: 86047 slabs destroyed in 61 s [time per slab: min=300 µs, avg=710 µs, max=2620 µs]
  zio_buf_16384: pmap_remove: n=86047, tot=61023321118 ns, min=298983 ns, avg=709185 ns, max=2618340 ns
...(many more buffers freed of various sizes)...

I only print the time for calls to pmap_remove() if there has been atleast one - so apparently for ZFS caches for sizes at 4096 and less pmap_remove() never gets called. I haven't analyzed that code.

The "time per slab" is the higher level code in keg_drain() that eventually calls pmap_remove(), I time both since I wanted to make sure the time wasn't spent elsewhere. Now, as an example - zio_buf_16384 at 86047 calls to pmap_remove at an average time of about 0.7 ms does add up to quite some time.... 61 seconds in that case.

If one could optimize pmap_remove() to be much faster then this code would gain a lot... Or perhaps rewrite the zfs cache handling stuff to allocate stuff differently so fewer pmap_remove() calls is needed...

Anyway I have a patch that completely skips spending all this time freeing these caches when rebooting now. And another that enables a bit more verbose rebooting so one can see that things are progressing and haven't stalled completely. 

I'll attach these patches to this bug when I've polished them a bit. The "skip freeing stuff" patch is very simple though - basically add an "if (rebooting) return;" first to the zio_fini() function in /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c (I added a sysctl to control this behaviour also).
Comment 10 Mark Johnston freebsd_committer 2019-12-10 18:32:14 UTC
(In reply to Peter Eriksson from comment #9)
> I only print the time for calls to pmap_remove() if there has been atleast one - so apparently for ZFS caches for sizes at 4096 and less pmap_remove() never gets called. I haven't analyzed that code.

Right, if the size is <= PAGE_SIZE, the slab is permanently mapped into the direct map.  I believe ABD is the default for ZFS in 12 and later, in which case the ARC uses 4KB buffers in most cases.

pmap_remove() cannot easily be optimized: as part of its operation it must interrupt all other CPUs to cause them to invalidate some per-CPU caches, and wait for the invalidation to complete on all CPUs.  There are schemes where this operation can be performed lazily, but this requires some mechanism to defer reuse of the to-be-freed pages.

I wrote a patch that introduces a "shutdown" state in UMA, entered when the first shutdown_post_sync handler is called.  In this state uma_zdestroy() is a no-op.
Comment 11 Mark Johnston freebsd_committer 2019-12-10 18:33:05 UTC
Created attachment 209839 [details]
proposed diff

Patch attached, I only compile-tested it so far.
Comment 12 Mark Johnston freebsd_committer 2020-01-07 16:58:13 UTC
https://reviews.freebsd.org/D23066
Comment 13 commit-hook freebsd_committer 2020-01-09 19:18:19 UTC
A commit references this bug:

Author: markj
Date: Thu Jan  9 19:17:43 UTC 2020
New revision: 356563
URL: https://svnweb.freebsd.org/changeset/base/356563

Log:
  UMA: Don't destroy zones after the system shutdown process starts.

  Some kernel subsystems, notably ZFS, will destroy UMA zones from a
  shutdown eventhandler.  This causes the zone to be drained.  For slabs
  that are mapped into KVA this can be very expensive and so it needlessly
  delays the shutdown process.

  Add a new state to the "booted" variable, BOOT_SHUTDOWN.  Once
  kern_reboot() starts invoking shutdown handlers, turn uma_zdestroy()
  into a no-op, provided that the zone does not have a custom finalization
  routine.

  PR:		242427
  Reviewed by:	jeff, kib, rlibby
  MFC after:	2 weeks
  Sponsored by:	The FreeBSD Foundation
  Differential Revision:	https://reviews.freebsd.org/D23066

Changes:
  head/sys/vm/uma_core.c