Bug 230704 - All the memory eaten away by ZFS 'solaris' malloc
Summary: All the memory eaten away by ZFS 'solaris' malloc
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 11.2-RELEASE
Hardware: Any Any
: --- Affects Some People
Assignee: Mark Johnston
URL:
Keywords: regression
Depends on:
Blocks:
 
Reported: 2018-08-17 13:33 UTC by Mark.Martinec
Modified: 2018-09-24 14:53 UTC (History)
4 users (show)

See Also:


Attachments
proposed patch (2.07 KB, patch)
2018-09-14 15:56 UTC, Mark Johnston
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Mark.Martinec 2018-08-17 13:33:29 UTC
Affected: ZFS on 11.2-RELEASE and 11.1-RELEASE-p11 (but not on 10.3).

Running commands like 'zpool list', 'zpool status' or 'zpool iostat'
on a defunct pool (with broken underlying disks) leaks memory.

When such command runs frequently (like by a monitoring tool
'telegraf'), in a couple of days the system runs out of memory,
applications start swapping and eventually everything grinds
to a standstill, requiring a forced reboot.

In few days, shortly before a freeze, the  vmstat -m  shows
the 'solaris' malloc approaching the total size of the memory
(prior to that this number was steadily growing linearly):

  $ vmstat -m :
         Type InUse MemUse HighUse Requests  Size(s)
       solaris 39359484 2652696K       - 234986296  ...


How to repeat:

  # create a test pool on md
  mdconfig -a -t swap -s 1Gb
  gpart create -s gpt /dev/md0
  gpart add -t freebsd-zfs -a 4k /dev/md0
  zpool create test /dev/md0p1
  # destroy the disk underneath the pool, making it "unavailable"
  mdconfig -d -u 0 -o force

  reboot now (before a reboot the trouble does not start yet)

  Now run 'zpool list' periodically, monitoring the growth of
  the 'solaris' malloc zone:

  (while true; do zpool list >/dev/null; vmstat -m | \
     fgrep solaris; sleep 0.5; done) | awk '{print $2-a; a=$2}'
  12224540
  2509
  3121
  5022
  2507
  1834
  2508
  2505


As suggested by Mark Johnston, here is a dtrace

  https://www.ijs.si/usr/mark/tmp/dtrace-cmd.out.bz2

from the following command:

  # dtrace -c "zpool list -Hp" -x temporal=off -n '
             dtmalloc::solaris:malloc
               /pid == $target/{@allocs[stack(), args[3]] = count()}
         dtmalloc::solaris:free
           /pid == $target/{@frees[stack(), args[3]] = count();}'
  This will record all allocations and frees from a single instance
  of "zpool list".



Andriy Gapon wrote on the mailing list:

I see one memory leak, not sure if it's the only one.
It looks like vdev_geom_read_config() leaks all parsed vdev nvlist-s but
the last.  The problems seems to come from r316760.  Before that commit
the function would return upon finding the first valid config, but now
it keeps iterating.

The memory leak should not be a problem when vdev-s are probed
sufficiently rarely, but it appears that with an unhealthy pool the
probing can happen much more frequently (e.g., every time pools are listed).



The whole discussion leading to the above findings is on
the stable@freebsd.org mailing list, 2018-07-23 to 2018-08-14,
subject:
  "All the memory eaten away by ZFS 'solaris' malloc - on 11.1-R amd64"
Comment 1 Mark Linimon freebsd_committer freebsd_triage 2018-08-17 19:47:46 UTC
Apparently a regression in ZFS vs. 10.x.
Comment 2 Peter Eriksson 2018-08-17 20:20:54 UTC
Running zfsd also triggers this memory leak (or a similar one) when you have a defunct pool. Seems zfsd probes the pool fairly often also... So a machine with 256GB RAM ran out of it in a day or so :)
Comment 3 Mark Johnston freebsd_committer freebsd_triage 2018-09-14 15:56:20 UTC
Created attachment 197091 [details]
proposed patch

Mark, are you still able to reproduce the problem?  Could you give the attached patch a try?
Comment 4 Mark.Martinec 2018-09-14 21:32:17 UTC
Thanks, great! Will try it on Monday (I'm on vacation now, with limited connectivity).
Comment 5 Mark.Martinec 2018-09-17 14:34:11 UTC
Tried it now (based on 11.2-RELEASE-p3). It is a major improvement:
the 'vmstat -m' InUse count on a solaris zone came down to 18 (from
previous 520 or so) per invocation of a 'zpool list' on a defunct pool:

# (while true; do zpool list stuff >/dev/null; vmstat -m | fgrep solaris; sleep 1; done) | awk '{print $2-a, $3; a=$2}'
41167 20851K
18 20854K
18 20856K
18 20859K
18 20861K
18 20863K
18 20866K
18 20868K
18 20870K
18 20873K
18 20875K
18 20878K
18 20880K
18 20882K
18 20885K

So instead of four days, this host would now stay up 30 times longer.
Comment 6 Mark Johnston freebsd_committer freebsd_triage 2018-09-17 15:11:27 UTC
(In reply to Mark.Martinec from comment #5)
This might be the result of zfs_dbgmsg().  If so, the memory usage will stop increasing once zfs_dbgmsg_size hits the 4MB limit.  Could you run zpool -Hp for a while and see if the memory usage stops increasing?
Comment 7 Mark Johnston freebsd_committer freebsd_triage 2018-09-17 15:45:47 UTC
https://reviews.freebsd.org/D17202
Comment 8 Mark.Martinec 2018-09-17 15:47:33 UTC
> This might be the result of zfs_dbgmsg().  If so, the memory usage
> will stop increasing once zfs_dbgmsg_size hits the 4MB limit.
> Could you run zpool -Hp for a while and see if the memory usage
> stops increasing?

Indeed this seems to be the case.
The growth stopped when the MemUse on solaris reached about 5 MB:

18 20854K
18 20856K
18 20859K
[...]
18 26371K
18 26373K
18 26376K
18 26378K
18 26381K
18 26383K
18 26385K
18 26388K
18 26390K
-1 26390K
1 26390K
0 26390K
2 26390K
0 26390K
0 26390K
0 26390K
0 26390K
0 26390K
Comment 9 Mark Johnston freebsd_committer freebsd_triage 2018-09-17 15:48:41 UTC
(In reply to Mark.Martinec from comment #8)
Great, thank you!
Comment 10 commit-hook freebsd_committer freebsd_triage 2018-09-17 16:17:05 UTC
A commit references this bug:

Author: markj
Date: Mon Sep 17 16:16:58 UTC 2018
New revision: 338724
URL: https://svnweb.freebsd.org/changeset/base/338724

Log:
  Fix an nvpair leak in vdev_geom_read_config().

  Also change the behaviour slightly: instead of freeing "config" if the
  last nvlist doesn't pass the tests, return the last config that did pass
  those tests.  This matches the comment at the beginning of the function.

  PR:		230704
  Diagnosed by:	avg
  Reviewed by:	asomers, avg
  Tested by:	Mark Martinec <Mark.Martinec@ijs.si>
  Approved by:	re (gjb)
  MFC after:	1 week
  Sponsored by:	The FreeBSD Foundation
  Differential revision:  https://reviews.freebsd.org/D17202

Changes:
  head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c
Comment 11 commit-hook freebsd_committer freebsd_triage 2018-09-24 14:48:32 UTC
A commit references this bug:

Author: markj
Date: Mon Sep 24 14:48:27 UTC 2018
New revision: 338904
URL: https://svnweb.freebsd.org/changeset/base/338904

Log:
  MFC r338724:
  Fix an nvpair leak in vdev_geom_read_config().

  PR:	230704

Changes:
_U  stable/11/
  stable/11/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c
Comment 12 commit-hook freebsd_committer freebsd_triage 2018-09-24 14:51:38 UTC
A commit references this bug:

Author: markj
Date: Mon Sep 24 14:50:45 UTC 2018
New revision: 338905
URL: https://svnweb.freebsd.org/changeset/base/338905

Log:
  MFC r338724:
  Fix an nvpair leak in vdev_geom_read_config().

  PR:	230704

Changes:
_U  stable/10/
  stable/10/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c
Comment 13 Mark Johnston freebsd_committer freebsd_triage 2018-09-24 14:53:46 UTC
Thanks for your help in tracking this down, Mark.