Bug 144330 - [nfs] mbuf leakage in nfsd with zfs
Summary: [nfs] mbuf leakage in nfsd with zfs
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: Unspecified
Hardware: Any Any
: Normal Affects Only Me
Assignee: freebsd-fs (Nobody)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-02-26 21:40 UTC by Gerrit Kühn
Modified: 2010-04-29 19:22 UTC (History)
0 users

See Also:


Attachments
svc.c.svc_getreq.patch (295 bytes, patch)
2010-03-21 13:43 UTC, Mikolaj Golub
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Gerrit Kühn 2010-02-26 21:40:02 UTC
When serving nfs dirs from a zfs storage, mbuffers keep being drained, but are never freed. See http://www.pmp.uni-hannover.de/test/Mitarbeiter/g_kuehn/data/mbuf.pdf as an example. y-axis is output of total mbuf clusters by netstat -m, x-axis is time in minutes. The flat region in the upper right corner corresponds to approx. 10min with nfsd turned off.
Several others report similar problems (see thread on -stable with topic "mbuf leakage with nfs/zfs? (was: em0 freezes on ZFS server)").

Fix: 

Unknown, but urgently needed. :-)
How-To-Repeat: Server nfs-dirs from zfs storage with 8-stable and monitor mbuf usage.
Comment 1 Mark Linimon freebsd_committer freebsd_triage 2010-02-26 22:25:29 UTC
Responsible Changed
From-To: freebsd-bugs->freebsd-fs

Over to maintainer(s).
Comment 2 kkockro 2010-03-17 10:43:51 UTC
Hi,

same problem here.


netstat -m ( after 6 days )
140889/35376/176265 mbufs in use (current/cache/total)
139381/4689/144070/262144 mbuf clusters in use (current/cache/total/max)
137206/3594 mbuf+clusters out of packet secondary zone in use (current/cache)
2/245/247/131072 4k (page size) jumbo clusters in use 
(current/cache/total/max)
0/0/0/65536 9k jumbo clusters in use (current/cache/total/max)
0/0/0/32768 16k jumbo clusters in use (current/cache/total/max)
313992K/19202K/333194K bytes allocated to network (current/cache/total)
0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters)
0/0/0 requests for jumbo clusters denied (4k/9k/16k)
0/0/0 sfbufs in use (current/peak/max)
0 requests for sfbufs denied
0 requests for sfbufs delayed
0 requests for I/O initiated by sendfile
0 calls to protocol drain routines

Its a NFS exportet ZPOOL ( 14TB, 40 clients, no UDP mounts ).

#uname -a
FreeBSD xxxx 8.0-STABLE FreeBSD 8.0-STABLE #0: Fri Feb 12 03:27:54 CET 2010     
root@xxxx:/usr/obj/usr/src/sys/GENERIC  amd64


#cat /boot/loader.conf
kern.maxdsiz="3G"
kern.maxssiz="2G"

vfs.zfs.prefetch_disable="1"
kern.ipc.nmbclusters="262144"
Comment 3 Mikolaj Golub 2010-03-21 13:43:21 UTC
block out on $if from any to <nfs>

while sleep 3; do 
    sudo pfctl -t nfs -vT add 10.0.0.217;
    sleep 2; 
    sudo pfctl -t nfs -vT show;
    sudo pfctl -t nfs -vT delete 10.0.0.217;
done

The idea was to drop NFS server responses to make the client resend requests
and make RPC reply comes from the cache. And mbufs usage growth started to
observe:

09:30: 2806/1589/4395 mbufs in use (current/cache/total)
10:00: 5397/1068/6465 mbufs in use (current/cache/total)
10:30: 7945/1760/9705 mbufs in use (current/cache/total)
11:00: 9560/1435/10995 mbufs in use (current/cache/total)
11:30: 10337/2113/12450 mbufs in use (current/cache/total)

Athough it might be another issue then reported in this pr :-).

Reviewing rpc/svc.c:svc_getreq() it looks for me that for RS_DONE case args
are nevere freed. Shouldn't it be like in the attached patch?

Running the above test on the patched kernel the growth has not been observed
so far:

13:00: 1501/2219/3720 mbufs in use (current/cache/total)
13:30: 1514/2971/4485 mbufs in use (current/cache/total)
14:00: 1096/3389/4485 mbufs in use (current/cache/total)
14:30: 1107/3378/4485 mbufs in use (current/cache/total)
15:00: 1105/3380/4485 mbufs in use (current/cache/total)
15:30: 1105/3380/4485 mbufs in use (current/cache/total)

-- 
Mikolaj Golub
Comment 4 rmacklem 2010-03-22 00:12:22 UTC
On Sun, 21 Mar 2010, Mikolaj Golub wrote:

[good stuff snipped]
>
> Athough it might be another issue then reported in this pr :-).
>

I think it's the same one, since disabling the replay cache made the
leak go away.

> Reviewing rpc/svc.c:svc_getreq() it looks for me that for RS_DONE case args
> are nevere freed. Shouldn't it be like in the attached patch?
>
Good catch!! It certainly looks like what would have caused the leak
to me. Since r_args has not been set to args for that case, svc_freereq()
wouldn't free args, just as you observed.

Hopefully Jeremy can test this, but I suspect you've found/fixed the
culprit.

Sorry, I can't remember if you are a committer? (If not, I'll try and
get dfr to review it and then get it committed.)

Again, good job, rick
ps: I was looking for a leak of the copy in the cache and didn't think
     of the request coming in.
Comment 5 rmacklem 2010-03-22 00:23:02 UTC
On Sun, 21 Mar 2010, Mikolaj Golub wrote:

>
> Reviewing rpc/svc.c:svc_getreq() it looks for me that for RS_DONE case args
> are nevere freed. Shouldn't it be like in the attached patch?
>
Oops, I meant to ask Daniel Braniss (not Jeremy) w.r.t testing the patch,
since he can easily reproduce the problem. Of course, I'd appreciate
anyone who can test it to do so and let us know how it goes.

Daniel, here's the patch just in case you didn't see Mikolaj's email.

rick
Mikolaj's patch:
--- sys/rpc/svc.c.orig	2010-03-21 10:17:20.000000000 +0200
+++ sys/rpc/svc.c	2010-03-21 10:20:05.000000000 +0200
@@ -819,6 +819,7 @@ svc_getreq(SVCXPRT *xprt, struct svc_req
  					free(r->rq_addr, M_SONAME);
  					r->rq_addr = NULL;
  				}
+				m_freem(args);
  				goto call_done;

  			default:
Comment 6 kkockro 2010-03-22 06:26:05 UTC
I will test it tonight on our ZFS Storages.

kai


Am Montag, 22. M=E4rz 2010 01:23:02 schrieb Rick Macklem:
> On Sun, 21 Mar 2010, Mikolaj Golub wrote:
> > Reviewing rpc/svc.c:svc_getreq() it looks for me that for RS_DONE case
> > args are nevere freed. Shouldn't it be like in the attached patch?
>=20
> Oops, I meant to ask Daniel Braniss (not Jeremy) w.r.t testing the patch,
> since he can easily reproduce the problem. Of course, I'd appreciate
> anyone who can test it to do so and let us know how it goes.
>=20
> Daniel, here's the patch just in case you didn't see Mikolaj's email.
>=20
> rick
> Mikolaj's patch:
> --- sys/rpc/svc.c.orig	2010-03-21 10:17:20.000000000 +0200
> +++ sys/rpc/svc.c	2010-03-21 10:20:05.000000000 +0200
> @@ -819,6 +819,7 @@ svc_getreq(SVCXPRT *xprt, struct svc_req
>   					free(r->rq_addr, M_SONAME);
>   					r->rq_addr =3D NULL;
>   				}
> +				m_freem(args);
>   				goto call_done;
>=20
>   			default:
>=20
Comment 7 danny 2010-03-22 11:09:53 UTC
> 
> 
> On Sun, 21 Mar 2010, Mikolaj Golub wrote:
> 
> >
> > Reviewing rpc/svc.c:svc_getreq() it looks for me that for RS_DONE case args
> > are nevere freed. Shouldn't it be like in the attached patch?
> >
> Oops, I meant to ask Daniel Braniss (not Jeremy) w.r.t testing the patch,
> since he can easily reproduce the problem. Of course, I'd appreciate
> anyone who can test it to do so and let us know how it goes.
> 
> Daniel, here's the patch just in case you didn't see Mikolaj's email.
> 
> rick
> Mikolaj's patch:
> --- sys/rpc/svc.c.orig	2010-03-21 10:17:20.000000000 +0200
> +++ sys/rpc/svc.c	2010-03-21 10:20:05.000000000 +0200
> @@ -819,6 +819,7 @@ svc_getreq(SVCXPRT *xprt, struct svc_req
>   					free(r->rq_addr, M_SONAME);
>   					r->rq_addr = NULL;
>   				}
> +				m_freem(args);
>   				goto call_done;
> 
>   			default:

well, it's much better!, but no cookies yet :-)

from comparing graphs in
	ftp://ftp.cs.huji.ac.il/users/danny/freebsd/mbuf-leak/
store-01-e.ps: a production server running newfsd - now up almost 20 days
	notice that the average used mbuf is below 1000!

store-02.ps: kernel without last patch, classic nfsd
	the leak is huge.

store-02++.ps: with latest patch
	the leak is much smaller but I see 2 issues:
		- the initial leap to over 2000, then a smaller leak.

could someone explain replay_prune() to me?

cheers,
	danny
Comment 8 rmacklem 2010-03-22 14:04:46 UTC
On Mon, 22 Mar 2010, Daniel Braniss wrote:

>
> well, it's much better!, but no cookies yet :-)
>

Well, that's good news. I'll try and get dfr to review it and then
commit it. Thanks Mikolaj, for finding this.

> from comparing graphs in
> 	ftp://ftp.cs.huji.ac.il/users/danny/freebsd/mbuf-leak/
> store-01-e.ps: a production server running newfsd - now up almost 20 days
> 	notice that the average used mbuf is below 1000!
>
> store-02.ps: kernel without last patch, classic nfsd
> 	the leak is huge.
>
> store-02++.ps: with latest patch
> 	the leak is much smaller but I see 2 issues:
> 		- the initial leap to over 2000, then a smaller leak.

The initial leap doesn't worry me. That's just a design constraint.
A slow leak after that is still a problem. (I might have seen the
slow leak in testing here. I'll poke at it and see if I can reproduce
that.)

>
> could someone explain replay_prune() to me?
>
I just looked at it and I think it does the following:
 	- when it thinks the cache is too big (either too many entries
           or too much mbuf data) it loops around until:
 		- no longer too much or can't free any more
                 (when an entry is free'd, rc_size and rc_count are
                  reduced)
           (the loop is from the end of the tailq, so it is freeing
            the least recently used entries)
 	- the test for rce_repmsg.rm_xid != 0 avoids freeing ones
           that are in progress, since rce_repmsg is all zeroed until
           the reply has been generated

I did notice that the call to replay_prune() from replay_setsize() does 
not lock the mutex before calling it, so it doesn't look smp safe to me 
for this case, but I doubt that would cause a slow leak. (I think this is
only called when the number of mbuf clusters in the kernel changes and
might cause a kernel crash if the tailq wasn't in a consistent state as
it rattled through the list in the loop.)

rick
Comment 9 danny 2010-03-22 15:04:40 UTC
> 
> 
> On Mon, 22 Mar 2010, Daniel Braniss wrote:
> 
> >
> > well, it's much better!, but no cookies yet :-)
> >
> 
> Well, that's good news. I'll try and get dfr to review it and then
> commit it. Thanks Mikolaj, for finding this.
> 
> > from comparing graphs in
> > 	ftp://ftp.cs.huji.ac.il/users/danny/freebsd/mbuf-leak/
> > store-01-e.ps: a production server running newfsd - now up almost 20 days
> > 	notice that the average used mbuf is below 1000!
> >
> > store-02.ps: kernel without last patch, classic nfsd
> > 	the leak is huge.
> >
> > store-02++.ps: with latest patch
> > 	the leak is much smaller but I see 2 issues:
> > 		- the initial leap to over 2000, then a smaller leak.
> 
> The initial leap doesn't worry me. That's just a design constraint.
yes, but new-nsfd does it better.

> A slow leak after that is still a problem. (I might have seen the
> slow leak in testing here. I'll poke at it and see if I can reproduce
> that.)

all I do is mount upd on a client and start a write process.

> 
> >
> > could someone explain replay_prune() to me?
> >
> I just looked at it and I think it does the following:
>  	- when it thinks the cache is too big (either too many entries
>            or too much mbuf data) it loops around until:
>  		- no longer too much or can't free any more
>                  (when an entry is free'd, rc_size and rc_count are
>                   reduced)
>            (the loop is from the end of the tailq, so it is freeing
>             the least recently used entries)
>  	- the test for rce_repmsg.rm_xid != 0 avoids freeing ones
>            that are in progress, since rce_repmsg is all zeroed until
>            the reply has been generated

thanks for the information, it's what i thought, but the coding made it look 
as something
else could happen - why else start the search of the queue after each match?> 

> I did notice that the call to replay_prune() from replay_setsize() does 
> not lock the mutex before calling it, so it doesn't look smp safe to me 
> for this case, but I doubt that would cause a slow leak. (I think this is
> only called when the number of mbuf clusters in the kernel changes and
> might cause a kernel crash if the tailq wasn't in a consistent state as
> it rattled through the list in the loop.)
> 
there seems to be an NFSLOCK involved before calling replay_setsize ...

well, the server is a 2 cpu quad nehalem, so maybe I should try several 
clients ...

> rick
> 
btw, the new-nfsd has been running on a production server for almost 20 days
and all seeems fine.

anyways, things are looking better,
cheers,
	danny
Comment 10 rmacklem 2010-03-23 00:48:50 UTC
On Mon, 22 Mar 2010, Daniel Braniss wrote:

[good stuff snipped]
>>
>> The initial leap doesn't worry me. That's just a design constraint.
> yes, but new-nsfd does it better.
>

It's the classic tradeoff between a generic tool and one designed for
a specific case. Because of quirks in NFSv4, the experimental server
has no choice but to use a replay cache designed specifically for it
and it knows assorted things about NFS. The one in sys/rpc/replay.c
doesn't know anything about NFS, so it will be less efficient w.r.t.
NFS.

>> A slow leak after that is still a problem. (I might have seen the
>> slow leak in testing here. I'll poke at it and see if I can reproduce
>> that.)
>
> all I do is mount upd on a client and start a write process.
>

I only have a FreeBSD client at this point, and it doesn't cause the
leak for nfsv3,udp for me here.

Doug Rabson pointed out that there would be a leak for the "default:"
case too, although didn't know if that would occur in practice.

So, maybe you could test this variant of the patch (just in case that
was the slow leak...):
--- rpc/svc.c.sav	2010-03-21 18:46:20.000000000 -0400
+++ rpc/svc.c	2010-03-22 19:00:17.000000000 -0400
@@ -819,9 +819,11 @@
  					free(r->rq_addr, M_SONAME);
  					r->rq_addr = NULL;
  				}
+				m_freem(args);
  				goto call_done;

  			default:
+				m_freem(args);
  				goto call_done;
  			}
  		}
> there seems to be an NFSLOCK involved before calling replay_setsize ...
>

Ah, thanks for pointing that out.

Thanks for the good testing. At least we're down to a slow leak..rick
Comment 11 danny 2010-03-23 08:34:33 UTC
> 
> 
> On Mon, 22 Mar 2010, Daniel Braniss wrote:
> 
> [good stuff snipped]
> I only have a FreeBSD client at this point, and it doesn't cause the
> leak for nfsv3,udp for me here.
my client is also FreeBSD 8.0, strange

> 
> Doug Rabson pointed out that there would be a leak for the "default:"
> case too, although didn't know if that would occur in practice.
it does! :-)

> 
> So, maybe you could test this variant of the patch (just in case that
> was the slow leak...):
> --- rpc/svc.c.sav	2010-03-21 18:46:20.000000000 -0400
> +++ rpc/svc.c	2010-03-22 19:00:17.000000000 -0400
> @@ -819,9 +819,11 @@
>   					free(r->rq_addr, M_SONAME);
>   					r->rq_addr = NULL;
>   				}
> +				m_freem(args);
>   				goto call_done;
> 
>   			default:
> +				m_freem(args);
>   				goto call_done;
>   			}
>   		}
that plugged it!
see
	ftp://ftp.cs.huji.ac.il/users/danny/freebsd/mbuf-leak/store-02+++.ps

[...]
> Thanks for the good testing. At least we're down to a slow leak..rick
thanks to you for taking time off of your retirement :-)

danny
Comment 12 rmacklem 2010-03-23 13:34:39 UTC
On Tue, 23 Mar 2010, Daniel Braniss wrote:

>> I only have a FreeBSD client at this point, and it doesn't cause the
>> leak for nfsv3,udp for me here.
> my client is also FreeBSD 8.0, strange
>

I was already using the patch below when I tested and couldn't see it,
so I guess it now appears that the patch works.

>>
>> Doug Rabson pointed out that there would be a leak for the "default:"
>> case too, although didn't know if that would occur in practice.
> it does! :-)
>
>>
>> So, maybe you could test this variant of the patch (just in case that
>> was the slow leak...):
>> --- rpc/svc.c.sav	2010-03-21 18:46:20.000000000 -0400
>> +++ rpc/svc.c	2010-03-22 19:00:17.000000000 -0400
>> @@ -819,9 +819,11 @@
>>   					free(r->rq_addr, M_SONAME);
>>   					r->rq_addr = NULL;
>>   				}
>> +				m_freem(args);
>>   				goto call_done;
>>
>>   			default:
>> +				m_freem(args);
>>   				goto call_done;
>>   			}
>>   		}
> that plugged it!
> see
> 	ftp://ftp.cs.huji.ac.il/users/danny/freebsd/mbuf-leak/store-02+++.ps

Good work with the testing. I'll get it committed and put it up on the
nfs patches page I have under http://people.freebsd.org/~rmacklem

> thanks to you for taking time off of your retirement :-)
>
I plan on doing quite a bit of FreeBSD/NFS stuff during it, rick
Comment 13 dfilter service freebsd_committer freebsd_triage 2010-03-23 23:03:52 UTC
Author: rmacklem
Date: Tue Mar 23 23:03:30 2010
New Revision: 205562
URL: http://svn.freebsd.org/changeset/base/205562

Log:
  When the regular NFS server replied to a UDP client out of the replay
  cache, it did not free the request argument mbuf list, resulting in a leak.
  This patch fixes that leak.
  
  Tested by:	danny AT cs.huji.ac.il
  PR:		kern/144330
  Submitted by:	to.my.trociny AT gmail.com (earlier version)
  Reviewed by:	dfr
  MFC after:	2 weeks

Modified:
  head/sys/rpc/svc.c

Modified: head/sys/rpc/svc.c
==============================================================================
--- head/sys/rpc/svc.c	Tue Mar 23 23:00:35 2010	(r205561)
+++ head/sys/rpc/svc.c	Tue Mar 23 23:03:30 2010	(r205562)
@@ -819,9 +819,11 @@ svc_getreq(SVCXPRT *xprt, struct svc_req
 					free(r->rq_addr, M_SONAME);
 					r->rq_addr = NULL;
 				}
+				m_freem(args);
 				goto call_done;
 
 			default:
+				m_freem(args);
 				goto call_done;
 			}
 		}
_______________________________________________
svn-src-all@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-all
To unsubscribe, send any mail to "svn-src-all-unsubscribe@freebsd.org"
Comment 14 kkockro 2010-03-24 06:39:04 UTC
Hi,

after 3 days with the first patch ( FreeBSD 8-STABLE AMD64, old nfsd ):

netstat -m
5732/10528/16260 mbufs in use (current/cache/total)
4825/10131/14956/262144 mbuf clusters in use (current/cache/total/max)

It looks very fine. I'll patch with the final corrections and then test aga=
in.

But why i have the same issues? I dont use UDP connects, only TCP?!

Greetings,
Kai


Am Dienstag, 23. M=E4rz 2010 14:34:39 schrieb Rick Macklem:
> On Tue, 23 Mar 2010, Daniel Braniss wrote:
> >> I only have a FreeBSD client at this point, and it doesn't cause the
> >> leak for nfsv3,udp for me here.
> >
> > my client is also FreeBSD 8.0, strange
>=20
> I was already using the patch below when I tested and couldn't see it,
> so I guess it now appears that the patch works.
>=20
> >> Doug Rabson pointed out that there would be a leak for the "default:"
> >> case too, although didn't know if that would occur in practice.
> >
> > it does! :-)
> >
> >> So, maybe you could test this variant of the patch (just in case that
> >> was the slow leak...):
> >> --- rpc/svc.c.sav	2010-03-21 18:46:20.000000000 -0400
> >> +++ rpc/svc.c	2010-03-22 19:00:17.000000000 -0400
> >> @@ -819,9 +819,11 @@
> >>   					free(r->rq_addr, M_SONAME);
> >>   					r->rq_addr =3D NULL;
> >>   				}
> >> +				m_freem(args);
> >>   				goto call_done;
> >>
> >>   			default:
> >> +				m_freem(args);
> >>   				goto call_done;
> >>   			}
> >>   		}
> >
> > that plugged it!
> > see
> > 	ftp://ftp.cs.huji.ac.il/users/danny/freebsd/mbuf-leak/store-02+++.ps
>=20
> Good work with the testing. I'll get it committed and put it up on the
> nfs patches page I have under http://people.freebsd.org/~rmacklem
>=20
> > thanks to you for taking time off of your retirement :-)
>=20
> I plan on doing quite a bit of FreeBSD/NFS stuff during it, rick
>=20
Comment 15 rmacklem 2010-03-25 01:04:53 UTC
On Wed, 24 Mar 2010, Kai Kockro wrote:

> Hi,
>
> after 3 days with the first patch ( FreeBSD 8-STABLE AMD64, old nfsd ):
>
> netstat -m
> 5732/10528/16260 mbufs in use (current/cache/total)
> 4825/10131/14956/262144 mbuf clusters in use (current/cache/total/max)
>
> It looks very fine. I'll patch with the final corrections and then test again.
>

Sounds good. Thanks for letting us know.

> But why i have the same issues? I dont use UDP connects, only TCP?!
>
Hmm, from what I can see, the replay cache is used for both UDP and
TCP in the regular NFS server. If I'm correct on that, any retry of
an RPC over TCP could cause the leak. (Retries of an RPC over TCP
are infrequent, with the likelyhood differing between clients. I
can only guess that your TCP clients do retries?)

rick
Comment 16 dfilter service freebsd_committer freebsd_triage 2010-04-07 00:03:39 UTC
Author: rmacklem
Date: Tue Apr  6 23:03:20 2010
New Revision: 206288
URL: http://svn.freebsd.org/changeset/base/206288

Log:
  MFC: r205562
  When the regular NFS server replied to a UDP client out of the replay
  cache, it did not free the request argument mbuf list, resulting in a leak.
  This patch fixes that leak.
  
  PR:	kern/144330

Modified:
  stable/8/sys/rpc/svc.c
Directory Properties:
  stable/8/sys/   (props changed)
  stable/8/sys/amd64/include/xen/   (props changed)
  stable/8/sys/cddl/contrib/opensolaris/   (props changed)
  stable/8/sys/contrib/dev/acpica/   (props changed)
  stable/8/sys/contrib/pf/   (props changed)
  stable/8/sys/dev/xen/xenpci/   (props changed)

Modified: stable/8/sys/rpc/svc.c
==============================================================================
--- stable/8/sys/rpc/svc.c	Tue Apr  6 21:39:18 2010	(r206287)
+++ stable/8/sys/rpc/svc.c	Tue Apr  6 23:03:20 2010	(r206288)
@@ -819,9 +819,11 @@ svc_getreq(SVCXPRT *xprt, struct svc_req
 					free(r->rq_addr, M_SONAME);
 					r->rq_addr = NULL;
 				}
+				m_freem(args);
 				goto call_done;
 
 			default:
+				m_freem(args);
 				goto call_done;
 			}
 		}
_______________________________________________
svn-src-all@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-all
To unsubscribe, send any mail to "svn-src-all-unsubscribe@freebsd.org"
Comment 17 Jaakko Heinonen freebsd_committer freebsd_triage 2010-04-29 19:22:50 UTC
State Changed
From-To: open->closed

Fixed in head and stable/8.