When serving nfs dirs from a zfs storage, mbuffers keep being drained, but are never freed. See http://www.pmp.uni-hannover.de/test/Mitarbeiter/g_kuehn/data/mbuf.pdf as an example. y-axis is output of total mbuf clusters by netstat -m, x-axis is time in minutes. The flat region in the upper right corner corresponds to approx. 10min with nfsd turned off. Several others report similar problems (see thread on -stable with topic "mbuf leakage with nfs/zfs? (was: em0 freezes on ZFS server)"). Fix: Unknown, but urgently needed. :-) How-To-Repeat: Server nfs-dirs from zfs storage with 8-stable and monitor mbuf usage.
Responsible Changed From-To: freebsd-bugs->freebsd-fs Over to maintainer(s).
Hi, same problem here. netstat -m ( after 6 days ) 140889/35376/176265 mbufs in use (current/cache/total) 139381/4689/144070/262144 mbuf clusters in use (current/cache/total/max) 137206/3594 mbuf+clusters out of packet secondary zone in use (current/cache) 2/245/247/131072 4k (page size) jumbo clusters in use (current/cache/total/max) 0/0/0/65536 9k jumbo clusters in use (current/cache/total/max) 0/0/0/32768 16k jumbo clusters in use (current/cache/total/max) 313992K/19202K/333194K bytes allocated to network (current/cache/total) 0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters) 0/0/0 requests for jumbo clusters denied (4k/9k/16k) 0/0/0 sfbufs in use (current/peak/max) 0 requests for sfbufs denied 0 requests for sfbufs delayed 0 requests for I/O initiated by sendfile 0 calls to protocol drain routines Its a NFS exportet ZPOOL ( 14TB, 40 clients, no UDP mounts ). #uname -a FreeBSD xxxx 8.0-STABLE FreeBSD 8.0-STABLE #0: Fri Feb 12 03:27:54 CET 2010 root@xxxx:/usr/obj/usr/src/sys/GENERIC amd64 #cat /boot/loader.conf kern.maxdsiz="3G" kern.maxssiz="2G" vfs.zfs.prefetch_disable="1" kern.ipc.nmbclusters="262144"
block out on $if from any to <nfs> while sleep 3; do sudo pfctl -t nfs -vT add 10.0.0.217; sleep 2; sudo pfctl -t nfs -vT show; sudo pfctl -t nfs -vT delete 10.0.0.217; done The idea was to drop NFS server responses to make the client resend requests and make RPC reply comes from the cache. And mbufs usage growth started to observe: 09:30: 2806/1589/4395 mbufs in use (current/cache/total) 10:00: 5397/1068/6465 mbufs in use (current/cache/total) 10:30: 7945/1760/9705 mbufs in use (current/cache/total) 11:00: 9560/1435/10995 mbufs in use (current/cache/total) 11:30: 10337/2113/12450 mbufs in use (current/cache/total) Athough it might be another issue then reported in this pr :-). Reviewing rpc/svc.c:svc_getreq() it looks for me that for RS_DONE case args are nevere freed. Shouldn't it be like in the attached patch? Running the above test on the patched kernel the growth has not been observed so far: 13:00: 1501/2219/3720 mbufs in use (current/cache/total) 13:30: 1514/2971/4485 mbufs in use (current/cache/total) 14:00: 1096/3389/4485 mbufs in use (current/cache/total) 14:30: 1107/3378/4485 mbufs in use (current/cache/total) 15:00: 1105/3380/4485 mbufs in use (current/cache/total) 15:30: 1105/3380/4485 mbufs in use (current/cache/total) -- Mikolaj Golub
On Sun, 21 Mar 2010, Mikolaj Golub wrote: [good stuff snipped] > > Athough it might be another issue then reported in this pr :-). > I think it's the same one, since disabling the replay cache made the leak go away. > Reviewing rpc/svc.c:svc_getreq() it looks for me that for RS_DONE case args > are nevere freed. Shouldn't it be like in the attached patch? > Good catch!! It certainly looks like what would have caused the leak to me. Since r_args has not been set to args for that case, svc_freereq() wouldn't free args, just as you observed. Hopefully Jeremy can test this, but I suspect you've found/fixed the culprit. Sorry, I can't remember if you are a committer? (If not, I'll try and get dfr to review it and then get it committed.) Again, good job, rick ps: I was looking for a leak of the copy in the cache and didn't think of the request coming in.
On Sun, 21 Mar 2010, Mikolaj Golub wrote: > > Reviewing rpc/svc.c:svc_getreq() it looks for me that for RS_DONE case args > are nevere freed. Shouldn't it be like in the attached patch? > Oops, I meant to ask Daniel Braniss (not Jeremy) w.r.t testing the patch, since he can easily reproduce the problem. Of course, I'd appreciate anyone who can test it to do so and let us know how it goes. Daniel, here's the patch just in case you didn't see Mikolaj's email. rick Mikolaj's patch: --- sys/rpc/svc.c.orig 2010-03-21 10:17:20.000000000 +0200 +++ sys/rpc/svc.c 2010-03-21 10:20:05.000000000 +0200 @@ -819,6 +819,7 @@ svc_getreq(SVCXPRT *xprt, struct svc_req free(r->rq_addr, M_SONAME); r->rq_addr = NULL; } + m_freem(args); goto call_done; default:
I will test it tonight on our ZFS Storages. kai Am Montag, 22. M=E4rz 2010 01:23:02 schrieb Rick Macklem: > On Sun, 21 Mar 2010, Mikolaj Golub wrote: > > Reviewing rpc/svc.c:svc_getreq() it looks for me that for RS_DONE case > > args are nevere freed. Shouldn't it be like in the attached patch? >=20 > Oops, I meant to ask Daniel Braniss (not Jeremy) w.r.t testing the patch, > since he can easily reproduce the problem. Of course, I'd appreciate > anyone who can test it to do so and let us know how it goes. >=20 > Daniel, here's the patch just in case you didn't see Mikolaj's email. >=20 > rick > Mikolaj's patch: > --- sys/rpc/svc.c.orig 2010-03-21 10:17:20.000000000 +0200 > +++ sys/rpc/svc.c 2010-03-21 10:20:05.000000000 +0200 > @@ -819,6 +819,7 @@ svc_getreq(SVCXPRT *xprt, struct svc_req > free(r->rq_addr, M_SONAME); > r->rq_addr =3D NULL; > } > + m_freem(args); > goto call_done; >=20 > default: >=20
> > > On Sun, 21 Mar 2010, Mikolaj Golub wrote: > > > > > Reviewing rpc/svc.c:svc_getreq() it looks for me that for RS_DONE case args > > are nevere freed. Shouldn't it be like in the attached patch? > > > Oops, I meant to ask Daniel Braniss (not Jeremy) w.r.t testing the patch, > since he can easily reproduce the problem. Of course, I'd appreciate > anyone who can test it to do so and let us know how it goes. > > Daniel, here's the patch just in case you didn't see Mikolaj's email. > > rick > Mikolaj's patch: > --- sys/rpc/svc.c.orig 2010-03-21 10:17:20.000000000 +0200 > +++ sys/rpc/svc.c 2010-03-21 10:20:05.000000000 +0200 > @@ -819,6 +819,7 @@ svc_getreq(SVCXPRT *xprt, struct svc_req > free(r->rq_addr, M_SONAME); > r->rq_addr = NULL; > } > + m_freem(args); > goto call_done; > > default: well, it's much better!, but no cookies yet :-) from comparing graphs in ftp://ftp.cs.huji.ac.il/users/danny/freebsd/mbuf-leak/ store-01-e.ps: a production server running newfsd - now up almost 20 days notice that the average used mbuf is below 1000! store-02.ps: kernel without last patch, classic nfsd the leak is huge. store-02++.ps: with latest patch the leak is much smaller but I see 2 issues: - the initial leap to over 2000, then a smaller leak. could someone explain replay_prune() to me? cheers, danny
On Mon, 22 Mar 2010, Daniel Braniss wrote: > > well, it's much better!, but no cookies yet :-) > Well, that's good news. I'll try and get dfr to review it and then commit it. Thanks Mikolaj, for finding this. > from comparing graphs in > ftp://ftp.cs.huji.ac.il/users/danny/freebsd/mbuf-leak/ > store-01-e.ps: a production server running newfsd - now up almost 20 days > notice that the average used mbuf is below 1000! > > store-02.ps: kernel without last patch, classic nfsd > the leak is huge. > > store-02++.ps: with latest patch > the leak is much smaller but I see 2 issues: > - the initial leap to over 2000, then a smaller leak. The initial leap doesn't worry me. That's just a design constraint. A slow leak after that is still a problem. (I might have seen the slow leak in testing here. I'll poke at it and see if I can reproduce that.) > > could someone explain replay_prune() to me? > I just looked at it and I think it does the following: - when it thinks the cache is too big (either too many entries or too much mbuf data) it loops around until: - no longer too much or can't free any more (when an entry is free'd, rc_size and rc_count are reduced) (the loop is from the end of the tailq, so it is freeing the least recently used entries) - the test for rce_repmsg.rm_xid != 0 avoids freeing ones that are in progress, since rce_repmsg is all zeroed until the reply has been generated I did notice that the call to replay_prune() from replay_setsize() does not lock the mutex before calling it, so it doesn't look smp safe to me for this case, but I doubt that would cause a slow leak. (I think this is only called when the number of mbuf clusters in the kernel changes and might cause a kernel crash if the tailq wasn't in a consistent state as it rattled through the list in the loop.) rick
> > > On Mon, 22 Mar 2010, Daniel Braniss wrote: > > > > > well, it's much better!, but no cookies yet :-) > > > > Well, that's good news. I'll try and get dfr to review it and then > commit it. Thanks Mikolaj, for finding this. > > > from comparing graphs in > > ftp://ftp.cs.huji.ac.il/users/danny/freebsd/mbuf-leak/ > > store-01-e.ps: a production server running newfsd - now up almost 20 days > > notice that the average used mbuf is below 1000! > > > > store-02.ps: kernel without last patch, classic nfsd > > the leak is huge. > > > > store-02++.ps: with latest patch > > the leak is much smaller but I see 2 issues: > > - the initial leap to over 2000, then a smaller leak. > > The initial leap doesn't worry me. That's just a design constraint. yes, but new-nsfd does it better. > A slow leak after that is still a problem. (I might have seen the > slow leak in testing here. I'll poke at it and see if I can reproduce > that.) all I do is mount upd on a client and start a write process. > > > > > could someone explain replay_prune() to me? > > > I just looked at it and I think it does the following: > - when it thinks the cache is too big (either too many entries > or too much mbuf data) it loops around until: > - no longer too much or can't free any more > (when an entry is free'd, rc_size and rc_count are > reduced) > (the loop is from the end of the tailq, so it is freeing > the least recently used entries) > - the test for rce_repmsg.rm_xid != 0 avoids freeing ones > that are in progress, since rce_repmsg is all zeroed until > the reply has been generated thanks for the information, it's what i thought, but the coding made it look as something else could happen - why else start the search of the queue after each match?> > I did notice that the call to replay_prune() from replay_setsize() does > not lock the mutex before calling it, so it doesn't look smp safe to me > for this case, but I doubt that would cause a slow leak. (I think this is > only called when the number of mbuf clusters in the kernel changes and > might cause a kernel crash if the tailq wasn't in a consistent state as > it rattled through the list in the loop.) > there seems to be an NFSLOCK involved before calling replay_setsize ... well, the server is a 2 cpu quad nehalem, so maybe I should try several clients ... > rick > btw, the new-nfsd has been running on a production server for almost 20 days and all seeems fine. anyways, things are looking better, cheers, danny
On Mon, 22 Mar 2010, Daniel Braniss wrote: [good stuff snipped] >> >> The initial leap doesn't worry me. That's just a design constraint. > yes, but new-nsfd does it better. > It's the classic tradeoff between a generic tool and one designed for a specific case. Because of quirks in NFSv4, the experimental server has no choice but to use a replay cache designed specifically for it and it knows assorted things about NFS. The one in sys/rpc/replay.c doesn't know anything about NFS, so it will be less efficient w.r.t. NFS. >> A slow leak after that is still a problem. (I might have seen the >> slow leak in testing here. I'll poke at it and see if I can reproduce >> that.) > > all I do is mount upd on a client and start a write process. > I only have a FreeBSD client at this point, and it doesn't cause the leak for nfsv3,udp for me here. Doug Rabson pointed out that there would be a leak for the "default:" case too, although didn't know if that would occur in practice. So, maybe you could test this variant of the patch (just in case that was the slow leak...): --- rpc/svc.c.sav 2010-03-21 18:46:20.000000000 -0400 +++ rpc/svc.c 2010-03-22 19:00:17.000000000 -0400 @@ -819,9 +819,11 @@ free(r->rq_addr, M_SONAME); r->rq_addr = NULL; } + m_freem(args); goto call_done; default: + m_freem(args); goto call_done; } } > there seems to be an NFSLOCK involved before calling replay_setsize ... > Ah, thanks for pointing that out. Thanks for the good testing. At least we're down to a slow leak..rick
> > > On Mon, 22 Mar 2010, Daniel Braniss wrote: > > [good stuff snipped] > I only have a FreeBSD client at this point, and it doesn't cause the > leak for nfsv3,udp for me here. my client is also FreeBSD 8.0, strange > > Doug Rabson pointed out that there would be a leak for the "default:" > case too, although didn't know if that would occur in practice. it does! :-) > > So, maybe you could test this variant of the patch (just in case that > was the slow leak...): > --- rpc/svc.c.sav 2010-03-21 18:46:20.000000000 -0400 > +++ rpc/svc.c 2010-03-22 19:00:17.000000000 -0400 > @@ -819,9 +819,11 @@ > free(r->rq_addr, M_SONAME); > r->rq_addr = NULL; > } > + m_freem(args); > goto call_done; > > default: > + m_freem(args); > goto call_done; > } > } that plugged it! see ftp://ftp.cs.huji.ac.il/users/danny/freebsd/mbuf-leak/store-02+++.ps [...] > Thanks for the good testing. At least we're down to a slow leak..rick thanks to you for taking time off of your retirement :-) danny
On Tue, 23 Mar 2010, Daniel Braniss wrote: >> I only have a FreeBSD client at this point, and it doesn't cause the >> leak for nfsv3,udp for me here. > my client is also FreeBSD 8.0, strange > I was already using the patch below when I tested and couldn't see it, so I guess it now appears that the patch works. >> >> Doug Rabson pointed out that there would be a leak for the "default:" >> case too, although didn't know if that would occur in practice. > it does! :-) > >> >> So, maybe you could test this variant of the patch (just in case that >> was the slow leak...): >> --- rpc/svc.c.sav 2010-03-21 18:46:20.000000000 -0400 >> +++ rpc/svc.c 2010-03-22 19:00:17.000000000 -0400 >> @@ -819,9 +819,11 @@ >> free(r->rq_addr, M_SONAME); >> r->rq_addr = NULL; >> } >> + m_freem(args); >> goto call_done; >> >> default: >> + m_freem(args); >> goto call_done; >> } >> } > that plugged it! > see > ftp://ftp.cs.huji.ac.il/users/danny/freebsd/mbuf-leak/store-02+++.ps Good work with the testing. I'll get it committed and put it up on the nfs patches page I have under http://people.freebsd.org/~rmacklem > thanks to you for taking time off of your retirement :-) > I plan on doing quite a bit of FreeBSD/NFS stuff during it, rick
Author: rmacklem Date: Tue Mar 23 23:03:30 2010 New Revision: 205562 URL: http://svn.freebsd.org/changeset/base/205562 Log: When the regular NFS server replied to a UDP client out of the replay cache, it did not free the request argument mbuf list, resulting in a leak. This patch fixes that leak. Tested by: danny AT cs.huji.ac.il PR: kern/144330 Submitted by: to.my.trociny AT gmail.com (earlier version) Reviewed by: dfr MFC after: 2 weeks Modified: head/sys/rpc/svc.c Modified: head/sys/rpc/svc.c ============================================================================== --- head/sys/rpc/svc.c Tue Mar 23 23:00:35 2010 (r205561) +++ head/sys/rpc/svc.c Tue Mar 23 23:03:30 2010 (r205562) @@ -819,9 +819,11 @@ svc_getreq(SVCXPRT *xprt, struct svc_req free(r->rq_addr, M_SONAME); r->rq_addr = NULL; } + m_freem(args); goto call_done; default: + m_freem(args); goto call_done; } } _______________________________________________ svn-src-all@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-all To unsubscribe, send any mail to "svn-src-all-unsubscribe@freebsd.org"
Hi, after 3 days with the first patch ( FreeBSD 8-STABLE AMD64, old nfsd ): netstat -m 5732/10528/16260 mbufs in use (current/cache/total) 4825/10131/14956/262144 mbuf clusters in use (current/cache/total/max) It looks very fine. I'll patch with the final corrections and then test aga= in. But why i have the same issues? I dont use UDP connects, only TCP?! Greetings, Kai Am Dienstag, 23. M=E4rz 2010 14:34:39 schrieb Rick Macklem: > On Tue, 23 Mar 2010, Daniel Braniss wrote: > >> I only have a FreeBSD client at this point, and it doesn't cause the > >> leak for nfsv3,udp for me here. > > > > my client is also FreeBSD 8.0, strange >=20 > I was already using the patch below when I tested and couldn't see it, > so I guess it now appears that the patch works. >=20 > >> Doug Rabson pointed out that there would be a leak for the "default:" > >> case too, although didn't know if that would occur in practice. > > > > it does! :-) > > > >> So, maybe you could test this variant of the patch (just in case that > >> was the slow leak...): > >> --- rpc/svc.c.sav 2010-03-21 18:46:20.000000000 -0400 > >> +++ rpc/svc.c 2010-03-22 19:00:17.000000000 -0400 > >> @@ -819,9 +819,11 @@ > >> free(r->rq_addr, M_SONAME); > >> r->rq_addr =3D NULL; > >> } > >> + m_freem(args); > >> goto call_done; > >> > >> default: > >> + m_freem(args); > >> goto call_done; > >> } > >> } > > > > that plugged it! > > see > > ftp://ftp.cs.huji.ac.il/users/danny/freebsd/mbuf-leak/store-02+++.ps >=20 > Good work with the testing. I'll get it committed and put it up on the > nfs patches page I have under http://people.freebsd.org/~rmacklem >=20 > > thanks to you for taking time off of your retirement :-) >=20 > I plan on doing quite a bit of FreeBSD/NFS stuff during it, rick >=20
On Wed, 24 Mar 2010, Kai Kockro wrote: > Hi, > > after 3 days with the first patch ( FreeBSD 8-STABLE AMD64, old nfsd ): > > netstat -m > 5732/10528/16260 mbufs in use (current/cache/total) > 4825/10131/14956/262144 mbuf clusters in use (current/cache/total/max) > > It looks very fine. I'll patch with the final corrections and then test again. > Sounds good. Thanks for letting us know. > But why i have the same issues? I dont use UDP connects, only TCP?! > Hmm, from what I can see, the replay cache is used for both UDP and TCP in the regular NFS server. If I'm correct on that, any retry of an RPC over TCP could cause the leak. (Retries of an RPC over TCP are infrequent, with the likelyhood differing between clients. I can only guess that your TCP clients do retries?) rick
Author: rmacklem Date: Tue Apr 6 23:03:20 2010 New Revision: 206288 URL: http://svn.freebsd.org/changeset/base/206288 Log: MFC: r205562 When the regular NFS server replied to a UDP client out of the replay cache, it did not free the request argument mbuf list, resulting in a leak. This patch fixes that leak. PR: kern/144330 Modified: stable/8/sys/rpc/svc.c Directory Properties: stable/8/sys/ (props changed) stable/8/sys/amd64/include/xen/ (props changed) stable/8/sys/cddl/contrib/opensolaris/ (props changed) stable/8/sys/contrib/dev/acpica/ (props changed) stable/8/sys/contrib/pf/ (props changed) stable/8/sys/dev/xen/xenpci/ (props changed) Modified: stable/8/sys/rpc/svc.c ============================================================================== --- stable/8/sys/rpc/svc.c Tue Apr 6 21:39:18 2010 (r206287) +++ stable/8/sys/rpc/svc.c Tue Apr 6 23:03:20 2010 (r206288) @@ -819,9 +819,11 @@ svc_getreq(SVCXPRT *xprt, struct svc_req free(r->rq_addr, M_SONAME); r->rq_addr = NULL; } + m_freem(args); goto call_done; default: + m_freem(args); goto call_done; } } _______________________________________________ svn-src-all@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-all To unsubscribe, send any mail to "svn-src-all-unsubscribe@freebsd.org"
State Changed From-To: open->closed Fixed in head and stable/8.