Bug 131360 - [nfs] poor scaling behavior of the NFS server under load
Summary: [nfs] poor scaling behavior of the NFS server under load
Status: Open
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 7.1-RELEASE
Hardware: Any Any
: Normal Affects Only Me
Assignee: freebsd-bugs (Nobody)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2009-02-04 12:40 UTC by Martin Birgmeier
Modified: 2018-01-03 05:13 UTC (History)
0 users

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Martin Birgmeier 2009-02-04 12:40:01 UTC
	[Please note that the email address given is not valid due
	to security reasons. Just reply to the gnats entry, and
	I'll follow it via the web interface.]

	Between FreeBSD 6.3 and 7.1, the behavior of the NFS server
	changed for the worse. Under 6.3, the load generated by the
	nfsd's would never exceed their number (I am using
	nfs_server_flags="-u -t -n 8"). With 7.1, when the client
	generates a lot of requests, it seems that the load on the
	NFS server can grow nearly without bounds, rendering the
	server all but unusable.

	I have been testing this with two scenarios:

	- running a FreeBSD buildworld -j4 with src on NFS and obj
	  local, so there are only NFS reads, and

	- compiling openoffice.org-3-devel with WRKDIRPREFIX set
	  to an NFS-mounted directory, and MAXMODULES=3, MAXPROCESSES=3,
	  resulting in both reads and writes.

	In both cases, the NFS server reaches absurdly high load
	values: with 'buildword -j4' on the client I currently read
	from 'systat -vm 1' on the server:

	- a load just passing the 15 mark

	- nearly 100% 'sys'

	- there seems to be nearly no disk i/o activity

	- very poor general network throughput: an xterm from the
	  server to the client is nearly unresponsive (in fact, I
	  again have to suspend the build on the client right now
	  in order to be able to write this bug report in an xterm
	  running on the server).

	- resuming the build immediately sends the load skyrocketing
	  again.

	I am quite sure that this is a server issue, as for a while
	I had the client running 7.0 and server 6.3, and in this
	scenario this did not happen.

	BTW, the NFS mounts on the client are done via amd - this
	probably influences which mount type is used.

Fix: 

Unknown.
How-To-Repeat: 	Generate a high NFS load on a client, and watch the server
	load skyrocket.
Comment 1 Martin Birgmeier 2009-02-04 13:38:06 UTC
I just realized that there might also be an interaction with pppoa2
(from /usr/ports/net/pppoa). The excessive load seems to happen
more likely if at the same time I have heavy traffic through my
ADSL connection, which goes through an Alcatel ADSL modem, pppoa2,
and user-mode ppp.

Syslog will continually show the following message:

Feb  4 13:56:59 gandalf pppoa2[1491]: write_dest: 11800 ENOBUFS errors
Feb  4 13:57:03 gandalf pppoa2[1491]: write_dest: 11900 ENOBUFS errors
Feb  4 13:57:08 gandalf pppoa2[1491]: write_dest: 12000 ENOBUFS errors
Feb  4 13:57:11 gandalf pppoa2[1491]: write_dest: 12100 ENOBUFS errors
Feb  4 13:57:15 gandalf pppoa2[1491]: write_dest: 12200 ENOBUFS errors

At the same time, netstat -m shows the following:

# netstat -m
233/412/645 mbufs in use (current/cache/total)
207/241/448/25600 mbuf clusters in use (current/cache/total/max)
204/180 mbuf+clusters out of packet secondary zone in use (current/cache)
1/19/20/12800 4k (page size) jumbo clusters in use (current/cache/total/max)
0/0/0/6400 9k jumbo clusters in use (current/cache/total/max)
0/0/0/3200 16k jumbo clusters in use (current/cache/total/max)
476K/661K/1137K bytes allocated to network (current/cache/total)
0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters)
0/0/0 requests for jumbo clusters denied (4k/9k/16k)
0/5/6656 sfbufs in use (current/peak/max)
0 requests for sfbufs denied
0 requests for sfbufs delayed
15 requests for I/O initiated by sendfile
0 calls to protocol drain routines
Comment 2 Mark Linimon freebsd_committer freebsd_triage 2009-02-04 16:13:29 UTC
Responsible Changed
From-To: freebsd-bugs->freebsd-fs

Over to maintainer(s).
Comment 3 Martin Birgmeier 2009-02-07 15:31:21 UTC
I am now very sure that this is an interaction with pppoa, and it is
also worse than I originally thought: it will even lead to failed NFS
transactions for the client.

Here is what I have:

Machine A ('server', a mini home server) does the following:
  - connecting to the Internet using usermode ppp over pppoa over an
    Alcatel ADSL modem
  - NFS serving FreeBSD sources

Machine B does the following:
  - Mounting the FreeBSD sources from A (using amd), under directory
    /vol/SRC/FreeBSD/HEAD/src
  - Compiling the FreeBSD sources: make -j4 buildworld, such that the
    corresponding obj is local (via amd again)

Especially in the first phase of the buildworld (clean, depend, obj),
there is a lot of simultaneous NFS traffic from B to A. As soon as
a download is started at A (going via pppoa, of course), the load on
A rises to very high values (> 20 not uncommon).

This may lead to B aborting the compile, it just did that with
"directory not found".

Both machines are running 7.1.0.

No such problem happended when both were running 6.3.0.
Comment 4 Martin Birgmeier 2009-02-07 16:32:15 UTC
o.k. pppoa does not have (much) to do with it... even when it is not running,
the excessive load happens.
Comment 5 Martin Birgmeier 2009-02-07 16:59:05 UTC
o.k. more info... I have this on the server (machine A in my previous post);
it is showing quite a high number for "Server Ret-Failed":

# nfsstat
Client Info:
Rpc Counts:
  Getattr   Setattr    Lookup  Readlink      Read     Write    Create    Remove
  2304000         0       221       165         0         0         0         0
   Rename      Link   Symlink     Mkdir     Rmdir   Readdir  RdirPlus    Access
        0         0         0         0         0        22         0         0
    Mknod    Fsstat    Fsinfo  PathConf    Commit
        0        58         0         0         0
Rpc Info:
 TimedOut   Invalid X Replies   Retries  Requests
        0         0        17        17   2304466
Cache Info:
Attr Hits    Misses Lkup Hits    Misses BioR Hits    Misses BioW Hits    Misses
        0   2304000    885499       221         0         0         0         0
BioRLHits    Misses BioD Hits    Misses DirE Hits    Misses
   532202       165        89        11        50         0

Server Info:
  Getattr   Setattr    Lookup  Readlink      Read     Write    Create    Remove
        8        19   3123468        82     34382      3156        19        11
   Rename      Link   Symlink     Mkdir     Rmdir   Readdir  RdirPlus    Access
        0         0        11         0         0      9522         0    868616
    Mknod    Fsstat    Fsinfo  PathConf    Commit
        0     31396        23         0       147
Server Ret-Failed
          3046139
Server Faults
            0
Server Cache Stats:
   Inprog      Idem  Non-idem    Misses
        0        51         0       412
Server Write Gathering:
 WriteOps  WriteRPC   Opsaved
     3156      3156         0


and this on the client (machine B):

# nfsstat
Client Info:
Rpc Counts:
  Getattr   Setattr    Lookup  Readlink      Read     Write    Create    Remove
 48679649         0   3035823        58     26400         0         0         0
   Rename      Link   Symlink     Mkdir     Rmdir   Readdir  RdirPlus    Access
        0         0         0         0         0      7255         0    767940
    Mknod    Fsstat    Fsinfo  PathConf    Commit
        0     27347         1         0         0
Rpc Info:
 TimedOut   Invalid X Replies   Retries  Requests
        0         0         0         1  52543821
Cache Info:
Attr Hits    Misses Lkup Hits    Misses BioR Hits    Misses BioW Hits    Misses
 46870061  49414909  64135869   3035823   1144711     26389         0         0
BioRLHits    Misses BioD Hits    Misses DirE Hits    Misses
  8748412        58     43987      7218     25358         0

Server Info:
  Getattr   Setattr    Lookup  Readlink      Read     Write    Create    Remove
        0         0         0         0         0         0         0         0
   Rename      Link   Symlink     Mkdir     Rmdir   Readdir  RdirPlus    Access
        0         0         0         0         0         0         0         0
    Mknod    Fsstat    Fsinfo  PathConf    Commit
        0         0         0         0         0
Server Ret-Failed
                0
Server Faults
            0
Server Cache Stats:
   Inprog      Idem  Non-idem    Misses
        0         0         0         0
Server Write Gathering:
 WriteOps  WriteRPC   Opsaved
        0         0         0
Comment 6 Martin Birgmeier 2009-02-08 07:40:31 UTC
Yet more info... here is output from top.

Also, the following just happened:
  - I am editing this mail on the NFS server.
  - together with the top output from below, I was pasting a total of 1000
    lines (my XTerm scroll size).
  - This caused the load on this server to effectively double again (over
    the pasted values shown below).

Basically, I can only continue editing this mail if I suspend the
build on the client machine, in which case the server immediately
becomes responsive again.

So maybe it is not a pppoa interaction with NFS serving, but any
load on the server + NFS server makes the load on the server go to
insane values. Or may be it is just additional TCP load, because I
am displaying this XTerm on the NFS client (where the X server is
running), and all the pasting has to go via the X server's TCP
connection.

Also, I have the impression that as long as only one of the 8 nfsd's
on the server is busy, things are mostly normal, but as soon as
more than one starts doing work (as seen in the output below), the
load on the server goes way up. And regarding "mostly normal": even
if only one nfsd seems to be active, the load on the server is
already close to one - assuming that an nfsd does not do much more
than network and disk i/o this really should not be the case (and
was not under 6.3, where the load was low even under quite heavy
NFS i/o).

So maybe it is a ULE problem, after all?

last pid:  2527;  load averages: 14.71, 10.36,  6.13    up 0+01:04:43  08:21:08
111 processes: 9 running, 102 sleeping
CPU:  1.4% user,  0.0% nice, 90.5% system,  8.1% interrupt,  0.0% idle
Mem: 135M Active, 745M Inact, 119M Wired, 1012K Cache, 112M Buf, 248M Free
Swap: 2048M Total, 2048M Free

  PID USERNAME    THR PRI NICE   SIZE    RES STATE    TIME   WCPU COMMAND
  971 root          1   4    0  3128K   944K -       13:45 40.28% nfsd
  972 root          1   4    0  3128K   944K -        2:19 15.09% nfsd
  973 root          1   4    0  3128K   944K -        1:31 10.16% nfsd
  974 root          1   4    0  3128K   944K -        1:03  6.05% nfsd
  975 root          1   4    0  3128K   944K -        0:49  4.59% nfsd
  977 root          1   4    0  3128K   944K -        0:41  3.56% nfsd
  978 root          1   4    0  3128K   944K -        0:35  2.64% nfsd
  976 root          1   4    0  3128K   944K -        0:31  1.81% nfsd
 2527 root          1  96    0  3164K   992K RUN      0:00  1.54% rsh
 1471 root          1  81  -15  5032K  2716K select   0:05  0.05% ppp
  919 root          1  96    0  3128K  3148K select   2:16  0.00% amd
 1539 root          1  96    0  6508K  4964K RUN      0:10  0.00% xterm
 1140 squid         1   4    0 12000K 10152K sbwait   0:05  0.00% perl5.8.9
 1130 squid         1  96    0 15660K 10820K RUN      0:05  0.00% squid
 1141 squid         1   4    0 12000K 10148K sbwait   0:04  0.00% perl5.8.9
 1142 squid         1   4    0 12000K 10148K sbwait   0:04  0.00% perl5.8.9
 1143 squid         1   4    0 12000K 10104K sbwait   0:03  0.00% perl5.8.9
Comment 7 Martin Birgmeier 2009-02-08 08:00:20 UTC
Sorry for the many posts... this is problem is really nagging me,
and I need to clarify an error I made: sysctl kern.sched.name on
the server reports '4BSD', so it's not ULE as I wrote in my previous
posting.
Comment 8 Martin Birgmeier 2009-02-08 08:12:46 UTC
Another top snapshot on the server, scenario is the following:
  - make -j4 buildworld running on the client, currently in
    "===> kerberos5/lib/libkadm5srv (all)"
  - on the server, in addition to serving NFS, also running
    a 'svn log' command, where the repository is also served
    by the server via http (httpd, subversion repo, and svn log
    all running on the server, displaying in an XTerm running
    on the client).

If I do the 'svn log' without NFS load from the client (buildworld
stopped), there is nearly instantaneous output. If I do it with a
running buildworld on the client, I get the a top output similar
to the one below (actually, it is very hard to capture the "worst"
moments, as the server is so unresponsive - in fact the load was
something like 25, and each of the 8 nfsds consumed about 10% of
CPU).

last pid:  2527;  load averages: 14.71, 10.36,  6.13    up 0+01:04:43  08:21:08
111 processes: 9 running, 102 sleeping
CPU:  1.4% user,  0.0% nice, 90.5% system,  8.1% interrupt,  0.0% idle
Mem: 135M Active, 745M Inact, 119M Wired, 1012K Cache, 112M Buf, 248M Free
Swap: 2048M Total, 2048M Free

  PID USERNAME    THR PRI NICE   SIZE    RES STATE    TIME   WCPU COMMAND
  971 root          1   4    0  3128K   944K -       13:45 40.28% nfsd
  972 root          1   4    0  3128K   944K -        2:19 15.09% nfsd
  973 root          1   4    0  3128K   944K -        1:31 10.16% nfsd
  974 root          1   4    0  3128K   944K -        1:03  6.05% nfsd
  975 root          1   4    0  3128K   944K -        0:49  4.59% nfsd
  977 root          1   4    0  3128K   944K -        0:41  3.56% nfsd
  978 root          1   4    0  3128K   944K -        0:35  2.64% nfsd
  976 root          1   4    0  3128K   944K -        0:31  1.81% nfsd
 2527 root          1  96    0  3164K   992K RUN      0:00  1.54% rsh
 1471 root          1  81  -15  5032K  2716K select   0:05  0.05% ppp
  919 root          1  96    0  3128K  3148K select   2:16  0.00% amd
 1539 root          1  96    0  6508K  4964K RUN      0:10  0.00% xterm
 1140 squid         1   4    0 12000K 10152K sbwait   0:05  0.00% perl5.8.9
 1130 squid         1  96    0 15660K 10820K RUN      0:05  0.00% squid
 1141 squid         1   4    0 12000K 10148K sbwait   0:04  0.00% perl5.8.9
 1142 squid         1   4    0 12000K 10148K sbwait   0:04  0.00% perl5.8.9
 1143 squid         1   4    0 12000K 10104K sbwait   0:03  0.00% perl5.8.9
Comment 9 freebsd 2009-02-26 15:50:18 UTC
We experience exactly the same problem. With FreeBSD 7.0 everything was 
alright, but with the upgrade to 7.1 performance became abyssmal. Normally
it seems to work for some time; but after a few hours, the load climbs 
and the 4 nfsd-processes each use 100% CPU (on a 4way-system). The system 
only serves as nfs-root (ro) for about 20 systems, so there's not too much
nfs-traffic going on. 

I wasn't able to capture it when every nfsd was using 100% CPU (because, well,
some thousands of users depend on it); this is what it looks when it's 
half-working (hundreds of timeouts on the clients): 
37907 root        1   4    0  4604K  1112K -      2  46:29 33.59% nfsd
37908 root        1   4    0  4604K  1112K -      2  15:29 17.38% nfsd
37909 root        1   4    0  4604K  1112K -      3   5:59  5.66% nfsd
37910 root        1   4    0  4604K  1112K -      2   2:38  0.00% nfsd

We're suspecting the ULE scheduler as responsible for the mess. Right
now we're testing a kernel with the 4BSD scheduler; we'll know more
next week.

But now it looks like this:
 1040 root        1   4    0  4604K  1068K -      3   5:49  6.49% nfsd
 1039 root        1   4    0  4604K  1068K -      3   2:20  1.86% nfsd
 1042 root        1   4    0  4604K  1068K -      1   0:48  0.05% nfsd
 1041 root        1   4    0  4604K  1068K -      2   0:15  0.00% nfsd

Promising.

Regards
Peter
-- 
"Those who give up essential liberties for temporary safety deserve 
neither liberty nor safety." -- Benjamin Franklin
"It's also true that those who would give up privacy for security are 
likely to end up with neither." -- Bruce Schneier
Comment 10 Martin Birgmeier 2009-02-28 08:16:44 UTC
To add to what Peter Keel is writing: My kernels *did* still use
the 4BSD scheduler, so I am quite sure that Peter will not see an
improvement when switching to it from the ULE scheduler.

Next observation: My server, aside from serving NFS, is also serving
samba clients. Yesterday, from a single Windows 98 host, a directory
on the server containing approx. 100 files was deleted. During this
time, the server was completely unresponsive (except that I could
still ping it). It was not even possible to contact the DNS server
running on it.

After a few minutes (and presumably when the Windows 98 host was
finished deleting the directory, I did not watch this directly),
things returned to normal. However, the "xload" display from the
server then refreshed again and indicated a truly gigantic load
peak - it must have been greater than 50 as the background of the
xload window was completely filled with y axis lines (the horizontal
lines dividing load levels).

Something has been messed up horribly with multiprocessing on 7.1.
Comment 11 Robert Watson freebsd_committer freebsd_triage 2009-03-01 19:06:25 UTC
On Wed, 4 Feb 2009, Martin Birgmeier wrote:

> 	[Please note that the email address given is not valid due
> 	to security reasons. Just reply to the gnats entry, and
> 	I'll follow it via the web interface.]
>
> 	Between FreeBSD 6.3 and 7.1, the behavior of the NFS server
> 	changed for the worse. Under 6.3, the load generated by the
> 	nfsd's would never exceed their number (I am using
> 	nfs_server_flags="-u -t -n 8"). With 7.1, when the client
> 	generates a lot of requests, it seems that the load on the
> 	NFS server can grow nearly without bounds, rendering the
> 	server all but unusable.

Hi Martin:

Could I ask you to clarify a few things about the NFS server configuration:

- What device/device driver is on the server?
- Is the mount over TCP or UDP?

There are a number of changes between 7.0, one of the potential changes to 
look at is the increased UDP parallelism support due to read-locking in the 
UDP stack overwhelming the capabilities of some device/device drivers.  We've 
had a couple of reports of this specifically with the bge driver.  If you're 
using UDP, converting to TCP would be interesting; if it's bge, then we could 
give you a patch to locally revert the read-locking changes to see if that 
helps manage contention better (i.e., more contention higher on a sec of locks 
moves the contention off the single device driver lock).

Robert N M Watson
Computer Laboratory
University of Cambridge
Comment 12 Martin Birgmeier 2009-03-07 09:23:54 UTC
Robert,

Thanks for taking this.

First off, the previous message I wrote (about similar problems with
samba) might in fact also be caused by NFS - samba is running on this
server, but it was configured to serve a directory which itself is
served by amd(8) on this server.

Now to your questions:

device/driver (grep sis /var/run/dmesg.boot):

sis0: <NatSemi DP8381[56] 10/100BaseTX> port 0xa400-0xa4ff mem 0xd5800000-0xd5800fff irq 9 at device 9.0 on pci0
sis0: Silicon Revision: DP83815D
miibus0: <MII bus> on sis0
sis0: Ethernet address: 00:40:f4:1a:42:ba
sis0: [ITHREAD]
sis0: Applying short cable fix (reg=f4)
sis0: Applying short cable fix (reg=f5)

mount:

It is what amd(8) instructs it to be, and from running tcpdump I believe
it is over TCP.
Comment 13 freebsd 2009-03-11 16:03:36 UTC
Addendum: 

- Changing the scheduler yielded nothing, as you expected.

- This was over TCP, we never used UDP. We mounted the systems with 
  these options: rw,nfsv3,tcp,intr,nolockd,bg 

- The ethernet-device drivers are "em" on all machines.
  em0: <Intel(R) PRO/1000 Network Connection 6.9.6> 

Regards
Peter
-- 
"Those who give up essential liberties for temporary safety deserve 
neither liberty nor safety." -- Benjamin Franklin
"It's also true that those who would give up privacy for security are 
likely to end up with neither." -- Bruce Schneier
Comment 14 Martin Birgmeier 2010-08-04 09:42:18 UTC
You can probably close this, I cannot reproduce it any more on the 
current setup: RELENG_7_3_0_RELEASE.

Regards,

Martin
Comment 15 Eitan Adler freebsd_committer freebsd_triage 2017-12-31 08:00:15 UTC
For bugs matching the following criteria:

Status: In Progress Changed: (is less than) 2014-06-01

Reset to default assignee and clear in-progress tags.

Mail being skipped