Bug 183381

Summary: [em] [patch] Use of 9k buffers in if_em.c hangs with resource starvation
Product: Base System Reporter: David Gilbert <dgilbert>
Component: kernAssignee: freebsd-net (Nobody) <net>
Status: Closed FIXED    
Severity: Affects Only Me CC: marius, sbruno
Priority: Normal Keywords: IntelNetworking
Version: 9.2-STABLE   
Hardware: Any   
OS: Any   
Attachments:
Description Flags
file.diff none

Description David Gilbert 2013-10-28 05:10:00 UTC
I have a ZFS server here that also runs a number of other services
and has 9k packets turned on.  It hangs every day or two with it's
current load.  I had a number of theories on the problem ... but it
seems that 9k buffer allocation is a problem when other resources
are stressed.  In the email list discussion which I will pin to
this pr after I submit it, GAWollman noted that the 9k buffer
allocation may not be required by the code in if_em.c, and,
indeed, when I removed it, it wasn't.  Not only was it not required,
but removing it fixed the hangs.

I suppose the argument, then, is, that the more efficient path
is to use page-sized buffers and scatter-gather --- which
apparently everything supports (he is my only reference
for this statement).

How-To-Repeat: This might be challenging.  The server has 8 Gig of RAM, 17 disks in ZFS
and another 2 disks in UFS service.  ZFS serves SMB, NFS (v3), and
iSCSI to a GigE lan with 9k packets enabled.  The system also
runs (signiciant) postgreSQL, rtorrent and apache loads.  Of all of
these, the rtorrent and iSCSI loads seem to be most involved in
replicating this problem.
Comment 1 dave 2013-10-28 05:12:12 UTC
As promised, here is the email conversation:

Subject: *Or it could be ZFS memory starvation and 9k packets (was Re:
istgt causes massive jumbo nmbclusters loss)*
------------------------

From: *Zaphod Beeblebrox* <zbeeble@gmail.com <mailto:zbeeble@gmail.com>>
Date: Sat, Oct 26, 2013 at 1:16 AM
To: FreeBSD Net <freebsd-net@freebsd.org
<mailto:freebsd-net@freebsd.org>>, freebsd-fs <freebsd-fs@freebsd.org
<mailto:freebsd-fs@freebsd.org>>


At first I thought this was entirely the interaction of istgt and 9k
packets, but after some observation (and a few more hangs) I'm
reasonably positive it's a form of resource starvation related to ZFS
and 9k packets.

To reliably trigger the hang, I need to do something that triggers a
demand for 9k packets (like istgt traffic, but also bit torrent traffic
--- as you see the MTU is 9014) and it must have been some time since
the system booted.  ZFS is fairly busy (with both NFS and SMB guests),
so it generally takes quite a bit of the 8G of memory for itself.

Now... below the netstat -m shows 1399 9k bufs with 376 available.  When
the network gets busy, I've seen 4k or even 5k bufs in total... never
near the 77k max.  After some time of lesser activity, the number of 9k
buffers returns to this level.

When the problem occurs, the number of denied buffers will shoot up at
the rate of several hundred or even several thousand per second, but the
system will not be "out" of memory.  Top will show 800 meg often in the
free column when this happens.  While it's happening, when I'm logged
into the console, none of these stats seem out of place, save the number
of denied 9k buffer allocations and the "cache" of 9k buffers will be
less than 10 (but I've never seen it at 0).


On Tue, Oct 22, 2013 at 3:42 PM, Zaphod Beeblebrox <zbeeble@gmail.com
<mailto:zbeeble@gmail.com>> wrote:

    I have a server

    FreeBSD virtual.accountingreality.com
    <http://virtual.accountingreality.com> 9.2-STABLE FreeBSD 9.2-STABLE
    #13 r256549M: Tue Oct 15 16:29:48 EDT 2013    
    root@virtual.accountingreality.com:/usr/obj/usr/src/sys/VRA  amd64

    That has an em0 with jumbo packets enabled:

    em0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu
    9014

    It has (among other things): ZFS, NFS, iSCSI (via istgt) and Samba.

    Every day or two, it looses it's ability to talk to the network. 
    ifconfig down/up on em0 gives the message about not being able to
    allocate the receive buffers...

    With everything running, but with specifically iSCSI not used,
    everything seems good.  When I start hitting istgt, I see the denied
    stat for 9k mbufs rise very rapidly (this amount only took a few
    seconds):

    [1:47:347]root@virtual:/usr/local/etc/iet> netstat -m
    1313/877/2190 <tel:1313%2F877%2F2190> mbufs in use (current/cache/total)
    20/584/604/523514 mbuf clusters in use (current/cache/total/max)
    20/364 mbuf+clusters out of packet secondary zone in use (current/cache)
    239/359/598/261756 4k (page size) jumbo clusters in use
    (current/cache/total/max)
    1023/376/1399/77557 9k jumbo clusters in use (current/cache/total/max)
    0/0/0/43626 16k jumbo clusters in use (current/cache/total/max)
    10531K/6207K/16738K bytes allocated to network (current/cache/total)
    0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters)
    0/0/0 requests for mbufs delayed (mbufs/clusters/mbuf+clusters)
    0/0/0 requests for jumbo clusters delayed (4k/9k/16k)
    0/50199/0 requests for jumbo clusters denied (4k/9k/16k)
    0/0/0 sfbufs in use (current/peak/max)
    0 requests for sfbufs denied
    0 requests for sfbufs delayed
    0 requests for I/O initiated by sendfile
    0 calls to protocol drain routines

    ... the denied number rises... and somewhere in the millions or more
    the machine stops --- but even with the large number of denied 9k
    clusters, the "9k jumbo clusters in use" line will always indicate
    some available.

    ... so is this a tuning or a bug issue?  I've tried ietd ---
    basically it doesn't want to work with a zfs zvol, it seems (refuses
    to use it).



----------
From: *Garrett Wollman* <wollman@hergotha.csail.mit.edu
<mailto:wollman@hergotha.csail.mit.edu>>
Date: Sat, Oct 26, 2013 at 1:52 AM
To: zbeeble@gmail.com <mailto:zbeeble@gmail.com>
Cc: net@freebsd.org <mailto:net@freebsd.org>


In article
<CACpH0MfEy50Y5QOZCdn2co_JmY_QPfVRxYwK-73W0WYsHB-Fqw@mail.gmail.com
<mailto:CACpH0MfEy50Y5QOZCdn2co_JmY_QPfVRxYwK-73W0WYsHB-Fqw@mail.gmail.com>>
you write:

>Now... below the netstat -m shows 1399 9k bufs with 376 available.  When
>the network gets busy, I've seen 4k or even 5k bufs in total... never near
>the 77k max.  After some time of lesser activity, the number of 9k buffers
>returns to this level.

The network interface (driver) almost certainly should not be using 9k
mbufs.  These buffers are physically contiguous, and after not too
much activity, it will be nearly impossible to allocate three
physically contiguous buffers.

>> That has an em0 with jumbo packets enabled:
>>
>> em0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 9014

I don't know for certain about em(4), but it very likely should not be
using 9k mbufs.  Intel network hardware has done scatter-gather since
nearly the year dot.  (Seriously, I wrote a network driver for the
i82586 back at the very beginning of FreeBSD's existence, and *that*
part had scatter-gather.  No jumbo frames, though!)

The entire existence of 9k and 16k mbufs is probably a mistake.  There
should not be any network interfaces that are modern enough to do
jumbo frames but ancient enough to require physically contiguous pages
for each frame.  I don't know if the em(4) driver is written such that
you can just disable the use of those mbufs, though.  You could try
making this change, though.  Look for this code in if_em.c:

        /*
        ** Figure out the desired mbuf
        ** pool for doing jumbos
        */
        if (adapter->max_frame_size <= 2048)
                adapter->rx_mbuf_sz = MCLBYTES;
        else if (adapter->max_frame_size <= 4096)
                adapter->rx_mbuf_sz = MJUMPAGESIZE;
        else
                adapter->rx_mbuf_sz = MJUM9BYTES;

Comment out the last two lines and change the else if (...) to else.
It's not obvious that the rest of the code can cope with this, but it
does work that way on other Intel hardware so it seems like it may be
worth a shot.

-GAWollman

----------
From: *Zaphod Beeblebrox* <zbeeble@gmail.com <mailto:zbeeble@gmail.com>>
Date: Sat, Oct 26, 2013 at 2:55 PM
To: Garrett Wollman <wollman@hergotha.csail.mit.edu
<mailto:wollman@hergotha.csail.mit.edu>>
Cc: net@freebsd.org <mailto:net@freebsd.org>


To be clear, I made just this patch:

Index: if_em.c
===================================================================
--- if_em.c     (revision 256870)
+++ if_em.c     (working copy)
@@ -1343,10 +1343,10 @@
        */
        if (adapter->hw.mac.max_frame_size <= 2048)
                adapter->rx_mbuf_sz = MCLBYTES;
-       else if (adapter->hw.mac.max_frame_size <= 4096)
+       else /*if (adapter->hw.mac.max_frame_size <= 4096) */
                adapter->rx_mbuf_sz = MJUMPAGESIZE;
-       else
-               adapter->rx_mbuf_sz = MJUM9BYTES;
+       /* else
+               adapter->rx_mbuf_sz = MJUM9BYTES; */

        /* Prepare receive descriptors and buffers */
        if (em_setup_receive_structures(adapter)) {

(which is against 9.2-STABLE if you're looking).

The result is that no 9k clusters appear to be allocated.  I'm still
running the system as before, but so far the problem has not recurred. 
Of note, given your comment, is that this patch doesn't appear to break
anything, either.  Should I send-pr it?

----------
From: *Garrett Wollman* <wollman@bimajority.org
<mailto:wollman@bimajority.org>>
Date: Sat, Oct 26, 2013 at 7:18 PM
To: Zaphod Beeblebrox <zbeeble@gmail.com <mailto:zbeeble@gmail.com>>
Cc: net@freebsd.org <mailto:net@freebsd.org>


<<On Sat, 26 Oct 2013 14:55:19 -0400, Zaphod Beeblebrox
<zbeeble@gmail.com <mailto:zbeeble@gmail.com>> said:

> The result is that no 9k clusters appear to be allocated.  I'm still
> running the system as before, but so far the problem has not recurred.  Of
> note, given your comment, is that this patch doesn't appear to break
> anything, either.  Should I send-pr it?

You bet.  Otherwise it will get lost.  Hopefully it can be assigned to
whoever is maintaining this driver as a reminder.

-GAWollman
Comment 2 Mark Linimon freebsd_committer freebsd_triage 2014-05-04 06:24:11 UTC
Responsible Changed
From-To: freebsd-bugs->freebsd-net

Over to maintainer(s).
Comment 3 Eitan Adler freebsd_committer freebsd_triage 2018-05-28 19:41:00 UTC
batch change:

For bugs that match the following
-  Status Is In progress 
AND
- Untouched since 2018-01-01.
AND
- Affects Base System OR Documentation

DO:

Reset to open status.


Note:
I did a quick pass but if you are getting this email it might be worthwhile to double check to see if this bug ought to be closed.
Comment 4 Marius Strobl freebsd_committer freebsd_triage 2019-02-13 14:50:48 UTC
Close; this has been addressed with r340148 and r342790 in stable/11
and stable/10 respectively and doesn't apply to later versions of
FreeBSD.