I have a ZFS server here that also runs a number of other services and has 9k packets turned on. It hangs every day or two with it's current load. I had a number of theories on the problem ... but it seems that 9k buffer allocation is a problem when other resources are stressed. In the email list discussion which I will pin to this pr after I submit it, GAWollman noted that the 9k buffer allocation may not be required by the code in if_em.c, and, indeed, when I removed it, it wasn't. Not only was it not required, but removing it fixed the hangs. I suppose the argument, then, is, that the more efficient path is to use page-sized buffers and scatter-gather --- which apparently everything supports (he is my only reference for this statement). How-To-Repeat: This might be challenging. The server has 8 Gig of RAM, 17 disks in ZFS and another 2 disks in UFS service. ZFS serves SMB, NFS (v3), and iSCSI to a GigE lan with 9k packets enabled. The system also runs (signiciant) postgreSQL, rtorrent and apache loads. Of all of these, the rtorrent and iSCSI loads seem to be most involved in replicating this problem.
As promised, here is the email conversation: Subject: *Or it could be ZFS memory starvation and 9k packets (was Re: istgt causes massive jumbo nmbclusters loss)* ------------------------ From: *Zaphod Beeblebrox* <zbeeble@gmail.com <mailto:zbeeble@gmail.com>> Date: Sat, Oct 26, 2013 at 1:16 AM To: FreeBSD Net <freebsd-net@freebsd.org <mailto:freebsd-net@freebsd.org>>, freebsd-fs <freebsd-fs@freebsd.org <mailto:freebsd-fs@freebsd.org>> At first I thought this was entirely the interaction of istgt and 9k packets, but after some observation (and a few more hangs) I'm reasonably positive it's a form of resource starvation related to ZFS and 9k packets. To reliably trigger the hang, I need to do something that triggers a demand for 9k packets (like istgt traffic, but also bit torrent traffic --- as you see the MTU is 9014) and it must have been some time since the system booted. ZFS is fairly busy (with both NFS and SMB guests), so it generally takes quite a bit of the 8G of memory for itself. Now... below the netstat -m shows 1399 9k bufs with 376 available. When the network gets busy, I've seen 4k or even 5k bufs in total... never near the 77k max. After some time of lesser activity, the number of 9k buffers returns to this level. When the problem occurs, the number of denied buffers will shoot up at the rate of several hundred or even several thousand per second, but the system will not be "out" of memory. Top will show 800 meg often in the free column when this happens. While it's happening, when I'm logged into the console, none of these stats seem out of place, save the number of denied 9k buffer allocations and the "cache" of 9k buffers will be less than 10 (but I've never seen it at 0). On Tue, Oct 22, 2013 at 3:42 PM, Zaphod Beeblebrox <zbeeble@gmail.com <mailto:zbeeble@gmail.com>> wrote: I have a server FreeBSD virtual.accountingreality.com <http://virtual.accountingreality.com> 9.2-STABLE FreeBSD 9.2-STABLE #13 r256549M: Tue Oct 15 16:29:48 EDT 2013 root@virtual.accountingreality.com:/usr/obj/usr/src/sys/VRA amd64 That has an em0 with jumbo packets enabled: em0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 9014 It has (among other things): ZFS, NFS, iSCSI (via istgt) and Samba. Every day or two, it looses it's ability to talk to the network. ifconfig down/up on em0 gives the message about not being able to allocate the receive buffers... With everything running, but with specifically iSCSI not used, everything seems good. When I start hitting istgt, I see the denied stat for 9k mbufs rise very rapidly (this amount only took a few seconds): [1:47:347]root@virtual:/usr/local/etc/iet> netstat -m 1313/877/2190 <tel:1313%2F877%2F2190> mbufs in use (current/cache/total) 20/584/604/523514 mbuf clusters in use (current/cache/total/max) 20/364 mbuf+clusters out of packet secondary zone in use (current/cache) 239/359/598/261756 4k (page size) jumbo clusters in use (current/cache/total/max) 1023/376/1399/77557 9k jumbo clusters in use (current/cache/total/max) 0/0/0/43626 16k jumbo clusters in use (current/cache/total/max) 10531K/6207K/16738K bytes allocated to network (current/cache/total) 0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters) 0/0/0 requests for mbufs delayed (mbufs/clusters/mbuf+clusters) 0/0/0 requests for jumbo clusters delayed (4k/9k/16k) 0/50199/0 requests for jumbo clusters denied (4k/9k/16k) 0/0/0 sfbufs in use (current/peak/max) 0 requests for sfbufs denied 0 requests for sfbufs delayed 0 requests for I/O initiated by sendfile 0 calls to protocol drain routines ... the denied number rises... and somewhere in the millions or more the machine stops --- but even with the large number of denied 9k clusters, the "9k jumbo clusters in use" line will always indicate some available. ... so is this a tuning or a bug issue? I've tried ietd --- basically it doesn't want to work with a zfs zvol, it seems (refuses to use it). ---------- From: *Garrett Wollman* <wollman@hergotha.csail.mit.edu <mailto:wollman@hergotha.csail.mit.edu>> Date: Sat, Oct 26, 2013 at 1:52 AM To: zbeeble@gmail.com <mailto:zbeeble@gmail.com> Cc: net@freebsd.org <mailto:net@freebsd.org> In article <CACpH0MfEy50Y5QOZCdn2co_JmY_QPfVRxYwK-73W0WYsHB-Fqw@mail.gmail.com <mailto:CACpH0MfEy50Y5QOZCdn2co_JmY_QPfVRxYwK-73W0WYsHB-Fqw@mail.gmail.com>> you write: >Now... below the netstat -m shows 1399 9k bufs with 376 available. When >the network gets busy, I've seen 4k or even 5k bufs in total... never near >the 77k max. After some time of lesser activity, the number of 9k buffers >returns to this level. The network interface (driver) almost certainly should not be using 9k mbufs. These buffers are physically contiguous, and after not too much activity, it will be nearly impossible to allocate three physically contiguous buffers. >> That has an em0 with jumbo packets enabled: >> >> em0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 9014 I don't know for certain about em(4), but it very likely should not be using 9k mbufs. Intel network hardware has done scatter-gather since nearly the year dot. (Seriously, I wrote a network driver for the i82586 back at the very beginning of FreeBSD's existence, and *that* part had scatter-gather. No jumbo frames, though!) The entire existence of 9k and 16k mbufs is probably a mistake. There should not be any network interfaces that are modern enough to do jumbo frames but ancient enough to require physically contiguous pages for each frame. I don't know if the em(4) driver is written such that you can just disable the use of those mbufs, though. You could try making this change, though. Look for this code in if_em.c: /* ** Figure out the desired mbuf ** pool for doing jumbos */ if (adapter->max_frame_size <= 2048) adapter->rx_mbuf_sz = MCLBYTES; else if (adapter->max_frame_size <= 4096) adapter->rx_mbuf_sz = MJUMPAGESIZE; else adapter->rx_mbuf_sz = MJUM9BYTES; Comment out the last two lines and change the else if (...) to else. It's not obvious that the rest of the code can cope with this, but it does work that way on other Intel hardware so it seems like it may be worth a shot. -GAWollman ---------- From: *Zaphod Beeblebrox* <zbeeble@gmail.com <mailto:zbeeble@gmail.com>> Date: Sat, Oct 26, 2013 at 2:55 PM To: Garrett Wollman <wollman@hergotha.csail.mit.edu <mailto:wollman@hergotha.csail.mit.edu>> Cc: net@freebsd.org <mailto:net@freebsd.org> To be clear, I made just this patch: Index: if_em.c =================================================================== --- if_em.c (revision 256870) +++ if_em.c (working copy) @@ -1343,10 +1343,10 @@ */ if (adapter->hw.mac.max_frame_size <= 2048) adapter->rx_mbuf_sz = MCLBYTES; - else if (adapter->hw.mac.max_frame_size <= 4096) + else /*if (adapter->hw.mac.max_frame_size <= 4096) */ adapter->rx_mbuf_sz = MJUMPAGESIZE; - else - adapter->rx_mbuf_sz = MJUM9BYTES; + /* else + adapter->rx_mbuf_sz = MJUM9BYTES; */ /* Prepare receive descriptors and buffers */ if (em_setup_receive_structures(adapter)) { (which is against 9.2-STABLE if you're looking). The result is that no 9k clusters appear to be allocated. I'm still running the system as before, but so far the problem has not recurred. Of note, given your comment, is that this patch doesn't appear to break anything, either. Should I send-pr it? ---------- From: *Garrett Wollman* <wollman@bimajority.org <mailto:wollman@bimajority.org>> Date: Sat, Oct 26, 2013 at 7:18 PM To: Zaphod Beeblebrox <zbeeble@gmail.com <mailto:zbeeble@gmail.com>> Cc: net@freebsd.org <mailto:net@freebsd.org> <<On Sat, 26 Oct 2013 14:55:19 -0400, Zaphod Beeblebrox <zbeeble@gmail.com <mailto:zbeeble@gmail.com>> said: > The result is that no 9k clusters appear to be allocated. I'm still > running the system as before, but so far the problem has not recurred. Of > note, given your comment, is that this patch doesn't appear to break > anything, either. Should I send-pr it? You bet. Otherwise it will get lost. Hopefully it can be assigned to whoever is maintaining this driver as a reminder. -GAWollman
Responsible Changed From-To: freebsd-bugs->freebsd-net Over to maintainer(s).
batch change: For bugs that match the following - Status Is In progress AND - Untouched since 2018-01-01. AND - Affects Base System OR Documentation DO: Reset to open status. Note: I did a quick pass but if you are getting this email it might be worthwhile to double check to see if this bug ought to be closed.
Close; this has been addressed with r340148 and r342790 in stable/11 and stable/10 respectively and doesn't apply to later versions of FreeBSD.