I fully realize that this is going to be a bear to track down. The machine in question is my home server ... so it runs a bit of everything. The trigger for the behavior seems to be more than 1000 torrents running. Stats on that later.
The kicker is: replace em0 (pcie) with rl0 (motherboard) and it goes away.
I have the em0 in the machine because I believe it's a better card. Sigh.
SO WHAT HAPPENS?
When em0 is misbehaving, the "mild" symptoms are local LAN lag from 500ms to 5000 ms. (this is why I jokingly accuse em0 of storing the packets.) Beyond about 5000ms, it seems the packets are dropped. This can be observed by pinging out from the console of the box or pinging from another box to it.
I often first notice the box is having trouble when the UPS monitor looses network connectivity with the UPS.
Salient details I can think of? The answer to "netstat -an | grep tcp4 | grep -v LISTEN | wc" is 320. 1000 torrents configured doesn't mean that that many streams happen. It is possible that the behavior is related to the number of torrent streams in progress... or the number of TCP streams with small transmit queues. In among that is some mildly fast SMB service for various media devices around the house.
The em0 behavior is 100% related to the large number of torrents (running using rtorrent). Stopping rtorrent makes the host good again, starting rtorrent hoses it.
So the machine is:
FreeBSD virtual.xxx.xxx 10.3-RELEASE-p5 FreeBSD 10.3-RELEASE-p5 #4 r301872: Mon Jun 13 14:35:24 EDT 2016 firstname.lastname@example.org:/usr/obj/usr/src/sys/GENERIC amd64
CPU: AMD FX(tm)-9590 Eight-Core Processor (4716.02-MHz K8-class CPU)
Origin="AuthenticAMD" Id=0x600f20 Family=0x15 Model=0x2 Stepping=0
Structured Extended Features=0x8<BMI1>
TSC: P-state invariant, performance statistics
real memory = 34359738368 (32768 MB)
avail memory = 33186353152 (31648 MB)
The ethernet cards are:
em0: <Intel(R) PRO/1000 Network Connection 7.6.1-k> port 0x8000-0x801f mem 0xfe340000-0xfe35ffff,0xfe320000-0xfe33ffff irq 20 at device 0.0 on pci9
em0: Using an MSI interrupt
em0: Ethernet address: 00:15:17:0d:04:a8
em0@pci0:9:0:0: class=0x020000 card=0x10838086 chip=0x10b98086 rev=0x06 hdr=0x00
vendor = 'Intel Corporation'
device = '82572EI Gigabit Ethernet Controller (Copper)'
re0: <RealTek 8168/8111 B/C/CP/D/DP/E/F/G PCIe Gigabit Ethernet> port 0x7000-0x70ff mem 0xd2104000-0xd2104fff,0xd2100000-0xd2103fff irq 21 at device 0.0 on pci10
re0: Using 1 MSI-X message
re0: Chip rev. 0x48000000
re0: MAC rev. 0x00000000
miibus0: <MII bus> on re0
rgephy0: <RTL8169S/8110S/8211 1000BASE-T media interface> PHY 1 on miibus0
rgephy0: none, 10baseT, 10baseT-FDX, 10baseT-FDX-flow, 100baseTX, 100baseTX-FDX, 100baseTX-FDX-flow, 1000baseT-FDX, 1000baseT-FDX-master, 1000baseT-FDX-flow, 1000baseT-FDX-flow-master, auto, auto-flow
re0: Using defaults for TSO: 65518/35/2048
re0: Ethernet address: 10:c3:7b:9d:8b:6d
re0@pci0:10:0:0: class=0x020000 card=0x85051043 chip=0x816810ec rev=0x09 hdr=0x00
vendor = 'Realtek Semiconductor Co., Ltd.'
device = 'RTL8111/8168B PCI Express Gigabit Ethernet controller'
A couple of things to try while we are doing an overhaul to em(4):
Can you retry with 12-CURRENT?
This was finally shown to be due to jumbo allocation for >4k (greater-than-page-size) mbufs failing. As such, it would still happen if this still happens.
That said, I don't have a means to test on 12-CURRENT.
I suspect this will work a lot better on newer releases. If you have time to retest please let us know.
Have things changed significantly? I abandoned jumbo packets because they were more problems than they were worth. With Covid, my wife is working at home now --- so the server affects more than one person if I get it to missbehave. I can't even logically delegate this to a VM ... as the outer system would need jumbo packets. Are pages even 4k? still? So much information I don't have.
This wasn't difficult to trigger --- and if I'm the only one up for testing, I suppose I can make time --- but someone with a lab of systems might be a better choice.
The system in question had 32G ram then. It has 128G now.
(In reply to dgilbert from comment #5)
Yes, the intel driver was substantially rewritten, so it would be necessary to retest to continue, and will need some active data collection on top of that once the symptoms are produced.
It's a big topic to discuss jumbo frames in general, I would agree with your assessment to not use them. I have secondary concern you might have run into memory fragmentation issues, the way larger frames are handled is sub-optimal in FreeBSD. Although that is not a driver bug so it still makes sense to close this PR out.
This sound alot like an interrupt problem to me. I'm
NOT suggesting that a newer FreeBSD/driver version
won't help mitigate this. Just that it sounds to me
like an interrupt "thing" -- especially when I see:
em0: Using an MSI interrupt
If you add
to loader.conf(5). Do you see a line with Quirks
in it related to your card? I would expect to see
MSI-X associated with your (em) card.
(In reply to Chris Hutchinson from comment #7)
There is an MSI errata on this particular chip, see issue 63 https://raw.githubusercontent.com/tpn/pdfs/master/Intel%20Ethernet%20Controller%2082571EB%2C%2082572EI%2C%2082571GB%2C%2082571GI%20-%20Specification%20Update%20-%20Rev%206.8%20Nov%202014.pdf
I could hack something together for you right now to force a legacy interrupt if you want to investigate that. But first we need to see if the problem still exists, moving our effort to 12 or 13 to be useful.
to be clear, this didn't seem to be interrupt related at all. it was a resource starvation caused by memory fragmentation and the bone-headed way we handle 9k jumbo frames. The whole problem would boil down to having a scarce few 9k jumbo frame buffers servicing all the packets coming in (and the few large ones going out).
The system now has an mlxen driver card in it. Even more reason to prefer jumbo frames with 10GE, but, again... I haven't even bothered as it appeared that the jumbo frames were the issue, not the interrupts.
So... it's very possible that this ticket should be closed --- but only if another regarding jumbo frames > 4k is opened.
*** This bug has been marked as a duplicate of bug 255070 ***
(In reply to dgilbert from comment #9)
I agree with your assessment, I opened a tracking bug up for the general issue of jumbos.