Bug 206932 - Realtek 8111 card stops responding under high load in netmap mode
Summary: Realtek 8111 card stops responding under high load in netmap mode
Status: In Progress
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: CURRENT
Hardware: Any Any
: --- Affects Only Me
Assignee: freebsd-net mailing list
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-02-04 21:53 UTC by Olivier - interfaSys sàrl
Modified: 2019-01-11 09:46 UTC (History)
1 user (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Olivier - interfaSys sàrl 2016-02-04 21:53:27 UTC
I've filed a bug report with netmap, but it seems the FreeBSD project is using a different tree, so I'm reporting it here as well.
I've reproduced the problem with
* 10.1
* 10.2
* 10.2 with the netmap + re code from 11-CURRENT
* 10.2 with netmap from the official repository (master)

The problem is always the same

Using pkt-gen and after 20 or so "batches", the card is overloaded and stops responding. I've tried various driver settings (polling, fast queue, no MSI, irq filtering, etc.), but nothing helped.

There is a driver from Realtek, but it doesn't support netmap, so I've tried to patch it, but I've got exactly the same results as described in other netmap issues. Only one batch makes it. If I limit the rate, it fails after the total of each batch matches the one of a default batch.

One thing I've noticed in my tests is that the generic software implementation (which works flawlessly, but eats a lot of CPU) has 1024 queues and when looking at the number of mbufs used with netstat, I can see that 1024 are in use.
In dmesg, I can see that the realtek driver support 256 queues, but in netstat, it uses 512 and sometimes even more (erratic changes up to 600+ at which point things fail).

Could this be the reason? Is this fixable in netmap or is this a driver issue which should be reported in the FreeBSD project?

Details about the card

```
re0: <RealTek 8168/8111 B/C/CP/D/DP/E/F/G PCIe Gigabit Ethernet> port 0xe000-0xe0ff mem 0x81300000-0x81300fff,0xa0100000-0xa0103fff irq 17 at device 0.0 on pci2
re0: Using 1 MSI-X message
re0: turning off MSI enable bit.
re0: Chip rev. 0x4c000000
re0: MAC rev. 0x00000000
miibus0: <MII bus> on re0
rgephy0: <RTL8251 1000BASE-T media interface> PHY 1 on miibus0
rgephy0:  none, 10baseT, 10baseT-FDX, 10baseT-FDX-flow, 100baseTX, 100baseTX-FDX, 100baseTX-FDX-flow, 1000baseT-FDX, 1000baseT-FDX-master, 1000baseT-FDX-flow, 1000baseT-FDX-flow-master, auto, auto-flow
re0: Using defaults for TSO: 65518/35/2048
re0: netmap queues/slots: TX 1/256, RX 1/256

re0@pci0:2:0:0: class=0x020000 card=0x012310ec chip=0x816810ec rev=0x0c hdr=0x00
    vendor     = 'Realtek Semiconductor Co., Ltd.'
    device     = 'RTL8111/8168B PCI Express Gigabit Ethernet controller'
    class      = network
    subclass   = ethernet
    bar   [10] = type I/O Port, range 32, base rxe000, size 256, enabled
    bar   [18] = type Memory, range 64, base rx81300000, size 4096, enabled
    bar   [20] = type Prefetchable Memory, range 64, base rxa0100000, size 16384, enabled
```
Comment 1 Olivier - interfaSys sàrl 2016-02-05 02:09:19 UTC
I've just tested on 11-CURRENT and got the same results.
Comment 2 Olivier - interfaSys sàrl 2016-02-05 15:38:00 UTC
Setting re0 to use a MTU of 9000 and the connection stays alive. Instead of timing out, the packet rate drops drastically once and things go back to normal. 
The main difference in netstat is that the mbuf clusters are split between standard and jumbo frames

```
768/2787/3555 mbufs in use (current/cache/total)
256/1524/1780/500200 mbuf clusters in use (current/cache/total/max)
256/1515 mbuf+clusters out of packet secondary zone in use (current/cache)
0/46/46/250099 4k (page size) jumbo clusters in use (current/cache/total/max)
256/65/321/74103 9k jumbo clusters in use (current/cache/total/max)
0/0/0/41683 16k jumbo clusters in use (current/cache/total/max)
3008K/4513K/7521K bytes allocated to network (current/cache/total)
0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters)
0/0/0 requests for mbufs delayed (mbufs/clusters/mbuf+clusters)
0/0/0 requests for jumbo clusters delayed (4k/9k/16k)
0/0/0 requests for jumbo clusters denied (4k/9k/16k)
0 requests for sfbufs denied
0 requests for sfbufs delayed
0 requests for I/O initiated by sendfile
```

The rate in vmstat keeps rising, but that doesn't seem to be a problem

```
interrupt                          total       rate
irq16: sdhci_pci0                      1          0
cpu0:timer                       3008083       1113
irq256: ahci0                      10125          3
irq257: xhci0                      11363          4
irq258: hdac0                          3          0
irq259: re0                     13105929       4850
irq260: re1                       101440         37
cpu2:timer                       1095578        405
cpu1:timer                       1083354        400
cpu3:timer                       1123144        415
Total                           19539020       7231
```
Comment 3 Olivier - interfaSys sàrl 2016-02-05 16:52:56 UTC
I think this is logged when things start failing
```
231.020147 [2925] netmap_transmit           re0 full hwcur 0 hwtail 0 qlen 255 len 42 m 0xfffff80052005400
235.997171 [2925] netmap_transmit           re0 full hwcur 0 hwtail 0 qlen 255 len 42 m 0xfffff80023ec9c00
240.989245 [2925] netmap_transmit           re0 full hwcur 0 hwtail 0 qlen 255 len 42 m 0xfffff800521ad500
247.887586 [2925] netmap_transmit           re0 full hwcur 0 hwtail 0 qlen 255 len 42 m 0xfffff80023da9b00
253.069781 [2925] netmap_transmit           re0 full hwcur 0 hwtail 0 qlen 255 len 42 m 0xfffff80023ec7700
258.110746 [2925] netmap_transmit           re0 full hwcur 0 hwtail 0 qlen 255 len 42 m 0xfffff800521ade00
263.188076 [2925] netmap_transmit           re0 full hwcur 0 hwtail 0 qlen 255 len 42 m 0xfffff800237d6900
```
Comment 4 Olivier - interfaSys sàrl 2016-05-24 09:43:14 UTC
Using fresh netmap from FreeBSD 11 and a newer pkt-gen, this is what I see.

986.519903 [2163] netmap_ioctl              nr_cmd must be 0 not 12
047.486179 [1481] nm_txsync_prologue        fail head < kring->rhead || head > kring->rtail
047.510386 [1511] nm_txsync_prologue        re0 TX0 kring error: head 107 cur 107 tail 106 rhead 52 rcur 52 rtail 106 hwcur 52 hwtail 106
047.534818 [1612] netmap_ring_reinit        called for re0 TX0
051.945718 [1481] nm_txsync_prologue        fail head < kring->rhead || head > kring->rtail
051.990215 [1511] nm_txsync_prologue        re0 TX0 kring error: head 225 cur 225 tail 224 rhead 223 rcur 223 rtail 224 hwcur 223 hwtail 224
052.009143 [1612] netmap_ring_reinit        called for re0 TX0

At this point pkt-gen exits with error. 

I've also tried using the netmap software emulation and it crashes even earlier.
Comment 5 Vincenzo Maffione freebsd_committer 2019-01-11 09:46:32 UTC
Current netmap code in HEAD, stable/11 and stable/12 is aligned to the github
(and code has changed quite a lot since 2016).
I just tried to run pkt-gen (tx or rx) in a VM with a r8169 emulated NIC, and everything seems to work fine to me.

Can you check if the issue is still there?