Bug 197535

Summary: [re] [panic] if_re (Realtek 8168) causes memory write after free and kernel panic
Product: Base System Reporter: luca.pizzamiglio
Component: kernAssignee: freebsd-net (Nobody) <net>
Status: Closed FIXED    
Severity: Affects Some People CC: chris, marius, ngie, ohartmann, yongari
Priority: --- Keywords: patch
Version: CURRENTFlags: luca.pizzamiglio: mfc-stable10?
Hardware: Any   
OS: Any   
Attachments:
Description Flags
Dmesg and kernel panic on CURRENT
none
dmesg (verbose) and kernel panic on 10-STABLE
none
re_cfgv2.diff
none
patch solving the issue using this hardware none

Description luca.pizzamiglio 2015-02-11 11:07:15 UTC
Created attachment 152865 [details]
Dmesg and kernel panic on CURRENT

When I set the network interface address, I get a bunch of "Memory modified after free" messages:
Memory modified after free 0xfffff800039de800(2048) val=ffffffff @ 0xfffff800039de800
Memory modified after free 0xfffff800039d4800(2048) val=ffffffff @ 0xfffff800039d4800

If I wait long enough (a couple of minutes) I get a kernel panic.

I attach an example (dmesg + kernel panic)


I've tested it using 10.1-STABLE, same messages after ifconfig, but the kernel panic is different.

On 10, I see really often the value 0x3201c040 causing segmentation fault (!), but I don't know where it comes from.

About the messages, it could be that the init procedure of re(4) cannot correctly stop the device (a normal Realtek 8168) and the dma address are rewritten by receiving packets.
Comment 1 luca.pizzamiglio 2015-02-11 11:08:08 UTC
Created attachment 152866 [details]
dmesg (verbose) and kernel panic on 10-STABLE
Comment 2 luca.pizzamiglio 2015-02-11 11:16:23 UTC
I add the pciconf output

pciconf -lvbce pci0:3:0:0
re0@pci0:3:0:0: class=0x020000 card=0x012310ec chip=0x816810ec rev=0x0c hdr=0x00
    vendor     = 'Realtek Semiconductor Co., Ltd.'
    device     = 'RTL8111/8168B PCI Express Gigabit Ethernet controller'
    class      = network
    subclass   = ethernet
    bar   [10] = type I/O Port, range 32, base rx1000, size 256, enabled
    bar   [18] = type Memory, range 64, base rx90500000, size 4096, enabled
    bar   [20] = type Prefetchable Memory, range 64, base rx90400000, size 16384, enabled
    cap 01[40] = powerspec 3  supports D0 D1 D2 D3  current D0
    cap 05[50] = MSI supports 1 message, 64 bit
    cap 10[70] = PCI-Express 2 endpoint IRQ 1 max data 128(128) link x1(x1)
                 speed 2.5(2.5) ASPM disabled(L0s/L1)
    cap 11[b0] = MSI-X supports 4 messages, enabled
                 Table in map 0x20[0x0], PBA in map 0x20[0x800]
    cap 03[d0] = VPD
    ecap 0001[100] = AER 1 0 fatal 0 non-fatal 2 corrected
    ecap 0002[140] = VC 1 max VC0
    ecap 0003[160] = Serial 1 01000000684ce000
    ecap 0018[170] = LTR 1
  PCI-e errors = Correctable Error Detected
     Corrected = Receiver Error
                 Bad DLLP
Comment 3 luca.pizzamiglio 2015-02-13 15:59:34 UTC
More information:
* The chip is recognized as 8168G
* Both Rx and Tx don't works
* debugging the driver, I discovered that the 0x3201c040 is the value of the rl_desc->cmd_stat
* the rx can get 256 Rx descriptor, then there're no more packets (RL_RDESC_STAT_OWN always on)

It seems to me that the DMA/device is not correctly initialized and the communication with the device works in an unexpected way.
Comment 4 Marius Strobl freebsd_committer freebsd_triage 2015-02-15 00:00:59 UTC
Created attachment 152990 [details]
re_cfgv2.diff

You could give this patch a try (note that the location of if_rlreg.h depends
on the version of FreeBSD). If fixes a couple of bugs, mainly in the area of
receiver configuration of newer chips. Given that these configuration bits got
repurposed and it's unknown what both the old and new bits do exactly in later
MACs, this patch might make a difference for you.
That said, generally your problem appears to be caused by a hardware defect of
some sort. For one, rev. 0x4c000000 chips are known to work at this time. Also,
the memory used for descriptors shouldn't suddenly go away and cause a page
fault when accessed. Moreover, the freed memory neither containing the expected
0xdeadc0de nor some random bits but just always all ones in your case is very suspicious. Have you tested whether that piece of hardware works with Linux?
Comment 5 luca.pizzamiglio 2015-02-16 11:58:18 UTC
Thanks for patch!
I tested it and unfortunately it doesn't any difference.

I tested the board on Linux and it works.
I shortly looked in the re Linux driver and it seems that they have the ability to dynamically load a known firmware on the chip and they do it for this specific chip. It could be the case that my 0x4c000000 card has a different firmware version.
I'll try to get more information on that side.

It really seems that the network card is writing received packets (via DMA?) in unexpected memory addresses overwriting portion of the kernel and causing the panics.
Comment 6 Marius Strobl freebsd_committer freebsd_triage 2015-02-16 22:48:16 UTC
Your rev. 0x4c000000 MAC coming up with broken firmware could be another
reason, which would be unfortunate, though, as these images are GPLed.

However, the Atom E3800 errata has an entry (VLI30 in the non-NDA October
2013 version of that document) suggesting that the MMU will not behave
correctly when employing super pages, which would be a more plausible
cause for the problems you are experiencing. Thus, I'd give a kernel
with super page usage disabled a try. Last time I tested, unfortunately,
the corresponding loader tunable didn't take effect. So manually setting
pg_ps_enabled to 0 in sys/amd64/amd64/pmap.c and recompiling likely is
safest in order to do so.
Comment 7 luca.pizzamiglio 2015-02-17 17:30:58 UTC
I tested disabling super pages and I get the same behavior. I verified the sysctl bit and it was disabled.
I tested disabling tso (as suggested on mailing list, despite tso was already  disabled) and I get the same behavior.

I would love to implement the API to load this firmware blob, to test it. It could take some time but it seems doable. Firmware blobs are not GPLed, they're distributed as binary permitted by the vendor. BTW, I would just test it.

In parallel, I'll try to understand where this new rx_desc are written, if there's a logical connection with the loaded DMA maps.
Comment 8 O. Hartmann 2015-02-17 18:30:30 UTC
See also Bug 193743 - RTL8111/8168B PCI Express Gigabit Ethernet controller: doesn't work properly, problems getting UP automatically
Comment 9 luca.pizzamiglio 2015-02-19 17:50:55 UTC
I collect some other information about this error and I'm a little bit confused.
I added some printf in the re_rxeof() function, to understand who's overwriting the memory after free (my suspect was the DMA was writing in the wrong place):

re0: idx 22 - rxstat 0x3201C040 - cur_rx at 0xfffff8000179b160
re0: rl_rx_list 0xfffff8000179b000 - rl_rx_list_addr 0x179b000
Memory modified after free 0xfffff800069da800(2048) val=ffffffff @ 0xfffff800069da800
re0: newbuf m 0xfffff800069f0600 - segs.ds_addr 0x00000000069da800
re0: newbuf m->m_data 0xfffff800069da800

rl_rx_list is the pointer to the rl_desc list
rl_rx_list_addr is the phy address used by the device (DMA segment)
The newbuf function allocate a new mbuf and load it in a new dma segment:
m->m_data is the virtual address
segs.ds_addr is the phy address used by the device (DMA segment)

Apparently, the driver can really get data from DMA.
I've still to explain:
* received XX packets, "netstat -s" shows 0 packets received
* after the first 256 packets, the first descriptor of the ring is overwritten by a new one that is always: rl_desc->cmdstat is 0x80000800 => no packets
* how some rl_desc are flying away causing page faults... 

The Linux firmware actually is not a real firmware, but it seems to me a way to encode and hide the chip initialization.
Comment 10 luca.pizzamiglio 2015-02-25 14:48:53 UTC
Finally, I get the behavior of the my re0 interface, unfortunately, still without solution.

When the driver gets a new descriptor on the rxring, the mbuf is still full of 0xdeadcode.

The Memory modified after free:
Memory modified after free 0xfffff800039de800(2048) val=ffffffff @ 0xfffff800039de800
The value is the first 4 byte of a broadcast ethernet packet and the address is one previously used to store a mbuf.

The first conclusion is that packets are not yet written when a rxring entry is sent.

I've also checked why rxring entries are flying away causing page faults: simply, the ring is acting as an array :) In other words, new entries are always written sequentially, violating the size of the ring.
When the last element of the ring (idx=255) is used, it correctly rewritten using 0xC0000800 instead of 0x80000800, signalling that it's the last entry of the ring.

Any ideas?
Comment 11 luca.pizzamiglio 2015-03-04 15:14:01 UTC
Created attachment 153780 [details]
patch solving the issue using this hardware

Explanation and solution

The actual problem was that the rx ring (and probably the txring too) wasn't updated by the driver or, better, the device wasn't able to get these updates.
That explains why the last entry in the ring was ignored and why only few packets were really copied via DMA.

I found this log entry in the linux git: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=d6e572911a4cb2b9fcd1c26a38d5317a3971f2fd

It seems that for "some" chips, Rx and Tx should be enabled in the RL_COMMAND register later, after rings configuration and so on.

The patch shows only how I used this tip, but I don't know how it could affect other devices.
Comment 12 Enji Cooper freebsd_committer freebsd_triage 2015-04-09 16:59:10 UTC
The patch (minus the commented out code) LGTM. I’ll test it out on my workstation (a few kldunload/kldload/ifconfig/dhclient runs) and commit it if no one objects in a few days.
Comment 13 Enji Cooper freebsd_committer freebsd_triage 2015-04-09 17:00:05 UTC
And just for reference, here's what I'm running:

re0@pci0:6:0:0: class=0x020000 card=0x83671043 chip=0x816810ec rev=0x02 hdr=0x00
    vendor     = 'Realtek Semiconductor Co., Ltd.'
    device     = 'RTL8111/8168B PCI Express Gigabit Ethernet controller'
    class      = network
    subclass   = ethernet
$ uname -a
FreeBSD bayonetta.local 9.3-RELEASE FreeBSD 9.3-RELEASE #0 r268512: Thu Jul 10 23:44:39 UTC 2014     root@snap.freebsd.org:/usr/obj/usr/src/sys/GENERIC  amd64
Comment 14 commit-hook freebsd_committer freebsd_triage 2015-04-09 21:36:29 UTC
A commit references this bug:

Author: marius
Date: Thu Apr  9 21:35:45 UTC 2015
New revision: 281337
URL: https://svnweb.freebsd.org/changeset/base/281337

Log:
  Don't enable RX and TX before their initial configuration is done, i. e.
  after setting up interrupt moderation but before turning interrupts on.
  This matches what Realtek's r8168 Linux driver does as of version 8.039.00
  and fixes problems with certain incarnations of certain MAC revisions
  like the interface requiring an extra up/down-cycle after boot to start
  working or DMA configuration not being adhered to.

  PR:		193743, 197535
  MFC after:	1 week

Changes:
  head/sys/dev/re/if_re.c
Comment 15 commit-hook freebsd_committer freebsd_triage 2015-07-05 20:17:38 UTC
A commit references this bug:

Author: marius
Date: Sun Jul  5 20:16:39 UTC 2015
New revision: 285177
URL: https://svnweb.freebsd.org/changeset/base/285177

Log:
  MFC: r281337

  Don't enable RX and TX before their initial configuration is done, i. e.
  after setting up interrupt moderation but before turning interrupts on.
  This matches what Realtek's r8168 Linux driver does as of version 8.039.00
  and fixes problems with certain incarnations of certain MAC revisions
  like the interface requiring an extra up/down-cycle after boot to start
  working or DMA configuration not being adhered to.

  PR:		193743, 197535
  Approved by:	re (kib)

Changes:
_U  stable/10/
  stable/10/sys/dev/re/if_re.c
Comment 16 commit-hook freebsd_committer freebsd_triage 2015-07-05 20:17:43 UTC
A commit references this bug:

Author: marius
Date: Sun Jul  5 20:16:46 UTC 2015
New revision: 285178
URL: https://svnweb.freebsd.org/changeset/base/285178

Log:
  MFC: r281337

  Don't enable RX and TX before their initial configuration is done, i. e.
  after setting up interrupt moderation but before turning interrupts on.
  This matches what Realtek's r8168 Linux driver does as of version 8.039.00
  and fixes problems with certain incarnations of certain MAC revisions
  like the interface requiring an extra up/down-cycle after boot to start
  working or DMA configuration not being adhered to.

  PR:	193743, 197535

Changes:
_U  stable/9/sys/
_U  stable/9/sys/dev/
  stable/9/sys/dev/re/if_re.c
Comment 17 commit-hook freebsd_committer freebsd_triage 2015-07-05 20:17:47 UTC
A commit references this bug:

Author: marius
Date: Sun Jul  5 20:16:52 UTC 2015
New revision: 285179
URL: https://svnweb.freebsd.org/changeset/base/285179

Log:
  MFC: r281337

  Don't enable RX and TX before their initial configuration is done, i. e.
  after setting up interrupt moderation but before turning interrupts on.
  This matches what Realtek's r8168 Linux driver does as of version 8.039.00
  and fixes problems with certain incarnations of certain MAC revisions
  like the interface requiring an extra up/down-cycle after boot to start
  working or DMA configuration not being adhered to.

  PR:	193743, 197535

Changes:
_U  stable/8/sys/
_U  stable/8/sys/dev/
_U  stable/8/sys/dev/re/
  stable/8/sys/dev/re/if_re.c