202680 – Silent data corruption on em(4) interfaces

Bug 202680 - Silent data corruption on em(4) interfaces

Summary: Silent data corruption on em(4) interfaces

Status:	Closed Not Enough Information

Alias:	None

Product:	Base System
Classification:	Unclassified
Component:	kern (show other bugs)
Version:	10.2-STABLE
Hardware:	Any Any

Importance:	--- Affects Some People
Assignee:	freebsd-net (Nobody)

URL:
Keywords:	IntelNetworking

Depends on:
Blocks:

Reported:	2015-08-26 16:35 UTC by Dmitry Afanasiev
Modified:	2021-12-12 19:13 UTC (History)
CC List:	5 users (show)

See Also:

Attachments
Corrupted data in transferred file (3.12 KB, application/octet-stream) 2015-08-31 11:11 UTC, Dmitry Afanasiev	no flags	Details
pciconf -lvbc output (9.83 KB, text/plain) 2015-09-01 18:27 UTC, Dmitry Afanasiev	no flags	Details
boot -v output (36.19 KB, text/plain) 2015-09-01 18:28 UTC, Dmitry Afanasiev	no flags	Details
output of pciconf -lv (5.31 KB, text/plain) 2015-11-10 18:30 UTC, Yaniv Kunda	no flags	Details
output of dmesg -a (19.49 KB, text/plain) 2015-11-10 18:30 UTC, Yaniv Kunda	no flags	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Dmitry Afanasiev 2015-08-26 16:35:06 UTC

I got silent data corruption when transferring data via em(4) interfaces on 10.2-STABLE r286912.
Immediately after reboot does not occur - after reboot I can transfer files without any errors.
Problem occurs *only* after few hours of uptime. I checked on system with uptime > 1 day and got:
1. broken large file transferred via ftp (MD5 checksum mismatched);
For example:
Original MD5(eis-dvd.iso)= 3705908d84f3c0f9f922a00e6e9332d2
Downloaded MD5 (eis-dvd.iso) = 1f7de1fdc682ae9ca26ec0621b377e89

2. SSH disconnects when transferring large data with message:
Corrupted MAC on input.
Disconnecting: Packet corrupt

I using Sun Fire X4100 server with 4 build in ethernet ports Intel 82546EB:
em0@pci0:1:1:0: class=0x020000 card=0x10118086 chip=0x10108086 rev=0x03 hdr=0x00
    vendor     = 'Intel Corporation'
    device     = '82546EB Gigabit Ethernet Controller (Copper)'
    class      = network
    subclass   = ethernet
This server works under solaris without any problems.

I tried to use:
1. lagg0.2 (lagg0 with vlan tag) over em0 and em2 interfaces:
cloned_interfaces="lagg0 lagg0.2"
ifconfig_em0="up"
ifconfig_em2="up"
create_args_lagg0="laggproto lacp laggport em0 laggport em2"

2. tagged em1.2 interface via manual ifconfig em1.2 create 192.168.X.Y

And I got same problem in both cases.

*One* time when transferring file I got this messages:
em0: Interface stopped DISTRIBUTING, possible flapping
em0: Watchdog timeout -- resetting
em2: Interface stopped DISTRIBUTING, possible flapping
em2: Watchdog timeout -- resetting
But this problem occurred only one time.

netstat -in does not see any problems on interfaces:
Name    Mtu Network       Address              Ipkts Ierrs Idrop    Opkts Oerrs  Coll
em0    1500 <Link#1>      00:14:4f:01:3f:7a  6689452     0     0   146720     0     0
em1    1500 <Link#2>      00:14:4f:01:3f:7b  5732168     0     0  2865912     0     0
em2    1500 <Link#3>      00:14:4f:01:3f:7c   501817     0     0  3392333     0     0

TCP_OFFLOAD was disabled in kernel's config.

System is not in production use, and I can try to roll back to pre-r269196 code or try to collect any required additional diagnosis information, but I can't provide remote access to this system.

Comment 1 Dmitry Afanasiev 2015-08-31 11:11:04 UTC

Created attachment 160557 [details]
Corrupted data in transferred file

I created a 16Gb file "hugefile" contains 16777216 lines with 1023 spaces and one '\n' (16777216 * (1023 + 1) = 17179869184 = 16Gb) and transferred this file via FTP.
On destination machine I checked resulting file by this command:
$ grep '[^ ]' hugefile > wrong

Result (wrong data in hugefile) is this attachment.

As far as I know, errors are checked on:
- ethernet level
- IP level
- TCP level.
, but I got corrupted data from TCP level.

Comment 2 Hiren Panchasara freebsd_committer

2015-08-31 16:26:19 UTC

I asked this on mailing list and repeating here. Do you have TSO turned on? If so, can you try disabling it to rule out that possibility?
'ifconfig em0 -tso' should disable tso on em0, for example.

Comment 3 Dmitry Afanasiev 2015-09-01 12:30:44 UTC

(In reply to Hiren Panchasara from comment #2)

As I wrote above, TSO is disabled on this system.
I got some problems with TSO before on another systems and I commented out TCP_OFFLOAD in my kernel config.

# ifconfig | grep ^em
em0: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
em1: flags=8802<BROADCAST,SIMPLEX,MULTICAST> metric 0 mtu 1500
em2: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
em3: flags=8802<BROADCAST,SIMPLEX,MULTICAST> metric 0 mtu 1500

# ifconfig | grep -i tso | wc -l
0

Also "ifconfig em0 -tso" do nothing, because TSO already disabled.

Comment 4 Sean Bruno freebsd_committer

2015-09-01 16:21:41 UTC

The 82546EB is a if_lem.c consumer and I haven't touched that driver yet.

I suspect that this might be a similar issue to pr 196980.

Can you post the output of "pciconf -lvbc" for the host as well as a full, boot verbose dmesg?

Comment 5 Dmitry Afanasiev 2015-09-01 18:27:40 UTC

Created attachment 160604 [details]
pciconf -lvbc output

Comment 6 Dmitry Afanasiev 2015-09-01 18:28:11 UTC

Created attachment 160605 [details]
boot -v output

Comment 7 Dmitry Afanasiev 2015-09-01 18:42:49 UTC

(In reply to Sean Bruno from comment #4)
"pciconf -lvbc" output and dmesg.boot from "boot -v" attached.

I'm not sure that the problem is similar to PR196980, because I have no "Allocation of PCI resources failed" problems and all four interfaces in my Sun Fire X4100 working.

Also... I think problem with data corruption exists in 9-STABLE too...
I tried to scp from another host to 9.3-STABLE r274181 on another X4100 server and got:
Corrupted MAC on input.
Disconnecting: Packet corrupt
lost connection

Comment 8 Sean Bruno freebsd_committer

2015-09-02 16:35:45 UTC

Can you try to apply three changesets from head and see if the condition still exists?

https://svnweb.freebsd.org/base?view=revision&revision=286831
https://svnweb.freebsd.org/base?view=revision&revision=285879
https://svnweb.freebsd.org/base?view=revision&revision=287330

Comment 9 Sean Bruno freebsd_committer

2015-09-02 16:36:32 UTC

(In reply to Sean Bruno from comment #8)
wait ... never mind.  This is a lem(4) device not a em(4) device so those patches are not applicable.  I'll think more.

Comment 10 Dmitry Afanasiev 2015-09-14 20:46:59 UTC

Also I installed into buggy system with fbsd10.2 old Sun's gigabit ethernet adapter with cas(4) driver and reconfigured lagg to use only this adapter.
I transferred huge files several times with system's uptime >10 days via ssh and ftp without any problems.
I think problem related to only em(4) network driver, not another system's parts.

Comment 11 Sean Bruno freebsd_committer

2015-09-15 18:59:38 UTC

(In reply to Dmitry Afanasiev from comment #10)
Can you see if this change in head helps with your issue.  https://svnweb.freebsd.org/base?view=revision&revision=287330

Comment 12 Dmitry Afanasiev 2015-09-15 19:44:15 UTC

(In reply to Sean Bruno from comment #11)
sys/dev/e1000 in head have many differences from stable/10.
For example different lem's driver version:
< char lem_driver_version[] = "1.0.6";
---
> char lem_driver_version[] = "1.1.0";

Should I use entire sys/dev/e1000 directory from head in my stable/10 system?

Comment 13 Dmitry Afanasiev 2015-09-15 20:16:20 UTC

(In reply to Dmitry Afanasiev from comment #12)
sys/dev/e1000 tree from head is not compatible with stable/10 :(

Comment 14 Dmitry Afanasiev 2015-10-19 08:15:39 UTC

I tried to use freebsd-current from nightly snapshot:
FreeBSD sunrise0 11.0-CURRENT FreeBSD 11.0-CURRENT #0 r289044: Thu Oct  8 21:21:40 UTC 2015     root@releng2.nyi.freebsd.org:/usr/obj/usr/src/sys/GENERIC  amd64

But nothing changed - after 2 days uptime I again got incorrect MD5 checksum and messages from ssh:
Corrupted MAC on input.
Disconnecting: Packet corrupt

Comment 15 Yaniv Kunda 2015-11-09 21:23:26 UTC

Happens to me as well, on NAS4Free 10.2.0.2.1906 (based on FreeBSD 10.2-RELEASE-p5) - but also previously on several 9.3.x versions (based on FreeBSD 9.3-RELEASE-y)

Here is my device info from pciconf:
em0@pci0:3:0:0: class=0x020000 card=0x10828086 chip=0x107d8086 rev=0x06 hdr=0x00
    vendor     = 'Intel Corporation'
    device     = '82572EI Gigabit Ethernet Controller (Copper)'
    class      = network
    subclass   = ethernet
    bar   [10] = type Memory, range 32, base rxfbce0000, size 131072, enabled
    bar   [14] = type Memory, range 32, base rxfbcc0000, size 131072, enabled
    bar   [18] = type I/O Port, range 32, base rxdf00, size 32, enabled
    cap 01[c8] = powerspec 2  supports D0 D3  current D0
    cap 05[d0] = MSI supports 1 message, 64 bit enabled with 1 message
    cap 10[e0] = PCI-Express 1 endpoint max data 128(256) link x1(x1)
                 speed 2.5(2.5) ASPM disabled(L0s)
    ecap 0001[100] = AER 1 1 fatal 1 non-fatal 5 corrected
    ecap 0003[140] = Serial 1 001b21ffffb43a7f

Comment 16 Sean Bruno freebsd_committer

2015-11-10 16:29:05 UTC

(In reply to Yaniv Kunda from comment #15)
What type of hardware are you using?  Can you dump a full pciconf -lv and dmesg into this ticket?

Comment 17 Yaniv Kunda 2015-11-10 18:30:02 UTC

Created attachment 162976 [details]
output of pciconf -lv

Comment 18 Yaniv Kunda 2015-11-10 18:30:34 UTC

Created attachment 162977 [details]
output of dmesg -a

Comment 19 Yaniv Kunda 2015-11-10 18:33:57 UTC

(In reply to Sean Bruno from comment #16)
I'm using commodity desktop PC hardware, installed with NAS4Free on a USB stick.
Attached outputs of pciconf -lv and dmesg -a

I'd like to help find the root cause of this - please assist in me in any way needed!

Comment 20 Yaniv Kunda 2015-11-10 22:11:33 UTC

(In reply to Yaniv Kunda from comment #19)

Also tried after running ifconfig em0 -tso (verified that it removed TSO4 from em0's options) but with the same result - a big file copy resulted in a quiet error.

Comment 21 Sean Bruno freebsd_committer

2015-11-11 15:22:25 UTC

Can you folks try setting the processing limits to -1?  I'm curious if what you're seeing is a symptom of conflicting processing during interrupt context and deferred handling to the taskqueue threads.

change:
hw.em.rx_process_limit: 100

hw.em.rx_process_limit = -1

Comment 22 Yaniv Kunda 2015-11-11 18:24:55 UTC

(In reply to Sean Bruno from comment #21)

I've set the value to -1 (in loader.conf), rebooted, and verified the new value using sysctl.
In contrast to Dmitry, the problem happens for me even right after reboot, and not only after some uptime.

Copied a bunch of files, and checked MD5's - one of them was wrong - so it is probably not a symptom of what you described.

p.s. I think the importance of this issue should be raised from "Affects Some People" - since part of this problem is quiet corruption, I assume a lot of people with Intel adapters have corrupted data and do not even know.

Comment 23 Dmitry Afanasiev 2015-11-19 20:35:38 UTC

(In reply to Yaniv Kunda from comment #22)
Yaniv, is PF(4) used on your system with em interfaces?

I tested another Sun X4200 with em(4) interfaces and I have *NO* problem with data corruption!

On this system I checked:
10.1-STABLE r280356 with uptime 231 days
10.2-STABLE r290385 with uptime 8 days

Sun X4200 have same to X4100 motherboard, BIOS patch level and network interfaces.
Only ONE difference - on X4200 PF functionality was disabled.

After a week I can enable PF on my X4200 and recheck.

Also, on X4100 I got data corruption several times immediately after system's boot.

Comment 24 Yaniv Kunda 2015-11-21 18:34:15 UTC

(In reply to Dmitry Afanasiev from comment #23)
I don't even have PF loaded... :-(

Comment 25 Kevin Bowling freebsd_committer

2017-01-10 11:33:46 UTC

Can you retest with 12-CURRENT?

Comment 26 Dmitry Afanasiev 2017-01-10 16:27:02 UTC

(In reply to Kevin Bowling from comment #25)
I started to recheck this problem on stable/10 r306961, and currently have no problems.
I need more time to more tests, but may be problem is already solved in r294958 (or next changes?).
I will report later.

Comment 27 Kevin Bowling freebsd_committer

2017-01-10 19:56:56 UTC

It's possible things were fixed in past versions, but r311849 in -CURRENT has a large cleanup and optimization of the FreeBSD part of the driver.

Comment 28 Dmitry Afanasiev 2017-02-01 22:44:01 UTC

Bug still exists on stable/10 r309900 - 3 times got ssh disconnections, and got 2 errors for dumb TCP data transfer per ~3Tb of transmitted data.

To check on current I should switch some services from this server and should have a time to run tests. But I will try to test in this month.

Comment 29 Eitan Adler freebsd_committer

2018-05-28 19:47:28 UTC

batch change:

For bugs that match the following
-  Status Is In progress 
AND
- Untouched since 2018-01-01.
AND
- Affects Base System OR Documentation

DO:

Reset to open status.


Note:
I did a quick pass but if you are getting this email it might be worthwhile to double check to see if this bug ought to be closed.

Comment 30 Kevin Bowling freebsd_committer

2021-04-15 23:43:49 UTC

I'll need more information, per the 2017 comment we need to get on newer code to make sure it still is problematic.  Please reopen and assign to me if this happens on FreeBSD 12 or 13.