Bug 202680 - Silent data corruption on em(4) interfaces
Summary: Silent data corruption on em(4) interfaces
Status: Open
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 10.2-STABLE
Hardware: Any Any
: --- Affects Some People
Assignee: freebsd-net (Nobody)
Keywords: IntelNetworking
Depends on:
Reported: 2015-08-26 16:35 UTC by Dmitry Afanasiev
Modified: 2018-05-28 19:47 UTC (History)
4 users (show)

See Also:

Corrupted data in transferred file (3.12 KB, application/octet-stream)
2015-08-31 11:11 UTC, Dmitry Afanasiev
no flags Details
pciconf -lvbc output (9.83 KB, text/plain)
2015-09-01 18:27 UTC, Dmitry Afanasiev
no flags Details
boot -v output (36.19 KB, text/plain)
2015-09-01 18:28 UTC, Dmitry Afanasiev
no flags Details
output of pciconf -lv (5.31 KB, text/plain)
2015-11-10 18:30 UTC, Yaniv Kunda
no flags Details
output of dmesg -a (19.49 KB, text/plain)
2015-11-10 18:30 UTC, Yaniv Kunda
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Dmitry Afanasiev 2015-08-26 16:35:06 UTC
I got silent data corruption when transferring data via em(4) interfaces on 10.2-STABLE r286912.
Immediately after reboot does not occur - after reboot I can transfer files without any errors.
Problem occurs *only* after few hours of uptime. I checked on system with uptime > 1 day and got:
1. broken large file transferred via ftp (MD5 checksum mismatched);
For example:
Original MD5(eis-dvd.iso)= 3705908d84f3c0f9f922a00e6e9332d2
Downloaded MD5 (eis-dvd.iso) = 1f7de1fdc682ae9ca26ec0621b377e89

2. SSH disconnects when transferring large data with message:
Corrupted MAC on input.
Disconnecting: Packet corrupt

I using Sun Fire X4100 server with 4 build in ethernet ports Intel 82546EB:
em0@pci0:1:1:0: class=0x020000 card=0x10118086 chip=0x10108086 rev=0x03 hdr=0x00
    vendor     = 'Intel Corporation'
    device     = '82546EB Gigabit Ethernet Controller (Copper)'
    class      = network
    subclass   = ethernet
This server works under solaris without any problems.

I tried to use:
1. lagg0.2 (lagg0 with vlan tag) over em0 and em2 interfaces:
cloned_interfaces="lagg0 lagg0.2"
create_args_lagg0="laggproto lacp laggport em0 laggport em2"

2. tagged em1.2 interface via manual ifconfig em1.2 create 192.168.X.Y

And I got same problem in both cases.

*One* time when transferring file I got this messages:
em0: Interface stopped DISTRIBUTING, possible flapping
em0: Watchdog timeout -- resetting
em2: Interface stopped DISTRIBUTING, possible flapping
em2: Watchdog timeout -- resetting
But this problem occurred only one time.

netstat -in does not see any problems on interfaces:
Name    Mtu Network       Address              Ipkts Ierrs Idrop    Opkts Oerrs  Coll
em0    1500 <Link#1>      00:14:4f:01:3f:7a  6689452     0     0   146720     0     0
em1    1500 <Link#2>      00:14:4f:01:3f:7b  5732168     0     0  2865912     0     0
em2    1500 <Link#3>      00:14:4f:01:3f:7c   501817     0     0  3392333     0     0

TCP_OFFLOAD was disabled in kernel's config.

System is not in production use, and I can try to roll back to pre-r269196 code or try to collect any required additional diagnosis information, but I can't provide remote access to this system.
Comment 1 Dmitry Afanasiev 2015-08-31 11:11:04 UTC
Created attachment 160557 [details]
Corrupted data in transferred file

I created a 16Gb file "hugefile" contains 16777216 lines with 1023 spaces and one '\n' (16777216 * (1023 + 1) = 17179869184 = 16Gb) and transferred this file via FTP.
On destination machine I checked resulting file by this command:
$ grep '[^ ]' hugefile > wrong

Result (wrong data in hugefile) is this attachment.

As far as I know, errors are checked on:
- ethernet level
- IP level
- TCP level.
, but I got corrupted data from TCP level.
Comment 2 Hiren Panchasara freebsd_committer 2015-08-31 16:26:19 UTC
I asked this on mailing list and repeating here. Do you have TSO turned on? If so, can you try disabling it to rule out that possibility?
'ifconfig em0 -tso' should disable tso on em0, for example.
Comment 3 Dmitry Afanasiev 2015-09-01 12:30:44 UTC
(In reply to Hiren Panchasara from comment #2)

As I wrote above, TSO is disabled on this system.
I got some problems with TSO before on another systems and I commented out TCP_OFFLOAD in my kernel config.

# ifconfig | grep ^em
em1: flags=8802<BROADCAST,SIMPLEX,MULTICAST> metric 0 mtu 1500
em3: flags=8802<BROADCAST,SIMPLEX,MULTICAST> metric 0 mtu 1500

# ifconfig | grep -i tso | wc -l

Also "ifconfig em0 -tso" do nothing, because TSO already disabled.
Comment 4 Sean Bruno freebsd_committer 2015-09-01 16:21:41 UTC
The 82546EB is a if_lem.c consumer and I haven't touched that driver yet.

I suspect that this might be a similar issue to pr 196980.

Can you post the output of "pciconf -lvbc" for the host as well as a full, boot verbose dmesg?
Comment 5 Dmitry Afanasiev 2015-09-01 18:27:40 UTC
Created attachment 160604 [details]
pciconf -lvbc output
Comment 6 Dmitry Afanasiev 2015-09-01 18:28:11 UTC
Created attachment 160605 [details]
boot -v output
Comment 7 Dmitry Afanasiev 2015-09-01 18:42:49 UTC
(In reply to Sean Bruno from comment #4)
"pciconf -lvbc" output and dmesg.boot from "boot -v" attached.

I'm not sure that the problem is similar to PR196980, because I have no "Allocation of PCI resources failed" problems and all four interfaces in my Sun Fire X4100 working.

Also... I think problem with data corruption exists in 9-STABLE too...
I tried to scp from another host to 9.3-STABLE r274181 on another X4100 server and got:
Corrupted MAC on input.
Disconnecting: Packet corrupt
lost connection
Comment 9 Sean Bruno freebsd_committer 2015-09-02 16:36:32 UTC
(In reply to Sean Bruno from comment #8)
wait ... never mind.  This is a lem(4) device not a em(4) device so those patches are not applicable.  I'll think more.
Comment 10 Dmitry Afanasiev 2015-09-14 20:46:59 UTC
Also I installed into buggy system with fbsd10.2 old Sun's gigabit ethernet adapter with cas(4) driver and reconfigured lagg to use only this adapter.
I transferred huge files several times with system's uptime >10 days via ssh and ftp without any problems.
I think problem related to only em(4) network driver, not another system's parts.
Comment 11 Sean Bruno freebsd_committer 2015-09-15 18:59:38 UTC
(In reply to Dmitry Afanasiev from comment #10)
Can you see if this change in head helps with your issue.  https://svnweb.freebsd.org/base?view=revision&revision=287330
Comment 12 Dmitry Afanasiev 2015-09-15 19:44:15 UTC
(In reply to Sean Bruno from comment #11)
sys/dev/e1000 in head have many differences from stable/10.
For example different lem's driver version:
< char lem_driver_version[] = "1.0.6";
> char lem_driver_version[] = "1.1.0";

Should I use entire sys/dev/e1000 directory from head in my stable/10 system?
Comment 13 Dmitry Afanasiev 2015-09-15 20:16:20 UTC
(In reply to Dmitry Afanasiev from comment #12)
sys/dev/e1000 tree from head is not compatible with stable/10 :(
Comment 14 Dmitry Afanasiev 2015-10-19 08:15:39 UTC
I tried to use freebsd-current from nightly snapshot:
FreeBSD sunrise0 11.0-CURRENT FreeBSD 11.0-CURRENT #0 r289044: Thu Oct  8 21:21:40 UTC 2015     root@releng2.nyi.freebsd.org:/usr/obj/usr/src/sys/GENERIC  amd64

But nothing changed - after 2 days uptime I again got incorrect MD5 checksum and messages from ssh:
Corrupted MAC on input.
Disconnecting: Packet corrupt
Comment 15 Yaniv Kunda 2015-11-09 21:23:26 UTC
Happens to me as well, on NAS4Free (based on FreeBSD 10.2-RELEASE-p5) - but also previously on several 9.3.x versions (based on FreeBSD 9.3-RELEASE-y)

Here is my device info from pciconf:
em0@pci0:3:0:0: class=0x020000 card=0x10828086 chip=0x107d8086 rev=0x06 hdr=0x00
    vendor     = 'Intel Corporation'
    device     = '82572EI Gigabit Ethernet Controller (Copper)'
    class      = network
    subclass   = ethernet
    bar   [10] = type Memory, range 32, base rxfbce0000, size 131072, enabled
    bar   [14] = type Memory, range 32, base rxfbcc0000, size 131072, enabled
    bar   [18] = type I/O Port, range 32, base rxdf00, size 32, enabled
    cap 01[c8] = powerspec 2  supports D0 D3  current D0
    cap 05[d0] = MSI supports 1 message, 64 bit enabled with 1 message
    cap 10[e0] = PCI-Express 1 endpoint max data 128(256) link x1(x1)
                 speed 2.5(2.5) ASPM disabled(L0s)
    ecap 0001[100] = AER 1 1 fatal 1 non-fatal 5 corrected
    ecap 0003[140] = Serial 1 001b21ffffb43a7f
Comment 16 Sean Bruno freebsd_committer 2015-11-10 16:29:05 UTC
(In reply to Yaniv Kunda from comment #15)
What type of hardware are you using?  Can you dump a full pciconf -lv and dmesg into this ticket?
Comment 17 Yaniv Kunda 2015-11-10 18:30:02 UTC
Created attachment 162976 [details]
output of pciconf -lv
Comment 18 Yaniv Kunda 2015-11-10 18:30:34 UTC
Created attachment 162977 [details]
output of dmesg -a
Comment 19 Yaniv Kunda 2015-11-10 18:33:57 UTC
(In reply to Sean Bruno from comment #16)
I'm using commodity desktop PC hardware, installed with NAS4Free on a USB stick.
Attached outputs of pciconf -lv and dmesg -a

I'd like to help find the root cause of this - please assist in me in any way needed!
Comment 20 Yaniv Kunda 2015-11-10 22:11:33 UTC
(In reply to Yaniv Kunda from comment #19)

Also tried after running ifconfig em0 -tso (verified that it removed TSO4 from em0's options) but with the same result - a big file copy resulted in a quiet error.
Comment 21 Sean Bruno freebsd_committer 2015-11-11 15:22:25 UTC
Can you folks try setting the processing limits to -1?  I'm curious if what you're seeing is a symptom of conflicting processing during interrupt context and deferred handling to the taskqueue threads.

hw.em.rx_process_limit: 100

hw.em.rx_process_limit = -1
Comment 22 Yaniv Kunda 2015-11-11 18:24:55 UTC
(In reply to Sean Bruno from comment #21)

I've set the value to -1 (in loader.conf), rebooted, and verified the new value using sysctl.
In contrast to Dmitry, the problem happens for me even right after reboot, and not only after some uptime.

Copied a bunch of files, and checked MD5's - one of them was wrong - so it is probably not a symptom of what you described.

p.s. I think the importance of this issue should be raised from "Affects Some People" - since part of this problem is quiet corruption, I assume a lot of people with Intel adapters have corrupted data and do not even know.
Comment 23 Dmitry Afanasiev 2015-11-19 20:35:38 UTC
(In reply to Yaniv Kunda from comment #22)
Yaniv, is PF(4) used on your system with em interfaces?

I tested another Sun X4200 with em(4) interfaces and I have *NO* problem with data corruption!

On this system I checked:
10.1-STABLE r280356 with uptime 231 days
10.2-STABLE r290385 with uptime 8 days

Sun X4200 have same to X4100 motherboard, BIOS patch level and network interfaces.
Only ONE difference - on X4200 PF functionality was disabled.

After a week I can enable PF on my X4200 and recheck.

Also, on X4100 I got data corruption several times immediately after system's boot.
Comment 24 Yaniv Kunda 2015-11-21 18:34:15 UTC
(In reply to Dmitry Afanasiev from comment #23)
I don't even have PF loaded... :-(
Comment 25 Kevin Bowling freebsd_committer 2017-01-10 11:33:46 UTC
Can you retest with 12-CURRENT?
Comment 26 Dmitry Afanasiev 2017-01-10 16:27:02 UTC
(In reply to Kevin Bowling from comment #25)
I started to recheck this problem on stable/10 r306961, and currently have no problems.
I need more time to more tests, but may be problem is already solved in r294958 (or next changes?).
I will report later.
Comment 27 Kevin Bowling freebsd_committer 2017-01-10 19:56:56 UTC
It's possible things were fixed in past versions, but r311849 in -CURRENT has a large cleanup and optimization of the FreeBSD part of the driver.
Comment 28 Dmitry Afanasiev 2017-02-01 22:44:01 UTC
Bug still exists on stable/10 r309900 - 3 times got ssh disconnections, and got 2 errors for dumb TCP data transfer per ~3Tb of transmitted data.

To check on current I should switch some services from this server and should have a time to run tests. But I will try to test in this month.
Comment 29 Eitan Adler freebsd_committer freebsd_triage 2018-05-28 19:47:28 UTC
batch change:

For bugs that match the following
-  Status Is In progress 
- Untouched since 2018-01-01.
- Affects Base System OR Documentation


Reset to open status.

I did a quick pass but if you are getting this email it might be worthwhile to double check to see if this bug ought to be closed.