|Summary:||Silent data corruption on em(4) interfaces|
|Product:||Base System||Reporter:||Dmitry Afanasiev <KOT>|
|Component:||kern||Assignee:||freebsd-net (Nobody) <net>|
|Severity:||Affects Some People||CC:||kbowling, ports, sbruno, yaniv|
Description Dmitry Afanasiev 2015-08-26 16:35:06 UTC
I got silent data corruption when transferring data via em(4) interfaces on 10.2-STABLE r286912. Immediately after reboot does not occur - after reboot I can transfer files without any errors. Problem occurs *only* after few hours of uptime. I checked on system with uptime > 1 day and got: 1. broken large file transferred via ftp (MD5 checksum mismatched); For example: Original MD5(eis-dvd.iso)= 3705908d84f3c0f9f922a00e6e9332d2 Downloaded MD5 (eis-dvd.iso) = 1f7de1fdc682ae9ca26ec0621b377e89 2. SSH disconnects when transferring large data with message: Corrupted MAC on input. Disconnecting: Packet corrupt I using Sun Fire X4100 server with 4 build in ethernet ports Intel 82546EB: em0@pci0:1:1:0: class=0x020000 card=0x10118086 chip=0x10108086 rev=0x03 hdr=0x00 vendor = 'Intel Corporation' device = '82546EB Gigabit Ethernet Controller (Copper)' class = network subclass = ethernet This server works under solaris without any problems. I tried to use: 1. lagg0.2 (lagg0 with vlan tag) over em0 and em2 interfaces: cloned_interfaces="lagg0 lagg0.2" ifconfig_em0="up" ifconfig_em2="up" create_args_lagg0="laggproto lacp laggport em0 laggport em2" 2. tagged em1.2 interface via manual ifconfig em1.2 create 192.168.X.Y And I got same problem in both cases. *One* time when transferring file I got this messages: em0: Interface stopped DISTRIBUTING, possible flapping em0: Watchdog timeout -- resetting em2: Interface stopped DISTRIBUTING, possible flapping em2: Watchdog timeout -- resetting But this problem occurred only one time. netstat -in does not see any problems on interfaces: Name Mtu Network Address Ipkts Ierrs Idrop Opkts Oerrs Coll em0 1500 <Link#1> 00:14:4f:01:3f:7a 6689452 0 0 146720 0 0 em1 1500 <Link#2> 00:14:4f:01:3f:7b 5732168 0 0 2865912 0 0 em2 1500 <Link#3> 00:14:4f:01:3f:7c 501817 0 0 3392333 0 0 TCP_OFFLOAD was disabled in kernel's config. System is not in production use, and I can try to roll back to pre-r269196 code or try to collect any required additional diagnosis information, but I can't provide remote access to this system.
Comment 1 Dmitry Afanasiev 2015-08-31 11:11:04 UTC
Created attachment 160557 [details] Corrupted data in transferred file I created a 16Gb file "hugefile" contains 16777216 lines with 1023 spaces and one '\n' (16777216 * (1023 + 1) = 17179869184 = 16Gb) and transferred this file via FTP. On destination machine I checked resulting file by this command: $ grep '[^ ]' hugefile > wrong Result (wrong data in hugefile) is this attachment. As far as I know, errors are checked on: - ethernet level - IP level - TCP level. , but I got corrupted data from TCP level.
Comment 2 Hiren Panchasara 2015-08-31 16:26:19 UTC
I asked this on mailing list and repeating here. Do you have TSO turned on? If so, can you try disabling it to rule out that possibility? 'ifconfig em0 -tso' should disable tso on em0, for example.
Comment 3 Dmitry Afanasiev 2015-09-01 12:30:44 UTC
(In reply to Hiren Panchasara from comment #2) As I wrote above, TSO is disabled on this system. I got some problems with TSO before on another systems and I commented out TCP_OFFLOAD in my kernel config. # ifconfig | grep ^em em0: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500 em1: flags=8802<BROADCAST,SIMPLEX,MULTICAST> metric 0 mtu 1500 em2: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500 em3: flags=8802<BROADCAST,SIMPLEX,MULTICAST> metric 0 mtu 1500 # ifconfig | grep -i tso | wc -l 0 Also "ifconfig em0 -tso" do nothing, because TSO already disabled.
Comment 4 Sean Bruno 2015-09-01 16:21:41 UTC
The 82546EB is a if_lem.c consumer and I haven't touched that driver yet. I suspect that this might be a similar issue to pr 196980. Can you post the output of "pciconf -lvbc" for the host as well as a full, boot verbose dmesg?
Comment 5 Dmitry Afanasiev 2015-09-01 18:27:40 UTC
Created attachment 160604 [details] pciconf -lvbc output
Comment 6 Dmitry Afanasiev 2015-09-01 18:28:11 UTC
Created attachment 160605 [details] boot -v output
Comment 7 Dmitry Afanasiev 2015-09-01 18:42:49 UTC
(In reply to Sean Bruno from comment #4) "pciconf -lvbc" output and dmesg.boot from "boot -v" attached. I'm not sure that the problem is similar to PR196980, because I have no "Allocation of PCI resources failed" problems and all four interfaces in my Sun Fire X4100 working. Also... I think problem with data corruption exists in 9-STABLE too... I tried to scp from another host to 9.3-STABLE r274181 on another X4100 server and got: Corrupted MAC on input. Disconnecting: Packet corrupt lost connection
Comment 8 Sean Bruno 2015-09-02 16:35:45 UTC
Can you try to apply three changesets from head and see if the condition still exists? https://svnweb.freebsd.org/base?view=revision&revision=286831 https://svnweb.freebsd.org/base?view=revision&revision=285879 https://svnweb.freebsd.org/base?view=revision&revision=287330
Comment 9 Sean Bruno 2015-09-02 16:36:32 UTC
(In reply to Sean Bruno from comment #8) wait ... never mind. This is a lem(4) device not a em(4) device so those patches are not applicable. I'll think more.
Comment 10 Dmitry Afanasiev 2015-09-14 20:46:59 UTC
Also I installed into buggy system with fbsd10.2 old Sun's gigabit ethernet adapter with cas(4) driver and reconfigured lagg to use only this adapter. I transferred huge files several times with system's uptime >10 days via ssh and ftp without any problems. I think problem related to only em(4) network driver, not another system's parts.
Comment 11 Sean Bruno 2015-09-15 18:59:38 UTC
(In reply to Dmitry Afanasiev from comment #10) Can you see if this change in head helps with your issue. https://svnweb.freebsd.org/base?view=revision&revision=287330
Comment 12 Dmitry Afanasiev 2015-09-15 19:44:15 UTC
(In reply to Sean Bruno from comment #11) sys/dev/e1000 in head have many differences from stable/10. For example different lem's driver version: < char lem_driver_version = "1.0.6"; --- > char lem_driver_version = "1.1.0"; Should I use entire sys/dev/e1000 directory from head in my stable/10 system?
Comment 13 Dmitry Afanasiev 2015-09-15 20:16:20 UTC
(In reply to Dmitry Afanasiev from comment #12) sys/dev/e1000 tree from head is not compatible with stable/10 :(
Comment 14 Dmitry Afanasiev 2015-10-19 08:15:39 UTC
I tried to use freebsd-current from nightly snapshot: FreeBSD sunrise0 11.0-CURRENT FreeBSD 11.0-CURRENT #0 r289044: Thu Oct 8 21:21:40 UTC 2015 firstname.lastname@example.org:/usr/obj/usr/src/sys/GENERIC amd64 But nothing changed - after 2 days uptime I again got incorrect MD5 checksum and messages from ssh: Corrupted MAC on input. Disconnecting: Packet corrupt
Comment 15 Yaniv Kunda 2015-11-09 21:23:26 UTC
Happens to me as well, on NAS4Free 10.2.0.2.1906 (based on FreeBSD 10.2-RELEASE-p5) - but also previously on several 9.3.x versions (based on FreeBSD 9.3-RELEASE-y) Here is my device info from pciconf: em0@pci0:3:0:0: class=0x020000 card=0x10828086 chip=0x107d8086 rev=0x06 hdr=0x00 vendor = 'Intel Corporation' device = '82572EI Gigabit Ethernet Controller (Copper)' class = network subclass = ethernet bar  = type Memory, range 32, base rxfbce0000, size 131072, enabled bar  = type Memory, range 32, base rxfbcc0000, size 131072, enabled bar  = type I/O Port, range 32, base rxdf00, size 32, enabled cap 01[c8] = powerspec 2 supports D0 D3 current D0 cap 05[d0] = MSI supports 1 message, 64 bit enabled with 1 message cap 10[e0] = PCI-Express 1 endpoint max data 128(256) link x1(x1) speed 2.5(2.5) ASPM disabled(L0s) ecap 0001 = AER 1 1 fatal 1 non-fatal 5 corrected ecap 0003 = Serial 1 001b21ffffb43a7f
Comment 16 Sean Bruno 2015-11-10 16:29:05 UTC
(In reply to Yaniv Kunda from comment #15) What type of hardware are you using? Can you dump a full pciconf -lv and dmesg into this ticket?
Comment 17 Yaniv Kunda 2015-11-10 18:30:02 UTC
Created attachment 162976 [details] output of pciconf -lv
Comment 18 Yaniv Kunda 2015-11-10 18:30:34 UTC
Created attachment 162977 [details] output of dmesg -a
Comment 19 Yaniv Kunda 2015-11-10 18:33:57 UTC
(In reply to Sean Bruno from comment #16) I'm using commodity desktop PC hardware, installed with NAS4Free on a USB stick. Attached outputs of pciconf -lv and dmesg -a I'd like to help find the root cause of this - please assist in me in any way needed!
Comment 20 Yaniv Kunda 2015-11-10 22:11:33 UTC
(In reply to Yaniv Kunda from comment #19) Also tried after running ifconfig em0 -tso (verified that it removed TSO4 from em0's options) but with the same result - a big file copy resulted in a quiet error.
Comment 21 Sean Bruno 2015-11-11 15:22:25 UTC
Can you folks try setting the processing limits to -1? I'm curious if what you're seeing is a symptom of conflicting processing during interrupt context and deferred handling to the taskqueue threads. change: hw.em.rx_process_limit: 100 hw.em.rx_process_limit = -1
Comment 22 Yaniv Kunda 2015-11-11 18:24:55 UTC
(In reply to Sean Bruno from comment #21) I've set the value to -1 (in loader.conf), rebooted, and verified the new value using sysctl. In contrast to Dmitry, the problem happens for me even right after reboot, and not only after some uptime. Copied a bunch of files, and checked MD5's - one of them was wrong - so it is probably not a symptom of what you described. p.s. I think the importance of this issue should be raised from "Affects Some People" - since part of this problem is quiet corruption, I assume a lot of people with Intel adapters have corrupted data and do not even know.
Comment 23 Dmitry Afanasiev 2015-11-19 20:35:38 UTC
(In reply to Yaniv Kunda from comment #22) Yaniv, is PF(4) used on your system with em interfaces? I tested another Sun X4200 with em(4) interfaces and I have *NO* problem with data corruption! On this system I checked: 10.1-STABLE r280356 with uptime 231 days 10.2-STABLE r290385 with uptime 8 days Sun X4200 have same to X4100 motherboard, BIOS patch level and network interfaces. Only ONE difference - on X4200 PF functionality was disabled. After a week I can enable PF on my X4200 and recheck. Also, on X4100 I got data corruption several times immediately after system's boot.
Comment 24 Yaniv Kunda 2015-11-21 18:34:15 UTC
(In reply to Dmitry Afanasiev from comment #23) I don't even have PF loaded... :-(
Comment 25 Kevin Bowling 2017-01-10 11:33:46 UTC
Can you retest with 12-CURRENT?
Comment 26 Dmitry Afanasiev 2017-01-10 16:27:02 UTC
(In reply to Kevin Bowling from comment #25) I started to recheck this problem on stable/10 r306961, and currently have no problems. I need more time to more tests, but may be problem is already solved in r294958 (or next changes?). I will report later.
Comment 27 Kevin Bowling 2017-01-10 19:56:56 UTC
It's possible things were fixed in past versions, but r311849 in -CURRENT has a large cleanup and optimization of the FreeBSD part of the driver.
Comment 28 Dmitry Afanasiev 2017-02-01 22:44:01 UTC
Bug still exists on stable/10 r309900 - 3 times got ssh disconnections, and got 2 errors for dumb TCP data transfer per ~3Tb of transmitted data. To check on current I should switch some services from this server and should have a time to run tests. But I will try to test in this month.
Comment 29 Eitan Adler 2018-05-28 19:47:28 UTC
batch change: For bugs that match the following - Status Is In progress AND - Untouched since 2018-01-01. AND - Affects Base System OR Documentation DO: Reset to open status. Note: I did a quick pass but if you are getting this email it might be worthwhile to double check to see if this bug ought to be closed.