I got silent data corruption when transferring data via em(4) interfaces on 10.2-STABLE r286912. Immediately after reboot does not occur - after reboot I can transfer files without any errors. Problem occurs *only* after few hours of uptime. I checked on system with uptime > 1 day and got: 1. broken large file transferred via ftp (MD5 checksum mismatched); For example: Original MD5(eis-dvd.iso)= 3705908d84f3c0f9f922a00e6e9332d2 Downloaded MD5 (eis-dvd.iso) = 1f7de1fdc682ae9ca26ec0621b377e89 2. SSH disconnects when transferring large data with message: Corrupted MAC on input. Disconnecting: Packet corrupt I using Sun Fire X4100 server with 4 build in ethernet ports Intel 82546EB: em0@pci0:1:1:0: class=0x020000 card=0x10118086 chip=0x10108086 rev=0x03 hdr=0x00 vendor = 'Intel Corporation' device = '82546EB Gigabit Ethernet Controller (Copper)' class = network subclass = ethernet This server works under solaris without any problems. I tried to use: 1. lagg0.2 (lagg0 with vlan tag) over em0 and em2 interfaces: cloned_interfaces="lagg0 lagg0.2" ifconfig_em0="up" ifconfig_em2="up" create_args_lagg0="laggproto lacp laggport em0 laggport em2" 2. tagged em1.2 interface via manual ifconfig em1.2 create 192.168.X.Y And I got same problem in both cases. *One* time when transferring file I got this messages: em0: Interface stopped DISTRIBUTING, possible flapping em0: Watchdog timeout -- resetting em2: Interface stopped DISTRIBUTING, possible flapping em2: Watchdog timeout -- resetting But this problem occurred only one time. netstat -in does not see any problems on interfaces: Name Mtu Network Address Ipkts Ierrs Idrop Opkts Oerrs Coll em0 1500 <Link#1> 00:14:4f:01:3f:7a 6689452 0 0 146720 0 0 em1 1500 <Link#2> 00:14:4f:01:3f:7b 5732168 0 0 2865912 0 0 em2 1500 <Link#3> 00:14:4f:01:3f:7c 501817 0 0 3392333 0 0 TCP_OFFLOAD was disabled in kernel's config. System is not in production use, and I can try to roll back to pre-r269196 code or try to collect any required additional diagnosis information, but I can't provide remote access to this system.
Created attachment 160557 [details] Corrupted data in transferred file I created a 16Gb file "hugefile" contains 16777216 lines with 1023 spaces and one '\n' (16777216 * (1023 + 1) = 17179869184 = 16Gb) and transferred this file via FTP. On destination machine I checked resulting file by this command: $ grep '[^ ]' hugefile > wrong Result (wrong data in hugefile) is this attachment. As far as I know, errors are checked on: - ethernet level - IP level - TCP level. , but I got corrupted data from TCP level.
I asked this on mailing list and repeating here. Do you have TSO turned on? If so, can you try disabling it to rule out that possibility? 'ifconfig em0 -tso' should disable tso on em0, for example.
(In reply to Hiren Panchasara from comment #2) As I wrote above, TSO is disabled on this system. I got some problems with TSO before on another systems and I commented out TCP_OFFLOAD in my kernel config. # ifconfig | grep ^em em0: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500 em1: flags=8802<BROADCAST,SIMPLEX,MULTICAST> metric 0 mtu 1500 em2: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500 em3: flags=8802<BROADCAST,SIMPLEX,MULTICAST> metric 0 mtu 1500 # ifconfig | grep -i tso | wc -l 0 Also "ifconfig em0 -tso" do nothing, because TSO already disabled.
The 82546EB is a if_lem.c consumer and I haven't touched that driver yet. I suspect that this might be a similar issue to pr 196980. Can you post the output of "pciconf -lvbc" for the host as well as a full, boot verbose dmesg?
Created attachment 160604 [details] pciconf -lvbc output
Created attachment 160605 [details] boot -v output
(In reply to Sean Bruno from comment #4) "pciconf -lvbc" output and dmesg.boot from "boot -v" attached. I'm not sure that the problem is similar to PR196980, because I have no "Allocation of PCI resources failed" problems and all four interfaces in my Sun Fire X4100 working. Also... I think problem with data corruption exists in 9-STABLE too... I tried to scp from another host to 9.3-STABLE r274181 on another X4100 server and got: Corrupted MAC on input. Disconnecting: Packet corrupt lost connection
Can you try to apply three changesets from head and see if the condition still exists? https://svnweb.freebsd.org/base?view=revision&revision=286831 https://svnweb.freebsd.org/base?view=revision&revision=285879 https://svnweb.freebsd.org/base?view=revision&revision=287330
(In reply to Sean Bruno from comment #8) wait ... never mind. This is a lem(4) device not a em(4) device so those patches are not applicable. I'll think more.
Also I installed into buggy system with fbsd10.2 old Sun's gigabit ethernet adapter with cas(4) driver and reconfigured lagg to use only this adapter. I transferred huge files several times with system's uptime >10 days via ssh and ftp without any problems. I think problem related to only em(4) network driver, not another system's parts.
(In reply to Dmitry Afanasiev from comment #10) Can you see if this change in head helps with your issue. https://svnweb.freebsd.org/base?view=revision&revision=287330
(In reply to Sean Bruno from comment #11) sys/dev/e1000 in head have many differences from stable/10. For example different lem's driver version: < char lem_driver_version[] = "1.0.6"; --- > char lem_driver_version[] = "1.1.0"; Should I use entire sys/dev/e1000 directory from head in my stable/10 system?
(In reply to Dmitry Afanasiev from comment #12) sys/dev/e1000 tree from head is not compatible with stable/10 :(
I tried to use freebsd-current from nightly snapshot: FreeBSD sunrise0 11.0-CURRENT FreeBSD 11.0-CURRENT #0 r289044: Thu Oct 8 21:21:40 UTC 2015 root@releng2.nyi.freebsd.org:/usr/obj/usr/src/sys/GENERIC amd64 But nothing changed - after 2 days uptime I again got incorrect MD5 checksum and messages from ssh: Corrupted MAC on input. Disconnecting: Packet corrupt
Happens to me as well, on NAS4Free 10.2.0.2.1906 (based on FreeBSD 10.2-RELEASE-p5) - but also previously on several 9.3.x versions (based on FreeBSD 9.3-RELEASE-y) Here is my device info from pciconf: em0@pci0:3:0:0: class=0x020000 card=0x10828086 chip=0x107d8086 rev=0x06 hdr=0x00 vendor = 'Intel Corporation' device = '82572EI Gigabit Ethernet Controller (Copper)' class = network subclass = ethernet bar [10] = type Memory, range 32, base rxfbce0000, size 131072, enabled bar [14] = type Memory, range 32, base rxfbcc0000, size 131072, enabled bar [18] = type I/O Port, range 32, base rxdf00, size 32, enabled cap 01[c8] = powerspec 2 supports D0 D3 current D0 cap 05[d0] = MSI supports 1 message, 64 bit enabled with 1 message cap 10[e0] = PCI-Express 1 endpoint max data 128(256) link x1(x1) speed 2.5(2.5) ASPM disabled(L0s) ecap 0001[100] = AER 1 1 fatal 1 non-fatal 5 corrected ecap 0003[140] = Serial 1 001b21ffffb43a7f
(In reply to Yaniv Kunda from comment #15) What type of hardware are you using? Can you dump a full pciconf -lv and dmesg into this ticket?
Created attachment 162976 [details] output of pciconf -lv
Created attachment 162977 [details] output of dmesg -a
(In reply to Sean Bruno from comment #16) I'm using commodity desktop PC hardware, installed with NAS4Free on a USB stick. Attached outputs of pciconf -lv and dmesg -a I'd like to help find the root cause of this - please assist in me in any way needed!
(In reply to Yaniv Kunda from comment #19) Also tried after running ifconfig em0 -tso (verified that it removed TSO4 from em0's options) but with the same result - a big file copy resulted in a quiet error.
Can you folks try setting the processing limits to -1? I'm curious if what you're seeing is a symptom of conflicting processing during interrupt context and deferred handling to the taskqueue threads. change: hw.em.rx_process_limit: 100 hw.em.rx_process_limit = -1
(In reply to Sean Bruno from comment #21) I've set the value to -1 (in loader.conf), rebooted, and verified the new value using sysctl. In contrast to Dmitry, the problem happens for me even right after reboot, and not only after some uptime. Copied a bunch of files, and checked MD5's - one of them was wrong - so it is probably not a symptom of what you described. p.s. I think the importance of this issue should be raised from "Affects Some People" - since part of this problem is quiet corruption, I assume a lot of people with Intel adapters have corrupted data and do not even know.
(In reply to Yaniv Kunda from comment #22) Yaniv, is PF(4) used on your system with em interfaces? I tested another Sun X4200 with em(4) interfaces and I have *NO* problem with data corruption! On this system I checked: 10.1-STABLE r280356 with uptime 231 days 10.2-STABLE r290385 with uptime 8 days Sun X4200 have same to X4100 motherboard, BIOS patch level and network interfaces. Only ONE difference - on X4200 PF functionality was disabled. After a week I can enable PF on my X4200 and recheck. Also, on X4100 I got data corruption several times immediately after system's boot.
(In reply to Dmitry Afanasiev from comment #23) I don't even have PF loaded... :-(
Can you retest with 12-CURRENT?
(In reply to Kevin Bowling from comment #25) I started to recheck this problem on stable/10 r306961, and currently have no problems. I need more time to more tests, but may be problem is already solved in r294958 (or next changes?). I will report later.
It's possible things were fixed in past versions, but r311849 in -CURRENT has a large cleanup and optimization of the FreeBSD part of the driver.
Bug still exists on stable/10 r309900 - 3 times got ssh disconnections, and got 2 errors for dumb TCP data transfer per ~3Tb of transmitted data. To check on current I should switch some services from this server and should have a time to run tests. But I will try to test in this month.
batch change: For bugs that match the following - Status Is In progress AND - Untouched since 2018-01-01. AND - Affects Base System OR Documentation DO: Reset to open status. Note: I did a quick pass but if you are getting this email it might be worthwhile to double check to see if this bug ought to be closed.
I'll need more information, per the 2017 comment we need to get on newer code to make sure it still is problematic. Please reopen and assign to me if this happens on FreeBSD 12 or 13.