Bug 240944 - em(4): Crash with Intel 82571EB NIC with AMD Piledriver and Steamroller APUs
Summary: em(4): Crash with Intel 82571EB NIC with AMD Piledriver and Steamroller APUs
Status: Open
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 12.0-RELEASE
Hardware: amd64 Any
: --- Affects Some People
Assignee: freebsd-net (Nobody)
URL: https://www.reddit.com/r/PFSENSE/comm...
Keywords: IntelNetworking, crash, needs-qa
Depends on:
Blocks:
 
Reported: 2019-09-30 16:50 UTC by tinfever
Modified: 2021-09-08 22:17 UTC (History)
7 users (show)

See Also:
koobs: mfc-stable12?
koobs: mfc-stable11?


Attachments
pciconf -l -vbc and dmesg from AMD Piledriver APU and HP NC364T NIC (29.50 KB, text/plain)
2019-10-01 03:47 UTC, Ron
no flags Details
pciconf -l -vbc from NC364T and AMD RX-427BB (16.35 KB, text/plain)
2019-10-01 05:57 UTC, tinfever
no flags Details
dmesg from from NC364T and AMD RX-427BB Steamroller (11.37 KB, text/plain)
2019-10-01 06:01 UTC, tinfever
no flags Details
dmesg output (10.66 KB, text/plain)
2021-09-05 17:04 UTC, Ace
no flags Details
ifconfig em0 output (535 bytes, text/plain)
2021-09-05 17:05 UTC, Ace
no flags Details
pciconf output (16.36 KB, text/plain)
2021-09-05 17:06 UTC, Ace
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description tinfever 2019-09-30 16:50:46 UTC
When using an Intel 82571EB quad-port gigabit NIC (HP NC364T) in a system with an AMD Piledriver or Steamroller APU, any taxing Ethernet workload will cause the entire system to either permanently lock up (requiring a hard reboot by holding down the power button to reset), or crash and reboot.

Initial discussion of this issue can be found in the pfSense subreddit: https://www.reddit.com/r/PFSENSE/comments/da6nh7/multiport_intel_82571ebbased_network_cards_not/

I have confirmed this issue is present even on the FreeBSD 12.0 release so it is not an issue introduced by pfSense or OPNsense.

Hardware used for testing and reproduction:

HP T730 thin client
AMD RX-427BB APU (4 core)
4GB RAM
16GB SSD
HP NC364T Network Controller (Uses Intel 82571EB chipset)

Steps to reproduce:

1. Download "FreeBSD-12.0-RELEASE-amd64-mini-memstick.img"
2. Used Etcher to load to flash drive.
3. Boot up in UEFI or Legacy to flash drive
4. Run as live CD instead of installer
5. Run "dhclient em3" to get network access.
6. Run command "fetch http://speedtest.tele2.net/1GB.zip -o /dev/null"
7. Crash happens within the first 60 seconds.

When it crashes, the system goes 100% unresponsive. No amount of Ctl-Alt-Del will do anything. I have to hold power button to do a hard reboot. System has been left in unresponsive state for 12+ hours with no change. There are no log entries or bug checks present (that I have seen)

Troubleshooting done:

1. Issue reoccurs using both UEFI and Legacy boot methods
2. Issue reoccurs with onboard NIC disabled
3. Issue reoccurs using included 7.6.1 and latest 7.7.5 Intel drivers
4. Issue reoccurs when using either USB or SSD boot media
5. Hardware works perfectly fine when booting to Ubuntu live USB for testing
6. Hardware passed memtest64+ @ 4 passes

Other user reports (see Reddit discussion link) indicate that:

1. Issue reoccurs with on every NC364T card when testing several of them
2. Issue occurs on NC360T (dual-port Intel 82571EB NIC)
3. Issue does not occur with NC112T (Intel 82574L-based single port NIC)
4. Issue occurs with other versions of pfSense including 2.1.5, 2.2.4, 2.3.3, and 2.5.0 (20190928). I have not looked up the correlation between pfSense and FreeBSD versions.

I'm no Linux/Unix guru, nor do I have very much time I can allocate to seeing this through to a resolution, but I will try to assist in any testing or data collection needed to hopefully address this issue. Thanks.
Comment 1 Krzysztof Galazka 2019-09-30 19:11:30 UTC
(In reply to tinfever from comment #0)
Could you, please, provide output from: pciconf -l -vbc and dmesg for that NIC?
Comment 2 Ron 2019-10-01 03:47:28 UTC
Created attachment 207973 [details]
pciconf -l -vbc and dmesg from AMD Piledriver APU and HP NC364T NIC

pciconf -l -vbc and dmesg as requested in comment #1
Comment 3 tinfever 2019-10-01 05:57:04 UTC
Created attachment 207977 [details]
pciconf -l -vbc from NC364T and AMD RX-427BB

Attached copy of requested output of "pciconf -l -vbc" on system using NC364T and AMD RX-427BB.
Comment 4 tinfever 2019-10-01 06:01:30 UTC
Created attachment 207978 [details]
dmesg from from NC364T and AMD RX-427BB Steamroller

Attached copy of requested output of dmesg on system using NC364T and AMD RX-427BB.

I had surprising difficulty getting these logs off the machine since even an ssh session seems to be enough to crash everything sometimes. I've also noticed that if you catch it crashing and start mashing buttons on the keyboard, you can see it register the key presses really slowly for a second until it eventually registers nothing at all.
Comment 5 bhert 2020-05-08 23:09:37 UTC
Dropping in to state that I am seeing the same issue with my HP t730 and both a HP NC364T and NC360T. I followed the same reproduction steps as tinfever but with the 12.1 release. Also tested with pfSense 2.5 (based on FreeBSD 12.1) running as an iperf server. 

I do also see this with pfSense 2.4.5 (based on FreeBSD 11.3).

If any further information is needed, please let me know.
Comment 6 Alin 2021-04-10 11:48:07 UTC
Same issue happens to me :| I am running it on HP T730 with HP NC365T Network Controller. 32GB SSD, 2x4GB RAM (brand new) 

Trying to make it work with pfSense 2.4.5 and 2.5 (FreeBSD 12.2-Stable). Changed different RAM sticks, SSDs, NICs. The only thing I have not changed is the CPU.

Works for about an hour maybe less and then becomes unresponsive and required hard reboot.

Let me know if you found any solution/workaround or I need to repurpose the box to something else. 

Thanks :)
Comment 7 Ace 2021-09-05 12:44:31 UTC
Stumbled upon the same issue today. Took the card out and it works fine again. Happy to provide any details if necessary.
Comment 8 Kevin Bowling freebsd_committer 2021-09-05 16:01:17 UTC
(In reply to Ace from comment #7)
Please do, I'd like to see the output of 'ifconfig em0' and a 'dmesg'.
Comment 9 Ace 2021-09-05 17:04:44 UTC
Created attachment 227688 [details]
dmesg output
Comment 10 Ace 2021-09-05 17:05:31 UTC
Created attachment 227689 [details]
ifconfig em0 output
Comment 11 Ace 2021-09-05 17:06:05 UTC
Created attachment 227690 [details]
pciconf output
Comment 12 Ace 2021-09-05 17:07:22 UTC
(In reply to Kevin Bowling from comment #8)
Thanks, I've attached them. I've also included pciconf -l -vbc output for good measure as asked in comment #1
Comment 13 Kevin Bowling freebsd_committer 2021-09-05 18:27:39 UTC
(In reply to Ace from comment #12)
> ecap 0001[100] = AER 1 0 fatal 1 non-fatal 2 corrected

You have some fatal PCI errors occurring on the card, and that looks consistent with the other pciconf reports..  just to start with a low effort guess can you try disabling PCI Link Power management (ASPM) and/or AER (advanced error reporting) in the system's firmware and see what happens?

Beyond that there are a number of relevant errata we may need to check off in the driver to see if we are missing some mitigation http://iommu.com/datasheets/e1000-datasheets/82571eb-82572ei-gbe-controller-spec-update.pdf the above two firmware changes stand out to me as eliminating some possible issues.
Comment 14 Ace 2021-09-08 20:39:20 UTC
> try disabling PCI Link Power management (ASPM) and/or AER (advanced error reporting) in the system's firmware and see what happens?

Sorry mate, I'm struggling to figure out how to do this. Sorry if the following sounds dumb in this context. I'm using UEFI but don't see any such option, nor do I find anything on the internet on how to disable these.
Comment 15 Kevin Bowling freebsd_committer 2021-09-08 21:14:40 UTC
(In reply to Ace from comment #14)
If the UEFI had options for it I think it would be obvious so it may not have the knobs exposed.  It will be tricky to proceed and make any fixes without a card.
Comment 16 Ace 2021-09-08 22:17:30 UTC
(In reply to Kevin Bowling from comment #15)

I just bought a new a Intel i350-T4 which users online have reported no issues with in combination with the HP T730 and OPNSense so fingers crossed that will fare better.

Having said that, if you are UK based, I'd be happy to post the HP NC364T card to you if it helps other users since I'll have no use for it. Please contact me directly via email if you're up for that.