Bug 240658

Summary: iflib: if_igb(4) and some if_em(4) devices don't recognize/report carrier loss.
Product: Base System Reporter: Harald Schmalzbauer <bugzilla.freebsd>
Component: kernAssignee: freebsd-net (Nobody) <net>
Status: Closed DUPLICATE    
Severity: Affects Many People CC: chris, emaste, krzysztof.galazka, net, webmaster
Priority: --- Keywords: needs-qa
Version: 12.0-STABLE   
Hardware: Any   
OS: Any   
URL: https://lists.freebsd.org/pipermail/freebsd-net/2018-May/050497.html
See Also: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=239240
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=236724
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=240818

Description Harald Schmalzbauer 2019-09-18 06:20:09 UTC
Hello,

reported back in May 2018 with a Clarkville MAC:
https://lists.freebsd.org/pipermail/freebsd-net/2018-May/050497.html

Currently I see this again in 12.1-pro with Powerville (i211), while other if_igb(4) NICs are not affected (i350).

I don't have a i217 handy to verify if this bug reoccured, or if my "not reproducable anymore" was simply wrong.

Can anybody remember when/who worked on that issue?

This is a major issue, since if_lagg(4) clones depend on the link state information and pulling a cable leads to interrupted communication – the opposite of what if_lagg(4) should provide! i211 is a widely used on-board server nic, so most likely many were affected by this issue if shipped with 12.1.

Thanks,

-Harry
Comment 1 Harald Schmalzbauer 2019-09-18 07:25:18 UTC
Sorry, need to correct myself, much more if_em(4) _and_ if_igb(4) NICs are affected!

Took all Intel NICs I have available to verify (in an S1200v3RP server):

82574L (Hartwell) - if_em(4) attaches
And the followig if_igb(4) NICs:
82576 (Kawela)
i210 (Springville)
i350 (StonyLake)

According to ifconfig(8), all NICs above report active link state, with the initial media speed, no matter if a cable is attached at all, or the link speed changed (100->1000 but ifconfig still shows 100Base-TX).

So currently, with 12.1-PRERELEASE r352354, I can see the issue with all Intel NICs I have available for testing, with one exception:
On a different machine with on-board i217-V (Clarkville, the one I reported in my initial post), the link state swiftly and correctly changes.

Unfortunately I won't find time to look into the svn hisroty to see if somebody fixed i217 back then or inspect anything in more detail – for the next few weeks at least :-(

Hope this can be fixed before 12.1 ships.

Thanks,

-harry
Comment 2 Krzysztof Galazka 2019-09-19 14:41:39 UTC
This issue was also reported in this PR:https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=239240
We were able to reproduce it and are working on a fix.
Comment 3 Kubilay Kocak freebsd_committer freebsd_triage 2019-09-20 04:24:43 UTC
Closing per comment 2

*** This bug has been marked as a duplicate of bug 239240 ***
Comment 4 Harald Schmalzbauer 2019-09-24 17:22:04 UTC
Initially, https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=239240 describes a different issue and I'd like to keep that separate, although John Delano added comment #3, where he mentions the same link state issue. But the proposed fix indicates this is an independent issue!

The proposed phabriactor patch in https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=239240#c13 (https://reviews.freebsd.org/D21769) doens't fix the problem.

There's progress though: After once marking the interface down and up again, the link state correctly changes from there on.  But right after loading the kernel module, link state still doesn't change when physical connection was lost (repeatedly won't change, unless you manually down/up).
Tested with 12.1-beta1, D21769+D21712 and i210, i350, 82576 (all hw.mac.type igb ) and 82574L (em) - consistently the same behaviour.

Thanks,

-Harry
Comment 5 Harald Schmalzbauer 2019-09-25 07:21:38 UTC
(In reply to John Delano from comment #16) in Bug 239240

Thanks for your details.  Besides the link state issue and the initial issue reported in https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=240658 we are referring a third issue/fix with your latest comment – the first-packet completion issue.
Since you mention that you didn't apply the proposed fix for the real 240658 issue, I use this bug report for my comment, which describes one of the issues you addressed and to which your feedback corresponds better.

Regarding the 1st-packet-fix:
Thanks to the very detailed commit log in r343369 from erj@, he explains why r343369 replaces r341156, r340310 and r341156.

This new fix (r343369) was MFCd in 344097 by marius, back in february.
How/why do you apply https://reviews.freebsd.org/D18545?


Regarding the topic (and link-state-fix https://reviews.freebsd.org/D21769):
I cannot reproduce this odd 1st-linkdetection-failure in my ESXi passthrough environment!  But I intentionally took the plain iron and packed all intel NICs I have into it, to do the test without any hypervisor/passthrough in between.
I did the test twice, since it seemed very odd to me too: After cold-start booting from ISO with latest 12.1-BETA1 with https://reviews.freebsd.org/D21769 applied, the link state still reports the last state/speed before unplugging, and does not catch the line drop, which changes when I once bring the interface down/up.
In this scenario, rc(8) hasn't configured the NICs!  Mabye that's the difference to your setup.
Will see to find time in the evening to further isolate the special condition.

Thanks,

-harry
Comment 6 Kubilay Kocak freebsd_committer freebsd_triage 2019-09-25 10:57:12 UTC
Reset assignee after re-open
Comment 7 John Delano 2019-09-25 13:32:29 UTC
(In reply to Harald Schmalzbauer from comment #5)

https://reviews.freebsd.org/D18545 is listed in comment #6 in https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=239240. Since it was supposed to fix the problem, I applied the patch.
Comment 8 Harald Schmalzbauer 2019-09-25 14:43:17 UTC
(In reply to John Delano from comment #7)

To what FreeBSD version did you apply D18545 from comment #6 of the queue-detection-issue bug report?
As expected, it's not applicable to 12.1 for me and should also fail on all other versions newer than stable/12 from february:

patch -C -p1 -i D18545.diff
Hmm...  Looks like a unified diff to me...
The text leading up to this was:
--------------------------
|Index: head/sys/dev/e1000/em_txrx.c
|===================================================================
|--- head/sys/dev/e1000/em_txrx.c
|+++ head/sys/dev/e1000/em_txrx.c
--------------------------
Patching file sys/dev/e1000/em_txrx.c using Plan A...
Reversed (or previously applied) patch detected!  Assume -R? [y] n
Apply anyway? [n] y
Hunk #1 failed at 457.
1 out of 1 hunks failed while patching sys/dev/e1000/em_txrx.c
:
:
:
--------------------------
|Index: head/sys/dev/ixl/ixl_txrx.c
|===================================================================
|--- head/sys/dev/ixl/ixl_txrx.c
|+++ head/sys/dev/ixl/ixl_txrx.c
--------------------------
Patching file sys/dev/ixl/ixl_txrx.c using Plan A...
Reversed (or previously applied) patch detected!  Assume -R? [y] n
Apply anyway? [n] y
Hunk #1 failed at 515.
Hunk #2 failed at 788.
Hunk #3 failed at 805.
3 out of 3 hunks failed while patching sys/dev/ixl/ixl_txrx.c
done
Comment 9 Harald Schmalzbauer 2019-09-25 14:43:54 UTC
Update:

The link-state-change issue doesn't affect _configured_ interfaces, so the fix in https://reviews.freebsd.org/D21769 is most likely perfectly valid.

I guess the remaining issue is by design.
As long as the interface was never configured, state change only works once in one direction -> recorded as active.
If I have the interface configured before or afterwards, connection change/loss will be recognized.  But not for unconfigured interfaces.
So the issue has no production influencing effects.  It's might just caus confusion on systems with (cold) standby interfaces, since the admin might assume there's a link active and searches all switches unsuccessfully for that link…

During this test I saw that assigning an IP address to an already link-established em(4) or igb(4) interface, causes loss/interruption of the establisehd ethernet link.
Don't know if this is desired…
Guess these facts are related and as long as the internal steps configuring a em/igb(4) (or probably iflib generic?) NIC stay as they are, it won't be possible to catch the link state loss with unconfigured/cold-stanby interfaces, right?

Thanks,

-Harry(In reply to John Delano from comment #7)
Comment 10 John Delano 2019-09-25 16:08:06 UTC
(In reply to Harald Schmalzbauer from comment #8)

Installed on Release 12.0
Comment 11 Harald Schmalzbauer 2019-09-26 17:39:07 UTC
(In reply to Harald Schmalzbauer from comment #9)

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=240818 confirms that the remaining issue is by design (iflib igb).
Since the fix for my issue is also reported in Bug 236724, I'll close this as duplicate.

*** This bug has been marked as a duplicate of bug 236724 ***