Bug 211836 - Kernel panic with net.isr enabled
Summary: Kernel panic with net.isr enabled
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 11.0-RC1
Hardware: amd64 Any
: Normal Affects Many People
Assignee: Adrian Chadd
URL:
Keywords: crash, patch
Depends on:
Blocks:
 
Reported: 2016-08-14 10:58 UTC by Babak Farrokhi
Modified: 2017-08-24 14:09 UTC (History)
6 users (show)

See Also:
koobs: mfc-stable11?
koobs: mfc-stable10?


Attachments
textdump output (86.32 KB, text/plain)
2016-08-14 10:58 UTC, Babak Farrokhi
no flags Details
textdump after applying patch from @ae (78.48 KB, text/plain)
2016-08-15 14:44 UTC, Babak Farrokhi
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Babak Farrokhi freebsd_committer 2016-08-14 10:58:20 UTC
Created attachment 173659 [details]
textdump output

Enabling net.isr.* in loader.conf causes kernel panic. Here is loader.conf:

net.isr.numthreads=2
net.isr.maxthreads=2
net.isr.bindthreads=1

I also attached textdump output that includes stacktrace as well as kernel config.
Comment 1 Enji Cooper freebsd_committer 2016-08-14 19:07:11 UTC
Fatal trap 12 - so bad memory access. I can't immediately repro this on my i386/current VM...

It would be good to know what networking services were in use when the system panicked.

Passing over to -net@ for further inspection.
Comment 2 Enji Cooper freebsd_committer 2016-08-14 19:59:37 UTC
(In reply to Ngie Cooper from comment #1)

Sidenote: I'm using GENERIC-NODEBUG, so that might affect the repro. I'll look at the KERNCONF later.
Comment 3 Babak Farrokhi freebsd_committer 2016-08-15 03:37:58 UTC
Nothing fancy in kernel except for RSS.
I could reproduce this on three different machines with three different NICs on each (ix, bge, igb).
Machine panicked within less than a minute after bootup.
Comment 4 Andrey V. Elsukov freebsd_committer 2016-08-15 10:55:25 UTC
I guess it is related to RSS option.
Your system has many CPU cores, so RSS uses all of them. But netisr has only 2 initialized threads. 
netisr_select_cpuid() uses rss_soft_m2cpuid_v4() to determine cpuid. Since the last one doesn't limit cpuid with nws_count, we can get cpuid for which we didn't initialized any data. 
This works for one netisr thread, because we return *cpuidp = nws_array[0], when nws_count == 1.
Comment 5 Babak Farrokhi freebsd_committer 2016-08-15 12:06:41 UTC
(In reply to Andrey V. Elsukov from comment #4)
So you suggest setting net.isr.maxthreads=32 would fix this (aside from locking overhead)? Isn't RSS aware of maximum CPUs used by netisr?
Comment 6 Andrey V. Elsukov freebsd_committer 2016-08-15 12:16:49 UTC
(In reply to Babak Farrokhi from comment #5)
> (In reply to Andrey V. Elsukov from comment #4)
> So you suggest setting net.isr.maxthreads=32 would fix this (aside from
> locking overhead)? Isn't RSS aware of maximum CPUs used by netisr?

I think netisr code should be fixed, we even have in the comment suggested solution: DPCPU_ID_GET(nws_array[arbitraryvalue % nws_count], nws)

So, probably if we define macro:
#define NETISR_WS(_cpuid)   DPCPU_ID_GET(nws_array[(_cpuid) % nws_count], nws)

Then replace all DPCPU_ID_GET(xxx, nws) with NETISR_WS(xxx), I think this will help. But, maybe Robert or Adrian will suggest something better.
Comment 7 Andrey V. Elsukov freebsd_committer 2016-08-15 13:44:41 UTC
So, it looks like this is enough to fix the panic:

Index: sys/net/netisr.c
===================================================================
--- sys/net/netisr.c	(revision 304101)
+++ sys/net/netisr.c	(working copy)
@@ -1014,7 +1014,7 @@ netisr_queue_internal(u_int proto, struct mbuf *m,
 
 	dosignal = 0;
 	error = 0;
-	nwsp = DPCPU_ID_PTR(cpuid, nws);
+	nwsp = DPCPU_ID_PTR(nws_array[cpuid % nws_count], nws);
 	npwp = &nwsp->nws_work[proto];
 	NWS_LOCK(nwsp);
 	error = netisr_queue_workstream(nwsp, proto, npwp, m, &dosignal);
Comment 8 Babak Farrokhi freebsd_committer 2016-08-15 14:17:42 UTC
(In reply to Andrey V. Elsukov from comment #7)
Worked for me. Thank you!
Comment 9 Babak Farrokhi freebsd_committer 2016-08-15 14:44:08 UTC
Created attachment 173703 [details]
textdump after applying patch from @ae

Panicked again after patch
Comment 10 Andrey V. Elsukov freebsd_committer 2016-08-15 16:03:50 UTC
(In reply to Babak Farrokhi from comment #9)
> Created attachment 173703 [details]
> textdump after applying patch from @ae
> 
> Panicked again after patch

Ok, this is yet another case :)
Can you test this patch instead? It should cover both (all) cases.
The NETISR_DISPATCH_HYBRID case looks not quite correct, but it is better than panic...

Index: sys/net/netisr.c
===================================================================
--- sys/net/netisr.c	(revision 304101)
+++ sys/net/netisr.c	(working copy)
@@ -810,10 +810,12 @@ netisr_select_cpuid(struct netisr_proto *npp, u_in
 		 * dispatch.  In the queued case, fall back on the SOURCE
 		 * policy.
 		 */
-		if (*cpuidp != NETISR_CPUID_NONE)
+		if (*cpuidp != NETISR_CPUID_NONE) {
+			*cpuidp = nws_array[ *cpuidp % nws_count ];
 			return (m);
+		}
 		if (dispatch_policy == NETISR_DISPATCH_HYBRID) {
-			*cpuidp = curcpu;
+			*cpuidp = nws_array[ curcpu % nws_count ];
 			return (m);
 		}
 		policy = NETISR_POLICY_SOURCE;
Comment 11 Adrian Chadd freebsd_committer 2016-08-15 17:54:49 UTC
hiya,

Hm, I thought i captured/fixed this.

I'd prefer the solution involve writing a function that takes the CPU and returns the pointer in question, and that function enforces the maximum netisr CPU count. That way we can fix it in one place and then I can change the behaviour in one place in the future.
Comment 12 Babak Farrokhi freebsd_committer 2016-08-16 06:39:06 UTC
(In reply to Andrey V. Elsukov from comment #10)
No crash with this patch for over an hour (it usually crashed within a few minutes after boot).
Comment 13 commit-hook freebsd_committer 2016-08-17 20:21:49 UTC
A commit references this bug:

Author: ae
Date: Wed Aug 17 20:21:34 UTC 2016
New revision: 304313
URL: https://svnweb.freebsd.org/changeset/base/304313

Log:
  Teach netisr_get_cpuid() to limit a given value to supported by netisr.
  Use netisr_get_cpuid() in netisr_select_cpuid() to limit cpuid value
  returned by protocol to be sure that it is not greather than nws_count.

  PR:		211836
  Reviewed by:	adrian
  MFC after:	3 days

Changes:
  head/sys/net/if_epair.c
  head/sys/net/netisr.c
Comment 14 Kubilay Kocak freebsd_committer freebsd_triage 2016-08-18 14:02:42 UTC
Assign to committer resolving
Comment 15 Robert Watson freebsd_committer 2016-08-18 16:56:36 UTC
While not crashing is definitely better than crashing, this patch would benefit from some further refinement with respect to the hybrid case. It's probably necessary for us to instantiate a workstream for each CPU regardless of whether there are enough netisr threads to fully occupy them. In direct dispatch and hybrid modes (when directly dispatching), the CPU-local workstream should be used. When doing a deferred dispatch, we should select instead, the target workstream of the netisr thread we want to do the work. I believe the current change (without having read too closely) may unnecessarily force deferred dispatch in hybrid mode on the basis that there isn't a local netisr -- even though a netisr thread isn't required to do a direct dispatch. However, that could be a misreading as I've not had time to replumb the code in any detail (it's been a few years!).
Comment 16 Adrian Chadd freebsd_committer 2016-08-18 20:25:10 UTC
We can revert it if you would like me to take a closer look at the hybrid / direct dispatch modes.

I have all the stuff to test it here..
Comment 17 Robert Watson freebsd_committer 2016-08-19 19:56:20 UTC
I'm fine with the current commit staying in for now, as crashing is definitely worse than not crashing -- but think it would make sense to take a closer look.
Comment 18 commit-hook freebsd_committer 2016-08-21 17:26:43 UTC
A commit references this bug:

Author: ae
Date: Sun Aug 21 17:26:16 UTC 2016
New revision: 304568
URL: https://svnweb.freebsd.org/changeset/base/304568

Log:
  MFC r304313:
    Teach netisr_get_cpuid() to limit a given value to supported by netisr.
    Use netisr_get_cpuid() in netisr_select_cpuid() to limit cpuid value
    returned by protocol to be sure that it is not greather than nws_count.

    PR:		211836

Changes:
_U  stable/11/
  stable/11/sys/net/if_epair.c
  stable/11/sys/net/netisr.c
Comment 19 Mark Linimon freebsd_committer freebsd_triage 2017-08-24 14:09:22 UTC
Committed back in 2016.