Created attachment 173659 [details] textdump output Enabling net.isr.* in loader.conf causes kernel panic. Here is loader.conf: net.isr.numthreads=2 net.isr.maxthreads=2 net.isr.bindthreads=1 I also attached textdump output that includes stacktrace as well as kernel config.
Fatal trap 12 - so bad memory access. I can't immediately repro this on my i386/current VM... It would be good to know what networking services were in use when the system panicked. Passing over to -net@ for further inspection.
(In reply to Ngie Cooper from comment #1) Sidenote: I'm using GENERIC-NODEBUG, so that might affect the repro. I'll look at the KERNCONF later.
Nothing fancy in kernel except for RSS. I could reproduce this on three different machines with three different NICs on each (ix, bge, igb). Machine panicked within less than a minute after bootup.
I guess it is related to RSS option. Your system has many CPU cores, so RSS uses all of them. But netisr has only 2 initialized threads. netisr_select_cpuid() uses rss_soft_m2cpuid_v4() to determine cpuid. Since the last one doesn't limit cpuid with nws_count, we can get cpuid for which we didn't initialized any data. This works for one netisr thread, because we return *cpuidp = nws_array[0], when nws_count == 1.
(In reply to Andrey V. Elsukov from comment #4) So you suggest setting net.isr.maxthreads=32 would fix this (aside from locking overhead)? Isn't RSS aware of maximum CPUs used by netisr?
(In reply to Babak Farrokhi from comment #5) > (In reply to Andrey V. Elsukov from comment #4) > So you suggest setting net.isr.maxthreads=32 would fix this (aside from > locking overhead)? Isn't RSS aware of maximum CPUs used by netisr? I think netisr code should be fixed, we even have in the comment suggested solution: DPCPU_ID_GET(nws_array[arbitraryvalue % nws_count], nws) So, probably if we define macro: #define NETISR_WS(_cpuid) DPCPU_ID_GET(nws_array[(_cpuid) % nws_count], nws) Then replace all DPCPU_ID_GET(xxx, nws) with NETISR_WS(xxx), I think this will help. But, maybe Robert or Adrian will suggest something better.
So, it looks like this is enough to fix the panic: Index: sys/net/netisr.c =================================================================== --- sys/net/netisr.c (revision 304101) +++ sys/net/netisr.c (working copy) @@ -1014,7 +1014,7 @@ netisr_queue_internal(u_int proto, struct mbuf *m, dosignal = 0; error = 0; - nwsp = DPCPU_ID_PTR(cpuid, nws); + nwsp = DPCPU_ID_PTR(nws_array[cpuid % nws_count], nws); npwp = &nwsp->nws_work[proto]; NWS_LOCK(nwsp); error = netisr_queue_workstream(nwsp, proto, npwp, m, &dosignal);
(In reply to Andrey V. Elsukov from comment #7) Worked for me. Thank you!
Created attachment 173703 [details] textdump after applying patch from @ae Panicked again after patch
(In reply to Babak Farrokhi from comment #9) > Created attachment 173703 [details] > textdump after applying patch from @ae > > Panicked again after patch Ok, this is yet another case :) Can you test this patch instead? It should cover both (all) cases. The NETISR_DISPATCH_HYBRID case looks not quite correct, but it is better than panic... Index: sys/net/netisr.c =================================================================== --- sys/net/netisr.c (revision 304101) +++ sys/net/netisr.c (working copy) @@ -810,10 +810,12 @@ netisr_select_cpuid(struct netisr_proto *npp, u_in * dispatch. In the queued case, fall back on the SOURCE * policy. */ - if (*cpuidp != NETISR_CPUID_NONE) + if (*cpuidp != NETISR_CPUID_NONE) { + *cpuidp = nws_array[ *cpuidp % nws_count ]; return (m); + } if (dispatch_policy == NETISR_DISPATCH_HYBRID) { - *cpuidp = curcpu; + *cpuidp = nws_array[ curcpu % nws_count ]; return (m); } policy = NETISR_POLICY_SOURCE;
hiya, Hm, I thought i captured/fixed this. I'd prefer the solution involve writing a function that takes the CPU and returns the pointer in question, and that function enforces the maximum netisr CPU count. That way we can fix it in one place and then I can change the behaviour in one place in the future.
(In reply to Andrey V. Elsukov from comment #10) No crash with this patch for over an hour (it usually crashed within a few minutes after boot).
A commit references this bug: Author: ae Date: Wed Aug 17 20:21:34 UTC 2016 New revision: 304313 URL: https://svnweb.freebsd.org/changeset/base/304313 Log: Teach netisr_get_cpuid() to limit a given value to supported by netisr. Use netisr_get_cpuid() in netisr_select_cpuid() to limit cpuid value returned by protocol to be sure that it is not greather than nws_count. PR: 211836 Reviewed by: adrian MFC after: 3 days Changes: head/sys/net/if_epair.c head/sys/net/netisr.c
Assign to committer resolving
While not crashing is definitely better than crashing, this patch would benefit from some further refinement with respect to the hybrid case. It's probably necessary for us to instantiate a workstream for each CPU regardless of whether there are enough netisr threads to fully occupy them. In direct dispatch and hybrid modes (when directly dispatching), the CPU-local workstream should be used. When doing a deferred dispatch, we should select instead, the target workstream of the netisr thread we want to do the work. I believe the current change (without having read too closely) may unnecessarily force deferred dispatch in hybrid mode on the basis that there isn't a local netisr -- even though a netisr thread isn't required to do a direct dispatch. However, that could be a misreading as I've not had time to replumb the code in any detail (it's been a few years!).
We can revert it if you would like me to take a closer look at the hybrid / direct dispatch modes. I have all the stuff to test it here..
I'm fine with the current commit staying in for now, as crashing is definitely worse than not crashing -- but think it would make sense to take a closer look.
A commit references this bug: Author: ae Date: Sun Aug 21 17:26:16 UTC 2016 New revision: 304568 URL: https://svnweb.freebsd.org/changeset/base/304568 Log: MFC r304313: Teach netisr_get_cpuid() to limit a given value to supported by netisr. Use netisr_get_cpuid() in netisr_select_cpuid() to limit cpuid value returned by protocol to be sure that it is not greather than nws_count. PR: 211836 Changes: _U stable/11/ stable/11/sys/net/if_epair.c stable/11/sys/net/netisr.c
Committed back in 2016.