Bug 228076

Summary: [sched_ule] Not stealing process from loaded CPU
Product: Base System Reporter: Mit Matelske <mit>
Component: kernAssignee: freebsd-bugs (Nobody) <bugs>
Status: New ---    
Severity: Affects Only Me CC: cem, eadler, emaste, fidaj, markj, truckman
Priority: --- Keywords: patch
Version: 11.1-RELEASE   
Hardware: amd64   
OS: Any   
Attachments:
Description Flags
fix sched_balance none

Description Mit Matelske 2018-05-08 18:13:04 UTC
We are having a problem with the ule scheduler scheduling a process to run on a processor that is monopolized by a realtime thread and in turn never running or getting "stolen" from that proccessor's queue to run on another.  

This happens readily in our application utilizing FreeBSD.  Since it would be impossible for a 3rd party to run our software, I came up with a test suite that reproduces the issue, but it needs to run for awhile for the lockup to happen.

To reproduce the issue, run three shells. One runs "many" which starts realtime threads on all but one available processor.  Next run "runOne" which starts a realtime thread on the last available processor for a period of time, then exits, repeating the process.  Finally, run "runIfconfig" which just calls ifconfig in a loop.  Eventually ifconfig will get stuck in "RUN" state.

Changing the "kern.sched.steal_thresh" to 1 has no effect.

The source for the programs and scripts are after the test machines details.

Snapshot of top showing the locked process (ifconfig) in the test:

last pid: 89788;  load averages:  8.02,  8.09,  9.71                                                                                                        up 0+19:15:59  14:40:09
49 processes:  9 running, 40 sleeping
CPU: 87.5% user,  0.0% nice,  0.0% system,  0.0% interrupt, 12.5% idle
Mem: 6784K Active, 50M Inact, 3087M Wired, 286M Buf, 28G Free
Swap: 4096M Total, 4096M Free

  PID USERNAME   PRI NICE   SIZE    RES STATE   C   TIME    WCPU COMMAND
89782 root        20    0 19600K  3660K ttyin   7   0:00   0.00% csh
89779 root        20    0 83092K  8288K select  7   0:00   0.00% sshd
89725 root        52    0  6244K  1996K nanslp  7   0:00   0.00% sleep
89717 root        52    0 13144K  2832K wait    7   0:00   0.00% sh
89125 root        20    0 19600K  3704K ttyin   7   0:00   0.00% csh
53534 root        20    0  8392K  2220K piperd  7   0:00   0.00% mail
53533 root        52    0 13144K  2816K wait    7   0:00   0.00% sh
53526 root        52    0 13144K  2808K wait    7   0:00   0.00% sh
53525 root        21    0  6244K  2004K wait    7   0:00   0.00% lockf
53523 root        21    0 13144K  2808K wait    7   0:00   0.00% sh
53521 root        20    0 13144K  2812K wait    7   0:00   0.00% sh
51764 root        20    0  8392K  2216K piperd  7   0:00   0.00% mail
51763 root        20    0 13144K  2816K wait    7   0:00   0.00% sh
51754 root        52    0 13144K  2808K wait    7   0:00   0.00% sh
51752 root        21    0  6244K  2004K wait    7   0:00   0.00% lockf
51748 root        21    0 13144K  2808K wait    7   0:00   0.00% sh
51746 root        20    0 12564K  2564K piperd  7   0:00   0.00% cron
38859 root        72    0 19104K  2768K RUN     6   0:00   0.00% ifconfig
 6945 root        20    0 13460K  2452K nanslp  7   0:00   0.00% many{many}
 6945 root       -21 r31F 13460K  2452K CPU0    0  19.2H  99.06% many{many}
 6945 root       -21 r31F 13460K  2452K CPU1    1  19.2H  89.40% many{many}
 6945 root       -21 r31F 13460K  2452K CPU2    2  19.2H 100.00% many{many}
 6945 root       -21 r31F 13460K  2452K CPU3    3  19.2H  89.41% many{many}
 6945 root       -21 r31F 13460K  2452K CPU4    4  19.2H  89.40% many{many}
 6945 root       -21 r31F 13460K  2452K CPU5    5  19.2H  89.39% many{many}
 6945 root       -21 r31F 13460K  2452K CPU6    6  19.1H  99.31% many{many}
  887 root        52    0 16664K  3984K wait    7   2:36   0.00% ksh93
...

Details of the test system:

uname -a:

FreeBSD fbdev 11.1-RELEASE-p9 FreeBSD 11.1-RELEASE-p9 #0: Tue Apr  3 16:59:16 UTC 2018     root@amd64-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC  amd64

root@fbl:~ # pciconf -l
hostb0@pci0:0:0:0:	class=0x060000 card=0x01588086 chip=0x01588086 rev=0x09 hdr=0x00
pcib1@pci0:0:1:0:	class=0x060400 card=0x01518086 chip=0x01518086 rev=0x09 hdr=0x01
vgapci0@pci0:0:2:0:	class=0x030000 card=0x21118086 chip=0x016a8086 rev=0x09 hdr=0x00
pcib2@pci0:0:6:0:	class=0x060400 card=0x015d8086 chip=0x015d8086 rev=0x09 hdr=0x01
none0@pci0:0:22:0:	class=0x078000 card=0x1c3a8086 chip=0x1c3a8086 rev=0x04 hdr=0x00
ehci0@pci0:0:26:0:	class=0x0c0320 card=0x1c2d8086 chip=0x1c2d8086 rev=0x05 hdr=0x00
pcib3@pci0:0:28:0:	class=0x060400 card=0x1c108086 chip=0x1c108086 rev=0xb5 hdr=0x01
ehci1@pci0:0:29:0:	class=0x0c0320 card=0x1c268086 chip=0x1c268086 rev=0x05 hdr=0x00
pcib4@pci0:0:30:0:	class=0x060401 card=0x244e8086 chip=0x244e8086 rev=0xa5 hdr=0x01
isab0@pci0:0:31:0:	class=0x060100 card=0x1c568086 chip=0x1c568086 rev=0x05 hdr=0x00
ahci0@pci0:0:31:2:	class=0x010601 card=0x1c028086 chip=0x1c028086 rev=0x05 hdr=0x00
none1@pci0:0:31:3:	class=0x0c0500 card=0x1c228086 chip=0x1c228086 rev=0x05 hdr=0x00
ix0@pci0:1:0:0:	class=0x020000 card=0xffffffff chip=0x10fb8086 rev=0x01 hdr=0x00
ix1@pci0:1:0:1:	class=0x020000 card=0xffffffff chip=0x10fb8086 rev=0x01 hdr=0x00
igb0@pci0:2:0:0:	class=0x020000 card=0x00008086 chip=0x150e8086 rev=0x01 hdr=0x00
igb1@pci0:2:0:1:	class=0x020000 card=0x00008086 chip=0x150e8086 rev=0x01 hdr=0x00
igb2@pci0:3:0:0:	class=0x020000 card=0x00008086 chip=0x150e8086 rev=0x01 hdr=0x00
igb3@pci0:3:0:1:	class=0x020000 card=0x00008086 chip=0x150e8086 rev=0x01 hdr=0x00
none2@pci0:4:0:0:	class=0x020000 card=0x012310ec chip=0x816710ec rev=0x10 hdr=0x00

dmeag -a:

Copyright (c) 1992-2017 The FreeBSD Project.
Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
	The Regents of the University of California. All rights reserved.
FreeBSD is a registered trademark of The FreeBSD Foundation.
FreeBSD 11.1-RELEASE-p9 #0: Tue Apr  3 16:59:16 UTC 2018
    root@amd64-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC amd64
FreeBSD clang version 4.0.0 (tags/RELEASE_400/final 297347) (based on LLVM 4.0.0)
VT(vga): resolution 640x480
CPU: Intel(R) Xeon(R) CPU E3-1275 V2 @ 3.50GHz (3492.14-MHz K8-class CPU)
  Origin="GenuineIntel"  Id=0x306a9  Family=0x6  Model=0x3a  Stepping=9
  Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE>
  Features2=0x7fbae3ff<SSE3,PCLMULQDQ,DTES64,MON,DS_CPL,VMX,SMX,EST,TM2,SSSE3,CX16,xTPR,PDCM,PCID,SSE4.1,SSE4.2,x2APIC,POPCNT,TSCDLT,AESNI,XSAVE,OSXSAVE,AVX,F16C,RDRAND>
  AMD Features=0x28100800<SYSCALL,NX,RDTSCP,LM>
  AMD Features2=0x1<LAHF>
  Structured Extended Features=0x281<FSGSBASE,SMEP,ERMS>
  XSAVE Features=0x1<XSAVEOPT>
  VT-x: PAT,HLT,MTF,PAUSE,EPT,UG,VPID
  TSC: P-state invariant, performance statistics
real memory  = 34359738368 (32768 MB)
avail memory = 33212313600 (31673 MB)
Event timer "LAPIC" quality 600
ACPI APIC Table: <ALASKA A M I>
FreeBSD/SMP: Multiprocessor System Detected: 8 CPUs
FreeBSD/SMP: 1 package(s) x 4 core(s) x 2 hardware threads
random: unblocking device.
ioapic0 <Version 2.0> irqs 0-23 on motherboard
SMP: AP CPU #1 Launched!
SMP: AP CPU #4 Launched!
SMP: AP CPU #2 Launched!
SMP: AP CPU #3 Launched!
SMP: AP CPU #5 Launched!
SMP: AP CPU #6 Launched!
SMP: AP CPU #7 Launched!
Timecounter "TSC-low" frequency 1746072482 Hz quality 1000
random: entropy device external interface
kbd1 at kbdmux0
netmap: loaded module
module_register_init: MOD_LOAD (vesa, 0xffffffff80f5eb40, 0) error 19
random: registering fast source Intel Secure Key RNG
random: fast provider: "Intel Secure Key RNG"
 0: virt=0xfffffe0859800000 phys=0x100000000
 1: virt=0xfffffe0899800000 phys=0x140000000
nexus0
vtvga0: <VT VGA driver> on motherboard
cryptosoft0: <software crypto> on motherboard
acpi0: <ALASKA A M I> on motherboard
acpi0: Power Button (fixed)
cpu0: <ACPI CPU> on acpi0
cpu1: <ACPI CPU> on acpi0
cpu2: <ACPI CPU> on acpi0
cpu3: <ACPI CPU> on acpi0
cpu4: <ACPI CPU> on acpi0
cpu5: <ACPI CPU> on acpi0
cpu6: <ACPI CPU> on acpi0
cpu7: <ACPI CPU> on acpi0
hpet0: <High Precision Event Timer> iomem 0xfed00000-0xfed003ff on acpi0
Timecounter "HPET" frequency 14318180 Hz quality 950
Event timer "HPET" frequency 14318180 Hz quality 550
atrtc0: <AT realtime clock> port 0x70-0x77 irq 8 on acpi0
atrtc0: Warning: Couldn't map I/O.
Event timer "RTC" frequency 32768 Hz quality 0
attimer0: <AT timer> port 0x40-0x43,0x50-0x53 irq 0 on acpi0
Timecounter "i8254" frequency 1193182 Hz quality 0
Event timer "i8254" frequency 1193182 Hz quality 100
Timecounter "ACPI-fast" frequency 3579545 Hz quality 900
acpi_timer0: <24-bit timer at 3.579545MHz> port 0x408-0x40b on acpi0
pcib0: <ACPI Host-PCI bridge> port 0xcf8-0xcff on acpi0
pcib0: _OSC returned error 0x10
pci0: <ACPI PCI bus> on pcib0
pcib1: <ACPI PCI-PCI bridge> irq 16 at device 1.0 on pci0
pci1: <ACPI PCI bus> on pcib1
ix0: <Intel(R) PRO/10GbE PCI-Express Network Driver, Version - 3.1.13-k> port 0xe020-0xe03f mem 0xf7d20000-0xf7d3ffff,0xf7d44000-0xf7d47fff irq 16 at device 0.0 on pci1
ix0: Using MSIX interrupts with 9 vectors
ix0: Ethernet address: 08:00:8a:00:10:e0
ix0: PCI Express Bus: Speed 5.0GT/s Width x8
ix0: netmap queues/slots: TX 8/2048, RX 8/2048
ix1: <Intel(R) PRO/10GbE PCI-Express Network Driver, Version - 3.1.13-k> port 0xe000-0xe01f mem 0xf7d00000-0xf7d1ffff,0xf7d40000-0xf7d43fff irq 17 at device 0.1 on pci1
ix1: Using MSIX interrupts with 9 vectors
ix1: Ethernet address: 08:00:8a:00:10:e1
ix1: PCI Express Bus: Speed 5.0GT/s Width x8
ix1: netmap queues/slots: TX 8/2048, RX 8/2048
vgapci0: <VGA-compatible display> port 0xf000-0xf03f mem 0xf7400000-0xf77fffff,0xe0000000-0xefffffff irq 16 at device 2.0 on pci0
vgapci0: Boot video device
pcib2: <ACPI PCI-PCI bridge> irq 19 at device 6.0 on pci0
pci2: <ACPI PCI bus> on pcib2
igb0: <Intel(R) PRO/1000 Network Connection, Version - 2.5.3-k> port 0xd020-0xd03f mem 0xf7a80000-0xf7afffff,0xf7b04000-0xf7b07fff irq 19 at device 0.0 on pci2
igb0: Using MSIX interrupts with 9 vectors
igb0: Ethernet address: 08:00:8a:00:00:82
igb0: Bound queue 0 to cpu 0
igb0: Bound queue 1 to cpu 1
igb0: Bound queue 2 to cpu 2
igb0: Bound queue 3 to cpu 3
igb0: Bound queue 4 to cpu 4
igb0: Bound queue 5 to cpu 5
igb0: Bound queue 6 to cpu 6
igb0: Bound queue 7 to cpu 7
igb0: netmap queues/slots: TX 8/1024, RX 8/1024
igb1: <Intel(R) PRO/1000 Network Connection, Version - 2.5.3-k> port 0xd000-0xd01f mem 0xf7a00000-0xf7a7ffff,0xf7b00000-0xf7b03fff irq 16 at device 0.1 on pci2
igb1: Using MSIX interrupts with 9 vectors
igb1: Ethernet address: 08:00:8a:00:00:83
igb1: Bound queue 0 to cpu 0
igb1: Bound queue 1 to cpu 1
igb1: Bound queue 2 to cpu 2
igb1: Bound queue 3 to cpu 3
igb1: Bound queue 4 to cpu 4
igb1: Bound queue 5 to cpu 5
igb1: Bound queue 6 to cpu 6
igb1: Bound queue 7 to cpu 7
igb1: netmap queues/slots: TX 8/1024, RX 8/1024
pci0: <simple comms> at device 22.0 (no driver attached)
ehci0: <Intel Cougar Point USB 2.0 controller> mem 0xf7e04000-0xf7e043ff irq 16 at device 26.0 on pci0
usbus0: EHCI version 1.0
usbus0 on ehci0
usbus0: 480Mbps High Speed USB v2.0
pcib3: <ACPI PCI-PCI bridge> irq 16 at device 28.0 on pci0
pci3: <ACPI PCI bus> on pcib3
igb2: <Intel(R) PRO/1000 Network Connection, Version - 2.5.3-k> port 0xc020-0xc03f mem 0xf7880000-0xf78fffff,0xf7904000-0xf7907fff irq 16 at device 0.0 on pci3
igb2: Using MSIX interrupts with 9 vectors
igb2: Ethernet address: 08:00:8a:00:00:80
igb2: Bound queue 0 to cpu 0
igb2: Bound queue 1 to cpu 1
igb2: Bound queue 2 to cpu 2
igb2: Bound queue 3 to cpu 3
igb2: Bound queue 4 to cpu 4
igb2: Bound queue 5 to cpu 5
igb2: Bound queue 6 to cpu 6
igb2: Bound queue 7 to cpu 7
igb2: netmap queues/slots: TX 8/1024, RX 8/1024
igb3: <Intel(R) PRO/1000 Network Connection, Version - 2.5.3-k> port 0xc000-0xc01f mem 0xf7800000-0xf787ffff,0xf7900000-0xf7903fff irq 17 at device 0.1 on pci3
igb3: Using MSIX interrupts with 9 vectors
igb3: Ethernet address: 08:00:8a:00:00:81
igb3: Bound queue 0 to cpu 0
igb3: Bound queue 1 to cpu 1
igb3: Bound queue 2 to cpu 2
igb3: Bound queue 3 to cpu 3
igb3: Bound queue 4 to cpu 4
igb3: Bound queue 5 to cpu 5
igb3: Bound queue 6 to cpu 6
igb3: Bound queue 7 to cpu 7
igb3: netmap queues/slots: TX 8/1024, RX 8/1024
ehci1: <Intel Cougar Point USB 2.0 controller> mem 0xf7e03000-0xf7e033ff irq 23 at device 29.0 on pci0
usbus1: EHCI version 1.0
usbus1 on ehci1
usbus1: 480Mbps High Speed USB v2.0
pcib4: <ACPI PCI-PCI bridge> at device 30.0 on pci0
pci4: <ACPI PCI bus> on pcib4
re0: <RealTek 8169SC/8110SC Single-chip Gigabit Ethernet> port 0xb000-0xb0ff mem 0xf7c20000-0xf7c200ff irq 16 at device 0.0 on pci4
re0: Chip rev. 0x18000000
re0: MAC rev. 0x00000000
miibus0: <MII bus> on re0
rgephy0: <RTL8169S/8110S/8211 1000BASE-T media interface> PHY 1 on miibus0
rgephy0:  none, 10baseT, 10baseT-FDX, 10baseT-FDX-flow, 100baseTX, 100baseTX-FDX, 100baseTX-FDX-flow, 1000baseT, 1000baseT-master, 1000baseT-FDX, 1000baseT-FDX-master, 1000baseT-FDX-flow, 1000baseT-FDX-flow-master, auto, auto-flow
re0: Using defaults for TSO: 65518/35/2048
re0: Ethernet address: 00:90:0b:29:07:b2
re0: netmap queues/slots: TX 1/256, RX 1/256
isab0: <PCI-ISA bridge> at device 31.0 on pci0
isa0: <ISA bus> on isab0
ahci0: <Intel Cougar Point AHCI SATA controller> port 0xf0b0-0xf0b7,0xf0a0-0xf0a3,0xf090-0xf097,0xf080-0xf083,0xf060-0xf07f mem 0xf7e02000-0xf7e027ff irq 19 at device 31.2 on pci0
ahci0: AHCI v1.30 with 6 6Gbps ports, Port Multiplier not supported
ahcich2: <AHCI channel> at channel 2 on ahci0
ahciem0: <AHCI enclosure management bridge> on ahci0
acpi_button0: <Power Button> on acpi0
acpi_tz0: <Thermal Zone> on acpi0
acpi_tz1: <Thermal Zone> on acpi0
ppc1: <Parallel port> port 0x378-0x37f irq 5 on acpi0
ppc1: Generic chipset (NIBBLE-only) in COMPATIBLE mode
ppbus0: <Parallel port bus> on ppc1
lpt0: <Printer> on ppbus0
lpt0: Interrupt-driven port
ppi0: <Parallel I/O> on ppbus0
uart0: <16550 or compatible> port 0x3f8-0x3ff irq 4 flags 0x10 on acpi0
atkbdc0: <Keyboard controller (i8042)> at port 0x60,0x64 on isa0
atkbd0: <AT Keyboard> irq 1 on atkbdc0
kbd0 at atkbd0
atkbd0: [GIANT-LOCKED]
ppc0: cannot reserve I/O port range
est0: <Enhanced SpeedStep Frequency Control> on cpu0
est1: <Enhanced SpeedStep Frequency Control> on cpu1
est2: <Enhanced SpeedStep Frequency Control> on cpu2
est3: <Enhanced SpeedStep Frequency Control> on cpu3
est4: <Enhanced SpeedStep Frequency Control> on cpu4
est5: <Enhanced SpeedStep Frequency Control> on cpu5
est6: <Enhanced SpeedStep Frequency Control> on cpu6
est7: <Enhanced SpeedStep Frequency Control> on cpu7
Timecounters tick every 1.000 msec
nvme cam probe device init
ugen1.1: <Intel EHCI root HUB> at usbus1
ugen0.1: <Intel EHCI root HUB> at usbus0
uhub0: <Intel EHCI root HUB, class 9/0, rev 2.00/1.00, addr 1> on usbus1
uhub1: <Intel EHCI root HUB, class 9/0, rev 2.00/1.00, addr 1> on usbus0
ses0 at ahciem0 bus 0 scbus1 target 0 lun 0
ses0: <AHCI SGPIO Enclosure 1.00 0001> SEMB S-E-S 2.00 device
ses0: SEMB SES Device
ada0 at ahcich2 bus 0 scbus0 target 0 lun 0
ada0: <TS16GCF160 20111219> ATA SATA 1.x device
ada0: Serial Number A2323035180354000072
ada0: 150.000MB/s transfers (SATA 1.x, UDMA2, PIO 512bytes)
ada0: 15280MB (31293440 512 byte sectors)
Trying to mount root from ufs:/dev/ada0p3 [rw]...
Setting hostuuid: c84483e3-9171-11e6-a133-08008a0010e0.
Setting hostid: 0x224a41c7.
Starting file system checks:
/dev/ada0p3: FILE SYSTEM CLEAN; SKIPPING CHECKS
/dev/ada0p3: clean, 519318 free (6518 frags, 64100 blocks, 0.2% fragmentation)
uhub0: 2 ports with 2 removable, self powered
uhub1: 2 ports with 2 removable, self powered
Mounting local filesystems:.
ELF ldconfig path: /lib /usr/lib /usr/lib/compat /usr/local/lib /usr/local/lib/gcc49 /usr/local/lib/perl5/5.20/mach/CORE /etc/ld-elf.so.conf
32-bit compatibility ldconfig path: /usr/lib32
Setting hostname: fbl.
Setting up harvesting: [UMA],[FS_ATIME],SWI,INTERRUPT,NET_NG,NET_ETHER,NET_TUN,MOUSE,KEYBOARD,ATTACH,CACHED
Feeding entropy: 
ugen0.2: <vendor 0x8087 product 0x0024> at usbus0
uhub2 on uhub1
uhub2: <vendor 0x8087 product 0x0024, class 9/0, rev 2.00/0.00, addr 2> on usbus0
ugen1.2: <vendor 0x8087 product 0x0024> at usbus1
uhub3 on uhub0
uhub3: <vendor 0x8087 product 0x0024, class 9/0, rev 2.00/0.00, addr 2> on usbus1
.
uhub2: 6 ports with 6 removable, self powered
uhub3: 8 ports with 8 removable, self powered
ugen1.3: <vendor 0x0d3d USBPS2> at usbus1
ukbd0 on uhub3
ukbd0: <EP1> on usbus1
kbd2 at ukbd0
igb2: link state changed to UP
Starting Network: lo0 ix0 ix1 igb0 igb1 igb2 igb3 re0.
lo0: flags=8049<UP,LOOPBACK,RUNNING,MULTICAST> metric 0 mtu 16384
	options=600003<RXCSUM,TXCSUM,RXCSUM_IPV6,TXCSUM_IPV6>
	inet6 ::1 prefixlen 128 
	inet6 fe80::1%lo0 prefixlen 64 scopeid 0x8 
	inet 127.0.0.1 netmask 0xff000000 
	nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
	groups: lo 
ix0: flags=8802<BROADCAST,SIMPLEX,MULTICAST> metric 0 mtu 1500
	options=e407bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6>
	ether 08:00:8a:00:10:e0
	hwaddr 08:00:8a:00:10:e0
	nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
	media: Ethernet autoselect
	status: no carrier
ix1: flags=8802<BROADCAST,SIMPLEX,MULTICAST> metric 0 mtu 1500
	options=e407bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6>
	ether 08:00:8a:00:10:e1
	hwaddr 08:00:8a:00:10:e1
	nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
	media: Ethernet autoselect
	status: no carrier
igb0: flags=8c02<BROADCAST,OACTIVE,SIMPLEX,MULTICAST> metric 0 mtu 1500
	options=6403bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6>
	ether 08:00:8a:00:00:82
	hwaddr 08:00:8a:00:00:82
	nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
	media: Ethernet autoselect
	status: no carrier
igb1: flags=8c02<BROADCAST,OACTIVE,SIMPLEX,MULTICAST> metric 0 mtu 1500
	options=6403bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6>
	ether 08:00:8a:00:00:83
	hwaddr 08:00:8a:00:00:83
	nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
	media: Ethernet autoselect
	status: no carrier
igb2: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
	options=6403bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6>
	ether 08:00:8a:00:00:80
	hwaddr 08:00:8a:00:00:80
	inet 206.210.221.193 netmask 0xffffff00 broadcast 206.210.221.255 
	nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
	media: Ethernet autoselect (100baseTX <full-duplex>)
	status: active
igb3: flags=8c02<BROADCAST,OACTIVE,SIMPLEX,MULTICAST> metric 0 mtu 1500
	options=6403bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6>
	ether 08:00:8a:00:00:81
	hwaddr 08:00:8a:00:00:81
	nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
	media: Ethernet autoselect
	status: no carrier
re0: flags=8802<BROADCAST,SIMPLEX,MULTICAST> metric 0 mtu 1500
	options=8209b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,WOL_MAGIC,LINKSTATE>
	ether 00:90:0b:29:07:b2
	hwaddr 00:90:0b:29:07:b2
	nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
re0: link state changed to DOWN
	media: Ethernet autoselect (10baseT/UTP <half-duplex>)
	status: no carrier
Starting devd.
Starting Network: ix0.
ix0: flags=8802<BROADCAST,SIMPLEX,MULTICAST> metric 0 mtu 1500
	options=e407bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6>
	ether 08:00:8a:00:10:e0
	hwaddr 08:00:8a:00:10:e0
	nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
	media: Ethernet autoselect
	status: no carrier
Starting Network: ix1.
ix1: flags=8802<BROADCAST,SIMPLEX,MULTICAST> metric 0 mtu 1500
	options=e407bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6>
	ether 08:00:8a:00:10:e1
	hwaddr 08:00:8a:00:10:e1
	nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
	media: Ethernet autoselect
	status: no carrier
Starting Network: igb0.
igb0: flags=8c02<BROADCAST,OACTIVE,SIMPLEX,MULTICAST> metric 0 mtu 1500
	options=6403bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6>
	ether 08:00:8a:00:00:82
	hwaddr 08:00:8a:00:00:82
	nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
	media: Ethernet autoselect
	status: no carrier
Starting Network: igb1.
igb1: flags=8c02<BROADCAST,OACTIVE,SIMPLEX,MULTICAST> metric 0 mtu 1500
	options=6403bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6>
	ether 08:00:8a:00:00:83
	hwaddr 08:00:8a:00:00:83
	nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
	media: Ethernet autoselect
	status: no carrier
Starting Network: igb3.
igb3: flags=8c02<BROADCAST,OACTIVE,SIMPLEX,MULTICAST> metric 0 mtu 1500
	options=6403bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6>
	ether 08:00:8a:00:00:81
	hwaddr 08:00:8a:00:00:81
	nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
	media: Ethernet autoselect
	status: no carrier
Starting Network: re0.
re0: flags=8802<BROADCAST,SIMPLEX,MULTICAST> metric 0 mtu 1500
	options=8209b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,WOL_MAGIC,LINKSTATE>
	ether 00:90:0b:29:07:b2
	hwaddr 00:90:0b:29:07:b2
	nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
	media: Ethernet autoselect (10baseT/UTP <half-duplex>)
	status: no carrier
add host 127.0.0.1: gateway lo0 fib 0: route already in table
add net default: gateway 206.210.221.1
add host ::1: gateway lo0 fib 0: route already in table
add net fe80::: gateway ::1
add net ff02::: gateway ::1
add net ::ffff:0.0.0.0: gateway ::1
add net ::0.0.0.0: gateway ::1
Generating host.conf.
Mounting NFS filesystems:.
ELF ldconfig path: /lib /usr/lib /usr/lib/compat /usr/local/lib /usr/local/lib/gcc49 /usr/local/lib/perl5/5.20/mach/CORE /etc/ld-elf.so.conf
32-bit compatibility ldconfig path: /usr/lib32
Creating and/or trimming log files.
Starting syslogd.

Sources:

many.c:

#include <pthread.h>
#include <pthread_np.h>
#include <sched.h>
#include <stdio.h>
#include <unistd.h>
#include <sys/sysctl.h>

static int cpu = 0;

void *run (void *data) {
   cpuset_t cpuset;
   struct sched_param scheduleParam;

   CPU_ZERO(&cpuset);
   CPU_SET(cpu, &cpuset);
   cpu++;
   int err = pthread_setaffinity_np(pthread_self(), sizeof(cpuset_t), &cpuset);
   printf("pthread_setaffinity_np: %d\n", err);
   sched_getparam(0, &scheduleParam);
   scheduleParam.sched_priority = sched_get_priority_min(SCHED_FIFO);
   sched_setscheduler(0, SCHED_FIFO, &scheduleParam);
   while (1) ;
}

int main (void) {
   pthread_t tp[32];
   uint32_t nCpu = 0;
   size_t cpuLen = sizeof nCpu;

   if (sysctlbyname("hw.ncpu", &nCpu, &cpuLen, NULL, 0) >= 0 && cpuLen == sizeof nCpu) {
      printf("%d CPUs available, starting realtime processes on %d of them\n", nCpu, nCpu - 1);
      for (int ix = 0; ix < nCpu - 1; ++ix) {
         pthread_create(&tp[ix], NULL, run, NULL);
      }

      while (1) sleep(1);
   }

   return (0);
}

one.c:

#include <pthread.h>
#include <pthread_np.h>
#include <sched.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <net/if.h>
#include <net/if_mib.h>
#include <sys/sysctl.h>

int getIfData (struct ifmibdata *mib, char *name) {
   size_t len;
   int idx, pkt[6];

   idx = if_nametoindex(name);
   if (idx == 0) {
      return -1;
   }
   memset(mib, 0, sizeof(struct ifmibdata));
   pkt[0] = CTL_NET;
   pkt[1] = PF_LINK;
   pkt[2] = NETLINK_GENERIC;
   pkt[3] = IFMIB_IFDATA;
   pkt[4] = idx;
   pkt[5] = IFDATA_GENERAL;
   len = sizeof(struct ifmibdata);
   if (sysctl(pkt, 6, mib, &len, NULL, 0) == -1) {
      return -1;
   }
   return idx;
}

uint64_t nanoseconds (struct timespec* ts) {
    return ts->tv_sec * (uint64_t)1000000000L + ts->tv_nsec;
}

int main (int argc, char *argv[]) {
   if (argc < 2) {
      printf("Please supply number of seconds to run\n");
      exit(1);
   }

   double secs = atof(argv[1]);

   uint32_t nCpu = 0;
   size_t cpuLen = sizeof nCpu;

   if (sysctlbyname("hw.ncpu", &nCpu, &cpuLen, NULL, 0) >= 0 && cpuLen == sizeof nCpu) {
      cpuset_t cpuset;
      struct sched_param scheduleParam;

      CPU_ZERO(&cpuset);
      CPU_SET(nCpu - 1, &cpuset);
      pthread_setaffinity_np(pthread_self(), sizeof(cpuset_t), &cpuset);
      sched_getparam(0, &scheduleParam);
      scheduleParam.sched_priority = sched_get_priority_min(SCHED_FIFO);
      sched_setscheduler(0, SCHED_FIFO, &scheduleParam);

      printf("Running for %s second(s) on cpu %d\n", argv[1], nCpu - 1);

      struct timespec tsNow, tsEnd;
      tsEnd.tv_sec = (int)secs;
      tsEnd.tv_nsec = (double)(secs - tsEnd.tv_sec) * (uint64_t)1000000000L; 
      clock_gettime(CLOCK_REALTIME, &tsNow);
      uint64_t start = nanoseconds(&tsNow);
      uint64_t stop = start + nanoseconds(&tsEnd);
      while (clock_gettime(CLOCK_REALTIME, &tsNow) == 0 && nanoseconds(&tsNow) < stop) {
         struct ifmibdata mib;

         getIfData(&mib, "lo0"); 
      }
   }
   
   return (0);
}

compile:

cc one.c -lthr -o one
cc many.c -lthr -o many 

runOne:

#!/bin/ksh

while true; do ./one 1.1; done;

runIfconfig:

#!/bin/ksh

while true; do ifconfig; done;
Comment 1 Conrad Meyer freebsd_committer freebsd_triage 2018-05-08 18:38:38 UTC
Don, I've CC'ed you as you have previously had some interest in ULE.  Feel free to un-CC if you're not interested :-).
Comment 2 Don Lewis freebsd_committer freebsd_triage 2018-05-08 22:37:05 UTC
Thread stealing is only done in the CPU idle loop (or more recently, when ULE detects that the CPU is about to go idle).  Since none of the CPUs is idle in this case, stealing will not happen.

Setting kern.sched.steal_thresh=1 won't make a difference since the stuck thread raises the load to 2 on the CPU is where it is assigned.

In theory sched_balance() should eventually move the stuck thread to the CPU is running "runOne", but it runs infrequently and if the destination CPU is chosen totally randomly, that could take a while.  There could be an issue in how it chooses the destination CPU that prevents it from ever choosing the CPU where "runOne" is running ...
Comment 3 Mit Matelske 2018-05-09 14:03:51 UTC
Don-

My problem is that once it is "stuck" I can stop the runOne script and free up the remaining CPU and the ifconfig process will remain on the busy CPU.  The machine is sitting 12.5% idle but the process never migrates.  The "top" snapshot I included shows this.
Comment 4 Mark Johnston freebsd_committer freebsd_triage 2018-05-09 15:13:23 UTC
Created attachment 193209 [details]
fix sched_balance

(In reply to Don Lewis from comment #2)
Hmm, I think sched_balance() is actually busted: the initial call is performed while smp_started == 0, so balance_ticks is never set to a non-zero value.

The attached patch seems to address that at least.
Comment 5 Don Lewis freebsd_committer freebsd_triage 2018-05-09 16:32:39 UTC
(In reply to Mark Johnston from comment #4)
Looks reasonable to me.
Comment 6 Ivan Klymenko 2018-05-09 16:42:27 UTC
(In reply to Mark Johnston from comment #4)
That is, this is due to this PR?
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=223914
Comment 7 Mark Johnston freebsd_committer freebsd_triage 2018-05-09 19:17:52 UTC
(In reply to Ivan Klymenko from comment #6)
Could be, I'm hoping that the submitter tries applying the patch to see if the
problem persists.

I didn't realize a PR already existed for sched_balance(). I'll work on getting
it fixed.
Comment 8 Mit Matelske 2018-05-09 20:00:57 UTC
Sorry for the delay!  

I tried the patch out and sched_balance_group now runs but the I can still get a process stuck in the RUN state on a busy CPU.  FWIW, it doesn't seem like that mechanism ever moves processes.  All my calls to tdq_move comes from the tdq_idled loop.

I'm debugging why the process doesn't get moved now.  Weird thing is other processes are successfully getting moved from the busy CPUs from the tdq_idled loop.  Just not the stuck process.

Don't know if this helps, but it seems like if I start another process, something triggers and it will move the stuck process and everything will run fine for awhile.  Any clues from that?
Comment 9 Don Lewis freebsd_committer freebsd_triage 2018-05-09 20:06:03 UTC
(In reply to Mit Matelske from comment #3)
Hmn, the idle CPU should then be able to steal the stuck process.  Maybe it is stuck in some state where migration is forbidden ...
Comment 10 Mit Matelske 2018-05-10 15:05:33 UTC
Looks like td_pinned gets set so the thread can't move from the busy processor.  Not sure why that occasionally happens?
Comment 11 Mark Johnston freebsd_committer freebsd_triage 2018-05-10 15:09:34 UTC
(In reply to Mit Matelske from comment #10)
The kernel stack of the thread will provide a clue, if you're able to obtain it. procstat -kk <pid> can grab the kernel stack of a running thread.
Comment 12 Mit Matelske 2018-05-10 15:53:26 UTC
# procstat -kk 1557
  PID    TID COMM                TDNAME              KSTACK                       
 1557 100825 ifconfig            -                   mi_switch+0xe5 critical_exit+0x7a ipi_bitmap_handler+0x79 ipi_intr_bitmap_handler_u+0x9c userland_sysctl+0x157 sys___sysctl+0x5f amd64_syscall+0x6c4 fast_syscall_common+0x106
Comment 13 Mark Johnston freebsd_committer freebsd_triage 2018-05-15 15:48:50 UTC
I'm not able to reproduce this on HEAD. I'll try setting up an 11.1 VM for
testing.
Comment 14 Mark Johnston freebsd_committer freebsd_triage 2018-05-15 17:01:22 UTC
I'm a bit confused by the repro. I presume the expected behaviour is that the ifconfig thread should be migrated to the CPU running "one". But in your sample "top" output, the last CPU is idle even though you're running "one" in a loop there - how can that be?
Comment 15 Mit Matelske 2018-05-15 17:58:08 UTC
In the top output, I had stopped the "one" program to show that even if the processor goes idle the ifconfig never gets migrated to 7.  Sorry for the confusion.
Comment 16 Ed Maste freebsd_committer freebsd_triage 2021-12-01 14:18:30 UTC
Is this reproducible on 12.2 or later?
Comment 17 Don Lewis freebsd_committer freebsd_triage 2021-12-02 00:17:13 UTC
I think td_pinned is the problem.  The thread can't move until that flag is cleared, and that can't happen until it is able to get some CPU cycles to get through the section of code that td_pinned is protecting.

I can see how this could happen in the general case, but in this particular case how does the ifconfig thread manage to make enough progress to set td_pinned while competing with the hard realtime thread that is hogging the cpu?