Bug 219216 - sched_bind() blocks if the entropy pool is starved
Summary: sched_bind() blocks if the entropy pool is starved
Status: New
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 11.0-STABLE
Hardware: amd64 Any
: --- Affects Many People
Assignee: freebsd-bugs (Nobody)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-05-11 09:24 UTC by Dominic Fandrey
Modified: 2019-05-19 16:14 UTC (History)
7 users (show)

See Also:


Attachments
shell script to reproduce the problem (773 bytes, text/plain)
2017-05-11 17:19 UTC, Dominic Fandrey
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Dominic Fandrey freebsd_committer freebsd_triage 2017-05-11 09:24:07 UTC
I recently updated my 11-stable system:

FreeBSD AprilRyan.norad 11.0-STABLE FreeBSD 11.0-STABLE #3 r318143: Wed May 10 17:56:12 CEST 2017     root@AprilRyan.norad:/usr/obj/S403/amd64/usr/src/sys/S403  amd64

I immediately noticed that rand_harvestq is permanently running and consuming a small but significant amount of CPU time, now. To investigate I started to `dd bs=1m < /dev/random > /dev/null`

Coincidentally I was running a release candidate of powerd++ in foreground mode with temperature throttling at the same time:
https://github.com/lonkamikaze/powerdxx/releases/tag/0.3.0-rc1

The following happened when I started the `dd`:

- Two cores were fully consumed, one by dd, one by random_harvestq
- powerd++ started to stutter and then completely freeze

After I killed the `dd` process the following happened:

- random_harvestq continued to consume an entire core for a long time
- powerd++ remained frozen

By erratically swiping my fingers over the touch screen I got powerd++ to return operation in a stuttering fashion. It took several minutes before the system acted normal again.

The two surprising conclusions so far:

- /dev/random blocks
- powerd++ consumes randomness

So I investigated the issue to find that it is the access to the following sysctls that blocks:

dev.cpu.0.temperature
dev.cpu.1.temperature
dev.cpu.2.temperature
dev.cpu.3.temperature

Unloading the coretemp module in the blocked state resulted in a kernel panic that told me coretemp was stuck in coretemp_get_val_sysctl().

With an unhealthy dose of uprintf() calls I figured out that the block happens in coretemp_get_thermal_msr() (see /usr/src/sys/dev/coretemp/coretemp.c:306).

The problem is the following code:

311         thread_lock(curthread);
312         sched_bind(curthread, cpu);
313         thread_unlock(curthread);

The call to sched_bind() blocks when the entropy pool is starved (I suspect only if the thread is not currently running on the right core any way).

Because I cannot fiddle with and replace sched_ule at runtime, I have decided this is as far as I'm digging.

I think that the scheduler depends on entropy is very worrying, not to say a bug, especially if randomness is a scarce resource. I got the system to panic many times during this investigation, mostly because locks have been held too long. E.g.:

spin lock 0xffffffff81c8e380 (sched lock 3) held by 0xfffff80028b19560 (tid 100196) too long
spin lock 0xffffffff81c8e380 (sched lock 3) held by 0xfffff80028b19560 (tid 100196) too long
panic: spin lock held too long
cpuid = 0
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe045154f850
vpanic() at vpanic+0x186/frame 0xfffffe045154f8d0
panic() at panic+0x43/frame 0xfffffe045154f930
_mtx_lock_spin_cookie() at _mtx_lock_spin_cookie+0x311/frame 0xfffffe045154f9a0
sched_idletd() at sched_idletd+0x3aa/frame 0xfffffe045154fa70
fork_exit() at fork_exit+0x85/frame 0xfffffe045154fab0
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe045154fab0
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---
KDB: enter: panic
Uptime: 4m31s


I also find it questionable that entropy harvesting continues after initially seeding the RNG, making /dev/random susceptible to entropy poisoning by a malicious process that feeds bad entropy into /dev/random.
Comment 1 Konstantin Belousov freebsd_committer freebsd_triage 2017-05-11 15:05:43 UTC
The description of the problem is not very useful.

When the panic 'spinlock held too long' occurs, presumably for the sched locks, catch exact kernel messages (same as you did) and also the backtrace for the thread which is reported as the owner of the spinlock.  I will see if more information is needed after that.

Note that if CPU speed is artificially reduced, then spinlock sections might take much longer time to execute, which would, in turn, trigger the detector for the stuck spinlock.  I do not claim that this is your case, but it could be.

sched_bind() does not depend on rnd.
Comment 2 Dominic Fandrey freebsd_committer freebsd_triage 2017-05-11 16:31:53 UTC
The panic is not the issue I am reporting. The problem is that sched_bind() does not return while I am reading from /dev/random.
Comment 3 Dominic Fandrey freebsd_committer freebsd_triage 2017-05-11 17:19:45 UTC
Created attachment 182516 [details]
shell script to reproduce the problem

This script reproduces PR 219216.

You need to be on an intel platform with coretemp loaded.

Every .5 seconds it prints all core temperatures. Every 5 seconds
it consumes 1 GiB from /dev/random.

If you are affected you will see access to the dev.cpu.%d.temperature
sysctls start to stutter and freeze.

To terminate the script press CTRL+C and move your mouse to feed
some entropy into the RNG.
Comment 4 Konstantin Belousov freebsd_committer freebsd_triage 2017-05-11 17:36:00 UTC
Peter, could you, please, try to reproduce this ?
Comment 5 Peter Holm freebsd_committer freebsd_triage 2017-05-12 06:31:39 UTC
(In reply to Konstantin Belousov from comment #4)
Running the script or powerd++ & dd of /dev/random, does not trigger
any problems with my setup.

Here's some of the output from powerd++:

power: unknown, load:  129 MHz,  40 C, cpu0.freq: 1200 MHz, wanted:  344 MHz,  81 C
power: unknown, load:  137 MHz,  40 C, cpu0.freq: 1200 MHz, wanted:  365 MHz,  81 C
power: unknown, load:   94 MHz,  40 C, cpu0.freq: 1200 MHz, wanted:  250 MHz,  81 C
power: unknown, load:   71 MHz,  40 C, cpu0.freq: 1200 MHz, wanted:  189 MHz,  81 C
power: unknown, load:   47 MHz,  40 C, cpu0.freq: 1200 MHz, wanted:  125 MHz,  81 C
power: unknown, load:    4 MHz,  40 C, cpu0.freq: 1200 MHz, wanted:   10 MHz,  81 C
power: unknown, load:    4 MHz,  40 C, cpu0.freq: 1200 MHz, wanted:   10 MHz,  81 C
power: unknown, load:   17 MHz,  39 C, cpu0.freq: 1200 MHz, wanted:   45 MHz,  81 C
power: unknown, load:   17 MHz,  40 C, cpu0.freq: 1200 MHz, wanted:   45 MHz,  81 C
power: unknown, load:   17 MHz,  40 C, cpu0.freq: 1200 MHz, wanted:   45 MHz,  81 C
power: unknown, load:   17 MHz,  41 C, cpu0.freq: 1200 MHz, wanted:   45 MHz,  81 C
power: unknown, load:    5 MHz,  41 C, cpu0.freq: 1200 MHz, wanted:   13 MHz,  81 C
power: unknown, load:    0 MHz,  41 C, cpu0.freq: 1200 MHz, wanted:    0 MHz,  81 C

My test box:
$ sysctl dev.cpu.0
dev.cpu.0.temperature: 35.0C
dev.cpu.0.coretemp.throttle_log: 0
dev.cpu.0.coretemp.tjmax: 91.0C
dev.cpu.0.coretemp.resolution: 1
dev.cpu.0.coretemp.delta: 56
dev.cpu.0.cx_method: C1/mwait/hwc C2/mwait/hwc/bma
dev.cpu.0.cx_usage_counters: 0 60
dev.cpu.0.cx_usage: 0.00% 100.00% last 13171us
dev.cpu.0.cx_lowest: C2
dev.cpu.0.cx_supported: C1/1/1 C2/2/41
dev.cpu.0.freq_levels: 2001/95000 2000/95000 1900/89224 1800/83561 1700/78011 1600/72573 1500/67246 1400/62042 1300/56932 1200/51931
dev.cpu.0.freq: 2001
dev.cpu.0.%domain: 0
dev.cpu.0.%parent: acpi0
dev.cpu.0.%pnpinfo: _HID=ACPI0007 _UID=0
dev.cpu.0.%location: handle=\_SB_.SCK0.CP00 _PXM=0
dev.cpu.0.%driver: cpu
dev.cpu.0.%desc: ACPI CPU
$
Comment 6 Dominic Fandrey freebsd_committer freebsd_triage 2017-05-12 07:19:28 UTC
Maybe it's something Haswell specific?

CPU: Intel(R) Core(TM) i7-4500U CPU @ 1.80GHz (2394.51-MHz K8-class CPU)
Comment 7 Peter Holm freebsd_committer freebsd_triage 2017-05-12 08:13:46 UTC
(In reply to Dominic Fandrey from comment #6)

On my host you can not throttle the CPU to less that 1200 MHz, but powerd++
seems very agresive. What is the value of dev.cpu.0.freq_levels on your box?
Comment 8 Konstantin Belousov freebsd_committer freebsd_triage 2017-05-12 09:37:11 UTC
(In reply to Dominic Fandrey from comment #6)
I might be not very clear first time: can you reproduce your problem without frequency adjustment code running in parallel ?
Comment 9 Fabian Keil 2017-05-12 11:29:19 UTC
I can't reproduce the issue either. I'm using a kernel based on r318145:

CPU: Intel(R) Core(TM) i5-2520M CPU @ 2.50GHz (2491.96-MHz K8-class CPU)

I let the script run through 1680 cycles without noticeable issues.

While dd keeps a core busy as expected, rand_harvestq stays below 3%
cpu use according to top. I use vanilla powerd(8) which changes the
cpu frequency between 800 and 2501.

The fact that rand_harvestq is busy doesn't necessarily indicate that
the entropy pool is starved.

You could try running:
sudo dtrace -n 'fbt:kernel:random_harvest_*:entry {@[probefunc, stack()] = count(); } tick-60sec {exit(0)}'
to see which random_harvest_* functions are called, from where and how often.

Additionally you could experiment with modifying the entropy sources with:
kern.random.harvest.mask
to see if it makes a difference.

Newly harvested entropy is mixed into the pool so adding more shouldn't
lower the "entropy quality" in the pool. Therefore it's not obvious to me how
the "poisoning" you mentioned would work.
Comment 10 Conrad Meyer freebsd_committer freebsd_triage 2019-05-19 16:11:41 UTC
rand_harvestq has had CPU reduced significantly in 12.x timeframe — previously it was attempting to obtain far more entropy and far more often than it actually needed.  This was in part due to conservative estimates (r338293) and in part due to bugs in random read (r338292, r339492).

r287023, r334450 may also be relevant.

If you have a concurrent reader, rand_harvestq will fight it for the global Fortuna state mutex on head today (and spin while waiting).  You are somewhat saved by READ_RANDOM_UIO dropping the Fortuna lock every 4kB (PAGE_SIZE).  I have a revision proposed to reduce this conflict further:

https://people.freebsd.org/~cem/rand_harvestq.png
https://reviews.freebsd.org/D20313
Comment 11 Conrad Meyer freebsd_committer freebsd_triage 2019-05-19 16:14:27 UTC
(In reply to Fabian Keil from comment #9)
> While dd keeps a core busy as expected, rand_harvestq stays below 3%
> cpu use according to top.
> ...
> 
> The fact that rand_harvestq is busy doesn't necessarily indicate that
> the entropy pool is starved.

True.  On 12.0 through head, if you have concurrent random readers (especially 2+), rand_harvestq CPU time may just indicate it is spinning on the global Fortuna lock while a reader generates random data.  Some more detail in my earlier comment #10.