Bug 195882 - Local DoS from unprivileged user
Summary: Local DoS from unprivileged user
Status: New
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 10.1-RELEASE
Hardware: Any Any
: --- Affects Many People
Assignee: freebsd-bugs (Nobody)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-12-11 08:31 UTC by Ivan Rozhuk
Modified: 2021-05-19 22:21 UTC (History)
5 users (show)

See Also:


Attachments
example code (1.99 KB, text/plain)
2014-12-11 08:31 UTC, Ivan Rozhuk
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Ivan Rozhuk 2014-12-11 08:31:53 UTC
Created attachment 150465 [details]
example code

System does not free mem after unmap().


Before start build set:
	const char *file_name = (const char *)"/testvn.tmp";
	off_t file_size = (4 * 1024 * mb); /* Set to x2 RAM size. */

4 - replace to your host ram size * 2.
Comment 1 Ivan Rozhuk 2014-12-11 08:38:17 UTC
swap is off.
Comment 2 Mark Millard 2018-11-05 23:40:55 UTC
		mem = mmap(NULL, write_size, (PROT_READ | PROT_WRITE),
		    (MAP_SHARED | MAP_NOCORE), fd, (i * write_size));
. . .
		//msync(mem, file_size, MS_SYNC);
		//posix_madvise(mem, file_size, MADV_FREE);
		munmap(mem, file_size);

write_size for len for mmap but file_size for len for
munmap?

Quoting the man page for munmap:

     The munmap() system call will fail	if:

     [EINVAL]		The addr argument was not page aligned,	the len	argu-
			ment was zero or negative, or some part	of the region
			being unmapped is outside the valid address range for
			a process.

As near as I can tell the munmap calls were returning
EINVAL and possibly not actually doing the unmap at all.

A correct len for the munmap would be needed in order
for the munmap calls to actually guarantee to unmap
without leaving any pages mapped.

The denial of service could just have the munmap
commented out, like the msync and posix_madvise
are. munmap freeing RAM (or not) is a separate issue.

You probably would need distinct submittals for the
two issues if both really apply.
Comment 3 Ivan Rozhuk 2018-11-06 00:12:11 UTC
Process memory usage does not grow, unmap should works fine.
Comment 4 Mark Millard 2018-11-06 01:27:52 UTC
(In reply to rozhuk.im from comment #3)

Looks like I misinterpreted the man page's description
the munmap does return 0 in the example.

Sorry for the noise for that point.

But may be a can make up for it . . .
May be the following example makes the report
clearer?

I ran an a.out built from a source with 16 instead of 4,
so for a 8 GiByte aarch64 system.

Before the a.out run:

Mem: 13M Active 1240K Inact, 108M Wired, 28M Buf, 7757M Free
Swap: 28G Total, 28G Free

After the .aout had finished, from a separate top
run:

Mem: 2197M Active, 4937M Inact, 255M Wired, 186M Buf, 629M Free
Swap: 28G Total, 11M Used, 28G Free

15 minutes+ later with the system left idle: not much change
from what is shown above for the above.
Comment 5 Mark Johnston freebsd_committer freebsd_triage 2018-11-06 01:48:31 UTC
(In reply to Mark Millard from comment #4)
This behaviour is expected.  The kernel is caching the file's contents even after the file is unmapped.
Comment 6 Ivan Rozhuk 2018-11-06 02:03:38 UTC
(In reply to Mark Johnston from comment #5)

On 10.1 this cache eat all free mem and all system freeze, if swap not enabled.

Now is better, but disk cache still cause swap usage or freezes (but less then on 10x) if no swap present.

Disk cache should have very low priority and be like free mem - avaible to use by any app for mem alloc or have very hard limit via sysctl and does not consume all free mem.

Another thing: I do not want use swap, but I need kernel coredumps, how can I do this?
Comment 7 Mark Johnston freebsd_committer freebsd_triage 2018-11-06 02:22:26 UTC
(In reply to rozhuk.im from comment #6)
Your test program is dirtying pages by writing to them, so the OS is forced to
flush them to disk before they can be reused.  It is easy to dirty pages more quickly than they can be written back and freed, in which case the system will continually be starved for free pages.  Currently I don't believe we have any mechanism to restrict the amount of dirty memory mapped into a given process.

Regarding kernel dumps, if you have space on a raw partition, you can simply point dumpon(8) at that.  If not, you might consider using netdump(4).
Comment 8 Ivan Rozhuk 2018-11-06 03:15:17 UTC
(In reply to Mark Johnston from comment #7)

Main problem is that dirty pages does not go to free after if flushed to disk, and it cause swap usage.
On 10.1 - system can not allocate mem even after system stop write to disk.

Probably now I should count free mem as: free + laundry.


Another bug: I have 6,5 Gb free, program write 6 Gb, rename file and restart program. Program fail some time later, but system can not flush pages to disk - no space, and whole time move mem:
...
CPU:  0.0% user,  0.0% nice, 10.5% system,  7.0% interrupt, 82.5% idle
Mem: 1945M Active, 120K Inact, 1537M Laundry, 315M Wired, 199M Buf, 47M Free
...

CPU:  0.0% user,  0.0% nice, 14.3% system,  6.3% interrupt, 79.5% idle
Mem: 2765M Active, 36K Inact, 717M Laundry, 315M Wired, 199M Buf, 48M Free
   16 root         49    -      0    72K CPU3     3  12:26  59.96% [pagedaemon{laundry: dom0}]
...
and eat CPU.
Until file was deleted.


About kernel dump.
I try:
swapoff -a
swapoff: /dev/gptid/0714a812-b98e-11e8-a831-7085c2375722.eli: Cannot allocate memory
System has 32 Gb RAM, 1,8 Gb in swap and summ of RES all running apps is less than 20 Gb.
Only after I stop one vbox vm and get 4+gb free mem it work without error.
Comment 9 Mark Millard 2018-11-06 04:11:48 UTC
(In reply to rozhuk.im from comment #6)

> On 10.1 this cache eat all free mem and all system freeze, if swap not enabled.

In what I describe below I was testing a head -r339076
based FreeBSD on an aarch64 8GiByte system. So I used
16 instead of 4 to scale to twice the RAM size. swapoff -a
had been used first so no swap was enabled.

First I give the ideas, then comment on the
tests of using the ideas.

If the file is to stick around and should be fully
updated, an fsync before closing would seem to deal
with updating the file at the time. If so, the RAM
pages should no longer be dirty. That should in
turn allow later conversion of such pages to other
uses in any process without I/O or such at the time.
True even if started after swapoff -a so no swap
is available.

If the file is to stick around and does not need to
be (fully) updated, posix_madvise(mem, write_size, MADV_FREE)
before each unmap and use of the fsync before close
(to be sure to deal with any dirty pages if the hint
is not taken) might be how to hint the intent to the
system. Both Active and Inact might end up containing
such clean RAM pages that could be put to direct use
for other memory requests, if I understand right.
True even if started after swapoff -a so no swap
is available.

So I tried those combinations:

The MADV_FREE and fsync combination allowed me to
(with swapoff -a before): run, rename the file,
run, . . .

The same was true for the just fsync variant:
run, rename the file, run, . . .

So the presence of multiple such 16 GiByte files
from past runs (no boots between) did not prevent
the next file from being created and used the same
way on the 8 GiByte RAM box with no swap space
enabled. Clearly the RAM is being made available
as needed for this program.

In at least the MADV_FREE involved case, both
Active and Inact could be large after a run but
that did not interfere with later runs: the RAM
pages became available as needed without needing
to be swapped first.

If there is something I missed in the test structure,
let me know.
Comment 10 Mark Millard 2018-11-06 22:15:41 UTC
(In reply to rozhuk.im from comment #8)

One difference in my test context and yours appears to
be that I'm not using any encryption layer but the .eli
in:

/dev/gptid/0714a812-b98e-11e8-a831-7085c2375722.eli

suggests that you are using geli-based encryption.

My context is basic, simple UFS. In my context the
I/O is fairly low latency and fairly high rate (an
SSD). So, for example, the fsync activity does not
last long. I have no experience with such issues
for geli encryption and have no clue how your I/O
subsystem latency and bandwidth might compare.

I'm also probably less likely to see the file system
try to allocate memory during its attempt to fsync
or otherwise write out dirty RAM pages (making them
clean).

All of this may make it harder for me to replicate
the behavior that you would see for the same test
program run the same way but in your context.
Comment 11 Mark Millard 2018-11-06 22:27:07 UTC
(In reply to Mark Johnston from comment #5)

Thanks for that note. The caching status after unmap
and after close and after process deletion has
helped clear out bad assumptions of mine, including
just what top's Active, Inact, and Buf mean in
various contexts. My assumptions were tied to the
observed behavior of the limited range of my typical
workloads.