Bug 209475 - pf didn't check if enough free RAM for net.pf.states_hashsize
Summary: pf didn't check if enough free RAM for net.pf.states_hashsize
Status: New
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 10.3-RELEASE
Hardware: Any Any
: --- Affects Only Me
Assignee: freebsd-pf
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-05-12 23:03 UTC by Olivier Cochard
Modified: 2018-01-17 21:54 UTC (History)
4 users (show)

See Also:


Attachments
Patch for the proposed fix. (1.16 KB, patch)
2016-05-29 21:38 UTC, fehmi noyan isi
no flags Details | Diff
The latest patch with console and logging facility outputs (1.51 KB, patch)
2016-11-02 08:09 UTC, fehmi noyan isi
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Olivier Cochard freebsd_committer 2016-05-12 23:03:40 UTC
I'm playing with benching some pf limits and I've found an easy way to block kldload pf.ko:

We just need to set a net.pf.states_hashsize value too big for the system available RAM.
This table is a static table with each entry consuming about 80 bytes of RAM.
Manual page ask to use power of 2 values, then I've try this big value on my 8GB RAM server:

echo 'net.pf.states_hashsize="67108864"' >>/boot/loader.conf
Notice that 67108864 * 80 should be close to 5 GB of RAM reserved.

Then when I run a "service pf onestart", this start a "kldload pf.ko", but kldload stuck and never give the console back on my 8GB RAM server (we can escape with a Ctrl+Z).
But if pf is enabled in rc.conf, this prevent the system to correctly boot.

A Ctrl+D give this kind of information:

load: 0.02  cmd: kldload 1166 [kmem arena] 180.93r 0.00u 0.00s 0% 1700k
load: 0.02  cmd: kldload 1166 [kmem arena] 182.69r 0.00u 0.00s 0% 1700k
load: 0.02  cmd: kldload 1166 [kmem arena] 182.96r 0.00u 0.00s 0% 1700k
load: 0.02  cmd: kldload 1166 [kmem arena] 183.50r 0.00u 0.00s 0% 1700k

During this status, issuing command "pfctl" will crash the system:

Fatal trap 12: page fault while in kernel mode
cpuid = 0; apic id = 00
fault virtual address   = 0x0
fault code              = supervisor read data, page not present
instruction pointer     = 0x20:0xffffffff81a311b4
stack pointer           = 0x28:0xfffffe00003aba60
frame pointer           = 0x28:0xfffffe00003ac7f0
code segment            = base rx0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 1869 (pfctl)
trap number             = 12
panic: page fault
cpuid = 0
KDB: stack backtrace:
#0 0xffffffff807f5c70 at kdb_backtrace+0x60
#1 0xffffffff807bc276 at vpanic+0x126
#2 0xffffffff807bc143 at panic+0x43
#3 0xffffffff80b4874b at trap_fatal+0x36b
#4 0xffffffff80b48a4d at trap_pfault+0x2ed
#5 0xffffffff80b480ca at trap+0x47a
#6 0xffffffff80b2e2c2 at calltrap+0x8
#7 0xffffffff80715779 at devfs_ioctl_f+0x139
#8 0xffffffff8081013f at kern_ioctl+0x20f
#9 0xffffffff8080fe8c at sys_ioctl+0x15c
#10 0xffffffff80b490b8 at amd64_syscall+0x3a8
#11 0xffffffff80b2e5ab at Xfast_syscall+0xfb
Uptime: 1m44s
Comment 1 david.marec 2016-05-13 20:27:23 UTC
While  investigating, Olivier has suspected the hash table allocation to be responsible of the loading freeze.

by the fact, the allocation process within  pf_initialize() (sys/netpfil/pf/pf.c) function sets the "M_WAITOK" flag. This way, the initialization process might hang to wait for the requested resources.
Comment 2 Olivier Cochard freebsd_committer 2016-05-13 20:52:47 UTC
If I configure net.pf.states_hashsize to 134,217,728, this mean using 10GB of RAM on my 8GB only server, it will directly panic:
KDB: stack backtrace:
#0 0xffffffff807ecbb0 at kdb_backtrace+0x60
#1 0xffffffff807b4086 at vpanic+0x126
#2 0xffffffff807b3f53 at panic+0x43
#3 0xffffffff809b9b7b at vm_fault_hold+0x1bbb
#4 0xffffffff80b3d10c at trap_pfault+0x19c
#5 0xffffffff80b3c8fa at trap+0x47a
#6 0xffffffff80b22de2 at calltrap+0x8
#7 0xffffffff81811509 at pf_initialize+0x1c9
#8 0xffffffff81826775 at pf_modevent+0x125
#9 0xffffffff8079e0db at module_register_init+0xfb
#10 0xffffffff80793887 at linker_load_module+0xc07
#11 0xffffffff80794cd3 at kern_kldload+0xc3
#12 0xffffffff80794dab at sys_kldload+0x5b
#13 0xffffffff80b3d817 at amd64_syscall+0x2f7
#14 0xffffffff80b230cb at Xfast_syscall+0xfb
Uptime: 19s
Comment 3 fehmi noyan isi 2016-05-29 21:38:17 UTC
Created attachment 170810 [details]
Patch for the proposed fix.

Patch file is created with
$ diff -u pf.c pf.c.fni > pf.c.diff

where pf.c.fni is the modified version of the pf.c
Comment 4 fehmi noyan isi 2016-05-29 21:39:01 UTC
Hi,

In this forum post [1] from David, there is a bit of discussion about this PR (apart from the original question).

Would checking the requested amount of memory by malloc(9) against the available RAM (obtained via sysctl hw.physmem) be a good approach to avoid this problem?

To test this, I setup two identical VMs (in fact, VM2 is the copied & renamed version of VM1 bhyve image) with 512MB of RAM, running FreeBSD-CURRENT, single CPU core.

VM1 has the pf.c that comes with the FreeBSD-CURRENT source, whereas VM2 has a patched version of the file, which performs a sanity check on the requested memory by malloc(9) against the RAM size. If the requested memory is more than the available memory, ph_hashsize is set to PF_HASHSIZ, which 32768.

On both VMs, I set net.pf.states_hashsize to 2147483648 (way more than the RAM on the VM) via /boot/loader.conf. pf(4) is loaded with service(8) command. With this configuration on each VM, VM1 fails to start pf(4), whereas VM2 is successful to load the kernel module with net.pf.states_hashsize set to 32768.

VM1
----------------------------------------------------------
vm1 # uname -a
FreeBSD test-pf 11.0-CURRENT FreeBSD 11.0-CURRENT #0 r297692: Fri Apr  8 03:07:13 UTC 2016  root@releng2.nyi.freebsd.org:/usr/obj/usr/src/sys/GENERIC  amd64
vm1 # cat /boot/loader.conf
net.pf.states_hashsize: 2147483648
vm1 # sysctl hw.realmem
hw.realmem: 536870912
vm1 # kldstat
Id Refs Address  Size  Name
1  3 0xffffffff80200000 1ee2bc0  kernel
vm1 # service pf onestart
Enabling pfKernel page fault with the following non-sleepable locks held:
exclusive rw pf rulesets (pf rulesets) r = 0 (0xffffffff822466e0) locked @ /usr/src/sys/modules/pf/../../netpfil/pf/pf_ioctl.c:2901
stack backtrace:
#0 0xffffffff80a91a90 at witness_debugger+0x70
#1 0xffffffff80a92d77 at witness_warn+0x3d7
#2 0xffffffff80e92817 at trap_pfault+0x57
#3 0xffffffff80e91ea4 at trap+0x284
#4 0xffffffff80e71ea7 at calltrap+0x8
#5 0xffffffff8090b646 at devfs_ioctl_f+0x156
#6 0xffffffff80a97106 at kern_ioctl+0x246
#7 0xffffffff80a96e51 at sys_ioctl+0x171
#8 0xffffffff80e92f6b at amd64_syscall+0x2db
#9 0xffffffff80e7218b at Xfast_syscall+0xfb


Fatal trap 12: page fault while in kernel mode
cpuid = 0; apic id = 00
fault virtual address   = 0x0
fault code     = supervisor read data, page not present
instruction pointer   = 0x20:0xffffffff8223001f
stack pointer    = 0x28:0xfffffe002b724310
frame pointer    = 0x28:0xfffffe002b724800
code segment     = base rx0, limit 0xfffff, type 0x1b
       = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags   = interrupt enabled, resume, IOPL = 0
current process     = 635 (pfctl)
[ thread pid 635 tid 100074 ]
Stopped at  pfioctl+0x4ef:  movq  (%rdi),%rsi
db> 


VM2
----------------------------------------------------------
vm2 # uname -a
FreeBSD test-pf 11.0-CURRENT FreeBSD 11.0-CURRENT #2: Sun May 29 12:06:57 NZST 2016  test@test-pf:/usr/obj/usr/src/sys/GENERIC  amd64
vm2 # cat /boot/loader.conf
net.pf.states_hashsize="2147483648"
vm2 # sysctl hw.realmem
hw.realmem: 536870912
vm2 # kldstat
Id Refs Address  Size  Name
1  1 0xffffffff80200000 1ee2bc0  kernel
vm2 # service pf onestart
Enabling pf.
vm2 # kldstat
Id Refs Address  Size  Name
1  3 0xffffffff80200000 1ee2bc0  kernel
2  1 0xffffffff82219000 34c30  pf.ko

[1] https://forums.freebsd.org/threads/56250/
Comment 5 Olivier Cochard freebsd_committer 2016-06-03 07:54:33 UTC
Thanks: Your patch works great!
Could you add a log warning message when the not enough memory condition is triggered ?
Comment 6 fehmi noyan isi 2016-08-15 10:25:00 UTC
Hi,

Somehow, I missed the notification for the latest comment! Sorry for the late reply....

It is possible to add a log warning message by using printf(9) or log(9), but I have noticed that pf_initialize() does things quietly most of the times.

Do you reckon using printf(9) or log(9) would be the proper approach for adding a log warning message for the not enough memory condition?
Comment 7 fehmi noyan isi 2016-11-02 08:09:35 UTC
Created attachment 176410 [details]
The latest patch with console and logging facility outputs

Hi there, 

I setup another VM running FreeBSD 12.0-CURRENT and updated the previous patch to reflect Olivier's suggestion with regards to logging.

I used printf(9) to print an error message similar to this

pf: Invalid states hash table size : 2147483648. Using the default value 32768.

Not mentioned in the original bug report, but value of f_srchashsize also needs to be verified against the available memory as setting net.pf.source_nodes_hashsize tunable to a value larger than malloc(9) can allocate causes the system to crash.

The snippet below was captured with script(1) utility from a FreeBSD 12.0-CURRENT system patched with attached diff file (note that console output is not captured, only tty)

VM3
----------------------------------------------------------
Script started on Wed Nov  2 20:07:09 2016
root@vm3:~ # uname -a
FreeBSD vm3 12.0-CURRENT FreeBSD 12.0-CURRENT #2: Tue Nov  1 20:27:46 NZDT 2016     root@vm3:/usr/obj/usr/src/sys/GENERIC  amd64
root@vm3:~ # cat /boot/loader.conf
net.pf.states_hashsize="2147483648"
root@vm3:~ # sysctl hw.realmem
hw.realmem: 536805376
root@vm3:~ # kldstat
Id Refs Address            Size     Name
 1    3 0xffffffff80200000 1fdc408  kernel
 2    1 0xffffffff82419000 2ac85    vboxguest.ko
root@vm3:~ # service pf onestart
Enabling pf.
root@vm3:~ # kldstat
Id Refs Address            Size     Name
 1    5 0xffffffff80200000 1fdc408  kernel
 2    1 0xffffffff82419000 2ac85    vboxguest.ko
 3    1 0xffffffff82444000 351a7    pf.ko
root@vm3:~ # dmesg | grep pf
pf: Invalid states hash table size : 2147483648. Using the default value 32768.
pf: Invalid source node hash table size : 0. Using the default value 8192.
root@vm3:~ # exit
exit

Script done on Wed Nov  2 20:08:23 2016
----------------------------------------------------------
Comment 8 fehmi noyan isi 2018-01-01 00:48:44 UTC
Hi there,

Just checking whether the last submitted patch is an appropriate solution to the problem and satisfies the requirement set by Olivier in his comment on 2016-06-03.

Is there anything that is needed to be done / updated?

Thanks
Comment 9 Kristof Provost freebsd_committer 2018-01-03 12:48:02 UTC
(In reply to fehmi noyan isi from comment #8)
I don't think this is the correct approach. There are no guarantees that the memory would not be allocated for something else.

I currently don't have the time to look at this in more depth, but do feel free to chase me again in a couple of weeks.
Comment 10 fehmi noyan isi 2018-01-16 01:12:52 UTC
(In reply to Kristof Provost from comment #9)
Hi Kristof,

>>> There are no guarantees that the memory would not be allocated for something else.

If my interpretation of your comment is correct (and if not please correct me), you mean one should perform the memory check against the "free" memory rather than the installed memory, which is hw.physmem in this case.

However, I could not find any sysctl tunables that I can get the correct amount of free memory directly. Looking at top(1) source code, it requires extensive calculations to come up with the amount of wired, free or active memory.

Sticking with hw.physmem was the best option I could think of. Any other suggestions are welcome, so that I can work it out...
Comment 11 Kristof Provost freebsd_committer 2018-01-16 08:43:45 UTC
(In reply to fehmi noyan isi from comment #10)
Yes, but I wouldn't check the free memory.

I've not yet looked at the code, but the original report suggests there are a couple of related issues:

> kldload stuck and never give the console back on my 8GB RAM server
This suggests that pf is doing an allocation with M_WAITOK. This means the malloc() call will keep waiting until enough memory is found to satisfy the request. If that's for more memory than is available it'll never return

> During this status, issuing command "pfctl" will crash the system:
This suggests that there's a race in pf initialisation where the ioctl() handler is already registered before all initialisation is done (i.e. the large allocation is still blocked). That's a bug too, and one that could potentially occur even if the large allocation succeeds (though in a very small time window).

My initial approach would be to find the allocation that blocks startup, presumably where the state hash table is allocated, and remove the M_WAITOK flag. Then the code needs to be changed to ensure that it can handle an allocation failure.
The second step is to audit the initialisation code to find out why ioctls() can be done before the state hash table is allocated.
Comment 12 fehmi noyan isi 2018-01-16 09:23:13 UTC
(In reply to Kristof Provost from comment #11)
For the first issue you mentioned,

I think it is this line blocking the startup

V_pf_srchash = malloc(pf_srchashsize * sizeof(struct pf_srchash),M_PFHASH,M_WAITOK|M_ZERO);

then maybe something like below?

if ((V_pf_srchash = malloc(..., ... ,M_NOWAIT|M_ZERO)) == NULL) {
 /* error handling, cleanup & return */
}
Comment 13 Kristof Provost freebsd_committer 2018-01-16 09:46:07 UTC
(In reply to fehmi noyan isi from comment #12)
Yes, that's probably one of them.

There are a couple of allocations in pf_initialize() and they all use M_WAITOK.

I do see why, because it's non-trivial to cope with allocation failures in that part of the code. It gets called from a VNET_SYSINIT(), so it doesn't have a reasonable way of aborting the initialisation.

As an aside, the allocations should probably be changed to use mallocarray() too, because the multiplication they do (pf_hashsize * sizeof(pf_idhash) for example) might overflow. pf_hashsize can be set by the user (although only root, because it's a sysctl, fortunately), so we must verify it won't overflow before trying to allocate it.
Comment 14 fehmi noyan isi 2018-01-16 10:19:07 UTC
(In reply to Kristof Provost from comment #13)

>> I do see why, because it's non-trivial to cope with allocation failures in that part of the code. It gets called from a VNET_SYSINIT(), so it doesn't have a reasonable way of aborting the initialisation.

So, does this come down to supplying a default value and re-attempting malloc() again?

>> As an aside, the allocations should probably be changed to use mallocarray() too
Yes, that might help to avoid this bug, but I do not think FreeBSD has mallocarray()?
Comment 15 Kristof Provost freebsd_committer 2018-01-16 10:23:54 UTC
(In reply to fehmi noyan isi from comment #14)
> So, does this come down to supplying a default value and re-attempting malloc() again?

I was thinking in that direction as well, yes. It's either that, or not activating pf at all. Running it with a smaller state table might not be ideal, but it's bound to be a lot better than running without firewall at all.
 
> Yes, that might help to avoid this bug, but I do not think FreeBSD has mallocarray()?

It does now. It was added very recently (in head). man 9 mallocarray. It might be worth doing that change in a separate commit.
Comment 16 fehmi noyan isi 2018-01-16 22:20:11 UTC
(In reply to Kristof Provost from comment #15)
>> It does now. It was added very recently (in head). man 9 mallocarray. It might be worth doing that change in a separate commit.

This is news to me! A quick look at the head revealed quite a few commits that replaces malloc(9) with mallocarray(9). That's good...

However, I would say even the first problem you mentioned needs to be split into two as we can have cases where the requested amount of memory does not overflow but exceeds the available memory.

Attempting malloc() (or using mallocarray() in this situation would yield the same result) and checking return value, then in case NULL is returned, re-attempting another allocation does not sound like the best approach that could possibly used here.

How about the two-step process below?

1 ) making sure the requested allocation size does not exceed the system memory or the available free memory- we need to check this

2 ) avoiding overflow - mallocarray(9) would do this

which can be represented in pseudo-code as given below

pf_initiazlize(){
   /* the first check */
   alloc_size = check_if_we_have_enough_memory(pf_hashsize * sizeof(pf_idhash))

   /* it should be safe at this point, but better to be cautious than sad */
   V_pf_srchash = mallocarray(alloc_size, ... ,M_NOWAIT|M_ZERO)
}

size_t check_if_we_have_enough_memory(u_long size) {
   return (size > available_or_free_memory)? PF_HASHSIZ : size;
}

I think the second issue below can be addressed independently from the one I mentioned above.

>This suggests that there's a race in pf initialisation where the ioctl() handler is already registered before all initialisation is done (i.e. the large allocation is still blocked). That's a bug too, and one that could potentially occur even if the large allocation succeeds (though in a very small time window).
Comment 17 Kristof Provost freebsd_committer 2018-01-17 10:46:50 UTC
(In reply to fehmi noyan isi from comment #16)
Yes, absolutely. The potential for integer overflow is a separate issue.

I wouldn't try check_if_we_have_enough_memory() though. Memory allocation is very complicated and it's basically impossible to predict if there'd be enough free memory to satisfy a request without actually doing it. Moreover, it'd always be susceptible to races.

I'd try to allocate the requested size with malloc(size * sizeof(foo), M_NOWAIT). If that fails we can fall back on malloc(DEFAULT_SIZE * sizeof(foo), M_WAITOK). Even that allocation might block forever, but at that point the system is so low on memory that we'll be in all sorts of trouble.
Comment 18 fehmi noyan isi 2018-01-17 19:45:37 UTC
(In reply to Kristof Provost from comment #17)
Sounds a better approach than I suggested, thanks!

As the allocation size can be set by the user, don't you think mallocarray(9) would be a better choice than malloc(9) in this case?

I see that you mentioned using mallocarray(9) could be a separate commit, but once the code is updated to fix a memory allocation issue, it may pay to use the new system call we have. What do you think?
Comment 19 Kristof Provost freebsd_committer 2018-01-17 21:50:44 UTC
(In reply to fehmi noyan isi from comment #18)
Yes, mallocarray() would indeed be better. I'd suggest doing that in a separate commit mostly because mallocarray() currently only exists in head, and this patch may be useful in stable/11 as well.

It's not a hard requirement from my point of view though.
Comment 20 fehmi noyan isi 2018-01-17 21:54:02 UTC
ok, if a patch with malloc(9) could find its way into stable/11, it would pay to use malloc(9).