Summary: | pf didn't check if enough free RAM for net.pf.states_hashsize | ||
---|---|---|---|
Product: | Base System | Reporter: | Olivier Cochard <olivier> |
Component: | kern | Assignee: | freebsd-pf (Nobody) <pf> |
Status: | Closed FIXED | ||
Severity: | Affects Only Me | CC: | david.marec, farrokhi, fnoyanisi, gonzo, kp |
Priority: | --- | ||
Version: | 10.3-RELEASE | ||
Hardware: | Any | ||
OS: | Any | ||
Attachments: |
Description
Olivier Cochard
2016-05-12 23:03:40 UTC
While investigating, Olivier has suspected the hash table allocation to be responsible of the loading freeze. by the fact, the allocation process within pf_initialize() (sys/netpfil/pf/pf.c) function sets the "M_WAITOK" flag. This way, the initialization process might hang to wait for the requested resources. If I configure net.pf.states_hashsize to 134,217,728, this mean using 10GB of RAM on my 8GB only server, it will directly panic: KDB: stack backtrace: #0 0xffffffff807ecbb0 at kdb_backtrace+0x60 #1 0xffffffff807b4086 at vpanic+0x126 #2 0xffffffff807b3f53 at panic+0x43 #3 0xffffffff809b9b7b at vm_fault_hold+0x1bbb #4 0xffffffff80b3d10c at trap_pfault+0x19c #5 0xffffffff80b3c8fa at trap+0x47a #6 0xffffffff80b22de2 at calltrap+0x8 #7 0xffffffff81811509 at pf_initialize+0x1c9 #8 0xffffffff81826775 at pf_modevent+0x125 #9 0xffffffff8079e0db at module_register_init+0xfb #10 0xffffffff80793887 at linker_load_module+0xc07 #11 0xffffffff80794cd3 at kern_kldload+0xc3 #12 0xffffffff80794dab at sys_kldload+0x5b #13 0xffffffff80b3d817 at amd64_syscall+0x2f7 #14 0xffffffff80b230cb at Xfast_syscall+0xfb Uptime: 19s Created attachment 170810 [details]
Patch for the proposed fix.
Patch file is created with
$ diff -u pf.c pf.c.fni > pf.c.diff
where pf.c.fni is the modified version of the pf.c
Hi, In this forum post [1] from David, there is a bit of discussion about this PR (apart from the original question). Would checking the requested amount of memory by malloc(9) against the available RAM (obtained via sysctl hw.physmem) be a good approach to avoid this problem? To test this, I setup two identical VMs (in fact, VM2 is the copied & renamed version of VM1 bhyve image) with 512MB of RAM, running FreeBSD-CURRENT, single CPU core. VM1 has the pf.c that comes with the FreeBSD-CURRENT source, whereas VM2 has a patched version of the file, which performs a sanity check on the requested memory by malloc(9) against the RAM size. If the requested memory is more than the available memory, ph_hashsize is set to PF_HASHSIZ, which 32768. On both VMs, I set net.pf.states_hashsize to 2147483648 (way more than the RAM on the VM) via /boot/loader.conf. pf(4) is loaded with service(8) command. With this configuration on each VM, VM1 fails to start pf(4), whereas VM2 is successful to load the kernel module with net.pf.states_hashsize set to 32768. VM1 ---------------------------------------------------------- vm1 # uname -a FreeBSD test-pf 11.0-CURRENT FreeBSD 11.0-CURRENT #0 r297692: Fri Apr 8 03:07:13 UTC 2016 root@releng2.nyi.freebsd.org:/usr/obj/usr/src/sys/GENERIC amd64 vm1 # cat /boot/loader.conf net.pf.states_hashsize: 2147483648 vm1 # sysctl hw.realmem hw.realmem: 536870912 vm1 # kldstat Id Refs Address Size Name 1 3 0xffffffff80200000 1ee2bc0 kernel vm1 # service pf onestart Enabling pfKernel page fault with the following non-sleepable locks held: exclusive rw pf rulesets (pf rulesets) r = 0 (0xffffffff822466e0) locked @ /usr/src/sys/modules/pf/../../netpfil/pf/pf_ioctl.c:2901 stack backtrace: #0 0xffffffff80a91a90 at witness_debugger+0x70 #1 0xffffffff80a92d77 at witness_warn+0x3d7 #2 0xffffffff80e92817 at trap_pfault+0x57 #3 0xffffffff80e91ea4 at trap+0x284 #4 0xffffffff80e71ea7 at calltrap+0x8 #5 0xffffffff8090b646 at devfs_ioctl_f+0x156 #6 0xffffffff80a97106 at kern_ioctl+0x246 #7 0xffffffff80a96e51 at sys_ioctl+0x171 #8 0xffffffff80e92f6b at amd64_syscall+0x2db #9 0xffffffff80e7218b at Xfast_syscall+0xfb Fatal trap 12: page fault while in kernel mode cpuid = 0; apic id = 00 fault virtual address = 0x0 fault code = supervisor read data, page not present instruction pointer = 0x20:0xffffffff8223001f stack pointer = 0x28:0xfffffe002b724310 frame pointer = 0x28:0xfffffe002b724800 code segment = base rx0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 635 (pfctl) [ thread pid 635 tid 100074 ] Stopped at pfioctl+0x4ef: movq (%rdi),%rsi db> VM2 ---------------------------------------------------------- vm2 # uname -a FreeBSD test-pf 11.0-CURRENT FreeBSD 11.0-CURRENT #2: Sun May 29 12:06:57 NZST 2016 test@test-pf:/usr/obj/usr/src/sys/GENERIC amd64 vm2 # cat /boot/loader.conf net.pf.states_hashsize="2147483648" vm2 # sysctl hw.realmem hw.realmem: 536870912 vm2 # kldstat Id Refs Address Size Name 1 1 0xffffffff80200000 1ee2bc0 kernel vm2 # service pf onestart Enabling pf. vm2 # kldstat Id Refs Address Size Name 1 3 0xffffffff80200000 1ee2bc0 kernel 2 1 0xffffffff82219000 34c30 pf.ko [1] https://forums.freebsd.org/threads/56250/ Thanks: Your patch works great! Could you add a log warning message when the not enough memory condition is triggered ? Hi, Somehow, I missed the notification for the latest comment! Sorry for the late reply.... It is possible to add a log warning message by using printf(9) or log(9), but I have noticed that pf_initialize() does things quietly most of the times. Do you reckon using printf(9) or log(9) would be the proper approach for adding a log warning message for the not enough memory condition? Created attachment 176410 [details]
The latest patch with console and logging facility outputs
Hi there,
I setup another VM running FreeBSD 12.0-CURRENT and updated the previous patch to reflect Olivier's suggestion with regards to logging.
I used printf(9) to print an error message similar to this
pf: Invalid states hash table size : 2147483648. Using the default value 32768.
Not mentioned in the original bug report, but value of f_srchashsize also needs to be verified against the available memory as setting net.pf.source_nodes_hashsize tunable to a value larger than malloc(9) can allocate causes the system to crash.
The snippet below was captured with script(1) utility from a FreeBSD 12.0-CURRENT system patched with attached diff file (note that console output is not captured, only tty)
VM3
----------------------------------------------------------
Script started on Wed Nov 2 20:07:09 2016
root@vm3:~ # uname -a
FreeBSD vm3 12.0-CURRENT FreeBSD 12.0-CURRENT #2: Tue Nov 1 20:27:46 NZDT 2016 root@vm3:/usr/obj/usr/src/sys/GENERIC amd64
root@vm3:~ # cat /boot/loader.conf
net.pf.states_hashsize="2147483648"
root@vm3:~ # sysctl hw.realmem
hw.realmem: 536805376
root@vm3:~ # kldstat
Id Refs Address Size Name
1 3 0xffffffff80200000 1fdc408 kernel
2 1 0xffffffff82419000 2ac85 vboxguest.ko
root@vm3:~ # service pf onestart
Enabling pf.
root@vm3:~ # kldstat
Id Refs Address Size Name
1 5 0xffffffff80200000 1fdc408 kernel
2 1 0xffffffff82419000 2ac85 vboxguest.ko
3 1 0xffffffff82444000 351a7 pf.ko
root@vm3:~ # dmesg | grep pf
pf: Invalid states hash table size : 2147483648. Using the default value 32768.
pf: Invalid source node hash table size : 0. Using the default value 8192.
root@vm3:~ # exit
exit
Script done on Wed Nov 2 20:08:23 2016
----------------------------------------------------------
Hi there, Just checking whether the last submitted patch is an appropriate solution to the problem and satisfies the requirement set by Olivier in his comment on 2016-06-03. Is there anything that is needed to be done / updated? Thanks (In reply to fehmi noyan isi from comment #8) I don't think this is the correct approach. There are no guarantees that the memory would not be allocated for something else. I currently don't have the time to look at this in more depth, but do feel free to chase me again in a couple of weeks. (In reply to Kristof Provost from comment #9) Hi Kristof, >>> There are no guarantees that the memory would not be allocated for something else. If my interpretation of your comment is correct (and if not please correct me), you mean one should perform the memory check against the "free" memory rather than the installed memory, which is hw.physmem in this case. However, I could not find any sysctl tunables that I can get the correct amount of free memory directly. Looking at top(1) source code, it requires extensive calculations to come up with the amount of wired, free or active memory. Sticking with hw.physmem was the best option I could think of. Any other suggestions are welcome, so that I can work it out... (In reply to fehmi noyan isi from comment #10) Yes, but I wouldn't check the free memory. I've not yet looked at the code, but the original report suggests there are a couple of related issues: > kldload stuck and never give the console back on my 8GB RAM server This suggests that pf is doing an allocation with M_WAITOK. This means the malloc() call will keep waiting until enough memory is found to satisfy the request. If that's for more memory than is available it'll never return > During this status, issuing command "pfctl" will crash the system: This suggests that there's a race in pf initialisation where the ioctl() handler is already registered before all initialisation is done (i.e. the large allocation is still blocked). That's a bug too, and one that could potentially occur even if the large allocation succeeds (though in a very small time window). My initial approach would be to find the allocation that blocks startup, presumably where the state hash table is allocated, and remove the M_WAITOK flag. Then the code needs to be changed to ensure that it can handle an allocation failure. The second step is to audit the initialisation code to find out why ioctls() can be done before the state hash table is allocated. (In reply to Kristof Provost from comment #11) For the first issue you mentioned, I think it is this line blocking the startup V_pf_srchash = malloc(pf_srchashsize * sizeof(struct pf_srchash),M_PFHASH,M_WAITOK|M_ZERO); then maybe something like below? if ((V_pf_srchash = malloc(..., ... ,M_NOWAIT|M_ZERO)) == NULL) { /* error handling, cleanup & return */ } (In reply to fehmi noyan isi from comment #12) Yes, that's probably one of them. There are a couple of allocations in pf_initialize() and they all use M_WAITOK. I do see why, because it's non-trivial to cope with allocation failures in that part of the code. It gets called from a VNET_SYSINIT(), so it doesn't have a reasonable way of aborting the initialisation. As an aside, the allocations should probably be changed to use mallocarray() too, because the multiplication they do (pf_hashsize * sizeof(pf_idhash) for example) might overflow. pf_hashsize can be set by the user (although only root, because it's a sysctl, fortunately), so we must verify it won't overflow before trying to allocate it. (In reply to Kristof Provost from comment #13) >> I do see why, because it's non-trivial to cope with allocation failures in that part of the code. It gets called from a VNET_SYSINIT(), so it doesn't have a reasonable way of aborting the initialisation. So, does this come down to supplying a default value and re-attempting malloc() again? >> As an aside, the allocations should probably be changed to use mallocarray() too Yes, that might help to avoid this bug, but I do not think FreeBSD has mallocarray()? (In reply to fehmi noyan isi from comment #14) > So, does this come down to supplying a default value and re-attempting malloc() again? I was thinking in that direction as well, yes. It's either that, or not activating pf at all. Running it with a smaller state table might not be ideal, but it's bound to be a lot better than running without firewall at all. > Yes, that might help to avoid this bug, but I do not think FreeBSD has mallocarray()? It does now. It was added very recently (in head). man 9 mallocarray. It might be worth doing that change in a separate commit. (In reply to Kristof Provost from comment #15) >> It does now. It was added very recently (in head). man 9 mallocarray. It might be worth doing that change in a separate commit. This is news to me! A quick look at the head revealed quite a few commits that replaces malloc(9) with mallocarray(9). That's good... However, I would say even the first problem you mentioned needs to be split into two as we can have cases where the requested amount of memory does not overflow but exceeds the available memory. Attempting malloc() (or using mallocarray() in this situation would yield the same result) and checking return value, then in case NULL is returned, re-attempting another allocation does not sound like the best approach that could possibly used here. How about the two-step process below? 1 ) making sure the requested allocation size does not exceed the system memory or the available free memory- we need to check this 2 ) avoiding overflow - mallocarray(9) would do this which can be represented in pseudo-code as given below pf_initiazlize(){ /* the first check */ alloc_size = check_if_we_have_enough_memory(pf_hashsize * sizeof(pf_idhash)) /* it should be safe at this point, but better to be cautious than sad */ V_pf_srchash = mallocarray(alloc_size, ... ,M_NOWAIT|M_ZERO) } size_t check_if_we_have_enough_memory(u_long size) { return (size > available_or_free_memory)? PF_HASHSIZ : size; } I think the second issue below can be addressed independently from the one I mentioned above. >This suggests that there's a race in pf initialisation where the ioctl() handler is already registered before all initialisation is done (i.e. the large allocation is still blocked). That's a bug too, and one that could potentially occur even if the large allocation succeeds (though in a very small time window). (In reply to fehmi noyan isi from comment #16) Yes, absolutely. The potential for integer overflow is a separate issue. I wouldn't try check_if_we_have_enough_memory() though. Memory allocation is very complicated and it's basically impossible to predict if there'd be enough free memory to satisfy a request without actually doing it. Moreover, it'd always be susceptible to races. I'd try to allocate the requested size with malloc(size * sizeof(foo), M_NOWAIT). If that fails we can fall back on malloc(DEFAULT_SIZE * sizeof(foo), M_WAITOK). Even that allocation might block forever, but at that point the system is so low on memory that we'll be in all sorts of trouble. (In reply to Kristof Provost from comment #17) Sounds a better approach than I suggested, thanks! As the allocation size can be set by the user, don't you think mallocarray(9) would be a better choice than malloc(9) in this case? I see that you mentioned using mallocarray(9) could be a separate commit, but once the code is updated to fix a memory allocation issue, it may pay to use the new system call we have. What do you think? (In reply to fehmi noyan isi from comment #18) Yes, mallocarray() would indeed be better. I'd suggest doing that in a separate commit mostly because mallocarray() currently only exists in head, and this patch may be useful in stable/11 as well. It's not a hard requirement from my point of view though. ok, if a patch with malloc(9) could find its way into stable/11, it would pay to use malloc(9). mallocarray() is now available for 11-Stable. https://svnweb.freebsd.org/changeset/base/328210 Hi there. I patched pf_initialize() in head (r328383, which has mallocarray(9)) by implementing a fallback mallocarray(9) call (as discussed in the comments)
if ((V_pf_idhash = mallocarray(pf_hashsize, sizeof(pf_idhash), M_PFHASH, M_WAITOK|M_ZERO)) == NULL) {
V_pf_idhash = mallocarray(PH_HASHSIZ, sizeof(pf_idhash), M_PFHASH, M_NOWAIT|M_ZERO)
}
// and all other malloc(9) calls
Although, this change prevents the initialisation process against any overflow issues, the fallback mallocarray(...,M_WAITOK) calls still block in case too much memory is requested.
>> Memory allocation is very complicated and it's basically impossible to predict if there'd be enough free memory to satisfy a request without actually doing it. Moreover, it'd always be susceptible to races.
Based on Kristof's comment above, shall we assume this is as far as we can go?
(In reply to fehmi noyan isi from comment #22) Yeah, that's probably the best we can do for the allocation. It'd also be good to log that we haven't allocated all of the requested memory. That's part of the problem. Another part (for another patch) is that the pf ioctl() handler seems to get registered before these allocations are done. Created attachment 190429 [details]
patch that uses mallocarray(9) and printf(9)
I have been tied by other commitments, so could not send this patch, which I wrote about a a week or so ago.
This patch attempts to allocate requested memory by using mallocarray(9) with NO_WAIT flag and if the call fails, it falls back and makes another mallocarray(9) call. Second call, however, is done by using the default PF_HASHSIZ value and M_WAITOK flag.
Please note, even with this patch, pf.ko still crashes during initialisation phase if you set the value of net.pf.states_hashsize sysctl tunable large enough to consume all physical memory, something like 2147483648 on a machine with 512MB RAM (this check has been opted out during the discussions).
The patch also adds some logging mechanism, by using printf(9), in case any of the mallocarray(9) calls fails.
The patch was generated the pf.c found in 12.0-CURRENT
(In reply to fehmi noyan isi from comment #24) I'd change the error message ('invalid states hashtable size(%lu)', and friends) to say there's not enough memory. It's not so much that the value is incorrect, just that it's more than we have memory for. What crash are you seeing? Note that mallocarray() will panic if you trigger an integer overflow, which could well happen with 2147483648 on a 32-bit machine. Created attachment 190476 [details] Kernel Crash Dump (In reply to Kristof Provost from comment #25) The VM I am running the tests is 64-bit,so I do not think the panic is triggered by mallocarray(9). However, I see that the mtx_init(9) in the for loop causes the crash. I attach textdump output for your reference.... Here is the problem; if ((V_pf_keyhash = mallocarray(pf_hashsize, sizeof(struct pf_keyhash), M_PFHASH, M_NOWAIT | M_ZERO)) == NULL){ V_pf_keyhash = mallocarray(PF_HASHSIZ, sizeof(struct pf_keyhash), M_PFHASH, M_WAITOK | M_ZERO); printf(...); } if ((V_pf_idhash = mallocarray(pf_hashsize, sizeof(struct pf_idhash), M_PFHASH, M_NOWAIT | M_ZERO)) == NULL){ V_pf_idhash = mallocarray(PF_HASHSIZ, sizeof(struct pf_idhash), M_PFHASH, M_WAITOK | M_ZERO); printf(...); } pf_hashmask = pf_hashsize - 1; // pf_hashsize is 2147483648 for (i = 0, kh = V_pf_keyhash, ih = V_pf_idhash; i <= pf_hashmask; i++, kh++, ih++) { mtx_init(&kh->lock, "pf_keyhash", NULL, MTX_DEF | MTX_DUPOK); mtx_init(&ih->lock, "pf_idhash", NULL, MTX_DEF); } In the code above, V_ph_idhash and V_pf_keyhash are allocated PF_HASHSIZ * sizeof(struct pf_keyhash) and PF_HASHSIZ * sizeof(struct pf_idhash) amount of memory respectively. The for loop following the mallocarray(9) calls expects the allocated memory to be aligned with pf_hashsize variable, which is usd in the loop and set to 2147483648 in our example. On the other hand, PH_HASHSIZ is 32768. This mismatch causes the initialisation to fail. Apparently, the value of pf_hashsize needs to be set and it should be used in mallocarray(9) calls rather than PF_HASHSIZ. Although, sizeof(struct pf_keyhash) = sizeof(struct pf_idhash) = 40, we cannot guarantee that the size of structs will stay the same (please correct me if I am wrong). Given that the for loop assumes V_ph_idhash and V_pf_keyhash are allocated memory by using the same multiplier, which is pf_hashsize, I think we either * should make a test before the mallocarray() calls and set pf_hashsize accordingly (how?) * make two mallocarray(...,M_NOWAIT) calls and test return values in a single if() statement. If either or both of these pointers is NULL, we should fallback and re-allocate memory for _both_ V_ph_idhash and V_pf_keyhash by using a single pf_hashsize value. (In reply to fehmi noyan isi from comment #26) Yes, your analysis looks to be correct. I'd go for the second option: try to allocate both keyhash and idhash with the requested size. If either one fails free both and re-try with the default size. I don't think the sizes of the structs are relevant here. We allocate 'pf_hashsize' elements of both, but we don't care how much memory each allocation takes, just that we know how many there are. Created attachment 190574 [details]
patch with mallocarray(9), printf(9) and a single fallback block
Attached is the latest patch with a single fallback block which is executed depending on the return values from mallcoarray(... | M_NOWAIT) calls. The code sets the value of respective variable that is passed to mallocarray(9) rather than directly using PH_HASHSIZ in the mallocarray(9) call.
I tried the patch on a 512MB VM running FreeBSD-12.0CURRENT and verified that PF initialisation went okay. A log message indicating the memory allocation issue is present in /var/log/messages (and in the console).
root@test-vm:~ # uname -a
FreeBSD test-vm 12.0-CURRENT FreeBSD 12.0-CURRENT #13: Wed Feb 14 13:28:52 NZDT 2018 root@test-vm:/usr/obj/usr/src/amd64.amd64/sys/GENERIC amd64
root@test-vm:~ # cat /boot/loader.conf
net.pf.states_hashsize="2147483648"
root@test-vm:~ # sysctl hw.realmem
hw.realmem: 536805376
root@test-vm:~ # kldstat
Id Refs Address Size Name
1 1 0xffffffff80200000 20bf2d0 kernel
root@test-vm:~ # service pf onestart
Enabling pf.
root@test-vm:~ # kldstat
Id Refs Address Size Name
1 3 0xffffffff80200000 20bf2d0 kernel
2 1 0xffffffff82419000 33590 pf.ko
root@test-vm:~ # grep pf_initialize /var/log/messages
Feb 14 13:40:20 test-vm kernel: pf_initialize : Not enough memory for 85899345920 bytes.
(In reply to fehmi noyan isi from comment #28) It's possible that one of the V_pf_keyhash or V_pf_idhash allocations succeeded, but not the other. That means you may have to free one of them. (Note that free(NULL, M_PFHASH) is safe.) It may be easier to discuss the patch if you post it on http://reviews.freebsd.org. (In reply to Kristof Provost from comment #29) Calling free(9) before reallocating the memory was a no brainer. Thanks! Below is the diff https://reviews.freebsd.org/D14367 A commit references this bug: Author: kp Date: Sun Feb 25 08:56:44 UTC 2018 New revision: 329950 URL: https://svnweb.freebsd.org/changeset/base/329950 Log: pf: Cope with overly large net.pf.states_hashsize If the user configures a states_hashsize or source_nodes_hashsize value we may not have enough memory to allocate this. This used to lock up pf, because these allocations used M_WAITOK. Cope with this by attempting the allocation with M_NOWAIT and falling back to the default sizes (with M_WAITOK) if these fail. PR: 209475 Submitted by: Fehmi Noyan Isi <fnoyanisi AT yahoo.com> MFC after: 3 weeks Differential Revision: https://reviews.freebsd.org/D14367 Changes: head/sys/net/pfvar.h head/sys/netpfil/pf/pf.c Thank you! A commit references this bug: Author: kp Date: Sun Mar 18 11:25:40 UTC 2018 New revision: 331116 URL: https://svnweb.freebsd.org/changeset/base/331116 Log: MFC r329950: pf: Cope with overly large net.pf.states_hashsize If the user configures a states_hashsize or source_nodes_hashsize value we may not have enough memory to allocate this. This used to lock up pf, because these allocations used M_WAITOK. Cope with this by attempting the allocation with M_NOWAIT and falling back to the default sizes (with M_WAITOK) if these fail. PR: 209475 Submitted by: Fehmi Noyan Isi <fnoyanisi AT yahoo.com> Changes: _U stable/11/ stable/11/sys/net/pfvar.h stable/11/sys/netpfil/pf/pf.c A commit references this bug: Author: kp Date: Sun Mar 18 11:26:07 UTC 2018 New revision: 331117 URL: https://svnweb.freebsd.org/changeset/base/331117 Log: MFC r329950: pf: Cope with overly large net.pf.states_hashsize If the user configures a states_hashsize or source_nodes_hashsize value we may not have enough memory to allocate this. This used to lock up pf, because these allocations used M_WAITOK. Cope with this by attempting the allocation with M_NOWAIT and falling back to the default sizes (with M_WAITOK) if these fail. PR: 209475 Submitted by: Fehmi Noyan Isi <fnoyanisi AT yahoo.com> Changes: _U stable/10/ stable/10/sys/net/pfvar.h stable/10/sys/netpfil/pf/pf.c There is a commit referencing this PR, but it's still not closed and has been inactive for some time. Closing the PR as fixed but feel free to re-open it if the issue hasn't been completely resolved. Thanks |