Created attachment 191980 [details] backtrace Hello, trying to boot a recent 12-CURRENT snapshot memstick image results in kernel panic. The system is actually HardenedBSD, but the affected code suggests it's not relevant to HBSD-related changes. Attached are pictures with a backtrace; sadly I can't save a dump because it's an installer image. The pictured snapshot is from March 21, I tried a recent one from today as well. This is on 2x Xeon E5-2683v3, so it's a NUMA configuration. The kernel in question has NUMA off and is effectively equivalent to GENERIC mins old FreeBSD compat and plus hardening. I can possibly try later on a vanilla FreeBSD -CURRENT snapshot and see if I can reproduce it there as well.
Created attachment 191983 [details] backtrace on vanilla Confirmed on 12-CURRENT vanilla memstick snapshot as well. 11-STABLE boots.
cc'd relevant people who modified the affected code recently
Look up the source line of the trap (of course, on the FreeBSD sources). I suspect that this would be the one of several hownmany()s in denominator. How many vm domains do you have ? If 4 domains, i.e. 'cluster on die' is turned on in bios, does it help with the boot to turn it off ?
(In reply to Konstantin Belousov from comment #3) I believe CoD is disabled right now, considering numactl on Linux shows two memory domains total (one per CPU). $ numactl -s policy: default preferred node: current physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 cpubind: 0 1 nodebind: 0 1 membind: 0 1 If it was on, it should've shown four, two per each CPU, right? This is my primary desktop/workstation computer, with mostly desktop style workloads (i.e. largely NUMA-unaware) so it's best to keep it disabled. Should I try what happens when I enable it?
Oh, and yes, I did look up the source of the error, and I tried disassembling the function to check out what's wrong, however I cannot make much sense of it; it seems like there is a divide by zero, but I don't understand why there would be a zero. It seems to me like glebius's recent changes (in February) could've been the breakage, but I cannot figure out what's wrong with it. I'm unfamiliar with FreeBSD's internals and even less with kernel subsystems :)
(In reply to Daniel Kolesa from comment #5) Yes, cluster on die creates two numa domains per socket, then multiplied by the number of sockets. Load the failing kernel into kgdb, then do 'list *uma_startup_count+0xe6', the expression is the symbolic address where the trap occured. You might need to use kernel.full or put kernel.debug in the right place.
(In reply to Konstantin Belousov from comment #6) The problem really is that I have no working FreeBSD setup on this hardware, only an installer image; the filesystem in the image is read-only and therefore I cannot save a crash dump and I have no debug version of the kernel. What I could possibly try is get a 11-STABLE installed on a flash drive, build a CURRENT kernel from a fresh source and try rebooting into that, saving a dump in the process, then boot into kernel.old (i.e. 11-STABLE kernel) and try kgdb from there. I was hoping to be able to avoid having to go through that though :) But if that is necessary, I'll try.
(In reply to Daniel Kolesa from comment #7) You do not need the crash dump to get the information I asked about. I suspect that you can even use linux gdb.
Actually, I managed to obtain a kernel-dbg.txz of the same snapshot and kgdb it on my server, which runs HardenedBSD 11-STABLE. This is the output: (kgdb) list *uma_startup_count+0xe6 0xffffffff80e13296 is in uma_startup_count (/usr/src/sys/vm/uma_core.c:1827). 1822 1823 bucket_init(); 1824 1825 booted = UMA_STARTUP; 1826 1827 #ifdef UMA_DEBUG 1828 printf("UMA startup complete.\n"); 1829 #endif 1830 } 1831 The printed lines appear to be wrong. The actual line (1827) though appears correct, it's consistent with what I guessed from my disassembly. 1827 pages += howmany(zones, 1828 UMA_SLAB_SPACE / roundup2(zsize, UMA_BOOT_ALIGN)); Sadly, without a dump, it won't be possible inspect the memory, and I realized I cannot obtain a crash dump as the kernel panic happens immediately after it's loaded, before userspace and therefore before any swap/dump device could be set up.
The incorrect lines seem to be coming from the 11-STABLE kernel running on the server, I guess this makes sense. But the 1827 line should still be correct...
(In reply to Konstantin Belousov from comment #8) In fact it is clear where the things get broken: UMA_SLAB_SPACE is UMA_SLAB_SIZE - sizeof(struct uma_slab) == 3984. You have 2 14-core CPUs, hyperthreaded. This gives mp_maxid = 2 * 2 * 14 == 56. Number of vm_domains is 2. Then zsize, according to gdb, is 3984. UMA_BOOT_ALIGN is 32, and roundup2(zsize, UMA_BOOT_ALIGN) is 4000. So kernel tries to calculate howmany(zones, 3984 / 4000) == howmany(zones, 0). If you look at the definition of howmany, you will see the division by the second arg.
Thanks Kostik. I was just preparing a patch to print out all variables. So, the fix should be: Index: uma_core.c =================================================================== --- uma_core.c (revision 331844) +++ uma_core.c (working copy) @@ -1820,7 +1820,7 @@ uma_startup_count(int vm_zones) #endif /* Memory for the rest of startup zones, UMA and VM, ... */ - if (zsize > UMA_SLAB_SIZE) + if (roundup2(zsize, UMA_BOOT_ALIGN) > UMA_SLAB_SIZE) pages += (zones + vm_zones) * howmany(roundup2(zsize, UMA_BOOT_ALIGN), UMA_SLAB_SIZE); else
(In reply to Konstantin Belousov from comment #11) That's amusing, I just came to the same conclusion by isolating the code into a standalone .c program with stubbed out variables. --- /* Memory for the rest of startup zones, UMA and VM, ... */ if (zsize > UMA_SLAB_SIZE) { printf("branch1\n"); pages += (zones + vm_zones) * howmany(roundup2(zsize, UMA_BOOT_ALIGN), UMA_SLAB_SIZE); } else { printf("branch2\n"); printf("roundup2 %d\n", roundup2(zsize, UMA_BOOT_ALIGN)); pages += howmany(zones, UMA_SLAB_SPACE / roundup2(zsize, UMA_BOOT_ALIGN)); } --- --- $ ./foo KSIZE 304 ZSIZE 3984 UMA_SLAB_SIZE 4096 sizeof uma_slab 112 UMA_SLAB_SPACE 3984 branch2 roundup2 4000 zsh: floating point exception ./foo --- So yeah, indeed, this seems to be the source of error.
(In reply to Gleb Smirnoff from comment #12) That's great. Looks like we've come to a conclusion pretty quickly after all :)
Can you please confirm that patch works for you?
(In reply to Gleb Smirnoff from comment #15) Daniel is relying on me to do custom builds for him to test. I'll perform a build this evening (US Eastern) for him to test with the patch.
(In reply to Gleb Smirnoff from comment #15) I can try, but if you could perhaps upload a kernel binary somewhere, that'd be great - otherwise I'll have to build on my HardenedBSD server, which has slow hardware.
Created attachment 192014 [details] patched kernel I built the kernel on my server and replaced it in the installer image, but it still fails at the same place. Backtrace photo in attachment, kgdb here: (kgdb) list *uma_startup_count+0xd6 0xffffffff80e3e7b6 is in uma_startup_count (/usr/src/sys/vm/uma_core.c:1827). 1822 /* Memory for the rest of startup zones, UMA and VM, ... */ 1823 if (roundup2(zsize, UMA_BOOT_ALIGN) > UMA_SLAB_SIZE) 1824 pages += (zones + vm_zones) * 1825 howmany(roundup2(zsize, UMA_BOOT_ALIGN), UMA_SLAB_SIZE); 1826 else 1827 pages += howmany(zones, 1828 UMA_SLAB_SPACE / roundup2(zsize, UMA_BOOT_ALIGN)); 1829 1830 /* ... and their kegs. Note that zone of zones allocates a keg! */ 1831 pages += howmany(zones + 1, As you can see from the source, it still fails in the very same place, and you can also see how I patched it. You can also see from the uname in the photo that it is indeed the patched kernel.
Which now that I'm looking at it is actually pretty obvious, because UMA_SLAB_SIZE is 4096 (i.e. pagesize), and roundup2(zsize, UMA_BOOT_ALIGN) is 4000, which is bigger than zsize (3984) but still smaller than UMA_SLAB_SIZE.
(In reply to Daniel Kolesa from comment #19) Maybe, shouldn't the condition be if (roundup2(zsize, UMA_BOOT_ALIGN) > UMA_SLAB_SPACE) instead?
Created attachment 192015 [details] different panic after patch I rebuilt the kernel once again with my patch this time (the updated condition, see comment above). It didn't panic there, but it did panic in another place. Backtrace attached. Here's relevant kgdb listing: 0xffffffff80e3f015 is in keg_ctor (/usr/src/sys/vm/uma_core.c:1294). 1289 shsize = 0; 1290 else 1291 shsize = sizeof(struct uma_slab); 1292 1293 keg->uk_ipers = (slabsize - shsize) / rsize; 1294 KASSERT(keg->uk_ipers > 0 && keg->uk_ipers <= SLAB_SETSIZE, 1295 ("%s: keg->uk_ipers %u", __func__, keg->uk_ipers)); 1296 1297 memused = keg->uk_ipers * rsize + shsize; 1298 wastedspace = slabsize - memused; Looks like the assertion is failing this time.
I see the problem. The size of zone before alignment fits into single slab, but after alignment it doesn't. However, since we have only 1 item per slab, we don't need alignment correction at all. I'm working on a patch.
(In reply to Gleb Smirnoff from comment #22) Alright. Let me know when you have a patch so I can test; without being familiar with the source, this is not something I can patch myself here, and blindly patching wouldn't yield any reasonable results because it takes nearly 30 minutes for buildkernel to finish on my server. So when there is a patch I'll try again and report back...
Created attachment 192017 [details] Add a case for a situation when zsize fits into slab space, but not after an alignment correction. Let's try this. All previous changes need to be reverted.
(In reply to Gleb Smirnoff from comment #24) Running a fresh kernel build now.
(In reply to Gleb Smirnoff from comment #24) This patch results in the same assertion failure (keg->uk_ipers 0) as my previous patch. Not posting a backtrace because it's the same.
I'll try adding some debug logging tomorrow in order to try and figure out why the assertion is failing... unless anyone has a better idea.
Created attachment 192075 [details] Add a case for a situation when zsize fits into slab space, but not after an alignment correction #2 Hmm, looks like there is not only a bug in pretty new function uma_startup_count() but also in old UMA core needs special handling for this edge case. I have tested this patch emulating your amount of CPUs and domains. Please test it on a real hardware.
(In reply to Gleb Smirnoff from comment #28) Alright, thanks, I'll test this tomorrow. If that still fails I'll at least provide a bunch of logs with the values so things can be debugged better.
Created attachment 192077 [details] successful boot (In reply to Gleb Smirnoff from comment #28) Seems like there will be no need for that. I decided I wasn't going to wait until tomorrow and did a fresh build, and it seems it booted successfully all the way into the installer now :) So unless there is a breakage on another hardware, this is probably okay to commit. Thanks for the fast solution :)
A commit references this bug: Author: glebius Date: Mon Apr 2 05:12:00 UTC 2018 New revision: 331871 URL: https://svnweb.freebsd.org/changeset/base/331871 Log: Handle a special case when a slab can fit only one allocation, and zone has a large alignment. With alignment taken into account uk_rsize will be greater than space in a slab. However, since we have only one item per slab, it is always naturally aligned. Code that will panic before this change with 4k page: z = uma_zcreate("test", 3984, NULL, NULL, NULL, NULL, 31, 0); uma_zalloc(z, M_WAITOK); A practical scenario to hit the panic is a machine with 56 CPUs and 2 NUMA domains, which yields in zone size of 3984. PR: 227116 MFC after: 2 weeks Changes: head/sys/vm/uma_core.c
A commit references this bug: Author: glebius Date: Mon Apr 2 05:14:32 UTC 2018 New revision: 331872 URL: https://svnweb.freebsd.org/changeset/base/331872 Log: In uma_startup_count() handle special case when zone will fit into single slab, but with alignment adjustment it won't. Again, when there is only one item in a slab alignment can be ignored. See previous revision of this file for more info. PR: 227116 Changes: head/sys/vm/uma_core.c
A commit references this bug: Author: glebius Date: Mon Apr 16 15:07:20 UTC 2018 New revision: 332572 URL: https://svnweb.freebsd.org/changeset/base/332572 Log: Merge r331871: Handle a special case when a slab can fit only one allocation, and zone has a large alignment. With alignment taken into account uk_rsize will be greater than space in a slab. However, since we have only one item per slab, it is always naturally aligned. Code that will panic before this change with 4k page: z = uma_zcreate("test", 3984, NULL, NULL, NULL, NULL, 31, 0); uma_zalloc(z, M_WAITOK); A practical scenario to hit the panic is a machine with 56 CPUs and 2 NUMA domains, which yields in zone size of 3984 (on head). PR: 227116 Changes: _U stable/11/ stable/11/sys/vm/uma_core.c