Hello, when I try to boot the latest current image on my AMD Epyc system, I get this panic message on boot: "cannot find a large enough size" My specs: Super Micro H11DSi motherboard: https://www.supermicro.com/en/products/motherboard/H11DSi Dual AMD Epyc 7601 32 core processors @ 2.20GHz base frequency. 128 GB of ECC RAM, 4x 32GB.
Hm, the panic and panic message are less than ideal. This indicates an allocation targeted to a specific memory domain couldn't find a a big enough span during early boot. Do you have some memory domains that are not populated? In fact, I am sure that is true given you've got a 2P configuration with only 4 DIMMs. So the bug here seems to be that this allocation is attempted / panics when it is possible for a memory domain to be not populated (such as on 2990WX). We used to boot fine on 2990WX, so perhaps this is a regression.
Just tried booting a 12.1-STABLE stick, and that booted fine!
(In reply to Conrad Meyer from comment #1) Hmm. In the one place where this function is used, we are allocating memory to back the vm page array. The selected domain is that of the corresponding page frame, which means that the domain contains at least one physical page, so an empty domain shouldn't be causing problems. This code has also been in the tree for a while, and I feel like we usually find out pretty quickly when a regression causes 2990WX to fail to boot. (In reply to Rafael Kitover from comment #2) Could you grab the output of sysctl vm.phys_segs while you have 12.1 booted?
(In reply to Mark Johnston from comment #3) Might be the e820 or EFI metadata is somewhat erroneous? Just speculating at this point. phys_segs should help. @Rafael, bootverbose mode should print out the BIOS or EFI metadata we use for understanding physical memory layout before the panic. 12.1 may print the same information, but I'm not sure. And thanks for the report, and your help investigating!
Here you go: vm.phys_segs: SEGMENT 0: start: 0x10000 end: 0x9f000 domain: 1 free list: 0xffffffff81f0a170 SEGMENT 1: start: 0x100000 end: 0x200000 domain: 1 free list: 0xffffffff81f0a170 SEGMENT 2: start: 0x200000 end: 0x1000000 domain: 0 free list: 0xffffffff81f09a20 SEGMENT 3: start: 0x1000000 end: 0x2483000 domain: 0 free list: 0xffffffff81f097b0 SEGMENT 4: start: 0x248b000 end: 0x24bf000 domain: 0 free list: 0xffffffff81f097b0 SEGMENT 5: start: 0x2700000 end: 0x75db0000 domain: 1 free list: 0xffffffff81f09f00 SEGMENT 6: start: 0x76000000 end: 0xb9eb9000 domain: 1 free list: 0xffffffff81f09f00 SEGMENT 7: start: 0xb9f31000 end: 0xb9fe5000 domain: 1 free list: 0xffffffff81f09f00 SEGMENT 8: start: 0xbad36000 end: 0xbc000000 domain: 1 free list: 0xffffffff81f09f00 SEGMENT 9: start: 0x100000000 end: 0x840000000 domain: 1 free list: 0xffffffff81f09c90 SEGMENT 10: start: 0x840000000 end: 0x1040000000 domain: 2 free list: 0xffffffff81f0a3e0 SEGMENT 11: start: 0x1040000000 end: 0x1840000000 domain: 4 free list: 0xffffffff81f0b280 SEGMENT 12: start: 0x1840000000 end: 0x1f71b2e000 domain: 7 free list: 0xffffffff81f0c870
So yep, we have 5 domains but only 4 DIMMS. Domain 2 has exactly 32GB, domain 4 ditto. Domain 7's got ~28GB, and domain 1 has ~32 GB. Domain 0 has only ~34 MB. That one seems problematic.
Just a guess, but could the 34MB be the video memory of the crappy video chip on my server motherboard? Or something like that anyway. Here is a photo of the panic by the way: https://photos.app.goo.gl/c6neWKTFFnyzdQpy6 I was not able to capture the earlier bootup logs, in 12.1 the /var/run/dmesg.boot file does not have messages from early enough with boot_verbose set, and my keyboard doesn't work in the bootup process so I cannot pause it.
(In reply to Rafael Kitover from comment #7) Probably the 34MB is wrongly carved from some other domain to not have an unpopulated domain 0. Is it possible the DIMMS are installed in an order not recommended by the motherboard vendor? Usually crappy board video chips just borrow some system RAM, and that allocation wouldn't be visible to the host OS (IIRC). Re: capturing earlier bootup logs, you can put 'kern.msgbufsize="512k"' in /boot/loader.conf (or similar) to use a larger-than-default msgbufsize. The default msgbuf size on CURRENT and STABLE/12 is 96 kB unless configured otherwise.
Created attachment 214737 [details] 12.1-STABLE /var/run/dmesg.boot
I set this variable at the loader prompt, and my full dmesg.boot for 12.1-STABLE is attached.
Hm, so this must be us synthesizing a bogus 0-domain: SRAT: Found memory domain 1 addr 0x0 len 0xa0000: enabled SRAT: Found memory domain 1 addr 0x100000 len 0xbff00000: enabled SRAT: Found memory domain 1 addr 0x100000000 len 0x740000000: enabled SRAT: Found memory domain 2 addr 0x840000000 len 0x800000000: enabled SRAT: Found memory domain 4 addr 0x1040000000 len 0x800000000: enabled SRAT: Found memory domain 7 addr 0x1840000000 len 0x800000000: enabled SRAT: mem dom 0 is empty SRAT: mem dom 3 is empty SRAT: mem dom 5 is empty SRAT: mem dom 6 is empty
(In reply to Conrad Meyer from comment #11) Ah, it'll be the code which registers vm_phys segments for preloaded data and the kernel page tables. It hard-codes domain 0. Maybe we can defer the registration until after SRAT has been parsed.
Should be fixed by https://reviews.freebsd.org/D25001
I have current on my laptop, I will build a kernel with this patch and put it on the stick to try on my epyc as soon as possible.
(In reply to Rafael Kitover from comment #14) Awesome, thank you!
I built current master with the patch. Boots and works fine on my intel + nvidia laptop. Unfortunately produces pretty much the same panic on my epyc, here is a screenshot: https://photos.app.goo.gl/q1W2crGNMnJt8gxW8 Initially this was the panic: https://photos.app.goo.gl/c6neWKTFFnyzdQpy6 The only address that is different in the backtrace is for pmap_page_array_startup(). In case I did anything wrong, this is the process that I used to test: I expanded the UFS filesystem on my current stick. I applied the patch I downloaded from that change review page to my clone in /usr/src. I ran `sudo make -j12 kernel` to build and install the kernel on my laptop. With the stick mounted at /mnt I ran: sudo make installkernel DESTDIR=/mnt to install the kernel to the stick.
(In reply to Rafael Kitover from comment #16) I see, thanks. I uploaded a new version of the patch to that review - could you please test it? I believe your procedure for updating the kernel is correct.
Just tried, it boots and works! sysctl -n hw.physmem: 137303121920 sysctl -n hw.phys_segs: SEGMENT 0: start: 0x10000 end: 0x9f000 domain: 1 free list: 0xffffffff81db5670 SEGMENT 1: start: 0x100000 end: 0x1000000 domain: 1 free list: 0xffffffff81db5670 SEGMENT 2: start: 0x1000000 end: 0x212a000 domain: 1 free list: 0xffffffff81db5400 SEGMENT 3: start: 0x2132000 end: 0x2164000 domain: 1 free list: 0xffffffff81db5400 SEGMENT 4: start: 0x2300000 end: 0x75db0000 domain: 1 free list: 0xffffffff81db5400 SEGMENT 5: start: 0x76000000 end: 0xb9eb9000 domain: 1 free list: 0xffffffff81db5400 SEGMENT 6: start: 0xb9f31000 end: 0xb9fe5000 domain: 1 free list: 0xffffffff81db5400 SEGMENT 7: start: 0xbad36000 end: 0xbc000000 domain: 1 free list: 0xffffffff81db5400 SEGMENT 8: start: 0x100001000 end: 0x80c000000 domain: 1 free list: 0xffffffff81db5190 SEGMENT 9: start: 0x840001000 end: 0x1009e00000 domain: 2 free list: 0xffffffff81db58e0 SEGMENT 10: start: 0x103f800000 end: 0x103f984000 domain: 2 free list: 0xffffffff81db58e0 SEGMENT 11: start: 0x1040001000 end: 0x180b5ec000 domain: 4 free list: 0xffffffff81db6780 SEGMENT 12: start: 0x1840001000 end: 0x200ae00000 domain: 7 free list: 0xffffffff81db7d70 SEGMENT 13: start: 0x203ee00000 end: 0x203efe6000 domain: 7 free list: 0xffffffff81db7d70
(In reply to Rafael Kitover from comment #18) Perfect. Thank you!
I forgot to tag the PR. Committed in https://svnweb.freebsd.org/changeset/base/361595