Hello, experienced kernel panics after update on real hardware. Please, see the https://moonfall.crypts.me/ieDae8HaN4Jo/crash.tar.bz2
Please add the backtrace / panic message as a comment to the PR
Created attachment 225621 [details] panic message 1 This is what I've managed to capture on remote screen.
Created attachment 225622 [details] panic message 2
The servers works under load for some time, then crashs with reboot. Hard to get much of the info besides what got into /var/crash. Though I'm not a FBSD expert and maybe missing some steps to debug.
Please compile a kernel with options INVARIANT_SUPPORT options INVARIANTS added to GENERIC, per https://docs.freebsd.org/en/books/handbook/kernelconfig/ Once that's done, boot the new kernel and try to reproduce the original panic. Please report the panic message and stack here.
Well, actually this system is in use! :) It's not for debugging puposes! I thought crash dumps are for cases like that.
(In reply to crypt47 from comment #6) That's true, but some panics are much easier to debug with INVARIANTS turned on. In any case, to look at the cores I need the matching contents of /boot/kernel and /usr/lib/debug/boot/kernel, assuming that you are indeed running a stable/13 kernel. Also the core.txt.* files are not populated, try installing the gdb package.
I see your point. Will think about that. > assuming that you are indeed running a stable/13 kernel. I always try to do by the book. Especially if I just study the system. Everything, binaries, debugging symbols etc should be like in default 13.0-RELEASE after freebsd-update.
(In reply to crypt47 from comment #8) Ok, I was going off of the fact that the version field is set to 13.0-STABLE. Based on the vmcores you're running unpatched 13.0-RELEASE. There's one bug fixed in p1 which might be relevant, since you appear to be using ipfw: https://www.freebsd.org/security/advisories/FreeBSD-EN-21:12.divert.asc Do you have any rules involving divert sockets?
One more thing, Mark. I can't run new release more then 10-15 minutes, it hangs. And this's a remote system. So if I need a new kernel after the upgrade, I probably can't do it right there. I have to build it somewhere else in VM. And currently I don't know FreeBSD that well to do such tricks.
> Do you have any rules involving divert sockets? I haven't add any rules for that explicitly. I use a standard firewall rules for filtering and nat. But I guess the system hangs during poudriere activity. And the error on the screenshot says 'page fault' in related utility.
(In reply to crypt47 from comment #11) Ok, I see that ipdivert.ko isn't loaded so the EN I mentioned isn't relevant. Looking at the cores I see a number of different stack traces that seem to indicate that mbuf chains are being corrupted. It'll be hard to narrow this down without extra debugging options enabled in the kernel.
What is the best way to build FreeBSD kernel on one machine and then to install it on another with all suplementary debug (/usr/lib/debug/boot/kernel)? As far as I understand this can not be done via packages? I have to copy manually the custom files over default GENERIC and then reboot? How does two module directories (/boot/kernel) coexist on the same system?
do I have to update the base system upto F13 too or just kernel will be enough to test?
(In reply to crypt47 from comment #13) On one host you'd run $ <check out FreeBSD sources, use the releng/13.0 branch> $ <edit sys/amd64/conf/GENERIC to enable INVARIANTS> $ make buildkernel $ tmpdir=$(mktemp -d) # make installkernel DESTDIR=$tmpdir Then copy over the ${tmpdir}/boot/kernel and ${tmpdir}/usr/lib/debug/boot/kernel directories to the affected host. Beforehand, rename the existing ones to /boot/kernel.old and /usr/lib/debug/boot/kernel.old. This way you can still select the old kernel from the loader. There's no need to update userspace, just the kernel.
Thank you for the detailed instruction. Please, see https://moonfall.crypts.me/ohuvi8vaeDoh/
Thanks, this helps. So we're getting #GP in the bridge transmit code, seemingly because the mbuf was freed at some point. With INVARIANTS enabled, UMA trashing makes the panic deterministic, all stacks look like this: #7 <signal handler called> #8 bridge_rthash (sc=0xfffff8000fdca400, addr=0xdeadc0dedeadc0de <error: Cannot access memory at address 0xdeadc0dedeadc0de>) at /freebsdsrc/sys/net/if_bridge.c:2970 #9 bridge_rtnode_lookup (sc=sc@entry=0xfffff8000fdca400, addr=addr@entry=0xdeadc0dedeadc0de <error: Cannot access memory at address 0xdeadc0dedeadc0de>, vlan=vlan@entry=1) at /freebsdsrc/sys/net/if_bridge.c:3011 #10 0xffffffff82b2d3b2 in bridge_rtlookup (sc=0xfffff8000fdca400, addr=0xdeadc0dedeadc0de <error: Cannot access memory at address 0xdeadc0dedeadc0de>, vlan=1) at /freebsdsrc/sys/net/if_bridge.c:2769 #11 bridge_transmit (ifp=0xfffff8000f92b000, m=0xfffff800233f5000) at /freebsdsrc/sys/net/if_bridge.c:2170 #12 0xffffffff80d1bb1b in ether_output_frame (ifp=ifp@entry=0xfffff8000f92b000, m=0x0) at /freebsdsrc/sys/net/if_ethersubr.c:511 #13 0xffffffff80d1ba21 in ether_output (ifp=<optimized out>, m=0x0, dst=0xfffffe00351795a0, ro=<optimized out>) at /freebsdsrc/sys/net/if_ethersubr.c:438 #14 0xffffffff80db199f in ip_output_send (inp=inp@entry=0x0, ifp=0xffffffff81d38ef0 <w_locklistdata+276896>, ifp@entry=0xfffff8000f92b000, m=m@entry=0xfffff80023804e00, gw=gw@entry=0xfffffe00351795a0, ro=0x246, ro@entry=0x0, stamp_tag=<optimized out>) at /freebsdsrc/sys/netinet/ip_output.c:275 #15 0xffffffff80db1655 in ip_output (m=m@entry=0xfffff80023804e00, opt=opt@entry=0x0, ro=<optimized out>, ro@entry=0x0, flags=<optimized out>, flags@entry=0, imo=imo@entry=0x0, inp=<optimized out>, inp@entry=0x0) at /freebsdsrc/sys/netinet/ip_output.c:812 #16 0xffffffff80dabf8a in icmp_send (m=0xfffff80023804e00, opts=0x0) at /freebsdsrc/sys/netinet/ip_icmp.c:1017 #17 icmp_reflect (m=<optimized out>, m@entry=0xfffff80023804e00) at /freebsdsrc/sys/netinet/ip_icmp.c:929 #18 0xffffffff80dab9ce in icmp_error (n=0xfffff80023804b00, type=type@entry=5, code=<optimized out>, code@entry=1, dest=0, mtu=<optimized out>, mtu@entry=0) at /freebsdsrc/sys/netinet/ip_icmp.c:393 #19 0xffffffff80daafd7 in ip_tryforward (m=<optimized out>, m@entry=0xfffff8007db10c00) at /freebsdsrc/sys/netinet/ip_fastfwd.c:511 #20 0xffffffff80dad930 in ip_input (m=0xfffff8007db10c00) at /freebsdsrc/sys/netinet/ip_input.c:579 #21 0xffffffff80d38b31 in netisr_dispatch_src (proto=1, source=source@entry=0, m=0xfffff8007db10c00) at /freebsdsrc/sys/net/netisr.c:1143 #22 0xffffffff80d38e7f in netisr_dispatch (proto=2177714816, m=0x1) at /freebsdsrc/sys/net/netisr.c:1234 #23 0xffffffff80d1bcbe in ether_demux (ifp=ifp@entry=0xfffff8000f92b000, m=0x0) at /freebsdsrc/sys/net/if_ethersubr.c:923 #24 0xffffffff80d1d371 in ether_input_internal (ifp=0xfffff8000f92b000, m=0x0) at /freebsdsrc/sys/net/if_ethersubr.c:709 #25 ether_nh_input (m=<optimized out>) at /freebsdsrc/sys/net/if_ethersubr.c:739 #26 0xffffffff80d38b31 in netisr_dispatch_src (proto=proto@entry=5, source=source@entry=0, m=m@entry=0xfffff8007db10c00) at /freebsdsrc/sys/net/netisr.c:1143 #27 0xffffffff80d38e7f in netisr_dispatch (proto=2177714816, proto@entry=5, m=0x1, m@entry=0xfffff8007db10c00) at /freebsdsrc/sys/net/netisr.c:1234 #28 0xffffffff80d1c1b1 in ether_input (ifp=0xfffff80003ec3800, m=0xfffff8007db10c00) at /freebsdsrc/sys/net/if_ethersubr.c:830 #29 0xffffffff80d34bf7 in iflib_rxeof (rxq=<optimized out>, rxq@entry=0xfffff80003ec3000, budget=<optimized out>) at /freebsdsrc/sys/net/iflib.c:3006 #30 0xffffffff80d2e76a in _task_fn_rx (context=0xfffff80003ec3000) at /freebsdsrc/sys/net/iflib.c:3949 #31 0xffffffff80c439e7 in gtaskqueue_run_locked (queue=queue@entry=0xfffff800039af300) at /freebsdsrc/sys/kern/subr_gtaskqueue.c:371 #32 0xffffffff80c437e4 in gtaskqueue_thread_loop (arg=arg@entry=0xfffffe0038ff2008) at /freebsdsrc/sys/kern/subr_gtaskqueue.c:547 #33 0xffffffff80bb6120 in fork_exit (callout=0xffffffff80c43750 <gtaskqueue_thread_loop>, arg=0xfffffe0038ff2008, frame=0xfffffe0035179c00)
Glad to hear!) I don't know what #GP is, but the same works on 12.2 flawlessly, so diff may help as well.
(In reply to crypt47 from comment #18) Could you please confirm the exact FreeBSD version (uname -a) and attach the previous external link references here as attachments please. Thank you!
Yes, here is uname: moonfall ~ # uname -a FreeBSD moonfall.crypts.me 12.2-RELEASE-p7 FreeBSD 12.2-RELEASE-p7 GENERIC amd64 moonfall ~ # uname -UK 1202000 1202000 And as for attachment the site says (File size limit: 1000 KB) so I didn't even tried.
(In reply to crypt47 from comment #20) Did the original panic occur on 12.2-p7, or 13.0 ? Trying to verify the version, as it was reported under 13.0-STABLE, but comment 9 references 13.0-RELEASE (no patches)
> Did the original panic occur on 12.2-p7, or 13.0 ? It certanly occured not with 12.2 kernel, but with 12.2 userland as well. As for STABLE and RELEASE - I always mix up this FreeBSD terms.:( First it occured after I've excuted (and rebooted) > freebsd-update -F --currently-running 12.2-RELEASE upgrade -r 13.0-RELEASE and the latest crash dumps are from latest kernel built from releng/13.0 as Mark instructed.
(In reply to crypt47 from comment #18) #GP is general protection fault in Intel terminology. All of the stack traces go through icmp_error(). ip_output() is transmitting a valid-looking mbuf, but when control flow reaches bridge_transmit(), we have a different, trashed mbuf. I'd guess that the problem is in a packet filter, it seems you have pf and ipfw both loaded. There's a large diff between 12.2 and 13.0 in that code. Would you be willing to share your rulesets somehow?
> Would you be willing to share your rulesets somehow? Yes, I simplified my running config just for testing in following regard: 1) I've disabled pf module, here is what I've managed go get just before crash (sorry, linux alias): moonfall ~ # lsmod Id Refs Address Size Name 1 51 0xffffffff80200000 1f18f80 kernel 2 1 0xffffffff82119000 6724a0 zfs.ko 3 2 0xffffffff8278c000 4eaf0 ipfw.ko 4 1 0xffffffff827db000 8ba8 ipfw_nat.ko 5 2 0xffffffff827e4000 176a8 libalias.ko 6 1 0xffffffff827fc000 4240 coretemp.ko 7 1 0xffffffff82b20000 3378 acpi_wmi.ko 8 1 0xffffffff82b24000 3250 ichsmb.ko 9 1 0xffffffff82b28000 2180 smbus.ko 10 1 0xffffffff82b2b000 7638 if_bridge.ko 11 1 0xffffffff82b33000 50d8 bridgestp.ko 12 1 0xffffffff82b39000 378f8 linux.ko 13 2 0xffffffff82b71000 db70 linux_common.ko 14 1 0xffffffff82b7f000 30ac8 linux64.ko 15 1 0xffffffff82bb0000 2260 pty.ko 16 1 0xffffffff82c00000 53f420 vmm.ko 17 1 0xffffffff82bb3000 21cc nmdm.ko 2) The only (mandatory) rules in config at the moment of testing that should have been loaded was: 00009 2 0 deny ip from any to any MAC any 02:00:00:00:00:00/8 layer2 via em0 65534 62027 22596555 allow ip from any to any 65535 13 320 allow ip from any to any
command format for the ipfw rule above just in case: ipfw add 9 deny all from any to any MAC any "02:68:39:f0:44:0b&ff:00:00:00:00:00" layer2 via em0
BTW, can somebody point me at ipfw packet flow diagram? It drives me crazy that one packet may be injected in ipfw several times and why some directives stop it's further flow and some don't. Example for just one ICMP packet that passed two rules 5 times: 00302 2 168 setfib 1 ip from any to any tagged 1 00304 3 252 allow tag 1 icmp from x.x.x.x to any 00306 0 0 setfib 1 ip from any to any tagged 1 00308 0 0 allow log icmp from x.x.x.x to any
(In reply to crypt47 from comment #26) I don't know of any diagrams except the one in the ipfw man page, in the packet flows section. I've been staring at the ipfw code for a while, especially anything related to link-layer processing, without any luck so far. Are you able to verify that the crashes go away if you disable ipfw entirely?
Eh, I see a problem now. ipfw_check_frame() is failing to update *p.m after calling ipfw_chk(). Looks like a bug in https://svnweb.freebsd.org/base?view=revision&revision=345166 The L3 filter, ipfw_check_packet(), handles this correctly.
Created attachment 225811 [details] proposed patch Could you please test the attached patch?
The server works with this patch for some time. Probably it fixes the problem. But. Couldn't make a full time testing because the first thing I've noticed was a broken NAT with pf.:(
(In reply to crypt47 from comment #30) You mean, there is some new regression in pf when the patch is applied? Are you still testing 13.0-RELEASE?
I hadn't much time for testing. I've used F13 kernel with F12.2 userland. One of the jails is setuped to use pf as a NAT. With the test kernel it hadn't start with error like kernel module is unavailable or something. Since it didn't run smoothly, I had to have it change back at least for more suitable time for testing. In case it required.
^Triage: Loop in committer of base r345166. No MFC in commit log message, unsure if it ended up merged or not.
A commit in branch main references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=bc6a2267fffeafd3946637607a74cfd639398f9d commit bc6a2267fffeafd3946637607a74cfd639398f9d Author: Mark Johnston <markj@FreeBSD.org> AuthorDate: 2021-06-16 13:46:56 +0000 Commit: Mark Johnston <markj@FreeBSD.org> CommitDate: 2021-06-16 13:46:56 +0000 ipfw: Update the pfil mbuf pointer in ipfw_check_frame() ipfw_chk() might call m_pullup() and thus can change the mbuf chain head. In this case, the new chain head has to be returned to the pfil hook caller, otherwise the pfil hook caller is left with a dangling pointer. Note that this affects only the link-layer hooks installed when the net.link.ether.ipfw sysctl is set to 1. PR: 256439, 254015, 255069, 255104 Fixes: f355cb3e6 Reviewed by: ae MFC after: 3 days Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D30764 sys/netpfil/ipfw/ip_fw_pfil.c | 2 ++ 1 file changed, 2 insertions(+)
so, this is already commited somewhere? no need to check the pf nat issue?
^Triage: Assign to committer resolving, update status, pending MFC
(In reply to crypt47 from comment #35) The ipfw bug is fixed in the development branch now, it will be fixed shortly in stable/13 and will be included in 13.0-RELEASE-p3. I have no idea what the pf nat issue is, comment 32 does not make it clear to me. Can you show the errors you get when trying to load the pf kernel module? It sounds like there is some mismatch between the running kernel and the installed kernel modules. In any case, this is probably best discussed in a separate PR.
A commit in branch stable/13 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=ed1acef3fe3053b418ce3e41036ccf24957253a4 commit ed1acef3fe3053b418ce3e41036ccf24957253a4 Author: Mark Johnston <markj@FreeBSD.org> AuthorDate: 2021-06-16 13:46:56 +0000 Commit: Mark Johnston <markj@FreeBSD.org> CommitDate: 2021-06-19 14:08:49 +0000 ipfw: Update the pfil mbuf pointer in ipfw_check_frame() ipfw_chk() might call m_pullup() and thus can change the mbuf chain head. In this case, the new chain head has to be returned to the pfil hook caller, otherwise the pfil hook caller is left with a dangling pointer. Note that this affects only the link-layer hooks installed when the net.link.ether.ipfw sysctl is set to 1. PR: 256439, 254015, 255069, 255104 Fixes: f355cb3e6 Reviewed by: ae Sponsored by: The FreeBSD Foundation (cherry picked from commit bc6a2267fffeafd3946637607a74cfd639398f9d) sys/netpfil/ipfw/ip_fw_pfil.c | 2 ++ 1 file changed, 2 insertions(+)
*** This bug has been marked as a duplicate of bug 254015 ***
A commit in branch releng/13.0 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=4647d115ff849534c9d6712cc2da32509721e20e commit 4647d115ff849534c9d6712cc2da32509721e20e Author: Mark Johnston <markj@FreeBSD.org> AuthorDate: 2021-06-16 13:46:56 +0000 Commit: Mark Johnston <markj@FreeBSD.org> CommitDate: 2021-06-29 17:09:43 +0000 ipfw: Update the pfil mbuf pointer in ipfw_check_frame() ipfw_chk() might call m_pullup() and thus can change the mbuf chain head. In this case, the new chain head has to be returned to the pfil hook caller, otherwise the pfil hook caller is left with a dangling pointer. Note that this affects only the link-layer hooks installed when the net.link.ether.ipfw sysctl is set to 1. Approved by: so Security: EN-21:21.ipfw PR: 256439, 254015, 255069, 255104 Fixes: f355cb3e6 Reviewed by: ae Sponsored by: The FreeBSD Foundation (cherry picked from commit bc6a2267fffeafd3946637607a74cfd639398f9d) (cherry picked from commit ed1acef3fe3053b418ce3e41036ccf24957253a4) sys/netpfil/ipfw/ip_fw_pfil.c | 2 ++ 1 file changed, 2 insertions(+)
Hello, I'm the bug reporter, tell me please if it's safe to use FreeBSD 13 latest stable release now or should I wait till F13.1 this spring?
(In reply to crypt47 from comment #41) The bug is fixed in 14.0-CURRENT, 13.0-STABLE, and 13.0-RELEASE-pN for N >= 3.
I see, thank you.
Im also faceing Kernel panics only in FreeBSD 13 with a Supermicro xeon 1518 board. Dies once a day with this: Fatal trap 12: page fault while in kernel mode cpuid = 2; apic id = 02 fault virtual address = 0xfffff80e00000004 fault code = supervisor read data, page not present instruction pointer = 0x20:0xffffffff80f1c50d stack pointer = 0x28:0xfffffe0144d3cc00 frame pointer = 0x28:0xfffffe0144d3cca0 code segment = base rx0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 2653 (g_eli[2] gptid/e2d4) trap number = 12 panic: page fault cpuid = 2 time = 1648393340 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe0144d3c9c0 vpanic() at vpanic+0x17f/frame 0xfffffe0144d3ca10 panic() at panic+0x43/frame 0xfffffe0144d3ca70 trap_fatal() at trap_fatal+0x385/frame 0xfffffe0144d3cad0 trap_pfault() at trap_pfault+0x4f/frame 0xfffffe0144d3cb30 calltrap() at calltrap+0x8/frame 0xfffffe0144d3cb30 --- trap 0xc, rip = 0xffffffff80f1c50d, rsp = 0xfffffe0144d3cc00, rbp = 0xfffffe0144d3cca0 --- aesni_crypt_xts() at aesni_crypt_xts+0x17d/frame 0xfffffe0144d3cca0 aesni_decrypt_xts() at aesni_decrypt_xts+0xe/frame 0xfffffe0144d3ccc0 aesni_cipher_crypt() at aesni_cipher_crypt+0x2f1/frame 0xfffffe0144d3cd70 aesni_process() at aesni_process+0x159/frame 0xfffffe0144d3cdc0 crypto_dispatch() at crypto_dispatch+0x118/frame 0xfffffe0144d3cdf0 g_eli_crypto_run() at g_eli_crypto_run+0x178/frame 0xfffffe0144d3ce90 g_eli_worker() at g_eli_worker+0x328/frame 0xfffffe0144d3cef0 fork_exit() at fork_exit+0x7e/frame 0xfffffe0144d3cf30 fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe0144d3cf30 --- trap 0x80af5f94, rip = 0, rsp = 0, rbp = 0 --- KDB: enter: panic
(In reply to Krautmaster from comment #44) This looks unrelated to the rest of the PR. Could you please submit a new bug report?
will do thanks