Hello, When compiled without INVARIANTS and INVARIANT_SUPPORT, my production armv6 kernel updated as of yesterday's latest commit panics (cf screenshot). When I compile with INVARIANTS and INVARIANT_SUPPORT, I don't have the panic, everything boots fine till the login prompt. I suspect the culprit might be https://reviews.freebsd.org/rS319722 Gleb, any idea?
Created attachment 184040 [details] Backtrace Backtrace
I am using a raspberry pi 2 as the test machine.
I see the same on powerpc32, armv6 (wandboard) and also on i386. On i386 I have a usable vmcore.
I confirmed that 319721 boots and is stable on i386. 319722 panics when I ssh to the i386 box.
There is a #ifdef INVARIANTS bzero(&so->so_rcv, sizeof(struct socket) - offsetof(struct socket, so_rcv)); #endif sys/kern/uipc_socket.c line 839 in the code written by Gleb, maybe the ifdef should be always true and the memory always zeroed. Sorry no time to test that this week.
This 220452 is likely a duplicate of my earlier 220404 report that was based on a detailed vmcore.* inspection on 32-bit powerpc: just more examples. I've only done the detailed analysis for the 32-bit powerpc context. As of now there are reports for 32-bit powerpc, armv6/7, and i386 spread over 4 people.
I can confirm that r319722 is the source of panics on i386 while r319721 is stable. Please see https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=220358 ( it seems duplicate of this bug )
My targets are running fine now with the removed ifdef from #5. All are under stress test now, buildworld/portbuild from nfs. Though, it is not clear to me why all my 64-bit targets(x86_64, powerpc64 and arm64) have had no issues with this ifdef.
(In reply to oleg.nauman from comment #7) Can you see if the patch suggested by comment #5 fixes the 220358 behavior? Right now 220404 and 220452 are tied by backtrace commonality not showing in 220358. So the duplication status is clearer. Given the bad handling of a union I expect that 220358 will be fixed as well. But I've not seen the specific 220358 backtraces or anything like them: I do not get that far prior to the suggested patch.
Please note that removing the #ifdef INVARIANTS to always bzero the memory is not a suggested patch by me - I was just wondering why the code works with invariants, so I intended to try that. Bug analysis and understanding why removing the #ifdef works should be performed by the relevant skilled people on this part of the kernel - that will lead I hope to a good patch and fix for this problem in production :-)
(In reply to Sylvain Garrigues from comment #10) Generally agreed. But I will note that the code is switching which anonymous struct's in the new union that is in struct socket now.. The bzero is preventing using values from the first (and initially used) anonymous struct being used by the second anonymous struct in the union. It gives much of the union a "clean slate" status of being zero for the use of the second anonymous struct. (This is after some other "close out" activities on the original content in the first anonymous struct.) Side note: The syntax for the union is not compliant with allowing C++ to compile the header: the enum syntax is a problem in that place in C++. So likely the header will change as well at some point.
Regarding comment #10, I fully agree. Regarding comment #9, I expect that is 'solved' too with removing the ifdef. The backtrace on my i386 was similar: .. ti_locked=<error reading variable: Cannot access memory at address 0x1>) .. Coming back to comment #10, it is a start to dig in and try to understand what is happening. Gleb is away till August. So for now it seems to be a band-aid to continue working.
As this bugzilla is getting most of the activity the following changes might be appropriate given the reports present here: Hardware: Any (not just arm anyway) Importance: Affects some people
(In reply to Andreas Tobler from comment #12) In my view leaving things as they are until August is a worse alternative than having a possibly temporary update to uipc_socket.c that is later updated some more. Is there someone appropriate to make that judgment and if they agree to apply the patch for now? As additional evidence (confirming repeatability of the improvement in another context that had the failure): I've finally gotten the 32-bit powerpc updated with the patch applied to -r320570 (-r320570 has another major-consequences bug fix for a TARET_ARCH=powerpc64 example (old PowerMac G5 so-called "Quad Core") and a armv6/7 example (a rpi2 that bob prohaska reported on), and likely more). The patch for 220452/220404/220358 as I've done it is: Index: /usr/src/sys/kern/uipc_socket.c =================================================================== --- /usr/src/sys/kern/uipc_socket.c (revision 320570) +++ /usr/src/sys/kern/uipc_socket.c (working copy) @@ -836,10 +836,8 @@ SOCKBUF_LOCK_DESTROY(&so->so_snd); SOCKBUF_LOCK_DESTROY(&so->so_rcv); -#ifdef INVARIANTS bzero(&so->so_rcv, sizeof(struct socket) - offsetof(struct socket, so_rcv)); -#endif so->sol_sbrcv_lowat = sbrcv_lowat; so->sol_sbsnd_lowat = sbsnd_lowat; and the patched -r320570 is now working for the TARGET_ARCH=powerpc context that previously could not finish a boot to the login prompt. Also, powerpc64 -r320570 continues to appear to work with the patch applied as well. My amd64 use did not show symptoms for this or for the major breakage that is in -r320509 through -r320561 (fixed by -r320570). So it is not a great test case other than to say that the fixes have had no observed negative consequences so far. I've yet to update from well before INO64 on aarch64 and armv6/7 so I've not tested those yet. I do not plan on proving failure on these: just eventually jumping to -r320570 (or later) with the patch and reporting on the result.
Does FreeBSD ever remove published snapshots for specific TARGET_ARCH's for vintages that have major problems on those TARGET_ARCH's? Does it do such for head? If yes, should some TARGET_ARCH's have head (12) snapshots removed for anything from -r319722 until sometime in August (or until a preliminary applied patch is in the build)? A similar point goes for the narrower range -r320509 through -r320561 for what -r320570 fixed. (One ends up unable to self-host in at least some cases: required programs crash.)
To the best of my knowledge, published 12-CURRENT ISOs on FreeBSD mirrors have a kernel with INVARIANTS config enabled so the symptoms described here may not appear.
Adding hselasky@ in the hope a fix can be found before glebius@ is back in 1 month
Created attachment 184050 [details] Fix for panic Can you try this patch?
https://reviews.freebsd.org/D11475
i386 is fine with the patch from #18. The tests for the other archs will be done later today. Thanks!
Hans, I have partial success with your patch ( i386 r319722M ). Unfortunately 'hald' service still panics the system: Fatal trap 12: page fault while in kernel mode cpuid = 0; apic id = 00 fault virtual address = 0x28c fault code = supervisor read, page not present instruction pointer = 0x20:0xc06d26a3 stack pointer = 0x28:0xefdf4a28 frame pointer = 0x28:0xefdf4a58 code segment = base rx0, limit 0xfffff, type 0x1b = DPL 0, pres 1, def32 1, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 871 (hald) trap number = 12 panic: page fault __curthread () at ./machine/pcpu.h:225 225 __asm("movl %%fs:%1,%0" : "=r" (td) (kgdb) #0 __curthread () at ./machine/pcpu.h:225 #1 doadump (textdump=-968621952) at /disk3/src319721/sys/kern/kern_shutdown.c:318 #2 0xc06e82e4 in kern_reboot (howto=<optimized out>) at /disk3/src319721/sys/kern/kern_shutdown.c:386 #3 0xc06e867b in vpanic (fmt=<optimized out>, ap=0xefdf488c "\233\324\235\300\350\033\211\306\001") at /disk3/src319721/sys/kern/kern_shutdown.c:779 #4 0xc06e853b in panic (fmt=0xc092d62e "%s") at /disk3/src319721/sys/kern/kern_shutdown.c:710 #5 0xc08ee111 in trap_fatal (frame=0xefdf49e8, eva=<optimized out>) at /disk3/src319721/sys/i386/i386/trap.c:978 #6 0xc08ee24d in trap_pfault (frame=0xefdf49e8, usermode=0, eva=<optimized out>) at /disk3/src319721/sys/i386/i386/trap.c:786 #7 0xc08ed87e in trap (frame=<optimized out>) at /disk3/src319721/sys/i386/i386/trap.c:512 #8 <signal handler called> #9 __mtx_lock_sleep (c=0xc6bd5cc8, v=<optimized out>, tid=<optimized out>) at /disk3/src319721/sys/kern/kern_mutex.c:538 #10 0xc0746959 in soo_stat (fp=0xc6893b28, ub=0xefdf4ae0, active_cred=0xc6ea3100, td=0xc6891a20) at /disk3/src319721/sys/kern/sys_socket.c:305 #11 0xc06a34ca in fo_stat (fp=0xc6893b28, sb=0xefdf4ae0, active_cred=0xc6bd5cb8, td=0xc6891a20) at /disk3/src319721/sys/sys/file.h:339 #12 kern_fstat (td=0xc6891a20, fd=<optimized out>, sbp=<optimized out>) at /disk3/src319721/sys/kern/kern_descrip.c:1362 #13 0xc06a35da in sys_fstat (td=0xefdf4ae0, uap=0xefdf4c00) at /disk3/src319721/sys/kern/kern_descrip.c:1341 #14 0xc08eeabd in syscallenter (td=<optimized out>, sa=<optimized out>) at /disk3/src319721/sys/i386/i386/../../kern/subr_syscall.c:136 #15 syscall (frame=<optimized out>) at /disk3/src319721/sys/i386/i386/trap.c:1102 #16 <signal handler called> #17 0x284404af in ?? () Backtrace stopped: Cannot access memory at address 0xbfbfe9bc Please ignore 'src319721' in kernel source path ; sources were updated to 319722 plus your patch applied. Thank you
arm is also fine with the patch in #18.
Oleg, would you mind updating to a more recent src tree? There were some more commits from glebius@ in this area: 319754, 319988, 320324. I don't know if they are really related. At least 319754 touches the sys_socket.c I'm right now on 320607
(In reply to Andreas Tobler from comment #23) Ok, tomorrow I will update my i386 CURRENT box to 320607 with patch from Hans.
A commit references this bug: Author: hselasky Date: Tue Jul 4 18:23:17 UTC 2017 New revision: 320652 URL: https://svnweb.freebsd.org/changeset/base/320652 Log: After r319722 two fields were left uninitialized when transforming a socket structure into a listening socket. This resulted in an invalid instruction fault for all 32-bit platforms. When INVARIANTS is set the union where the two uninitialized fields reside gets properly zeroed. This patch ensures the two uninitialized fields are zeroed when INVARIANTS is undefined. For 64-bit platforms this issue was not visible because so->sol_upcall which is uninitialized overlaps with so->so_rcv.sb_state which is already zero during soalloc(); For 32-bit platforms this issue was visible and resulted in an invalid instruction fault, because so->sol_upcall overlaps with so->so_rcv.sb_sel which is always initialized to a valid data pointer during soalloc(). Verifying the offset locations mentioned above are identical is left as an exercise to the reader. PR: 220452 PR: 220358 Reviewed by: ae (network), gallatin Differential Revision: https://reviews.freebsd.org/D11475 Sponsored by: Mellanox Technologies Changes: head/sys/kern/uipc_socket.c
Please re-open if this is still an issue with the latest version of FreeBSD-12-current. Thank you!
Thank *you* Hans!