Bug 220452 - Kernel panic in soisconnected during sysinit: invalid instruction
Summary: Kernel panic in soisconnected during sysinit: invalid instruction
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: arm (show other bugs)
Version: CURRENT
Hardware: arm Any
: --- Affects Only Me
Assignee: freebsd-arm (Nobody)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-07-03 17:47 UTC by Sylvain Garrigues
Modified: 2017-07-05 08:50 UTC (History)
7 users (show)

See Also:


Attachments
Backtrace (722.25 KB, image/jpeg)
2017-07-03 17:49 UTC, Sylvain Garrigues
no flags Details
Fix for panic (383 bytes, patch)
2017-07-04 08:45 UTC, Hans Petter Selasky
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Sylvain Garrigues 2017-07-03 17:47:01 UTC
Hello,

When compiled without INVARIANTS and INVARIANT_SUPPORT, my production armv6 kernel  updated as of yesterday's latest commit panics (cf screenshot).

When I compile with INVARIANTS and INVARIANT_SUPPORT, I don't have the panic, everything boots fine till the login prompt.

I suspect the culprit might be https://reviews.freebsd.org/rS319722

Gleb, any idea?
Comment 1 Sylvain Garrigues 2017-07-03 17:49:44 UTC
Created attachment 184040 [details]
Backtrace

Backtrace
Comment 2 Sylvain Garrigues 2017-07-03 17:53:52 UTC
I am using a raspberry pi 2 as the test machine.
Comment 3 Andreas Tobler freebsd_committer freebsd_triage 2017-07-03 17:59:45 UTC
I see the same on powerpc32, armv6 (wandboard) and also on i386. On i386 I have a usable vmcore.
Comment 4 Andreas Tobler freebsd_committer freebsd_triage 2017-07-03 18:00:49 UTC
I confirmed that 319721 boots and is stable on i386. 319722 panics when I ssh to the i386 box.
Comment 5 Sylvain Garrigues 2017-07-03 18:03:57 UTC
There is a 

#ifdef INVARIANTS
	bzero(&so->so_rcv,
	    sizeof(struct socket) - offsetof(struct socket, so_rcv));
#endif

sys/kern/uipc_socket.c line 839 in the code written by Gleb, maybe the ifdef should be always true and the memory always zeroed. Sorry no time to test that this week.
Comment 6 Mark Millard 2017-07-03 18:11:54 UTC
This 220452 is likely a duplicate of my earlier
220404 report that was based on a detailed vmcore.*
inspection on 32-bit powerpc: just more examples.

I've only done the detailed analysis for the 32-bit
powerpc context.

As of now there are reports for 32-bit powerpc,
armv6/7, and i386 spread over 4 people.
Comment 7 oleg.nauman 2017-07-03 19:50:59 UTC
I can confirm that r319722 is the source of panics on i386 while r319721 is stable.
Please see https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=220358 ( it seems duplicate of this bug )
Comment 8 Andreas Tobler freebsd_committer freebsd_triage 2017-07-03 20:58:08 UTC
My targets are running fine now with the removed ifdef from #5.
All are under stress test now, buildworld/portbuild from nfs.
Though, it is not clear to me why all my 64-bit targets(x86_64, powerpc64 and arm64) have had no issues with this ifdef.
Comment 9 Mark Millard 2017-07-03 21:04:49 UTC
(In reply to oleg.nauman from comment #7)

Can you see if the patch suggested by comment #5
fixes the 220358 behavior?

Right now 220404 and 220452 are tied by backtrace
commonality not showing in 220358. So the
duplication status is clearer.

Given the bad handling of a union I expect that
220358 will be fixed as well. But I've not seen
the specific 220358 backtraces or anything like
them: I do not get that far prior to the suggested
patch.
Comment 10 Sylvain Garrigues 2017-07-03 21:10:11 UTC
Please note that removing the #ifdef INVARIANTS to always bzero the memory is not a suggested patch by me - I was just wondering why the code works with invariants, so I intended to try that.

Bug analysis and understanding why removing the #ifdef works should be performed by the relevant skilled people on this part of the kernel - that will lead I hope to a good patch and fix for this problem in production :-)
Comment 11 Mark Millard 2017-07-03 21:19:48 UTC
(In reply to Sylvain Garrigues from comment #10)

Generally agreed.

But I will note that the code is switching which
anonymous struct's in the new union that is in
struct socket now.. The bzero is preventing using
values from the first (and initially used) anonymous
struct being used by the second anonymous struct in
the union.

It gives much of the union a "clean slate" status
of being zero for the use of the second anonymous
struct. (This is after some other "close
out" activities on the original content in the
first anonymous struct.)


Side note:

The syntax for the union is not compliant with
allowing C++ to compile the header: the enum
syntax is a problem in that place in C++.

So likely the header will change as well at some
point.
Comment 12 Andreas Tobler freebsd_committer freebsd_triage 2017-07-03 21:21:02 UTC
Regarding comment #10, I fully agree.

Regarding comment #9, I expect that is 'solved' too with removing the ifdef.
The backtrace on my i386 was similar:

..
ti_locked=<error reading variable: Cannot access memory at address 0x1>)
..

Coming back to comment #10, it is a start to dig in and try to understand what is happening. Gleb is away till August. So for now it seems to be a band-aid to continue working.
Comment 13 Mark Millard 2017-07-03 21:27:24 UTC
As this bugzilla is getting most of the activity
the following changes might be appropriate given
the reports present here:

Hardware: Any (not just arm anyway)
Importance: Affects some people
Comment 14 Mark Millard 2017-07-04 03:05:44 UTC
(In reply to Andreas Tobler from comment #12)

In my view leaving things as they are until
August is a worse alternative than having
a possibly temporary update to uipc_socket.c
that is later updated some more.

Is there someone appropriate to make that
judgment and if they agree to apply the
patch for now?


As additional evidence (confirming
repeatability of the improvement in
another context that had the failure):

I've finally gotten the 32-bit powerpc updated
with the patch applied to -r320570 (-r320570
has another major-consequences bug fix for
a TARET_ARCH=powerpc64 example (old PowerMac
G5 so-called "Quad Core") and a armv6/7
example (a rpi2 that bob prohaska reported
on), and likely more).

The patch for 220452/220404/220358 as I've
done it is:

Index: /usr/src/sys/kern/uipc_socket.c
===================================================================
--- /usr/src/sys/kern/uipc_socket.c     (revision 320570)
+++ /usr/src/sys/kern/uipc_socket.c     (working copy)
@@ -836,10 +836,8 @@
        SOCKBUF_LOCK_DESTROY(&so->so_snd);
        SOCKBUF_LOCK_DESTROY(&so->so_rcv);
 
-#ifdef INVARIANTS
        bzero(&so->so_rcv,
            sizeof(struct socket) - offsetof(struct socket, so_rcv));
-#endif
 
        so->sol_sbrcv_lowat = sbrcv_lowat;
        so->sol_sbsnd_lowat = sbsnd_lowat;


and the patched -r320570 is now working
for the TARGET_ARCH=powerpc context that
previously could not finish a boot to
the login prompt. Also, powerpc64 -r320570
continues to appear to work with the patch
applied as well.

My amd64 use did not show symptoms for this or
for the major breakage that is in -r320509
through -r320561 (fixed by -r320570). So it is
not a great test case other than to say that
the fixes have had no observed negative
consequences so far.

I've yet to update from well before INO64 on
aarch64 and armv6/7 so I've not tested those
yet. I do not plan on proving failure on these:
just eventually jumping to -r320570 (or later)
with the patch and reporting on the result.
Comment 15 Mark Millard 2017-07-04 04:23:14 UTC
Does FreeBSD ever remove published
snapshots for specific TARGET_ARCH's
for vintages that have major problems
on those TARGET_ARCH's? Does it do
such for head?

If yes, should some TARGET_ARCH's have
head (12) snapshots removed for anything
from -r319722 until sometime in August
(or until a preliminary applied patch
is in the build)?

A similar point goes for the narrower
range -r320509 through -r320561 for
what -r320570 fixed. (One ends up
unable to self-host in at least some
cases: required programs crash.)
Comment 16 Sylvain Garrigues 2017-07-04 07:46:20 UTC
To the best of my knowledge, published 12-CURRENT ISOs on FreeBSD mirrors have a kernel with INVARIANTS config enabled so the symptoms described here may not appear.
Comment 17 Sylvain Garrigues 2017-07-04 08:10:19 UTC
Adding hselasky@ in the hope a fix can be found before glebius@ is back in 1 month
Comment 18 Hans Petter Selasky freebsd_committer freebsd_triage 2017-07-04 08:45:23 UTC
Created attachment 184050 [details]
Fix for panic

Can you try this patch?
Comment 19 Hans Petter Selasky freebsd_committer freebsd_triage 2017-07-04 08:54:05 UTC
https://reviews.freebsd.org/D11475
Comment 20 Andreas Tobler freebsd_committer freebsd_triage 2017-07-04 11:00:02 UTC
i386 is fine with the patch from #18.
The tests for the other archs will be done later today.
Thanks!
Comment 21 oleg.nauman 2017-07-04 15:56:20 UTC
Hans, 

I have partial success with your patch ( i386 r319722M ).
Unfortunately 'hald' service still panics the system:


Fatal trap 12: page fault while in kernel mode
cpuid = 0; apic id = 00
fault virtual address   = 0x28c
fault code              = supervisor read, page not present
instruction pointer     = 0x20:0xc06d26a3
stack pointer           = 0x28:0xefdf4a28
frame pointer           = 0x28:0xefdf4a58
code segment            = base rx0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, def32 1, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 871 (hald)
trap number             = 12
panic: page fault

__curthread () at ./machine/pcpu.h:225
225             __asm("movl %%fs:%1,%0" : "=r" (td)
(kgdb) #0  __curthread () at ./machine/pcpu.h:225
#1  doadump (textdump=-968621952)
    at /disk3/src319721/sys/kern/kern_shutdown.c:318
#2  0xc06e82e4 in kern_reboot (howto=<optimized out>)
    at /disk3/src319721/sys/kern/kern_shutdown.c:386
#3  0xc06e867b in vpanic (fmt=<optimized out>,
    ap=0xefdf488c "\233\324\235\300\350\033\211\306\001")
    at /disk3/src319721/sys/kern/kern_shutdown.c:779
#4  0xc06e853b in panic (fmt=0xc092d62e "%s")
    at /disk3/src319721/sys/kern/kern_shutdown.c:710
#5  0xc08ee111 in trap_fatal (frame=0xefdf49e8, eva=<optimized out>)
    at /disk3/src319721/sys/i386/i386/trap.c:978
#6  0xc08ee24d in trap_pfault (frame=0xefdf49e8, usermode=0,
    eva=<optimized out>) at /disk3/src319721/sys/i386/i386/trap.c:786
#7  0xc08ed87e in trap (frame=<optimized out>)
    at /disk3/src319721/sys/i386/i386/trap.c:512
#8  <signal handler called>
#9  __mtx_lock_sleep (c=0xc6bd5cc8, v=<optimized out>, tid=<optimized out>)
    at /disk3/src319721/sys/kern/kern_mutex.c:538
#10 0xc0746959 in soo_stat (fp=0xc6893b28, ub=0xefdf4ae0,
    active_cred=0xc6ea3100, td=0xc6891a20)
    at /disk3/src319721/sys/kern/sys_socket.c:305
#11 0xc06a34ca in fo_stat (fp=0xc6893b28, sb=0xefdf4ae0,
    active_cred=0xc6bd5cb8, td=0xc6891a20)
    at /disk3/src319721/sys/sys/file.h:339
#12 kern_fstat (td=0xc6891a20, fd=<optimized out>, sbp=<optimized out>)
    at /disk3/src319721/sys/kern/kern_descrip.c:1362
#13 0xc06a35da in sys_fstat (td=0xefdf4ae0, uap=0xefdf4c00)
    at /disk3/src319721/sys/kern/kern_descrip.c:1341
#14 0xc08eeabd in syscallenter (td=<optimized out>, sa=<optimized out>)
    at /disk3/src319721/sys/i386/i386/../../kern/subr_syscall.c:136
#15 syscall (frame=<optimized out>)
    at /disk3/src319721/sys/i386/i386/trap.c:1102
#16 <signal handler called>
#17 0x284404af in ?? ()
Backtrace stopped: Cannot access memory at address 0xbfbfe9bc

Please ignore 'src319721' in kernel source path ; sources were updated to 319722 plus your patch applied.

Thank you
Comment 22 Andreas Tobler freebsd_committer freebsd_triage 2017-07-04 16:22:25 UTC
arm is also fine with the patch in #18.
Comment 23 Andreas Tobler freebsd_committer freebsd_triage 2017-07-04 16:26:14 UTC
Oleg, 
would you mind updating to a more recent src tree? There were some more commits from glebius@ in this area: 319754, 319988, 320324. I don't know if they are really related. At least 319754 touches the sys_socket.c

I'm right now on 320607
Comment 24 oleg.nauman 2017-07-04 16:31:35 UTC
(In reply to Andreas Tobler from comment #23)
 Ok, tomorrow I will update my i386 CURRENT box to 320607 with patch from Hans.
Comment 25 commit-hook freebsd_committer freebsd_triage 2017-07-04 18:23:52 UTC
A commit references this bug:

Author: hselasky
Date: Tue Jul  4 18:23:17 UTC 2017
New revision: 320652
URL: https://svnweb.freebsd.org/changeset/base/320652

Log:
  After r319722 two fields were left uninitialized when transforming a
  socket structure into a listening socket. This resulted in an invalid
  instruction fault for all 32-bit platforms.

  When INVARIANTS is set the union where the two uninitialized fields
  reside gets properly zeroed. This patch ensures the two uninitialized
  fields are zeroed when INVARIANTS is undefined.

  For 64-bit platforms this issue was not visible because so->sol_upcall
  which is uninitialized overlaps with so->so_rcv.sb_state which is
  already zero during soalloc();

  For 32-bit platforms this issue was visible and resulted in an invalid
  instruction fault, because so->sol_upcall overlaps with
  so->so_rcv.sb_sel which is always initialized to a valid data pointer
  during soalloc().

  Verifying the offset locations mentioned above are identical is left
  as an exercise to the reader.

  PR: 220452
  PR: 220358
  Reviewed by:	ae (network), gallatin
  Differential Revision:	https://reviews.freebsd.org/D11475
  Sponsored by:	Mellanox Technologies

Changes:
  head/sys/kern/uipc_socket.c
Comment 26 Hans Petter Selasky freebsd_committer freebsd_triage 2017-07-04 18:26:10 UTC
Please re-open if this is still an issue with the latest version of FreeBSD-12-current. Thank you!
Comment 27 Sylvain Garrigues 2017-07-05 08:50:56 UTC
Thank *you* Hans!