Bug 96981

Summary: reproducible instant reboot by unprivileged user
Product: Base System Reporter: Lodewijk Voege <lvoege>
Component: amd64Assignee: freebsd-amd64 (Nobody) <amd64>
Status: Closed FIXED    
Severity: Affects Only Me    
Priority: Normal    
Version: Unspecified   
Hardware: Any   
OS: Any   

Description Lodewijk Voege 2006-05-08 17:00:36 UTC
as a normal, unpriviledged user, running GDB on the mono binary, breaking anywhere, then continuing has my machine reboot instantly. no panic, no nothing. I hit 'c' for continue, then enter, and the next screen I see is the BIOS booting up.

Fix: 

dunno.
How-To-Repeat: compile the following using mcs:

public class Test {
	public static void Main(string[] args) {
		System.Console.WriteLine("hello");
	}
}

into a binary. run gdb /usr/local/bin/mono and 'r'. for me, it then SIGSEGVs somewhere in the garbage collector, but the problem occurs even if you break back into gdb by other means. now 'c'ontinue and it reboots.

I assume the problem could be distilled into a smaller example program, not using mono or gdb, but that unfortunately exceeds my ability at this time.
Comment 1 Lodewijk Voege 2006-05-15 16:09:54 UTC
I tried to reproduce on a i386, and could not. I upgraded my amd64 to  
6.1-RELEASE and the recipe still works, it still instantly reboots.

Lodeiwjk
Comment 2 Gavin Atkinson 2007-04-26 13:58:19 UTC
Hi,

Is this still a problem under 6.2?  If so, is there any chance you could
wire up a serial console to the machine as described in
http://www.freebsd.org/doc/en_US.ISO8859-1/articles/console-server/freebsd.html
and follow the instructions at
http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug-online-ddb.html
in order to obtain a stack trace once it has crashed?  It's likely that
your machine is crashing and rebooting straight away as you are in X,
however the above may well provide enough information to help debug this
issue further.

Thanks,

Gavin
Comment 3 Lodewijk Voege 2007-04-26 15:28:02 UTC
On 26-apr-2007, at 8:58, Gavin Atkinson wrote:

> Is this still a problem under 6.2?

I will check.

> If so, is there any chance you could wire up a serial console to  
> the machine as described in
> http://www.freebsd.org/doc/en_US.ISO8859-1/articles/console-server/ 
> freebsd.html
> and follow the instructions at
> http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers- 
> handbook/kerneldebug-online-ddb.html
> in order to obtain a stack trace once it has crashed?

I believe I did try DDB at the time, but the lock was hard and solid  
with no trapping to DDB. but if it's still a problem in 6.2 I will try.

> It's likely that your machine is crashing and rebooting straight  
> away as you are in X,
> however the above may well provide enough information to help debug  
> this
> issue further.

this was in text mode, even with X not running at all.

Lodewijk
Comment 4 Lodewijk Voege 2007-04-26 15:35:04 UTC
On 26-apr-2007, at 8:58, Gavin Atkinson wrote:

> Is this still a problem under 6.2?

yep, immediate reboot from the console. no messages, no trapping to  
ddb, no nothing. I hear my monitor change sync rate before my finger  
has left the enter key.

I'll try the serial line if I can find a nullmodem.

Lodewijk
Comment 5 Mark Linimon freebsd_committer freebsd_triage 2007-04-26 23:42:51 UTC
State Changed
From-To: open->analyzed

Although the problem has not been 'analyzed' as such, it has been confirmed, 
so set this to flag it.
Comment 6 Lodewijk Voege 2007-04-27 01:11:40 UTC
On 26-apr-2007, at 8:58, Gavin Atkinson wrote:

> If so, is there any chance you could wire up a serial console to =20
> the machine

done, the serial console is set up and works. dumpdev is set to AUTO, =20=

debug.debugger_on_panic to 1, but it won't dump or break to ddb. the =20
only behavioral difference I see with the serial console is that it =20
hangs instead of rebooting, and it says "kernel trap 9 with =20
interrupts disabled" on the serial console.

if I copy&paste a kdb_trap(type, 0, &frame); right after that printf=20
() in trap() in /usr/src/sys/amd64/amd64/trap.c it does break to ddb. =20=

I have three mono threads, two of which have a trace with only =20
doreti_iret(). the third has:

sched_switch() at sched_switch+0x11f
mi_switch() at mi_switch+0x153
sleepq_timedwait_sig() at sleepq_timedwait_sig+0x2b
msleep() at msleep+0x39a
kse_release() at kse_release+0xe0
syscall() at syscall+0x629
Xfast_syscall() at Xfast_syscall+0xa8
--- syscall (383, FreeBSD ELF64, kse_release), rip =3D 0x800fb285c, rsp =20=

=3D 0x7fffffbfef38, rbp =3D 0x81 ---

if I then make it panic and kgdb the core file against kernel.debug =20
and 'bt', this appears:

#0  doadump () at pcpu.h:172
#1  0xffffffff802832f3 in boot (howto=3D260)
     at ../../../kern/kern_shutdown.c:409
#2  0xffffffff80283927 in panic (fmt=3D0xffffff002533ebe0 "=B06\215+")
     at ../../../kern/kern_shutdown.c:565
#3  0xffffffff801aa1a2 in db_panic (addr=3D0, have_addr=3D0, count=3D0, =20=

modif=3D0x0)
     at ../../../ddb/db_command.c:438
#4  0xffffffff801aa6e5 in db_command_loop () at ../../../ddb/=20
db_command.c:350
#5  0xffffffff801ac5fd in db_trap (type=3D-1462293744, code=3D0)
     at ../../../ddb/db_main.c:222
#6  0xffffffff802a1bab in kdb_trap (type=3D9, code=3D0, =20
tf=3D0xffffffffa8d72c10)
     at ../../../kern/subr_kdb.c:473
#7  0xffffffff8041305c in trap (frame=3D
       {tf_rdi =3D 34366898272, tf_rsi =3D 34376163152, tf_rdx =3D =20
140737488348840, tf_rcx =3D 0, tf_r8 =3D 0, tf_r9 =3D 0, tf_rax =3D 0, =
tf_rbx =20
=3D 140737488348824, tf_rbp =3D 140737488348824, tf_r10 =3D 0, tf_r11 =3D =
0, =20
tf_r12 =3D 0, tf_r13 =3D 0, tf_r14 =3D 0, tf_r15 =3D 0, tf_trapno =3D 9, =
=20
tf_addr =3D 0, tf_flags =3D 0, tf_err =3D 0, tf_rip =3D -2143296837, =
tf_cs =3D =20
8, tf_rflags =3D 65670, tf_rsp =3D -1462293288, tf_ss =3D 16})
     at ../../../amd64/amd64/trap.c:219
#8  0xffffffff803fd2cb in calltrap () at ../../../amd64/amd64/=20
exception.S:168
#9  0xffffffff803fe2bb in doreti_exit ()
     at ../../../amd64/amd64/exception.S:496
#10 0x0000000801101e20 in ?? ()
#11 0x000000000000002b in ?? ()
#12 0x0000000000000202 in ?? ()
#13 0x00007fffffffe698 in ?? ()
#14 0x0000000000000000 in ?? ()
#15 0x00000000006eaff8 in ?? ()
#16 0x0000000000000023 in ?? ()
#17 0x0000000000000000 in ?? ()
#18 0x0000000000000000 in ?? ()
#19 0x0000000000000000 in ?? ()
#20 0x0000000000000000 in ?? ()
#21 0x0000000000000000 in ?? ()
#22 0x0000000000000000 in ?? ()
#23 0x0000000000000000 in ?? ()
#24 0x0000000000000000 in ?? ()
#25 0x0000000025c0a000 in ?? ()
#26 0xffffff002533ebe0 in ?? ()
#27 0x0000000000000001 in ?? ()
#28 0xffffff002b8d36b0 in ?? ()
#29 0xffffff00264d1000 in ?? ()
#30 0xffffffffa8d726d0 in ?? ()
#31 0xffffffffa8d726a8 in ?? ()
#32 0xffffff002533ebe0 in ?? ()
#33 0xffffffff80298f6f in sched_switch (td=3D0x7fffffffe698, newtd=3D0x0, =
=20
flags=3D0)
     at ../../../kern/sched_4bsd.c:973
Previous frame inner to this frame (corrupt stack?)
(kgdb)

let me know if I can provide more help.

Lodewijk=
Comment 7 Lodewijk Voege 2007-04-28 18:20:59 UTC
hello,

please disregard the previous stack trace. while it's the same  
process and a much fuller stracktrace, the problem really seems to be  
the thread that's hitting doreti_iret with the uninteresting trace of  
"doreti_iret() at doreti_iret"

I've been sprinkling printf()s around in sys/amd64/amd64/trap.c, and  
as far as I can tell:

   - it passes through the block that printf()s "kernel trap 9 with  
interrupts disabled"
   - then to the block with /* kernel trap */ at the top
   - to the T_PROTFLT clause in the switch, as type is 9
   - it enters the block that points the frame's tf_rip field to  
doreti_iret_fault. that block goto's to the end of trap(), which does  
a return
   - lockup

so my layman's suspicion is that it's a problem with doreti_iret_fault.

Lodewijk
Comment 8 Gavin Atkinson 2007-05-01 12:35:56 UTC
This PR should be refiled as amd64, neither I nor the submitter can
recreate it on i386, the submitter has some evidence to suggest the
problem is in or around the MD amd64 doreti_iret_fault code, and my
instinct suggests the same given it's an issue involving tracing code.
Comment 9 Mark Linimon freebsd_committer freebsd_triage 2007-05-01 16:17:22 UTC
Responsible Changed
From-To: freebsd-bugs->freebsd-amd64

gavin suggests this may be amd64-specific.
Comment 10 Gavin Atkinson 2007-05-10 17:17:49 UTC
I have been able to recreate this on an amd64 system of my own as
follows:

Fresh install of 6.2-RELEASE-amd64
ports tree from 2007.02.13.06.40.00 (newer trees have the port BROKEN on
amd64, for unrelated reasons)

cd /usr/ports/lang/mono && make install
(accept default options for all ports)

As regular user:
Create example file in PR written to 96981.m

$ mcs 96981.m
$ gdb /usr/local/bin/mono
(gdb) set args 96981.exe
(gdb) r
Starting program: /usr/local/bin/mono 96981.exe
(no debugging symbols found)...(no debugging symbols found)...(no
debugging symbols found)...(no debugging symbols found)...(no debugging
symbols found)...(no debugging symbols found)...(no debugging symbols
found)...warning: Unable to get location for thread creation breakpoint:
generic error
[New LWP 100094]
(no debugging symbols found)...(no debugging symbols found)...[New
Thread 0x6eb000 (LWP 100094)]

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x6eb000 (LWP 100094)]
0x00000000004cec66 in GC_push_all_stack ()
(gdb) c
Continuing.
Read from remote host ga9test3.york.ac.uk: Connection reset by peer

Sadly, I don't currently have an amd64 system with a serial port
available.
Comment 11 Lodewijk Voege 2007-08-31 16:10:46 UTC
hello,

this no longer occurs on CURRENT. sys/amd64/amd64/mp_machdep.c r1.261
might have been the one that fixed it, it talks of faults and silent
resets.

Lodewijk
Comment 12 Mark Linimon freebsd_committer freebsd_triage 2007-08-31 18:28:52 UTC
State Changed
From-To: analyzed->closed

Submitter notes that this has been resolved in -current.