Bug 212607

Summary: devel/gdb: debugging threaded process broken
Product: Ports & Packages Reporter: Tijl Coosemans <tijl>
Component: Individual Port(s)Assignee: freebsd-ports-bugs mailing list <ports-bugs>
Status: Closed FIXED    
Severity: Affects Only Me CC: eric, jhb, luca.pizzamiglio, misc-freebsd-bugzilla, olivier, volkovdablo
Priority: --- Flags: luca.pizzamiglio: maintainer-feedback+
Version: Latest   
Hardware: Any   
OS: Any   

Description Tijl Coosemans freebsd_committer 2016-09-12 13:55:58 UTC
I'm running into a problem debugging threaded programs on FreeBSD head amd64 r304294 and i386 r305230.

The following program reproduces it, but not always:

% cat test.c
#include <pthread.h>

void *
thr( void *arg ) {
	return( arg );
}

int
main( void ) {
	pthread_t pthr[ 4 ];

	pthread_create( &pthr[ 0 ], NULL, thr, NULL );
	pthread_create( &pthr[ 1 ], NULL, thr, NULL );
	pthread_create( &pthr[ 2 ], NULL, thr, NULL );
	pthread_create( &pthr[ 3 ], NULL, thr, NULL );
	pthread_join( pthr[ 0 ], NULL );
	pthread_join( pthr[ 1 ], NULL );
	pthread_join( pthr[ 2 ], NULL );
	pthread_join( pthr[ 3 ], NULL );
	return( 0 );
}
% cc -ggdb -o test test.c -lpthread
% gdb ./test
Reading symbols from ./test...done.
(gdb) b thr
Breakpoint 1 at 0x4007d8: file test.c, line 5.
(gdb) r
Starting program: /usr/home/tijl/test 
[New LWP 100221 of process 974]
[New LWP 100222 of process 974]
[Switching to LWP 100221 of process 974]

Thread 2 hit Breakpoint 1, thr (arg=0x0) at test.c:5
5		return( arg );
(gdb) c
Continuing.
[Switching to LWP 100222 of process 974]

Thread 3 hit Breakpoint 1, thr (arg=0x0) at test.c:5
5		return( arg );
(gdb) c
Continuing.
[LWP 100221 of process 974 exited]
[LWP 100222 of process 974 exited]
[New LWP 100223 of process 974]
[Switching to LWP 100223 of process 974]
0x0000000800828990 in ?? () from /lib/libthr.so.3
ptrace: No such process.

At this point gdb seems to be in an inconsistent state.

(gdb) bt
#0  0x0000000800828990 in ?? () from /lib/libthr.so.3
#1  0x0000000000000000 in ?? ()
Backtrace stopped: Cannot access memory at address 0x7fffdfbfc000
(gdb) q
A debugging session is active.

	Inferior 1 [process 974] will be killed.

Quit anyway? (y or n) y

Here gdb locks up and has to be killed with SIGKILL.

ports r411099 is the first commit that gives this behaviour.
Comment 1 luca.pizzamiglio 2016-09-17 11:47:07 UTC
I'm currently extermely busy (my girlfriend is in hospital), but I can give a look at this problem next week.

thanks for reporting
Comment 2 DomF 2016-10-31 18:59:04 UTC
Hello, I'm getting "ptrace: no such process" errors too.

I filed a bug report upstream (whoops?):

https://sourceware.org/bugzilla/show_bug.cgi?id=20743

I'm not sure if this deals with the "gdb locks up" issue too though (which I also experience).

FreeBSD 11.0-RELEASE-p1 with gdb 7.11.1_3
Comment 3 Javier Bizcocho 2016-11-14 22:57:50 UTC
Hello there,
Unfortunately I'm having exactly the same issue.

I have two computers, one with FreeBSD 11.0-RELEASE-p1 (amd64) <Where I found the error>, and a laptop with 10.3-RELEASE FreeBSD 10.3-RELEASE (i386) <I tried few times, but I cannot reproduce the issue>

Also, I tried to use -lthr, -lpthread and -pthread, but it seems that the real implementation underneath is libthr.

I'm willing to give a hand if it's necessary.

After some testing I can conclude:

- GDB 6.6 works without problem, although it's using libthread_db.so to manage threads.

- GDB 7.x it's using a different implementation, and it's no longer using libthread_db.so, but using ptrace directly (specially after the last addition for LWP events).

- Initially I discarded problems in the kernel, because GDB 6.6 was working, but now that I realized that they use different paths I'm not sure, maybe the new LWP code is broken.

- I tried vanillas GDB from elsewhere, same exact problem.

If I can help in anyway just let me know.

Cheers,

Javi.
Comment 4 Javier Bizcocho 2016-12-09 12:46:17 UTC
Pinging.... 
I'm still hitting this issue.

Did you have time to have a look luca?

Thanks in advance

Javi.
Comment 5 luca.pizzamiglio 2016-12-09 15:58:19 UTC
I've really trouble to replicate the problem. After 50 runs, no error.

I've spoken with jhb@ and he's not sure if it's a gdb problem, it could be potentially related to ptrace/kernel. AFAIK, The base gdb6 support threads in a different way.

I'm running a CURRENT, so it could be the reason why I've problem to reproduce the error.

@Javier: do you have another test case that trigger the error more often? I can use it to debug it
Comment 6 DomF 2016-12-12 14:57:11 UTC
(In reply to Tijl Coosemans from comment #0)
(In reply to Javier Bizcocho from comment #4)

Did you try the patch I linked to in comment #2 ?

The crux of the patch is change the top of resume_all_threads_cb() in fbsd-nat.c to:

resume_all_threads_cb (struct thread_info *tp, void *data)
{
  ptid_t *filter = (ptid_t *) data;

  /* don't resume an exited thread */
  if (tp->state == THREAD_EXITED)
    return 0;

[existing code, starting with if() continues from here]

I'm not able to run CURRENT right now but did suffer this problem with GDB v7 on 11-RC and 11-RELEASE.
Comment 7 Tijl Coosemans freebsd_committer 2016-12-18 18:13:31 UTC
(In reply to misc-freebsd-bugzilla from comment #6)
Your patch seems to fix the test case.  I can keep continuing until the process exists and then cleanly quit gdb.  However, when I try to quit gdb while the process is still running it still locks up and needs to be killed with SIGKILL.
Comment 8 John Baldwin freebsd_committer freebsd_triage 2016-12-19 16:02:59 UTC
Sorry that I haven't sat down to test this yet.  The hang on kill seems to be an issue with PT_KILL (and I'm not sure if it's gdb or the kernel that is broken).
Comment 9 commit-hook freebsd_committer 2017-01-12 21:40:29 UTC
A commit references this bug:

Author: olivier
Date: Thu Jan 12 21:40:07 UTC 2017
New revision: 431323
URL: https://svnweb.freebsd.org/changeset/ports/431323

Log:
  Add MIPS support and other fixes

  PR:		215938
  - Main PR that merge all
  Submitted by:   luca.pizzamiglio@gmail.com (maintainer)

  PR:		215783
  - Add MIPS support
  Submitted by:   jhb
  Sponsored by:   DARPA / AFRL

  PR:		215868
  - Fix build on powerpc architecture
  Reported by:    Mark Millard

  PR:		212607
  - Add a workaround to mitigate gdb hangs under some circumstances
  with multi-threaded applications
  (thanks to misc-freebsd-bugzilla@talk2dom.com)
  Reported by:    tijl

  PR:		215578
  - Fix build by removing option to use system readline
  Reported by:    rozhuk.im@gmail.com

Changes:
  head/devel/gdb/Makefile
  head/devel/gdb/distinfo
  head/devel/gdb/files/commit-387360daf9
  head/devel/gdb/files/commit-b268007c68
  head/devel/gdb/files/extrapatch-base-readline
  head/devel/gdb/files/extrapatch-kgdb
  head/devel/gdb/files/kgdb/mipsfbsd-kern.c
  head/devel/gdb/files/kgdb/ppcfbsd-kern.c
  head/devel/gdb/files/kgdb/sparc64fbsd-kern.c
  head/devel/gdb/files/patch-gdb-fbsd-nat.c
Comment 10 Olivier Cochard freebsd_committer 2017-01-12 21:52:14 UTC
Does the workaround committed is enough for closing this PR ?
Comment 11 luca.pizzamiglio 2017-01-12 21:58:47 UTC
(In reply to Olivier Cochard from comment #10)
No, the root of the problem is still unknown, so this PR should stay open
Comment 12 commit-hook freebsd_committer 2017-02-20 15:53:51 UTC
A commit references this bug:

Author: badger
Date: Mon Feb 20 15:53:17 UTC 2017
New revision: 313992
URL: https://svnweb.freebsd.org/changeset/base/313992

Log:
  Defer ptracestop() signals that cannot be delivered immediately

  When a thread is stopped in ptracestop(), the ptrace(2) user may request
  a signal be delivered upon resumption of the thread. Heretofore, those signals
  were discarded unless ptracestop()'s caller was issignal(). Fix this by
  modifying ptracestop() to queue up signals requested by the ptrace user that
  will be delivered when possible. Take special care when the signal is SIGKILL
  (usually generated from a PT_KILL request); no new stop events should be
  triggered after a PT_KILL.

  Add a number of tests for the new functionality. Several tests were authored
  by jhb.

  PR:		212607
  Reviewed by:	kib
  Approved by:	kib (mentor)
  MFC after:	2 weeks
  Sponsored by:	Dell EMC
  In collaboration with:	jhb
  Differential Revision:	https://reviews.freebsd.org/D9260

Changes:
  head/sys/kern/kern_fork.c
  head/sys/kern/kern_sig.c
  head/sys/kern/kern_thr.c
  head/sys/kern/subr_syscall.c
  head/sys/kern/sys_process.c
  head/sys/sys/signalvar.h
  head/tests/sys/kern/Makefile
  head/tests/sys/kern/ptrace_test.c
Comment 13 Eric Badger 2017-02-20 16:00:51 UTC
(In reply to commit-hook from comment #12)

This commit addresses the hang when trying to quit gdb. I still see the original problem (ptrace errors) on FreeBSD 10.3, I presume because it lacks thread events. I think a patch like the one alluded to be jhb here: https://sourceware.org/bugzilla/show_bug.cgi?id=20743#c2 is required to fix that.
Comment 14 John Baldwin freebsd_committer freebsd_triage 2017-02-20 20:22:44 UTC
(In reply to Eric Badger from comment #13)
I don't have a patch for the non-LWP events errors with 10.3.  My inclination at this point is to tell folks to just backport the LWP commits from stable/10 to 10.3. :-/  The errors GDB gets in the non-LWP event case do not have an obvious solution.
Comment 15 commit-hook freebsd_committer 2017-03-25 13:34:12 UTC
A commit references this bug:

Author: badger
Date: Sat Mar 25 13:33:25 UTC 2017
New revision: 315949
URL: https://svnweb.freebsd.org/changeset/base/315949

Log:
  MFC r313992, r314075, r314118, r315484:

  r315484:
      ptrace_test: eliminate assumption about thread scheduling

      A couple of the ptrace tests make assumptions about which thread in a
      multithreaded process will run after a halt. This makes the tests less
      portable across branches, and susceptible to future breakage. Instead,
      twiddle thread scheduling and priorities to match the tests'
      expectation.

  r314118:
      Actually fix buildworlds other than i386/amd64/sparc64 after r313992

      Disable offending test for platforms without a userspace visible
      breakpoint().

  r314075:
      Fix world build for archs where __builtin_debugtrap() does not work.

      The offending code was introduced in r313992.

  r313992:
      Defer ptracestop() signals that cannot be delivered immediately

      When a thread is stopped in ptracestop(), the ptrace(2) user may request
      a signal be delivered upon resumption of the thread. Heretofore, those signals
      were discarded unless ptracestop()'s caller was issignal(). Fix this by
      modifying ptracestop() to queue up signals requested by the ptrace user that
      will be delivered when possible. Take special care when the signal is SIGKILL
      (usually generated from a PT_KILL request); no new stop events should be
      triggered after a PT_KILL.

      Add a number of tests for the new functionality. Several tests were authored
      by jhb.

  PR:		212607
  Sponsored by:	Dell EMC

Changes:
_U  stable/10/
  stable/10/sys/kern/kern_fork.c
  stable/10/sys/kern/kern_sig.c
  stable/10/sys/kern/kern_thr.c
  stable/10/sys/kern/subr_syscall.c
  stable/10/sys/kern/sys_process.c
  stable/10/sys/sys/signalvar.h
  stable/10/tests/sys/kern/Makefile
  stable/10/tests/sys/kern/ptrace_test.c
_U  stable/11/
  stable/11/sys/kern/kern_fork.c
  stable/11/sys/kern/kern_sig.c
  stable/11/sys/kern/kern_thr.c
  stable/11/sys/kern/subr_syscall.c
  stable/11/sys/kern/sys_process.c
  stable/11/sys/sys/signalvar.h
  stable/11/tests/sys/kern/Makefile
  stable/11/tests/sys/kern/ptrace_test.c
Comment 16 John Baldwin freebsd_committer freebsd_triage 2017-04-18 16:57:16 UTC
FYI, a final patch has been merged to GDB master and will appear in 8.0 release.  The final patch is a bit different from the one in the port, but is functionally identical.  I don't think we need to rework the patch in the current port, but we can drop it when we import 8.0.

The GDB hangs were due to issues in the kernel that Eric Badger has thankfully fixed.