273688 – sysutils/pstack: does not work with Valgrind

Bug 273688 - sysutils/pstack: does not work with Valgrind

Summary: sysutils/pstack: does not work with Valgrind

Status:	Open

Alias:	None

Product:	Ports & Packages
Classification:	Unclassified
Component:	Individual Port(s) (show other bugs)
Version:	Latest
Hardware:	Any Any

Importance:	--- Affects Only Me
Assignee:	freebsd-ports-bugs (Nobody)

URL:
Keywords:

Depends on:
Blocks:

Reported:	2023-09-10 07:54 UTC by Paul Floyd
Modified:	2023-10-02 08:12 UTC (History)
CC List:	2 users (show)

See Also:

Flags:	fernape: maintainer-feedback? (pizzamig)

Attachments
patch for notype functions (468 bytes, patch) 2023-09-25 05:45 UTC, Paul Floyd	no flags	Details \| Diff
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Paul Floyd 2023-09-10 07:54:37 UTC

I get

paulf@freebsd:~ $ pstack 33671
33671: /home/paulf/valgrind/drd/drd-amd64-freebsd
----------------- thread -1 (running) -----------------
  0x3815ed66 ???????? (3815dcde, 0, 238c470, 10, 380bc159, 0)

which isn't much help.

Info on the exe:

paulf@freebsd:~ $ file /home/paulf/valgrind/drd/drd-amd64-freebsd
/home/paulf/valgrind/drd/drd-amd64-freebsd: ELF 64-bit LSB executable, x86-64, version 1 (FreeBSD), statically linked, for FreeBSD 15.0 (1500000), FreeBSD-style, with debug_info, not stripped

Not just statically linked but does not link with libc. Has its own _start and additionally uses a load address of 0x38000000.

Comment 1 Paul Floyd 2023-09-10 08:01:09 UTC

For comparison, pstack on Illumos gives me

25702:  /export/home/paulf/valgrind/./coregrind/valgrind --command-line-only=y
 00000000580749b6 pollsys  (5a0b6050, 1, 0, 0)
 000000005806308a vgPlain_poll () + 6a
 000000005811de35 vgPlain_poll_no_eintr () + 65
 000000005811ea1e readchar.part.0 () + 2e
 000000005811f3a8 getpkt () + 28
 00000000580afc57 server_main () + 67
 00000000580f2af8 call_gdbserver () + 118
 00000000580f35fa vgPlain_gdbserver () + 7a
 00000000580b69b3 vgPlain_scheduler () + 83
 00000000580c7f01 run_a_thread_NORETURN () + b1

Comment 2 Paul Floyd 2023-09-24 10:26:21 UTC

Some more analysis

In first window, 

./vg-in-place sleep 100000

Tthat's my dev build of Valgrind, just 'valgrind' from pkd devel/valgrind-devel should be fine as well.

In the second window, GDB test



(gdb) attach 48907
Attaching to process 48907
Reading symbols from /usr/home/paulf/scratch/valgrind/memcheck/memcheck-amd64-freebsd...
[Switching to LWP 104670 of process 48907]
vgModuleLocal_do_syscall_for_client_WRK () at m_syswrap/syscall-amd64-freebsd.S:144
144        setc  0(%rsp)         /* stash returned carry flag */
(gdb) bt
#0  vgModuleLocal_do_syscall_for_client_WRK () at m_syswrap/syscall-amd64-freebsd.S:144
#1  0x000000003819f27a in do_syscall_for_client (syscallno=240, tst=0x1002024f10, syscall_mask=0x1002ca9e20) at m_syswrap/syswrap-main.c:368
#2  vgPlain_client_syscall (tid=tid@entry=1, trc=trc@entry=73) at m_syswrap/syswrap-main.c:2341
#3  0x000000003819b150 in handle_syscall (tid=tid@entry=1, trc=trc@entry=73) at m_scheduler/scheduler.c:1206
#4  0x0000000038199223 in vgPlain_scheduler (tid=tid@entry=1) at m_scheduler/scheduler.c:1552
#5  0x00000000381ab33c in thread_wrapper (tidW=1) at m_syswrap/syswrap-freebsd.c:112
#6  run_a_thread_NORETURN (tidW=1) at m_syswrap/syswrap-freebsd.c:166
#7  0x0000000000000000 in ?? ()

That's what I'd expect.

And lldb:
(lldb) attach 48907
This version of LLDB has no plugin for the language "assembler". Inspection of frame variables will be limited.
Process 48907 stopped
* thread #1, name = 'memcheck-amd64-f', stop reason = signal SIGSTOP
    frame #0: 0x00000000381a03a6 memcheck-amd64-freebsd`vgModuleLocal_do_syscall_for_client_WRK at syscall-amd64-freebsd.S:144
   141        but hasn't been committed to RAX. */
   142 
   143     /* stack contents: 3 words for syscall above, plus our prologue */
-> 144     setc  0(%rsp)         /* stash returned carry flag */
   145 
   146     movq  -16(%rbp), %r11 /* r11 = VexGuestAMD64State * */
   147     movq  %rax, OFFSET_amd64_RAX(%r11)    /* save back to RAX */
Executable module set to "/usr/home/paulf/scratch/valgrind/memcheck/memcheck-amd64-freebsd".
Architecture set to: x86_64-unknown-freebsd13.2.
(lldb) bt
* thread #1, name = 'memcheck-amd64-f', stop reason = signal SIGSTOP
  * frame #0: 0x00000000381a03a6 memcheck-amd64-freebsd`vgModuleLocal_do_syscall_for_client_WRK at syscall-amd64-freebsd.S:144
    frame #1: 0x000000003819f27a memcheck-amd64-freebsd`vgPlain_client_syscall [inlined] do_syscall_for_client(syscallno=240, tst=0x0000001002024f10, syscall_mask=0x0000001002ca9e20) at syswrap-main.c:368:10
    frame #2: 0x000000003819f232 memcheck-amd64-freebsd`vgPlain_client_syscall(tid=1, trc=73) at syswrap-main.c:2341:10
    frame #3: 0x000000003819b150 memcheck-amd64-freebsd`handle_syscall(tid=1, trc=73) at scheduler.c:1206:4
    frame #4: 0x0000000038199223 memcheck-amd64-freebsd`vgPlain_scheduler(tid=1) at scheduler.c:1552:3
    frame #5: 0x00000000381ab33c memcheck-amd64-freebsd`run_a_thread_NORETURN [inlined] thread_wrapper(tidW=1) at syswrap-freebsd.c:112:10
    frame #6: 0x00000000381ab2c6 memcheck-amd64-freebsd`run_a_thread_NORETURN(tidW=1) at syswrap-freebsd.c:166:10

Again, that's OK.

I've dowloaded the pstack source from github and built it.

In gdb, looking at elfFindSymbolByAddress I see that the address that pstack is using is the same as the address I see when attaching gdb. Namely 0x381a03a6.

There is no .dynsym so elfFindSectionByName finds .symtab.

symStrings looks OK to me. The frst entry is nil, and after that there is

(gdb) x /s obj->fileData + shdrs[symSection->sh_link]->sh_offset+1
0x8027e5829:    "mc_leakcheck.c"

That matches what I see with objdump -t:

paulf> objdump -t /usr/home/paulf/scratch/valgrind/memcheck/memcheck-amd64-freebsd | less                   

/usr/home/paulf/scratch/valgrind/memcheck/memcheck-amd64-freebsd:     file format elf64-x86-64-freebsd

SYMBOL TABLE:
0000000000000000 l    df *ABS*  0000000000000000 mc_leakcheck.c

And the function where it is sleeping is

00000000381a034c g       .text  0000000000000000 vgModuleLocal_do_syscall_for_client_WRK

I've done some more debugging and I've seen one error.

@ -196,11 +198,23 @@ elfFindSymbolByAddress(struct ElfObject *obj, Elf_Addr addr,
                    symSection->sh_offset + symSection->sh_size);
 
                for (; sym < endSym; sym++) {
-                       if ((type == STT_NOTYPE ||
+                       if ((ELF_ST_TYPE(sym->st_info) == STT_NOTYPE ||

elfFindSymbolByAddress is only ever called with type == STT_FUNC and so
STT_FUNC == STT_NOTYPE is always false and any function with type
STT_NOTYPE aren't processed. I suppose STT_NOTYPE also means a size of 0.

With the above change I get

  0x381a03a6 vgModuleLocal_do_syscall_for_client_WRK (3819f27a, 0, 1301, 0, 3812ddf8, 0) + 5a

but only the one line.

I need to so some more debugging of elfFindSymbolByAddress() to see why it's not getting the full callstack.

Comment 3 Paul Floyd 2023-09-24 20:16:37 UTC

procReadThread() is what I meant

(gdb) info frame
Stack level 0, frame at 0x1002ca9d90:
 rip = 0x381a03a6 in vgModuleLocal_do_syscall_for_client_WRK (m_syswrap/syscall-amd64-freebsd.S:144); saved rip = 0x3819f27a
 called by frame at 0x1002ca9ea0
 source language asm.
 Arglist at 0x1002ca9d80, args: 
 Locals at 0x1002ca9d80, Previous frame's sp is 0x1002ca9d90
 Saved registers:
  rbp at 0x1002ca9d80, rip at 0x1002ca9d88

procReadThread breaks from the first pass through the loop.

Here's a printf that I added

DEBUG: procReadThread bp 1002024f20 <= frame->bp 1002ca9d80


Back over in gdb

current rbp:
(gdb) p $rbp
$5 = (void *) 0x1002ca9fa0

(gdb) info frame 0
Stack frame at 0x1002ca9d90:
 rip = 0x381a03a6 in vgModuleLocal_do_syscall_for_client_WRK (m_syswrap/syscall-amd64-freebsd.S:144); saved rip = 0x3819f27a
 called by frame at 0x1002ca9ea0
 source language asm.
 Arglist at 0x1002ca9d80, args: 
 Locals at 0x1002ca9d80, Previous frame's sp is 0x1002ca9d90
 Saved registers:
  rbp at 0x1002ca9d80, rip at 0x1002ca9d88
(gdb) info frame 1
Stack frame at 0x1002ca9ea0:
 rip = 0x3819f27a in do_syscall_for_client (m_syswrap/syswrap-main.c:368); saved rip = 0x3819b150
 inlined into frame 2, caller of frame at 0x1002ca9d90
 source language c.
 Arglist at unknown address.
 Locals at unknown address, Previous frame's sp is 0x1002ca9d90
 Saved registers:
  rbp at 0x1002ca9d80, rip at 0x1002ca9d88
(gdb) info frame 3
Stack frame at 0x1002ca9f10:
 rip = 0x3819b150 in handle_syscall (m_scheduler/scheduler.c:1206); saved rip = 0x38199223
 called by frame at 0x1002ca9fb0, caller of frame at 0x1002ca9ea0
 source language c.
 Arglist at 0x1002ca9ea0, args: tid=tid@entry=1, trc=trc@entry=73
 Locals at 0x1002ca9ea0, Previous frame's sp is 0x1002ca9f10
 Saved registers:
  rbx at 0x1002ca9ef0, rbp at 0x1002ca9f00, r14 at 0x1002ca9ef8, rip at 0x1002ca9f08

I'm beginning to wonder if the saved $rbps aren't just nonsense and whether gdb and lldb are getting their stack traces from debuginfo.

Valgrind is a bit of a monster.

On startup (it's own _start in assembler) it allocates its own small bootstrap stack of 1M. Once it has done all its initialization it allocates another stack with guard pages and does a stack transfer to that stack via a mov to rsp then a ret. The new stack appears from then on as the bottom of the callstack.

On top of that, the callstack that I've been looking at consists of

assembler funcion
inlined C function
C function called via longjmp
then "normal" C functions up to the re-rooted bottom of callstack.

It looks like the strange $rbp happens after the longjmp.

Comment 4 Paul Floyd 2023-09-25 05:45:23 UTC

Created attachment 245205 [details]
patch for notype functions

Fixes lookup for functions with notype and 0 size.

Comment 5 Paul Floyd 2023-09-25 06:42:09 UTC

In the end two changes are required. There's the patch that I just uploaded that fixes the problem with notype functions.

The other thing is that I needed to build Valgrind with -fno-omit-frame-pointer. From what I've seen most functions _are_ using the frame pointer. I need to do a bit more work to see if I can just tweak a few things to get the base pointer used everywhere.

Yesterday I stumbled across the bstack package when looking to see if there is a FreeBSD version of gstack. Using that is probably the best solution. pstack is not doing a great job of reinventing the wheel, plus upstream pstack seems unmaintained (no changes in 11 years).

Comment 6 Fernando Apesteguía freebsd_committer

2023-10-02 06:55:00 UTC

Hi Paul,

I think this patch should be reported upstream since it is not a FreeBSD *port* issue per se. What do you think?

Comment 7 Paul Floyd 2023-10-02 08:12:51 UTC

(In reply to Fernando Apesteguía from comment #6)

Yes, but the upstream repo hasn't changed for 11 years:
https://github.com/z0nt/pstack

I'll try a pull request, but I'm not hopeful.

It's not a big issue, "bstack" works much better and I'll be using that from now on. I was only using pstack because I have a lot of Solaris baggage.