Bug 273688 - sysutils/pstack: does not work with Valgrind
Summary: sysutils/pstack: does not work with Valgrind
Status: Open
Alias: None
Product: Ports & Packages
Classification: Unclassified
Component: Individual Port(s) (show other bugs)
Version: Latest
Hardware: Any Any
: --- Affects Only Me
Assignee: freebsd-ports-bugs (Nobody)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2023-09-10 07:54 UTC by Paul Floyd
Modified: 2023-10-02 08:12 UTC (History)
2 users (show)

See Also:
fernape: maintainer-feedback? (pizzamig)


Attachments
patch for notype functions (468 bytes, patch)
2023-09-25 05:45 UTC, Paul Floyd
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Paul Floyd 2023-09-10 07:54:37 UTC
I get

paulf@freebsd:~ $ pstack 33671
33671: /home/paulf/valgrind/drd/drd-amd64-freebsd
----------------- thread -1 (running) -----------------
  0x3815ed66 ???????? (3815dcde, 0, 238c470, 10, 380bc159, 0)

which isn't much help.

Info on the exe:

paulf@freebsd:~ $ file /home/paulf/valgrind/drd/drd-amd64-freebsd
/home/paulf/valgrind/drd/drd-amd64-freebsd: ELF 64-bit LSB executable, x86-64, version 1 (FreeBSD), statically linked, for FreeBSD 15.0 (1500000), FreeBSD-style, with debug_info, not stripped

Not just statically linked but does not link with libc. Has its own _start and additionally uses a load address of 0x38000000.
Comment 1 Paul Floyd 2023-09-10 08:01:09 UTC
For comparison, pstack on Illumos gives me

25702:  /export/home/paulf/valgrind/./coregrind/valgrind --command-line-only=y
 00000000580749b6 pollsys  (5a0b6050, 1, 0, 0)
 000000005806308a vgPlain_poll () + 6a
 000000005811de35 vgPlain_poll_no_eintr () + 65
 000000005811ea1e readchar.part.0 () + 2e
 000000005811f3a8 getpkt () + 28
 00000000580afc57 server_main () + 67
 00000000580f2af8 call_gdbserver () + 118
 00000000580f35fa vgPlain_gdbserver () + 7a
 00000000580b69b3 vgPlain_scheduler () + 83
 00000000580c7f01 run_a_thread_NORETURN () + b1
Comment 2 Paul Floyd 2023-09-24 10:26:21 UTC
Some more analysis

In first window, 

./vg-in-place sleep 100000

Tthat's my dev build of Valgrind, just 'valgrind' from pkd devel/valgrind-devel should be fine as well.

In the second window, GDB test



(gdb) attach 48907
Attaching to process 48907
Reading symbols from /usr/home/paulf/scratch/valgrind/memcheck/memcheck-amd64-freebsd...
[Switching to LWP 104670 of process 48907]
vgModuleLocal_do_syscall_for_client_WRK () at m_syswrap/syscall-amd64-freebsd.S:144
144        setc  0(%rsp)         /* stash returned carry flag */
(gdb) bt
#0  vgModuleLocal_do_syscall_for_client_WRK () at m_syswrap/syscall-amd64-freebsd.S:144
#1  0x000000003819f27a in do_syscall_for_client (syscallno=240, tst=0x1002024f10, syscall_mask=0x1002ca9e20) at m_syswrap/syswrap-main.c:368
#2  vgPlain_client_syscall (tid=tid@entry=1, trc=trc@entry=73) at m_syswrap/syswrap-main.c:2341
#3  0x000000003819b150 in handle_syscall (tid=tid@entry=1, trc=trc@entry=73) at m_scheduler/scheduler.c:1206
#4  0x0000000038199223 in vgPlain_scheduler (tid=tid@entry=1) at m_scheduler/scheduler.c:1552
#5  0x00000000381ab33c in thread_wrapper (tidW=1) at m_syswrap/syswrap-freebsd.c:112
#6  run_a_thread_NORETURN (tidW=1) at m_syswrap/syswrap-freebsd.c:166
#7  0x0000000000000000 in ?? ()

That's what I'd expect.

And lldb:
(lldb) attach 48907
This version of LLDB has no plugin for the language "assembler". Inspection of frame variables will be limited.
Process 48907 stopped
* thread #1, name = 'memcheck-amd64-f', stop reason = signal SIGSTOP
    frame #0: 0x00000000381a03a6 memcheck-amd64-freebsd`vgModuleLocal_do_syscall_for_client_WRK at syscall-amd64-freebsd.S:144
   141        but hasn't been committed to RAX. */
   142 
   143     /* stack contents: 3 words for syscall above, plus our prologue */
-> 144     setc  0(%rsp)         /* stash returned carry flag */
   145 
   146     movq  -16(%rbp), %r11 /* r11 = VexGuestAMD64State * */
   147     movq  %rax, OFFSET_amd64_RAX(%r11)    /* save back to RAX */
Executable module set to "/usr/home/paulf/scratch/valgrind/memcheck/memcheck-amd64-freebsd".
Architecture set to: x86_64-unknown-freebsd13.2.
(lldb) bt
* thread #1, name = 'memcheck-amd64-f', stop reason = signal SIGSTOP
  * frame #0: 0x00000000381a03a6 memcheck-amd64-freebsd`vgModuleLocal_do_syscall_for_client_WRK at syscall-amd64-freebsd.S:144
    frame #1: 0x000000003819f27a memcheck-amd64-freebsd`vgPlain_client_syscall [inlined] do_syscall_for_client(syscallno=240, tst=0x0000001002024f10, syscall_mask=0x0000001002ca9e20) at syswrap-main.c:368:10
    frame #2: 0x000000003819f232 memcheck-amd64-freebsd`vgPlain_client_syscall(tid=1, trc=73) at syswrap-main.c:2341:10
    frame #3: 0x000000003819b150 memcheck-amd64-freebsd`handle_syscall(tid=1, trc=73) at scheduler.c:1206:4
    frame #4: 0x0000000038199223 memcheck-amd64-freebsd`vgPlain_scheduler(tid=1) at scheduler.c:1552:3
    frame #5: 0x00000000381ab33c memcheck-amd64-freebsd`run_a_thread_NORETURN [inlined] thread_wrapper(tidW=1) at syswrap-freebsd.c:112:10
    frame #6: 0x00000000381ab2c6 memcheck-amd64-freebsd`run_a_thread_NORETURN(tidW=1) at syswrap-freebsd.c:166:10

Again, that's OK.

I've dowloaded the pstack source from github and built it.

In gdb, looking at elfFindSymbolByAddress I see that the address that pstack is using is the same as the address I see when attaching gdb. Namely 0x381a03a6.

There is no .dynsym so elfFindSectionByName finds .symtab.

symStrings looks OK to me. The frst entry is nil, and after that there is

(gdb) x /s obj->fileData + shdrs[symSection->sh_link]->sh_offset+1
0x8027e5829:    "mc_leakcheck.c"

That matches what I see with objdump -t:

paulf> objdump -t /usr/home/paulf/scratch/valgrind/memcheck/memcheck-amd64-freebsd | less                   

/usr/home/paulf/scratch/valgrind/memcheck/memcheck-amd64-freebsd:     file format elf64-x86-64-freebsd

SYMBOL TABLE:
0000000000000000 l    df *ABS*  0000000000000000 mc_leakcheck.c

And the function where it is sleeping is

00000000381a034c g       .text  0000000000000000 vgModuleLocal_do_syscall_for_client_WRK

I've done some more debugging and I've seen one error.

@ -196,11 +198,23 @@ elfFindSymbolByAddress(struct ElfObject *obj, Elf_Addr addr,
                    symSection->sh_offset + symSection->sh_size);
 
                for (; sym < endSym; sym++) {
-                       if ((type == STT_NOTYPE ||
+                       if ((ELF_ST_TYPE(sym->st_info) == STT_NOTYPE ||

elfFindSymbolByAddress is only ever called with type == STT_FUNC and so
STT_FUNC == STT_NOTYPE is always false and any function with type
STT_NOTYPE aren't processed. I suppose STT_NOTYPE also means a size of 0.

With the above change I get

  0x381a03a6 vgModuleLocal_do_syscall_for_client_WRK (3819f27a, 0, 1301, 0, 3812ddf8, 0) + 5a

but only the one line.

I need to so some more debugging of elfFindSymbolByAddress() to see why it's not getting the full callstack.
Comment 3 Paul Floyd 2023-09-24 20:16:37 UTC
procReadThread() is what I meant

(gdb) info frame
Stack level 0, frame at 0x1002ca9d90:
 rip = 0x381a03a6 in vgModuleLocal_do_syscall_for_client_WRK (m_syswrap/syscall-amd64-freebsd.S:144); saved rip = 0x3819f27a
 called by frame at 0x1002ca9ea0
 source language asm.
 Arglist at 0x1002ca9d80, args: 
 Locals at 0x1002ca9d80, Previous frame's sp is 0x1002ca9d90
 Saved registers:
  rbp at 0x1002ca9d80, rip at 0x1002ca9d88

procReadThread breaks from the first pass through the loop.

Here's a printf that I added

DEBUG: procReadThread bp 1002024f20 <= frame->bp 1002ca9d80


Back over in gdb

current rbp:
(gdb) p $rbp
$5 = (void *) 0x1002ca9fa0

(gdb) info frame 0
Stack frame at 0x1002ca9d90:
 rip = 0x381a03a6 in vgModuleLocal_do_syscall_for_client_WRK (m_syswrap/syscall-amd64-freebsd.S:144); saved rip = 0x3819f27a
 called by frame at 0x1002ca9ea0
 source language asm.
 Arglist at 0x1002ca9d80, args: 
 Locals at 0x1002ca9d80, Previous frame's sp is 0x1002ca9d90
 Saved registers:
  rbp at 0x1002ca9d80, rip at 0x1002ca9d88
(gdb) info frame 1
Stack frame at 0x1002ca9ea0:
 rip = 0x3819f27a in do_syscall_for_client (m_syswrap/syswrap-main.c:368); saved rip = 0x3819b150
 inlined into frame 2, caller of frame at 0x1002ca9d90
 source language c.
 Arglist at unknown address.
 Locals at unknown address, Previous frame's sp is 0x1002ca9d90
 Saved registers:
  rbp at 0x1002ca9d80, rip at 0x1002ca9d88
(gdb) info frame 3
Stack frame at 0x1002ca9f10:
 rip = 0x3819b150 in handle_syscall (m_scheduler/scheduler.c:1206); saved rip = 0x38199223
 called by frame at 0x1002ca9fb0, caller of frame at 0x1002ca9ea0
 source language c.
 Arglist at 0x1002ca9ea0, args: tid=tid@entry=1, trc=trc@entry=73
 Locals at 0x1002ca9ea0, Previous frame's sp is 0x1002ca9f10
 Saved registers:
  rbx at 0x1002ca9ef0, rbp at 0x1002ca9f00, r14 at 0x1002ca9ef8, rip at 0x1002ca9f08

I'm beginning to wonder if the saved $rbps aren't just nonsense and whether gdb and lldb are getting their stack traces from debuginfo.

Valgrind is a bit of a monster.

On startup (it's own _start in assembler) it allocates its own small bootstrap stack of 1M. Once it has done all its initialization it allocates another stack with guard pages and does a stack transfer to that stack via a mov to rsp then a ret. The new stack appears from then on as the bottom of the callstack.

On top of that, the callstack that I've been looking at consists of

assembler funcion
inlined C function
C function called via longjmp
then "normal" C functions up to the re-rooted bottom of callstack.

It looks like the strange $rbp happens after the longjmp.
Comment 4 Paul Floyd 2023-09-25 05:45:23 UTC
Created attachment 245205 [details]
patch for notype functions

Fixes lookup for functions with notype and 0 size.
Comment 5 Paul Floyd 2023-09-25 06:42:09 UTC
In the end two changes are required. There's the patch that I just uploaded that fixes the problem with notype functions.

The other thing is that I needed to build Valgrind with -fno-omit-frame-pointer. From what I've seen most functions _are_ using the frame pointer. I need to do a bit more work to see if I can just tweak a few things to get the base pointer used everywhere.

Yesterday I stumbled across the bstack package when looking to see if there is a FreeBSD version of gstack. Using that is probably the best solution. pstack is not doing a great job of reinventing the wheel, plus upstream pstack seems unmaintained (no changes in 11 years).
Comment 6 Fernando Apesteguía freebsd_committer freebsd_triage 2023-10-02 06:55:00 UTC
Hi Paul,

I think this patch should be reported upstream since it is not a FreeBSD *port* issue per se. What do you think?
Comment 7 Paul Floyd 2023-10-02 08:12:51 UTC
(In reply to Fernando Apesteguía from comment #6)

Yes, but the upstream repo hasn't changed for 11 years:
https://github.com/z0nt/pstack

I'll try a pull request, but I'm not hopeful.

It's not a big issue, "bstack" works much better and I'll be using that from now on. I was only using pstack because I have a lot of Solaris baggage.