Summary: | devel/git: crashes on start in rtld on QEMU with Hypervisor.framework on an M2 MacBook Pro (virtualised AArch64) | ||||||
---|---|---|---|---|---|---|---|
Product: | Ports & Packages | Reporter: | David Chisnall <theraven> | ||||
Component: | Individual Port(s) | Assignee: | David Chisnall <theraven> | ||||
Status: | Open --- | ||||||
Severity: | Affects Some People | CC: | Andrew, grahamperrin, theraven | ||||
Priority: | --- | Keywords: | crash, needs-qa | ||||
Version: | Latest | Flags: | bugzilla:
maintainer-feedback?
(garga) |
||||
Hardware: | arm64 | ||||||
OS: | Any | ||||||
URL: | https://www.freshports.org/devel/git/ | ||||||
Attachments: |
|
Description
David Chisnall
![]() ![]() I have also tested with hypervisor support disabled (QEMU in pure emulation mode, which is much slower) and can confirm that the problem persists. I'm not able to reproduce this on 14-CURRENT, 13-STABLE, or 13.2-RELEASE. The store instruction is using q0 as the data, and x0 as the address. These registers don't alias. There are a few cases where the kernel can raise a SIGBUS that are not directly from the trap, e.g. vm_fault returns KERN_RESOURCE_SHORTAGE or KERN_OUT_OF_BOUNDS. (In reply to Andrew Turner from comment #2) > I'm not able to reproduce this on 14-CURRENT, 13-STABLE, or 13.2-RELEASE. For me, it is 100% deterministic on 13.2-RELEASE (with and without running FreeBSD update). > The store instruction is using q0 as the data, and x0 as the address. Yup, Renato showed me that, it looks as if this spelling of the Neon store is barely documented. > These registers don't alias. There are a few cases where the kernel can raise a SIGBUS that are not directly from the trap, e.g. vm_fault returns KERN_RESOURCE_SHORTAGE or KERN_OUT_OF_BOUNDS. This made me wonder if the problem was the VirtIO balloon driver responding too slowly, but disabling the balloon driver doesn't fix it. A clean boot with minimal things running (sshd is about the only thing) shows 31G free in top and still crashes at this point. I also wondered if it was a problem with the wrong flavour of memory, so I tried reducing the total memory in the VM from 32 GiB to 768 MiB but that didn't make a difference. What would be helpful for me to try? I now have sources installed, which makes the debugger a bit more useful (though rtld is somewhat too optimised for it to be very useful). The fault in rtld is in map_objects.c:262 (the call to memset in the middle of BSS setup). I vaguely remember from some snmalloc debugging a while ago that libpcre2 hits some rtld paths on x86 that almost nothing else (except tcl?) does. The value of mapbase at that point (according to lldb) is 0x000000008178c000, which looks sensible (the base address of /usr/local/lib/libpcre2-8.so.0.11.2), but unfortunately clear_vaddr is optimised away. The procstat output for the region containing the address has C but not N in the flags, which I believe means that it is a CoW page that has already been copied (REF is 1, which supports this?). I'm not sure if this was the first write to that page, if so then it looks as if the CoW is proceeding correctly but we're then receiving a signal on the way back anyway. Setting a breakpoint in rtld doesn't seem to work here for me (maybe lldb is using hardware breakpoints and they are not working in the VM? Switching to emulation mode doesn't make them work either, so maybe they just don't work on AArch64?). Not sure if there's anything useful in these bits of dmesg that might help isolate anything CPU-model specific: ``` CPU 0: Unknown Implementer (midr: 00000000) affinity: 0 Cache Type = <64 byte D-cacheline,64 byte I-cacheline,PIPT ICache,64 byte ERG,64 byte CWG> Instruction Set Attributes 0 = <TLBI-OSR,CondM-8.5,FHM,DP,SHA3,RDM,Atomic,CRC32,SHA2+SHA512,SHA1,AES+PMULL> Instruction Set Attributes 1 = <PredInv,SB,FRINTTS,GPI,RCPC-8.4,FCMA,JSCVT,Impl PAuth+FPAC,DCCVADP> Instruction Set Attributes 2 = <> Processor Features 0 = <CSV3,CSV2,PSTATE.DIT,RAS,AdvSIMD+HP,FP+HP,EL1,EL0> Processor Features 1 = <> Memory Model Features 0 = <ExS,TGran4,TGran16,8bit ASID,4TB PA,0x100000000000000> Memory Model Features 1 = <XNX,SpecSEI,PAN+ATS1E1,LO,HPD+TTPBHA,8bit VMID> Memory Model Features 2 = <E0PD,TTL,IDS,AT,32bit CCIDX,48bit VA,IESB,UAO,CnP> Debug Features 0 = <DoubleLock,2 CTX BKPTs,4 Watchpoints,6 Breakpoints,Debugv8> Debug Features 1 = <> Auxiliary Features 0 = <> Auxiliary Features 1 = <> ``` I tried replacing the memset in rtld with a trivial C one. This still faults in roughly the same place. I added a debug printf in front of the memset and see (ASLR changes the exact numbers each run): ``` Clearing 1848 bytes from 0x56b8c8 Clearing 1472 bytes from 0x20428e53ca40 ``` And then the crash. When I run clang, I see that this code path is hit multiple times. This address is in the same places as before (from procstat -v): ``` 41103 0x20428e53c000 0x20428e53d000 rw- 0 0 1 0 C---- vn /usr/local/lib/libpcre2-8.so.0.11.2 ``` With the C version, I can see that it is faulting on the *first* byte write. Not sure what's happening here. Can you print the signal info? In gdb it's in $_siginfo, I'm not sure what it is in lldb. There is some extra information in it that will help narrow down what is causing the SIGBUS, e.g. trapno has the exception number. (In reply to Andrew Turner from comment #4) Unfortunately, gdb crashes on start with the same error while loading libiconv. I can see some commits for LLDB to support siginfo but I can't figure out how it is exposed in the UI. Ktrace shows: ``` 1489 git CALL mmap(0x83057000,0x1000,0x3<PROT_READ|PROT_WRITE>,0x40012<MAP_PRIVATE|MAP_FIXED|MAP_PREFAULT_READ>,0x3,0xa6000) 1489 git RET mmap 2198171648/0x83057000 1489 git PFLT 0x83057a40 0x2<VM_PROT_WRITE> 1489 git PRET KERN_OUT_OF_BOUNDS 1489 git PSIG SIGBUS SIG_DFL code=BUS_OBJERR ``` Not sure if that helps? It looks like the page fault is failing for some reason (not sure what KERN_OUT_OF_BOUNDS means in this context). It looks like vm_fault can return KERN_OUT_OF_BOUNDS in a few places. It looks like it could be from vm_fault_allocate or vm_fault_getpages. Was there anything printed to the console? In the vm_fault_getpages case it could print "vm_fault: pager read error, pid <pid> (<proc name>)" in one of the failure cases that would lead to an out of bounds error. (In reply to Andrew Turner from comment #6) Nothing in the console or dmesg. I don’t mind trashing this VM and it’s very fast to build a kernel in it, so if there are places in the kernel that you want me to stick some printfs, let me know where they are. I’ll try upgrading to CURRENT soon as well. Created attachment 243072 [details]
Extra debugging for releng/13.2
Something like the attached patch.
Testing with the 20230622 -CURRENT VM image does not reproduce this failure. It seems to have been fixed in something that has not been MFC'd. With that patch on 13.2-RELEASE, I see this in my dmesg / console: ``` vm_fault_allocate:1152 FAULT_OUT_OF_BOUNDS: a6 60 vm_fault:1566 KERN_OUT_OF_BOUNDS b vm_fault:1587 KERN_OUT_OF_BOUNDS c 4 vm_fault:1587 KERN_OUT_OF_BOUNDS c 4 vm_fault:1587 KERN_OUT_OF_BOUNDS c 4 vm_fault:1587 KERN_OUT_OF_BOUNDS c 4 vm_fault:1587 KERN_OUT_OF_BOUNDS c 4 vm_fault:1587 KERN_OUT_OF_BOUNDS c 4 vm_fault:1587 KERN_OUT_OF_BOUNDS c 4 vm_fault:1587 KERN_OUT_OF_BOUNDS c 4 vm_fault:1587 KERN_OUT_OF_BOUNDS c 4 vm_fault:1587 KERN_OUT_OF_BOUNDS c 4 vm_fault:1587 KERN_OUT_OF_BOUNDS c 4 vm_fault:1587 KERN_OUT_OF_BOUNDS c 4 vm_fault:1587 KERN_OUT_OF_BOUNDS c 4 vm_fault:1587 KERN_OUT_OF_BOUNDS c 4 vm_fault:1587 KERN_OUT_OF_BOUNDS c 4 vm_fault:1587 KERN_OUT_OF_BOUNDS c 4 vm_fault:1587 KERN_OUT_OF_BOUNDS c 4 vm_fault:1587 KERN_OUT_OF_BOUNDS c 4 vm_fault:1587 KERN_OUT_OF_BOUNDS c 4 vm_fault:1587 KERN_OUT_OF_BOUNDS c 4 vm_fault:1587 KERN_OUT_OF_BOUNDS c 4 vm_fault:1587 KERN_OUT_OF_BOUNDS c 4 vm_fault:1587 KERN_OUT_OF_BOUNDS c 4 vm_fault:1587 KERN_OUT_OF_BOUNDS c 4 vm_fault:1587 KERN_OUT_OF_BOUNDS c 4 vm_fault:1587 KERN_OUT_OF_BOUNDS c 4 vm_fault:1587 KERN_OUT_OF_BOUNDS c 4 vm_fault:1587 KERN_OUT_OF_BOUNDS c 4 vm_fault:1587 KERN_OUT_OF_BOUNDS c 4 vm_fault:1587 KERN_OUT_OF_BOUNDS c 4 vm_fault:1587 KERN_OUT_OF_BOUNDS c 4 vm_fault:1587 KERN_OUT_OF_BOUNDS c 4 vm_fault:1587 KERN_OUT_OF_BOUNDS c 4 vm_fault:1587 KERN_OUT_OF_BOUNDS c 4 vm_fault:1587 KERN_OUT_OF_BOUNDS c 4 vm_fault:1587 KERN_OUT_OF_BOUNDS c 4 vm_fault:1587 KERN_OUT_OF_BOUNDS c 4 vm_fault:1587 KERN_OUT_OF_BOUNDS c 4 vm_fault:1587 KERN_OUT_OF_BOUNDS c 4 vm_fault:1587 KERN_OUT_OF_BOUNDS c 4 vm_fault:1587 KERN_OUT_OF_BOUNDS c 4 vm_fault:1587 KERN_OUT_OF_BOUNDS c 4 vm_fault:1587 KERN_OUT_OF_BOUNDS c 4 vm_fault:1587 KERN_OUT_OF_BOUNDS c 4 vm_fault:1587 KERN_OUT_OF_BOUNDS c 4 vm_fault:1587 KERN_OUT_OF_BOUNDS c 4 vm_fault:1587 KERN_OUT_OF_BOUNDS c 4 vm_fault:1587 KERN_OUT_OF_BOUNDS c 4 vm_fault:1587 KERN_OUT_OF_BOUNDS c 4 vm_fault:1587 KERN_OUT_OF_BOUNDS c 4 vm_fault:1587 KERN_OUT_OF_BOUNDS c 4 vm_fault:1587 KERN_OUT_OF_BOUNDS c 4 vm_fault:1587 KERN_OUT_OF_BOUNDS c 4 vm_fault:1587 KERN_OUT_OF_BOUNDS c 4 vm_fault:1587 KERN_OUT_OF_BOUNDS c 4 vm_fault:1587 KERN_OUT_OF_BOUNDS c 4 vm_fault:1587 KERN_OUT_OF_BOUNDS c 4 vm_fault:1587 KERN_OUT_OF_BOUNDS c 4 vm_fault:1587 KERN_OUT_OF_BOUNDS c 4 vm_fault:1587 KERN_OUT_OF_BOUNDS c 4 vm_fault:1587 KERN_OUT_OF_BOUNDS c 4 vm_fault:1587 KERN_OUT_OF_BOUNDS c 4 vm_fault:1587 KERN_OUT_OF_BOUNDS c 4 vm_fault:1587 KERN_OUT_OF_BOUNDS c 4 vm_fault:1587 KERN_OUT_OF_BOUNDS c 4 vm_fault:1587 KERN_OUT_OF_BOUNDS c 4 vm_fault:1587 KERN_OUT_OF_BOUNDS c 4 vm_fault:1587 KERN_OUT_OF_BOUNDS c 4 vm_fault:1587 KERN_OUT_OF_BOUNDS c 4 vm_fault:1587 KERN_OUT_OF_BOUNDS c 4 vm_fault:1587 KERN_OUT_OF_BOUNDS c 4 vm_fault:1587 KERN_OUT_OF_BOUNDS c 4 vm_fault:1587 KERN_OUT_OF_BOUNDS c 4 vm_fault:1587 KERN_OUT_OF_BOUNDS c 4 vm_fault:1587 KERN_OUT_OF_BOUNDS c 4 vm_fault:1587 KERN_OUT_OF_BOUNDS c 4 vm_fault:1587 KERN_OUT_OF_BOUNDS c 4 vm_fault:1587 KERN_OUT_OF_BOUNDS c 4 vm_fault:1587 KERN_OUT_OF_BOUNDS c 4 vm_fault:1587 KERN_OUT_OF_BOUNDS c 4 vm_fault:1587 KERN_OUT_OF_BOUNDS c 4 vm_fault:1587 KERN_OUT_OF_BOUNDS c 4 vm_fault:1587 KERN_OUT_OF_BOUNDS c 4 vm_fault:1587 KERN_OUT_OF_BOUNDS c 4 vm_fault:1587 KERN_OUT_OF_BOUNDS c 4 vm_fault:1587 KERN_OUT_OF_BOUNDS c 4 vm_fault:1587 KERN_OUT_OF_BOUNDS c 4 vm_fault:1587 KERN_OUT_OF_BOUNDS c 4 vm_fault:1587 KERN_OUT_OF_BOUNDS c 4 vm_fault:1587 KERN_OUT_OF_BOUNDS c 4 vm_fault:1587 KERN_OUT_OF_BOUNDS c 4 vm_fault:1587 KERN_OUT_OF_BOUNDS c 4 vm_fault:1587 KERN_OUT_OF_BOUNDS c 4 vm_fault:1587 KERN_OUT_OF_BOUNDS c 4 vm_fault:1587 KERN_OUT_OF_BOUNDS c 4 vm_fault:1587 KERN_OUT_OF_BOUNDS c 4 vm_fault_allocate:1152 FAULT_OUT_OF_BOUNDS: a6 60 vm_fault:1566 KERN_OUT_OF_BOUNDS b vm_fault_allocate:1152 FAULT_OUT_OF_BOUNDS: a6 60 vm_fault:1566 KERN_OUT_OF_BOUNDS b ``` This is from a single run of git. Assigning to David since it's not a problem on git port and I'm not working on this issue at all |