I am experiencing a consistent kernel panic on FreeBSD 15.0-RELEASE-p4 when running a heavy file scanning tool (csync2) on a local SSD, while having MooseFS (FUSE) mounts active (even if the FUSE mounts are excluded from the scan). Panic details: Panic String: page fault Fault virtual address: 0x78 (NULL pointer dereference) Instruction pointer: fuse_vnop_write+0x1d5 Stack Trace: #0 vpanic #1 panic #2 trap_pfault #3 calltrap #4 fuse_vnop_write+0x1d5 #5 VOP_WRITE_APV #6 vn_write #7 vn_io_fault #8 dofilewrite #9 sys_write #10 amd64_syscall Observed behavior: The panic occurs when vfs.numvnodes grows rapidly due to local disk activity. It seems like the vnode reclamation process (recycling) triggers a write operation in the fusefs module, which then tries to access a NULL pointer (offset 0x78). System Environment: FreeBSD 15.0-RELEASE-p4 GENERIC RAM: 16GB FUSE mounts: MooseFS (mfsmount) Settings tried (but still panics): vfs.fusefs.data_cache_mode=0, kern.maxvnodes=2000000, vfs.fusefs.iov_credit=4. Dump header from device: /dev/gpt/swap Architecture: amd64 Architecture Version: 2 Dump Length: 2126323712 Blocksize: 512 Compression: none Dumptime: 2026-03-21 17:37:36 +0100 Hostname: almach.zjk.pl Magic: FreeBSD Kernel Dump Version String: FreeBSD 15.0-RELEASE-p4 GENERIC Panic String: page fault Dump Parity: 1743253822 Bounds: 7 Dump Status: good
(In reply to zjk from comment #0) Do you have a full core dump? If so, if you look in /var/crash/core.*.txt , there should be a more complete stack trace, with function arguments and stuff. The stack trace that you showed definitely didn't come from the vnode reclamation process, because it starts with sys_write. That means it's a write from userland.
(In reply to Alan Somers from comment #1) Yes, you're right. At first glance, it seems like FUSE (from the moosefs mount), but after extensive testing, I concluded that the issue is related to memory management – this needs to be verified: 1. Situation: I started synchronizing a website (40GB, several hundred files) from one server to multiple servers. 2. The website content contains mounts to the network moosefs – but I enabled synchronization exclusions for the entire mounted resource (this disk is mounted the same way on other servers). 3. csync2 causes panics not only during uploads, but also only during local file scanning. I've now confirmed that rsync also causes panics. (Ultimately, after many restarts, I have servers syncing, but it's not suitable for production – the restarts are regular/repeatable during sync). The HTTP server is running constantly, and sometimes the core dump trace leads to it. 4. I changed kern.maxvnodes=500000-2000000 (better at 2000000) vfs.fusefs.iov_credit=4-128 (better at 64) vfs.fusefs.data_cache_mode=0-1 (better at 0) where vfs.numvnodes changes from 60000 to 250000 (idle), when csync2 is running - close to 800000. 5. for moosefs mfsentrycacheto,mfsattrcacheto,mfsdirentrycacheto - ranges from 5 to 600 (better at 60). Moosefs mount options: rw,late,failok,noatime,mfsdelayedinit,mfscachemode=DIRECT,mfsport=9421,mountprog=/usr/local/bin/mfsmount,mfsentrycacheto=60,mfsattrcacheto=60,mfsdirentrycacheto=60,mfstimeout=30,mfsioretries=60 6. I still have free memory (1GB) and swap is at 0 usage. I'm sending a core dump, 2 examples: https://www.filemail.com/d/sqswmdirehrziwk Best regards zjk
Subject: Additional observations regarding vnode recycling and stability I have more data points that might help pinpoint the problem in fusefs. Observations: The Panic Trigger: I've confirmed that the panic (page fault 0x78 in fuse_vnop_write) consistently occurred exactly when vfs.numvnodes reached the kern.maxvnodes limit (initially set to 500000, and then 2,000,000). Workload Context: This happens on 2 out of 8 hosts in a cluster. These 2 hosts run heavy Apache/httpd traffic on MooseFS (FUSE) mounts. The other 6 hosts, which have lower file activity (around 500,000 vnodes), remained perfectly stable even with the 500000 limit (in this hosts are 3 file servers with moosefs source). Current Stability: After increasing kern.maxvnodes to 3,000,000 for 2 apache servers and setting vfs.fusefs.data_cache_mode=0, the systems have become stable. But today I dont make a sync (csync rsync). Current Stats: On the problematic hosts, vfs.numvnodes has now stabilized at approximately 2,050,000. When the limit was 2,000,000, maybe the system was forced into aggressive vnode reclamation/recycling, which triggered the crash. Hypothesis: The crash seems to be a race condition triggered by the vnode reclamation process. When the kernel tries to retire an inactive FUSE vnode under memory pressure, fuse_vnop_write is called (likely for a dirty page or metadata sync) and encounters a NULL pointer (offset 0x78). Current Workaround Settings: vfs.fusefs.data_cache_mode=0 vfs.fusefs.iov_credit=4 (reduced from 128) kern.maxvnodes=3000000 Mount options: noatime (to reduce metadata writes) But - once more - I dont make today sync test with csync2/rsync (I need more time and It must be time with "crash enable" for www-servers ;) ). Best regards zjk
Those core dump files are too big for me to download. They take over an hour apiece, and the download failed. Instead of the full core dump, can you send the summary? It should be in /var/crash/core.txt.* . If it's not there, you can generate it with this command: crashinfo /var/crash/core.txt.0
(In reply to Alan Somers from comment #4) I have two full core dumps available. Since they are very large, please let me know if you need specific information from them (like variable values from kgdb) or if I should try to upload the full files to a specific location. I was able to extract the stack backtrace from the vmcore using strings. System version: FreeBSD almach.zjk.pl 15.0-RELEASE-p4 FreeBSD 15.0-RELEASE-p4 GENERIC amd64 ################################################################################ cat panic_log_0.txt panic: page fault cpuid = 1 time = 1774124510 KDB: stack backtrace: #0 0xffffffff80bbe1ed at kdb_backtrace+0x5d #1 0xffffffff80b71576 at vpanic+0x136 #2 0xffffffff80b71433 at panic+0x43 #3 0xffffffff81079f69 at trap_pfault+0x3c9 #4 0xffffffff81050008 at calltrap+0x8 #5 0xffffffff839c9ad5 at fuse_vnop_write+0x1d5 #6 0xffffffff81140944 at VOP_WRITE_APV+0x94 #7 0xffffffff80c7d7f7 at vn_write+0x287 #8 0xffffffff80c750d1 at vn_io_fault+0x1c1 #9 0xffffffff80be0581 at dofilewrite+0x81 #10 0xffffffff80bdff87 at sys_write+0x127 #11 0xffffffff8107a8a6 at amd64_syscall+0x126 #12 0xffffffff810508fb at fast_syscall_common+0xf8 Uptime: 1h6m2s Dumping 2083 out of 16168 MB:in via lagg0 <110>ipfw: 6200 Accept TCP 192.168.1.14:61453 192.168.1.27:22000 in via lagg0 <110>ipfw: 6200 Accept TCP 192.168.1.14:61453 192.168.1.27:22000 out via lagg0 -- system panic: rebooting... max_pending_per_proc preallocate Send SIGSYS on return from invalid syscall sysctl mem timecounter description Estimated system boottime max_threads_per_proc mq_open_max unknown: detached on %s devstat Number of devices in the devstat list fsbn %jd%s epoch information EDRAIN taskqgroup_attach_cpu kobj_methodcount How often (times per second) to check for /dev/log waiters. MODINFOMD_SSYM MODINFOMD_CTORS_SIZE <%d> kern.maxusers Maximum hz value supported dfldsiz Last controller runtime rangeset pctrie nodes Trigger one-time IO on RTC clocks; 1=read (and discard), 2=write pipewr pptrace mprotect compat.sethostid compat6.lseek afs3_syscall __mac_set_file utimensat semmnu Minimum shared memory segment size Number of segments per process ttyin empty foreground process group sleeping without queue TLS sessions unable to switch to SW from ifnet destroy_task TLS sessions that failed to allocate a new ifnet send tag Unable to bind KTLS worker thread for CPU %d: error %d , %*D Default stream receive space. All buffers synced. shutdown sync complete -- panic: %s /usr/src/sys/kern/kern_shutdown.c sigsusp ksiginfo_t * Log invalid syscalls ksiginfo sysctl_unregister_oid Timecounter hardware detected fast_gettime logsigexit Livedump sx %s(%d): module %s e_shstrndx is %d, sh_type is %d <flag name="HTT">HTT group</flag> acl_posix1e %pnpinfo epoch_alloc epoch_call_tasks %s: qid not found for %s cpu=%d gtq_adrain Maximum number of transient BIOs mappings Maximum text size Last controller process variable input msgbuf_clear Clear kernel message buffer %s: trying %#jx <%#jx,%#jx> adjusted start exceeds end smp rendezvous SMR SHARED tq_destroy task queue neg offset Pipe KVA usage Process Descriptors writev compat6.mmap ktimer_gettime nanosleep mmap fstat Maximum number of semaphores in the system ttys no foreground process group sw_tx_pending Total Active TLS sessions corrupted_records mbuf %p: m_data outside mbuf data range left sonewconn: pcb %p (%s): Listen queue overflow: %i already in queue awaiting acceptance (%d occurrences), euid %d, rgid %d, jail %s %s: pcb %p: New socket allocation failure: limit reached or out of memory %s: pcb %p: soreserve() failed numopensockets AIOLIO -- panic: Encrypted kernel crash dumps support %lds Device(s) for kernel dumps forcesigexit ast_scheduler Fixed-point scale factor used for calculating load average values oiddescr kern.threads.max_threads_per_proc hit count %s: error writing livedump block at offset %jx: %d vnode_dump idlespinthresh XML dump of detected CPU topology %*s</group> config_intrhook_disestablish: disestablishing an unestablished hook comconsole_speed driver bug: Unable to set devclass (class: %s devname: %s) %.*s %srid=%#x Local offset from UTC in seconds cmd=write %jd-%jd epoch_calls %s: Loaded binary firmware using %s MODINFOMD_SPLASH [%jd] maxssiz vbox pctrie_iter_insert_lookup Inverse of proportional gain Maximum number of elements in an I/O vector; sysconf(_SC_IOV_MAX) length > iosize_max invalid timespec ktimer_create compat11.kevent kmq_timedsend ftruncate aio_mlock nil free_msgmaps msginfo.msgseg > 32767 msgmni semwait Maximum shared memory segment size intrwait Adjust format of kernel stack(9) traces on ^T (tty info): 0 - disabled; 1 - long; 2 - compact pts/%u Number of TLS threads in thread-pool Number of TLS 1.0 records waiting for earlier TLS records sw_tx_inqueue switch_to_ifnet m__freemp -- system panic: powering off... fscale OID path exceeds %d bytes current timecounter value Maximum number of threads itimerval ncallout kern.ncallout failed to create softclock thread for cpu %d: %d Enable compression of user corefiles (1 = gzip, 2 = zstd) usem %s: pcpu module space is out of space; cannot allocate %d for %s sem_value_max Interactivity score threshold Threshold before we will permit idle thread spinning Always run the stealer from the idle thread number of maximal free ranges: %jd count | size range root struct bio * firmware_register trigger a kernel panic, using the provided string as the panic message pretty-print the bootloader metadata msgbuf_show_timestamp kern.nswbuf Maximum value of vfs.maxbufspace pcpu-64 RANGESET_PCTRIE_REMOVE no unshared regions found maxid negative ndescs kern.ipc.maxpipekva exceeded; see tuning(7) soaio jobs tfdrd wait4 compat.getdirentries compat6.truncate nnpfs_syscall __mac_execve msqids Adjust on exit max value suspended tty_inq_flush_secure Hardware (ifnet) TLS session stats Active number of ifnet TLS sessions using AES-CBC TLS sessions updated to a new ifnet send tag Size of largest link layer header max_hdr starting so_splice worker thread for CPU %d num_aio_procs is attempting to use unsafe AIO requests -- kdb_sysctl_panic: %s alt_break_to_debugger setpoint %c%02x log_console_output Contents of kernel message buffer <null> No registered RTC clocks Max number of CPUs that the system was compiled for. Topology override setting; 0 is default provided by hardware. timeout stopping cpus iov_max piperesizefail Maximum number of kernel processes to use for async socket IO timerfd compat.sigaction compat7.shmctl modstat ksem_timedwait kmq_open cap_ioctls_get sysvsem semaem switch_failed Active number of software TLS sessions using AES-GCM wakeups maxsockets pcblist enable_unsafe Maximum number of aio requests to queue, globally AIOCB vfs_unbusy_pages: page missing bufspacethresh numbufallocfails const char * Number of namecache entries with vnodes held Number of cache hits (negative) evict_skipped_contended STS VFS Cache ncnege Cluster read min block count cluster_rbuild: b_bufsize(%ld) > b_kvasize(%d) lookup: SAVESTART nonosuid retry_delay_hz Mountpoints struct mount vlist mtx .ask retry usecount %d, writecount %d, refcount %d seqc users %d socket %p -- system panic: halting... %ldd sigfastblock_fetch_always I'm a teapot! Pin the per-CPU swis (except PCPU 0, which is also default) kldload: %s cmd %s pid %d tried to use non-present %s mapped_files off__cpu steal_thresh Topological distance limit for stealing threads in sched_switch() vaccess_acl_posix1e: ACL_OTHER missing config_intrhook_establish: establishing an already established hook. devices %parent not found bus-related data set to enter the debugger kdb_sysctl_panic Number of kernel object methods registered unable to find kernel metadata MODINFO_EMPTY MODINFOMD_FW_HANDLE kern.maxphys msgbufsize vm_guest Error value from last interval log_console_add_linefeed rman_init implement RMAN_GAUGE splitting region in three parts: [%#jx, %#jx]; [%#jx, %#jx]; [%#jx, %#jx] clock_show_io CLOCK_SETTIME error %d Number of CPUs online SMR Stats trap_enotcap selfd pipe mutex piperd /usr/src/sys/kern/sys_process.c Preferred number of ready kernel processes for async socket IO compat.recvmsg futimes compat7.__semctl msgrcv compat11.nfstat aio_waitcomplete aio_readv semmns Maximum number of pages available for shared memory TLS sessions switched from ifnet to SW -- panic: page fault cpuid = 1 time = 1774124510 KDB: stack backtrace: #0 0xffffffff80bbe1ed at kdb_backtrace+0x5d #1 0xffffffff80b71576 at vpanic+0x136 #2 0xffffffff80b71433 at panic+0x43 #3 0xffffffff81079f69 at trap_pfault+0x3c9 #4 0xffffffff81050008 at calltrap+0x8 #5 0xffffffff839c9ad5 at fuse_vnop_write+0x1d5 #6 0xffffffff81140944 at VOP_WRITE_APV+0x94 #7 0xffffffff80c7d7f7 at vn_write+0x287 #8 0xffffffff80c750d1 at vn_io_fault+0x1c1 #9 0xffffffff80be0581 at dofilewrite+0x81 #10 0xffffffff80bdff87 at sys_write+0x127 #11 0xffffffff8107a8a6 at amd64_syscall+0x126 #12 0xffffffff810508fb at fast_syscall_common+0xf8 Uptime: 1h6m2s Dumping 2083 out of 16168 MB:..1%..11%..21%..31%..41%..51%..61%..71%..81%..91%.14:61453 192.168.1.27:22000 in via lagg0 <110>ipfw: 6200 Accept TCP 192.168.1.14:61453 192.168.1.27:22000 out via lagg0 ################################################################################## cat panic_log_9.txt panic: page fault cpuid = 2 time = 1774120492 KDB: stack backtrace: #0 0xffffffff80bbe1ed at kdb_backtrace+0x5d #1 0xffffffff80b71576 at vpanic+0x136 #2 0xffffffff80b71433 at panic+0x43 #3 0xffffffff81079f69 at trap_pfault+0x3c9 #4 0xffffffff81050008 at calltrap+0x8 #5 0xffffffff8341aad5 at fuse_vnop_write+0x1d5 #6 0xffffffff81140944 at VOP_WRITE_APV+0x94 #7 0xffffffff80c7d7f7 at vn_write+0x287 #8 0xffffffff80c750d1 at vn_io_fault+0x1c1 #9 0xffffffff80be0581 at dofilewrite+0x81 #10 0xffffffff80bdff87 at sys_write+0x127 #11 0xffffffff8107a8a6 at amd64_syscall+0x126 #12 0xffffffff810508fb at fast_syscall_common+0xf8 Uptime: 2h12m36s Dumping 2251 out of 16168 MB: UDP 192.168.1.102:35390 239.255.255.250:1900 in via lagg0 -- panic: MUTEX_UNLOCK (%d) [%s:%d] x, pkey store, method ssl, file from select($PASSOUTw); $| = 1; select(STDOUT); return ($PASSINr, $PASSINw, $PASSOUTr, $PASSOUTw); # recv_pam_question(&conv, fd) # Reads one PAM question from the sub-process, and sends it to the HTTP handler. # Returns 0 if the conversation is over, 1 if not. sub recv_pam_question local ($conf, $fh) = @_; local $pr = $conf->{'PAMOUTr'}; select($pr); $| = 1; select(STDOUT); local $line = <$pr>; $line =~ s/\r|\n//g; if (!$line) { $line = <$pr>; $line =~ s/\r|\n//g; $conf->{'last'} = time(); if (!$line) { # Failed! print $fh "0 PAM conversation error\n"; return 0; else { local ($type, $msg) = split(/\s+/, $line, 2); if ($type =~ /^x(\d+)/) { # Pass this status code through print $fh "$1 $msg\n"; return $1 == 2 || $1 == 0 ? 0 : 1; elsif ($type == PAM_PROMPT_ECHO_ON()) { # A normal question print $fh "1 $msg\n"; return 1; elsif ($type == PAM_PROMPT_ECHO_OFF()) { # A password print $fh "3 $msg\n"; return 1; elsif ($type == PAM_ERROR_MSG() || $type == PAM_TEXT_INFO()) { # A message that does not require a response print $fh "4 $msg\n"; return 1; else { # Unknown type! print $fh "0 Unknown PAM message type $type\n"; return 0; # send_pam_answer(&conv, answer) # Sends a response from the user to the PAM sub-process sub send_pam_answer local ($conf, $answer) = @_; -- system panic: rebooting... max_pending_per_proc preallocate Send SIGSYS on return from invalid syscall sysctl mem timecounter description Estimated system boottime max_threads_per_proc mq_open_max unknown: detached on %s devstat Number of devices in the devstat list fsbn %jd%s epoch information EDRAIN taskqgroup_attach_cpu kobj_methodcount How often (times per second) to check for /dev/log waiters. MODINFOMD_SSYM MODINFOMD_CTORS_SIZE <%d> kern.maxusers Maximum hz value supported dfldsiz Last controller runtime rangeset pctrie nodes Trigger one-time IO on RTC clocks; 1=read (and discard), 2=write pipewr pptrace mprotect compat.sethostid compat6.lseek afs3_syscall __mac_set_file utimensat semmnu Minimum shared memory segment size Number of segments per process ttyin empty foreground process group sleeping without queue TLS sessions unable to switch to SW from ifnet destroy_task TLS sessions that failed to allocate a new ifnet send tag Unable to bind KTLS worker thread for CPU %d: error %d , %*D Default stream receive space. All buffers synced. shutdown sync complete -- panic: %s /usr/src/sys/kern/kern_shutdown.c sigsusp ksiginfo_t * Log invalid syscalls ksiginfo sysctl_unregister_oid Timecounter hardware detected fast_gettime logsigexit Livedump sx %s(%d): module %s e_shstrndx is %d, sh_type is %d <flag name="HTT">HTT group</flag> acl_posix1e %pnpinfo epoch_alloc epoch_call_tasks %s: qid not found for %s cpu=%d gtq_adrain Maximum number of transient BIOs mappings Maximum text size Last controller process variable input msgbuf_clear Clear kernel message buffer %s: trying %#jx <%#jx,%#jx> adjusted start exceeds end smp rendezvous SMR SHARED tq_destroy task queue neg offset Pipe KVA usage Process Descriptors writev compat6.mmap ktimer_gettime nanosleep mmap fstat Maximum number of semaphores in the system ttys no foreground process group sw_tx_pending Total Active TLS sessions corrupted_records mbuf %p: m_data outside mbuf data range left sonewconn: pcb %p (%s): Listen queue overflow: %i already in queue awaiting acceptance (%d occurrences), euid %d, rgid %d, jail %s %s: pcb %p: New socket allocation failure: limit reached or out of memory %s: pcb %p: soreserve() failed numopensockets AIOLIO -- panic: Encrypted kernel crash dumps support %lds Device(s) for kernel dumps forcesigexit ast_scheduler Fixed-point scale factor used for calculating load average values oiddescr kern.threads.max_threads_per_proc hit count %s: error writing livedump block at offset %jx: %d vnode_dump idlespinthresh XML dump of detected CPU topology %*s</group> config_intrhook_disestablish: disestablishing an unestablished hook comconsole_speed driver bug: Unable to set devclass (class: %s devname: %s) %.*s %srid=%#x Local offset from UTC in seconds cmd=write %jd-%jd epoch_calls %s: Loaded binary firmware using %s MODINFOMD_SPLASH [%jd] maxssiz vbox pctrie_iter_insert_lookup Inverse of proportional gain Maximum number of elements in an I/O vector; sysconf(_SC_IOV_MAX) length > iosize_max invalid timespec ktimer_create compat11.kevent kmq_timedsend ftruncate aio_mlock nil free_msgmaps msginfo.msgseg > 32767 msgmni semwait Maximum shared memory segment size intrwait Adjust format of kernel stack(9) traces on ^T (tty info): 0 - disabled; 1 - long; 2 - compact pts/%u Number of TLS threads in thread-pool Number of TLS 1.0 records waiting for earlier TLS records sw_tx_inqueue switch_to_ifnet m__freemp -- system panic: powering off... fscale OID path exceeds %d bytes current timecounter value Maximum number of threads itimerval ncallout kern.ncallout failed to create softclock thread for cpu %d: %d Enable compression of user corefiles (1 = gzip, 2 = zstd) usem %s: pcpu module space is out of space; cannot allocate %d for %s sem_value_max Interactivity score threshold Threshold before we will permit idle thread spinning Always run the stealer from the idle thread number of maximal free ranges: %jd count | size range root struct bio * firmware_register trigger a kernel panic, using the provided string as the panic message pretty-print the bootloader metadata msgbuf_show_timestamp kern.nswbuf Maximum value of vfs.maxbufspace pcpu-64 RANGESET_PCTRIE_REMOVE no unshared regions found maxid negative ndescs kern.ipc.maxpipekva exceeded; see tuning(7) soaio jobs tfdrd wait4 compat.getdirentries compat6.truncate nnpfs_syscall __mac_execve msqids Adjust on exit max value suspended tty_inq_flush_secure Hardware (ifnet) TLS session stats Active number of ifnet TLS sessions using AES-CBC TLS sessions updated to a new ifnet send tag Size of largest link layer header max_hdr starting so_splice worker thread for CPU %d num_aio_procs is attempting to use unsafe AIO requests -- kdb_sysctl_panic: %s alt_break_to_debugger setpoint %c%02x log_console_output Contents of kernel message buffer <null> No registered RTC clocks Max number of CPUs that the system was compiled for. Topology override setting; 0 is default provided by hardware. timeout stopping cpus iov_max piperesizefail Maximum number of kernel processes to use for async socket IO timerfd compat.sigaction compat7.shmctl modstat ksem_timedwait kmq_open cap_ioctls_get sysvsem semaem switch_failed Active number of software TLS sessions using AES-GCM wakeups maxsockets pcblist enable_unsafe Maximum number of aio requests to queue, globally AIOCB vfs_unbusy_pages: page missing bufspacethresh numbufallocfails const char * Number of namecache entries with vnodes held Number of cache hits (negative) evict_skipped_contended STS VFS Cache ncnege Cluster read min block count cluster_rbuild: b_bufsize(%ld) > b_kvasize(%d) lookup: SAVESTART nonosuid retry_delay_hz Mountpoints struct mount vlist mtx .ask retry usecount %d, writecount %d, refcount %d seqc users %d socket %p -- system panic: halting... %ldd sigfastblock_fetch_always I'm a teapot! Pin the per-CPU swis (except PCPU 0, which is also default) kldload: %s cmd %s pid %d tried to use non-present %s mapped_files off__cpu steal_thresh Topological distance limit for stealing threads in sched_switch() vaccess_acl_posix1e: ACL_OTHER missing config_intrhook_establish: establishing an already established hook. devices %parent not found bus-related data set to enter the debugger kdb_sysctl_panic Number of kernel object methods registered unable to find kernel metadata MODINFO_EMPTY MODINFOMD_FW_HANDLE kern.maxphys msgbufsize vm_guest Error value from last interval log_console_add_linefeed rman_init implement RMAN_GAUGE splitting region in three parts: [%#jx, %#jx]; [%#jx, %#jx]; [%#jx, %#jx] clock_show_io CLOCK_SETTIME error %d Number of CPUs online SMR Stats trap_enotcap selfd pipe mutex piperd /usr/src/sys/kern/sys_process.c Preferred number of ready kernel processes for async socket IO compat.recvmsg futimes compat7.__semctl msgrcv compat11.nfstat aio_waitcomplete aio_readv semmns Maximum number of pages available for shared memory TLS sessions switched from ifnet to SW -- panic: page fault cpuid = 2 time = 1774120492 KDB: stack backtrace: #0 0xffffffff80bbe1ed at kdb_backtrace+0x5d #1 0xffffffff80b71576 at vpanic+0x136 #2 0xffffffff80b71433 at panic+0x43 #3 0xffffffff81079f69 at trap_pfault+0x3c9 #4 0xffffffff81050008 at calltrap+0x8 #5 0xffffffff8341aad5 at fuse_vnop_write+0x1d5 #6 0xffffffff81140944 at VOP_WRITE_APV+0x94 #7 0xffffffff80c7d7f7 at vn_write+0x287 #8 0xffffffff80c750d1 at vn_io_fault+0x1c1 #9 0xffffffff80be0581 at dofilewrite+0x81 #10 0xffffffff80bdff87 at sys_write+0x127 #11 0xffffffff8107a8a6 at amd64_syscall+0x126 #12 0xffffffff810508fb at fast_syscall_common+0xf8 Uptime: 2h12m36s Dumping 2251 out of 16168 MB:..1%..11%..21%..31%..41%..51%..61%..71%..81%..91%via lagg0 <110>ipfw: 900 Deny UDP 192.168.1.102:35390 239.255.255.250:1900 in via lagg0
(In reply to zjk from comment #5) I don't want the full core dumps. I just want the crashinfo output. They should be files with names like /var/crash/core.*.txt and they should be hundreds of kB apiece, not GB. Do you have those?
(In reply to Alan Somers from comment #6) I have them, but they are the same and do not carry any information - there is a version error: " cat core.txt.9 'version' has unknown type; cast it to its declared type 'version' has unknown type; cast it to its declared type 'version' has unknown type; cast it to its declared type Unable to find matching kernel for /var/crash/vmcore.9 "
(In reply to zjk from comment #7) I am using a standard GENERIC kernel installed via freebsd-update. It seems it lacks the debug symbols required by crashinfo, which is why it fails with 'unknown type' and 'matching kernel' errors. However, the manual backtrace I provided earlier was extracted directly from the same vmcore and clearly shows the fault at fuse_vnop_write+0x1d5.
(In reply to zjk from comment #8) Do you have a bunch of *.debug files in ls /usr/lib/debug/boot/kernel ? If not, then that may be the problem. The easiest way to fix it is to download https://download.freebsd.org/releases/amd64/15.0-RELEASE/kernel-dbg.txz , then run "tar -C / -xf kernel-dbg.txz". Then you can run 'freebsd-update fetch ;freebsd-update-install'. Finally, you'll be able to run crashinfo again. However, that will only work if the crash happened while running the latest release kernel. If your uptime was more than a few months, then it's likely that an older kernel was running at the time. The reason why I'd like to see the full crashinfo output is that it contains much more information than the stack trace shown by ddb. For instance, line numbers and argument values.
(In reply to Alan Somers from comment #9) OK - I prepared and sent it. I delete the VMcore periodically, but these are the last two dumps I have. https://www.filemail.com/d/patrqvzolwazjlo If you need more core dumps, let me know. I can produce them for various scenarios by the end of the week. zjk
(In reply to zjk from comment #10) That's perfect. But it would be even better if you could provide one more thing. If the moosefs fuse daemon has some kind of --verbose or --debug output that prints a list of every FUSE opcode as it arrives, could you please provide whatever that shows before the crash? If not, that's ok. I can make do without.
(In reply to Alan Somers from comment #11) I understand what you're talking about. Yes, mfsmaster is dumping information to the syslog (maybe it's not a full dump of the operation, but I see some interesting things here). I'll try a csync tomorrow (probably causing a crash). I'll get the information and maybe something interesting will be logged from moosefs. Maybe a "spontaneous" dump will be created? I'll also check the syslog. I'll post it when I have the complete file.
It's obvious that getblk failed, but I really don't have any idea why. Could you please try reproducing the bug with this patch applied and share the output with me? This would help to narrow it down. diff --git a/sys/kern/vfs_bio.c b/sys/kern/vfs_bio.c index 2368f863a248..6f5ac2b5f363 100644 --- a/sys/kern/vfs_bio.c +++ b/sys/kern/vfs_bio.c @@ -3928,8 +3928,10 @@ getblk(struct vnode *vp, daddr_t blkno, int size, int slpflag, int slptimeo, int error; error = getblkx(vp, blkno, blkno, size, slpflag, slptimeo, flags, &bp); - if (error != 0) + if (error != 0) { + printf("getblk failed: \n", strerror(error)); return (NULL); + } return (bp); }
(In reply to Alan Somers from comment #13) Since this is a production machine, I'd prefer to avoid a custom kernel build if possible - but it isnt big problem, I can compile new kernel too. Also, your patch has a small typo (missing %s in printf). Can I use this DTrace one-liner instead? dtrace -n 'fbt::getblkx:return /arg1 != 0/ { printf("getblk failed for vnode %p, error: %d", args[0]->vp, arg1); }' If the system panics, this message should still be visible in the dmesg buffer within the vmcore (which I can extract with strings). Would the return value of getblkx and the vnode pointer be enough for you to investigate further? I just have new core dumps with messages from moosefes master - it looks like master drops communication with chunkservers - network problem.
(In reply to zjk from comment #14) The environment uses Intel i219 NICs which are known to be unstable under FreeBSD. I am experiencing frequent network glitches/timeouts on the lagg0 interface. It seems that fuse_vnop_write panics instead of returning an error when the underlying network connection to the MooseFS storage is interrupted. This confirms your suspicion about getblk failing due to these network-induced timeouts.
(In reply to zjk from comment #15) Alan - I made a kernel and cracked it with rsync. But nothing interesting was written to the info - then I did run kdbg. Of course, if you want, I can send you the core.txt, but: I've analyzed the vmcore with kgdb and found that the panic does not occur in getblk. The crash happens in fuse_inval_buf_range (called by fuse_vnop_write) at fuse_vnops.c:359. Key findings from kgdb: vp is a valid VREG vnode. vp->v_data is a valid fuse_vnode_data (nid: 78819, size: 484059039). The panic is a Page Fault in frame #8. It looks like fuse_inval_buf_range is trying to invalidate a buffer range that causes a fault, even though the vnode structures seem intact. getblk debug patch was never reached. Please see the attached bt and p *vp output. bt #0 __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:57 #1 doadump (textdump=<optimized out>) at /usr/src/sys/kern/kern_shutdown.c:399 #2 0xffffffff80b710f9 in kern_reboot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:519 #3 0xffffffff80b71607 in vpanic (fmt=0xffffffff811d12fe "%s", ap=ap@entry=0xfffffe00ff360920) at /usr/src/sys/kern/kern_shutdown.c:974 #4 0xffffffff80b71433 in panic (fmt=<unavailable>) at /usr/src/sys/kern/kern_shutdown.c:887 #5 0xffffffff81079f69 in trap_fatal (frame=<optimized out>, eva=<optimized out>) at /usr/src/sys/amd64/amd64/trap.c:969 #6 0xffffffff81079f69 in trap_pfault (frame=0xfffffe00ff3609a0, usermode=false, signo=<optimized out>, ucode=<optimized out>) #7 <signal handler called> #8 fuse_inval_buf_range (vp=vp@entry=0xfffff8007502f898, filesize=<optimized out>, start=<optimized out>, end=<optimized out>) at /usr/src/sys/fs/fuse/fuse_vnops.c:359 #9 0xffffffff8396bad5 in fuse_vnop_write (ap=<optimized out>) at /usr/src/sys/fs/fuse/fuse_vnops.c:2545 #10 0xffffffff81140944 in VOP_WRITE_APV (vop=0xffffffff83976898 <fuse_vnops>, a=a@entry=0xfffffe00ff360c40) at vnode_if.c:1096 #11 0xffffffff80c7d887 in VOP_WRITE (vp=0xfffff8007502f898, uio=0xfffffe00ff360da8, ioflag=131075, cred=0x0) at ./vnode_if.h:595 #12 vn_write (fp=fp@entry=0xfffff801833b7af0, uio=uio@entry=0xfffffe00ff360da8, active_cred=active_cred@entry=0xfffff802c2b86600, flags=<optimized out>, td=<optimized out>) at /usr/src/sys/kern/vfs_vnops.c:1300 #13 0xffffffff80c75161 in vn_io_fault (fp=0xfffff801833b7af0, uio=0xfffffe00ff360da8, active_cred=0xfffff802c2b86600, flags=0, td=<optimized out>) at /usr/src/sys/kern/vfs_vnops.c:1614 #14 0xffffffff80be0581 in fo_write (fp=0xfffff801833b7af0, uio=0xfffffe00ff360da8, active_cred=0x0, td=0xfffff8027eb6a780, flags=<optimized out>) at /usr/src/sys/sys/file.h:362 #15 dofilewrite (td=td@entry=0xfffff8027eb6a780, fd=fd@entry=152, fp=0xfffff801833b7af0, auio=auio@entry=0xfffffe00ff360da8, offset=offset@entry=-1, flags=flags@entry=0) at /usr/src/sys/kern/sys_generic.c:565 #16 0xffffffff80bdff87 in kern_writev (td=0xfffff8027eb6a780, fd=152, auio=0xfffffe00ff360da8) at /usr/src/sys/kern/sys_generic.c:492 #17 sys_write (td=0xfffff8027eb6a780, uap=<optimized out>) at /usr/src/sys/kern/sys_generic.c:407 #18 0xffffffff8107a8a6 in syscallenter (td=0xfffff8027eb6a780) at /usr/src/sys/amd64/amd64/../../kern/subr_syscall.c:193 #19 amd64_syscall (td=0xfffff8027eb6a780, traced=0) at /usr/src/sys/amd64/amd64/trap.c:1208 #20 <signal handler called> #21 0x000000082a78516a in ?? () Backtrace stopped: Cannot access memory at address 0x82078fe18 f 8 #8 fuse_inval_buf_range (vp=vp@entry=0xfffff8007502f898, filesize=<optimized out>, start=<optimized out>, end=<optimized out>) at /usr/src/sys/fs/fuse/fuse_vnops.c:359 359 if ((bp->b_flags & B_CACHE) != 0 && bp->b_dirtyoff < right_on) { p *vp $1 = {v_type = VREG, v_state = VSTATE_CONSTRUCTED, v_irflag = 0, v_seqc = 0, v_nchash = 443112150, v_hash = 309852434, v_op = 0xffffffff83976898 <fuse_vnops>, v_data = 0xfffff8015a143a80, v_mount = 0xfffffe00c1910b00, v_nmntvnodes = {tqe_next = 0xfffff8007541b898, tqe_prev = 0xfffff800755a8028}, {v_mountedhere = 0x0, v_unpcb = 0x0, v_rdev = 0x0, v_fifoinfo = 0x0}, v_hashlist = { le_next = 0x0, le_prev = 0xfffffe00f35ef4b8}, v_cache_src = {lh_first = 0x0}, v_cache_dst = {tqh_first = 0xfffff8030ee0ff58, tqh_last = 0xfffff8030ee0ff68}, v_cache_dd = 0xfffff8030ee0ff58, v_lock = { lock_object = {lo_name = 0xffffffff839723ff "fuse", lo_flags = 116588544, lo_data = 0, lo_witness = 0x0}, lk_lock = 18446735288332363648, lk_exslpfail = 0, lk_pri = 44, lk_timo = 51}, v_interlock = { lock_object = {lo_name = 0xffffffff811d76f7 "vnode interlock", lo_flags = 16973824, lo_data = 0, lo_witness = 0x0}, mtx_lock = 0}, v_vnlock = 0xfffff8007502f908, v_vnodelist = {tqe_next = 0x0, tqe_prev = 0xfffff8007502fcc8}, v_lazylist = {tqe_next = 0x0, tqe_prev = 0xfffffe00c1911550}, v_bufobj = {bo_lock = {lock_object = {lo_name = 0xffffffff81217a74 "bufobj interlock", lo_flags = 86179840, lo_data = 0, lo_witness = 0x0}, rw_lock = 1}, bo_ops = 0xffffffff818e46d0 <buf_ops_bio>, bo_object = 0xfffff803903c02e8, bo_synclist = {le_next = 0x0, le_prev = 0x0}, bo_private = 0xfffff8007502f898, bo_clean = {bv_hd = {tqh_first = 0x0, tqh_last = 0xfffff8007502f9c0}, bv_root = {pt_root = 0x1}, bv_cnt = 0}, bo_dirty = {bv_hd = {tqh_first = 0x0, tqh_last = 0xfffff8007502f9e0}, bv_root = { pt_root = 0x1}, bv_cnt = 0}, bo_numoutput = 0, bo_flag = 0, bo_domain = 4, bo_bsize = 65536}, v_pollinfo = 0x0, v_label = 0x0, v_lockf = 0x0, v_rl = {head = 1, sleepers = false}, v_holdcnt = 1, v_usecount = 1, v_iflag = 0, v_vflag = 0, v_mflag = 1, v_dbatchcpu = 0, v_writecount = 1, v_seqc_users = 0} p *(struct vnode *)vp $2 = {v_type = VREG, v_state = VSTATE_CONSTRUCTED, v_irflag = 0, v_seqc = 0, v_nchash = 443112150, v_hash = 309852434, v_op = 0xffffffff83976898 <fuse_vnops>, v_data = 0xfffff8015a143a80, v_mount = 0xfffffe00c1910b00, v_nmntvnodes = {tqe_next = 0xfffff8007541b898, tqe_prev = 0xfffff800755a8028}, {v_mountedhere = 0x0, v_unpcb = 0x0, v_rdev = 0x0, v_fifoinfo = 0x0}, v_hashlist = { le_next = 0x0, le_prev = 0xfffffe00f35ef4b8}, v_cache_src = {lh_first = 0x0}, v_cache_dst = {tqh_first = 0xfffff8030ee0ff58, tqh_last = 0xfffff8030ee0ff68}, v_cache_dd = 0xfffff8030ee0ff58, v_lock = { lock_object = {lo_name = 0xffffffff839723ff "fuse", lo_flags = 116588544, lo_data = 0, lo_witness = 0x0}, lk_lock = 18446735288332363648, lk_exslpfail = 0, lk_pri = 44, lk_timo = 51}, v_interlock = { lock_object = {lo_name = 0xffffffff811d76f7 "vnode interlock", lo_flags = 16973824, lo_data = 0, lo_witness = 0x0}, mtx_lock = 0}, v_vnlock = 0xfffff8007502f908, v_vnodelist = {tqe_next = 0x0, tqe_prev = 0xfffff8007502fcc8}, v_lazylist = {tqe_next = 0x0, tqe_prev = 0xfffffe00c1911550}, v_bufobj = {bo_lock = {lock_object = {lo_name = 0xffffffff81217a74 "bufobj interlock", lo_flags = 86179840, lo_data = 0, lo_witness = 0x0}, rw_lock = 1}, bo_ops = 0xffffffff818e46d0 <buf_ops_bio>, bo_object = 0xfffff803903c02e8, bo_synclist = {le_next = 0x0, le_prev = 0x0}, bo_private = 0xfffff8007502f898, bo_clean = {bv_hd = {tqh_first = 0x0, tqh_last = 0xfffff8007502f9c0}, bv_root = {pt_root = 0x1}, bv_cnt = 0}, bo_dirty = {bv_hd = {tqh_first = 0x0, tqh_last = 0xfffff8007502f9e0}, bv_root = { pt_root = 0x1}, bv_cnt = 0}, bo_numoutput = 0, bo_flag = 0, bo_domain = 4, bo_bsize = 65536}, v_pollinfo = 0x0, v_label = 0x0, v_lockf = 0x0, v_rl = {head = 1, sleepers = false}, v_holdcnt = 1, v_usecount = 1, v_iflag = 0, v_vflag = 0, v_mflag = 1, v_dbatchcpu = 0, v_writecount = 1, v_seqc_users = 0} p *(struct fuse_vnode_data *)vp->v_data $3 = {nid = 78819, generation = 1, parent_nid = 0, handles = {lh_first = 0xfffff80075db0900}, flag = 6656, attr_cache_timeout = {sec = 1623, frac = 14421952452054373985}, entry_cache_timeout = {sec = 1623, frac = 14421952452054373985}, last_local_modify = {tv_sec = 1563, tv_nsec = 892812274}, cached_attrs = {va_type = VREG, va_mode = 416, va_bsdflags = 0, va_uid = 80, va_gid = 80, va_nlink = 1, va_fsid = 18446744073390849800, va_fileid = 78819, va_size = 484059039, va_blocksize = 65536, va_atime = {tv_sec = 1774496793, tv_nsec = 0}, va_mtime = {tv_sec = 1774526210, tv_nsec = 0}, va_ctime = { tv_sec = 1774526210, tv_nsec = 0}, va_birthtime = {tv_sec = -1, tv_nsec = 0}, va_gen = 0, va_flags = 0, va_rdev = 0, va_bytes = 484054528, va_filerev = 18446744073709551615, va_vaflags = 0, va_spare = 0}, nlookup = 34, vtype = VNON, clusterw = {v_cstart = 0, v_lasta = 0, v_lastw = 0, v_clen = 0}}
(In reply to zjk from comment #16) Alan, I have analyzed the vmcore from the latest crash. It appears the panic occurs in fuse_inval_buf_range at fuse_vnops.c:359, where getblk returns NULL. Here is the relevant part of the disassembly from my crash (frame #8): gdb 358 bp = getblk(vp, right_lbn, right_blksize, PCATCH, 0, 0); 0xffffffff8396e3f0 <+304>: call 0xffffffff80c380f0 <getblk> 359 if ((bp->b_flags & B_CACHE) != 0 && bp->b_dirtyoff < right_on) { => 0xffffffff8396e3f5 <+309>: testb $0x20,0x78(%rax) At the time of the panic, the register %rax is 0x0 (returned by getblk), which causes the Page Fault when the code immediately attempts to access bp->b_flags. Values from kgdb at frame #8: bp = 0x0 right_blksize = 65536 right_lbn = <optimized out> vp = 0xfffff8007502f898 (VREG) The fuse_inval_buf_range function appears to access the bp pointer without checking for NULL. This explains why debug patch in vfs_bio.c was not triggered – the system panics here first.
1) You can't use dtrace in this case because dtrace works asynchronously. The message probably won't be printed before the panic happens. 2) The fact that you never saw that message printed even after patching the kernel tells us that getblkx did not return an error; it just didn't initialize bp. That's valuable to know. But I don't see how that can happen. I must be missing something. Either that, or you put the debug printf that I supplied in the wrong location by mistake. Because if getblkx returned an error, then that printf definitely should've printed. 3) At the very end of getblkx, just before the return, is a KASSERT that dereferences bp. That dereference ought to trigger a page fault, but the KASSERT is disabled by default on release kernels. If you can reproduce the problem with 16.0-CURRENT, or with a 15.0-RELEASE kernel built using the GENERIC-DEBUG configuration, that would get us closer to the cause of the problem. 4) getblkx also includes some CTR trace points. Those things are a bit like an older, crappier version of dtrace. But you should be able to view them from a core file. That would also reveal some information. But you'll have to rebuild the kernel to enable it, placing these options in your kernel config file: options KTR options KTR_BOOT_ENTRIES=1024 options KTR_ENTRIES=(128*1024) options KTR_COMPILE=(KTR_ALL) options KTR_MASK=KTR_BUF options KTR_CPUMASK=0xFFFF
Alan - I have the results. I've also improved your debug instructions (now they only list the error number - and it works): now it appears twice: at startup and then during a panic. These are numbers 60 (ETIMEOUT - panic) and 16 (EBUSY - start). You'll also find both instances in the core under the entry ALAN-DEBUG. I also created a package: core + KTR dump + system startup image - when it appears. Sending it. If I'm doing something wrong, let me know; I don't know if this is what you wanted. https://www.filemail.com/d/cquppwqeturkouo
Created attachment 269230 [details] Handle memory allocation failures more gracefully That ETIMEDOUT clue was key. It tells us that we're looking at a memory allocation failure. That makes sense, considering your workload. Unfortunately such failures are difficult to simulate during testing. But could you please try the attached patch? I think it should fix your problem. With this patch, writes that would previously page fault should now fail with EINTR. Also, would you mind telling me how much RAM and swap you have?
The patch won't compile because it doesn't have the strerror function—it throws a compilation error. I don't know how to find these functions in the libraries. I have 16GB of RAM and 16GB of swap—now empty after 3 hours of rebooting. If there's anything there, it's usually around 0.5GB after many hours of work. Furthermore, I've seen error 16 displayed on the console twice during this time —but everything works stably despite KTR.
(In reply to zjk from comment #21) A.... ok - now I understand, it's not about the previous patch, but the new one - attached at the beginning after the signature - ok, I'll implement it and let you know.
You know what – unfortunately, the file versions are different. As I mentioned, I have 15.0-RELEASE-p4 (I held off on the latest p5). So one file is OK, it automatically patches, but the other one has to be manually added. I won't be able to do that today.
That patch applies fine for me on the releng/15.0 branch. What kind of conflict are you getting? And if the printf patch is the problem, then don't bother with it.
(In reply to Alan Somers from comment #24) uname -a FreeBSD almach.zjk.pl 15.0-RELEASE-p4 FreeBSD 15.0-RELEASE-p4 ALAN_DEBUG amd64 I ran an mcdiff — it looks like completely different file versions. Mine are from February 11, 2026. I'm updating them with freebsd-update. Maybe I shouldn't wait and apply the p5 patch (I've been holding off on that)? Anyway, I can manually see what I need to change, and it's visible in the code, but in completely different places. However, I need to do it exactly right. Hmm... Looks like a unified diff to me... The text leading up to this was: -------------------------- | sys/fs/fuse/fuse_node.c | 6 +----- | sys/fs/fuse/fuse_vnops.c | 24 +++++++++++++++++------- | 2 files changed, 18 insertions(+), 12 deletions(-) | |diff --git a/sys/fs/fuse/fuse_node.c b/sys/fs/fuse/fuse_node.c |index f4fb993a7ca1..6768e87fdef1 100644 |--- a/sys/fs/fuse/fuse_node.c |+++ b/sys/fs/fuse/fuse_node.c -------------------------- Patching file sys/fs/fuse/fuse_node.c using Plan A... Hunk #1 failed at 506. 1 out of 1 hunks failed--saving rejects to sys/fs/fuse/fuse_node.c.rej Hmm... The next patch looks like a unified diff to me... The text leading up to this was: -------------------------- |diff --git a/sys/fs/fuse/fuse_vnops.c b/sys/fs/fuse/fuse_vnops.c |index ad7a2dcf84ef..dd3cc5f16092 100644 |--- a/sys/fs/fuse/fuse_vnops.c |+++ b/sys/fs/fuse/fuse_vnops.c -------------------------- Patching file sys/fs/fuse/fuse_vnops.c using Plan A... Hunk #1 succeeded at 324 with fuzz 2 (offset -3 lines). Hunk #2 failed at 337. Hunk #3 failed at 358. Hunk #4 failed at 728. Hunk #5 failed at 1025. Hunk #6 failed at 2677. Hunk #7 failed at 3211. 6 out of 7 hunks failed--saving rejects to sys/fs/fuse/fuse_vnops.c.rej
Alan - I've already made the changes and the compilation is ok. But I'll test it tomorrow :)
(In reply to zjk from comment #26) If I've made all the changes correctly—I still need to check — we still have ALAN_DEBUG error 16 at startup (about 32 instances). I'll start stress testing rsync this afternoon.
I have one more observation regarding system performance – I always balanced HTTP threads, for example, so that all RAM was used, but no swap. Now, without any configuration changes, I see that: last pid: 20964; Load averages: 0.40, 0.70, 1.08 up 0+01:04:20 09:29:36 128 processes: 1 running, 127 sleeping CPU: 1.6% user, 0.0% nice, 0.8% system, 0.8% interrupt, 96.9% idle Mem: 2162M Active, 2388M Inact, 2595M Wired, 1545M Buf, 8649M Free ARC: Swap: 16G Total, 16G Free So I suddenly have over 8.5 GB of free memory! Question: Is this a result of better memory management by FUSE, or is there, for example, a decrease in the number of buffers associated with a degradation in overall system performance? Should we explore a more dynamic way for FUSE to handle request buffers?
OK - I ran a few tests, all with positive results: 1. rsync - passes, although it throws a few error 60 messages 2. csync2 - ok 3. find - passes with error 60 messages. So far, there's been no system crash. Other results: besides the massive 8.5G of free memory, "find" is significantly slower. Scanning from "/" used to take 2-3 minutes, now it's 12 (but maybe that's a result of KTR still being enabled?). I'll continue testing the system today. Additionally - I have the procstat -kk output for the "frozen" find process. It's stuck in fdisp_wait_answ. The process is constantly sleeping in sleepq_timedwait_sig, waiting for FUSE answers.
Hi Alan, Quick update after 8 days of testing: Stability: The system is rock solid. Zero Panics or reboots on 15.0-RELEASE-p4 under production web traffic. Your patch effectively fixed the immediate crash. Performance: As mentioned, find and rsync are still significantly slower (12 min vs 2 min baseline). Since this is a production machine, I’m staying on this patch for now to avoid crashes, but is still the performance hit. Do you plan a 'Version 2' of patch? I'm ready to test it.
(In reply to zjk from comment #30) It's good to hear that the patch is working for you. As for performance, this patch should have no effect. If you're noticing a performance difference, it might be because you have INVARIANTS enabled now and you didn't before. When you switched from the GENERIC to the GENERIC-DEBUG kernel, that enabled invariants, which can definitely slow things down. Also, it may be that your moosefs server is more responsive now than it was two weeks ago. I'd really like to commit this in time for 15.1-RELEASE. That means we have about one week. Would you like to try building a GENERIC kernel with the patch to see if performance improves? If not I'll commit as is.
Hi Alan. I'm not an expert, but I just recompiled the kernel with the new options and specifically removed the invariants for fear of performance loss on a production machine. So the configuration looked like this: cat /usr/src/sys/amd64/conf/ALAN_DEBUG # Alan's debug configuration for FreeBSD include GENERIC ident ALAN_DEBUG # Debugging options from Alan options KTR options KTR_BOOT_ENTRIES=1024 options KTR_ENTRIES=(128*1024) options KTR_COMPILE=(KTR_ALL) options KTR_MASK=KTR_BUF options KTR_CPUMASK=0xFFFF # Invariants to catch silent errors #optionsINVARIANTS #options INVARIANT_SUPPORT i.e. only KTR is permanently enabled. I also looked at other results: My lockstat shows that while the system is stable, we have high contention on so_rcv, so_snd, and CAM device lock. It looks like the patch effectively serialized I/O and network operations to prevent panic, but it's hitting performance hard on UFS(?). The latency in g_io_request (21k+ nsec) suggests the I/O path is being throttled by these new locking constraints. But you'll have to judge for yourself. I tried it several times, and the rsync time increased from 2 to 12 minutes (more or less). I can of course test this further to confirm, but I think it comes at the expense of speed.
I forgot about KTR. That could definitely affect performance too. I don't use it much, because dtrace is usually superior.
OPK - I'll throw out the KTR and do some tests - we'll see what happens.
Hi Alan. A bit rushed, but I did the tests. I reverted to GENERIC with your fix. So, I'm still getting getblk failed messages. I synced my 40GB drive and several hundred thousand web content files several times. During one sync, I received about 10-20 gelblk failed messages. But of course, there was no kernel panic. Regarding the times, after removing KTR, the synchronization time returned to about 2 minutes (it was also about 2 minutes with your fix - almost identic). But also keep in mind that my synchronization is between two web servers: a directory on UFS - to a directory on UFS. Moosefs is bypassed (excluded from synchronization), yet there are errors number 60. I also have such synchronization on Moosefs, but I didn't measure the time for comparison, I only have the rsync time of the backup with hardlinks - in any case, it's several dozen minutes, over an hour (and there are errors number 16 here).
You can remove those debugging error messages, if they're bothering you. There are legitimate reasons for that code path to be reached. I'm glad to hear that the performance is back to normal. Are you satisfied that your original problem is fixed by this patch? If so I'll commit it, and it will be available in FreeBSD 15.1.
Let's briefly summarize: - FreeBSD 15.0 with Moosefs shares on ZFS (including home, network share), - random panics throughout the day, - panics forced by a simple find, - and almost 100% of panics forced by rsync, csync2 (but excluding Moosefs shares) After the patch: - no spontaneous or forced panics - stable operation - for operations on UFS - no performance degradation Concerns: - for Moosefs shares on ZFS - unfortunately, I don't have the full results yet. What I have now are the find results on the Moosefs share from identical computers with and without the patch: interestingly, the one with the patch runs about 10-30% faster (the results are surprising, although repeated). I need to calmly generate the read, write, and copy results using the Moosefs shares. This suggests that there's probably still potential for further improvements in FUSE. Alan - thanks for the fix, please consider further improvements to FUSE :) zjk
I have performed a side-by-side performance comparison on FreeBSD 15.0-RELEASE using MooseFS (FUSE). The results clearly show that the patch not only fixes the panics but also significantly improves metadata operation performance. Test Environment: OS: FreeBSD 15.0-RELEASE (GENERIC) Filesystem: MooseFS (via FUSE) with 2x replication on SSD Test Case: 5,000 small files creation/deletion and 512MB sequential I/O. Performance Comparison: Alan's Patch vs. Unpatched Operation With Patch (Alan) Unpatched (Baseline) Improvement Create 5,000 files 10.27s 24.64s 2.4x Faster Delete 5,000 files 1.67s 11.48s 6.8x Faster Write 512MB (Seq) 4.64s 4.19s No significant impact Read 512MB (Seq) 4.70s 4.47s No significant impact Kernel CPU (sys) 3.20s 9.06s 65% Less Overhead Maybe I must make more testruns... Observations: Metadata Efficiency: The most striking improvement is in file creation and deletion. Without the patch, the system experiences heavy lock contention, evidenced by the high sys time (9.06s). Alan's patch reduces this overhead to 3.20s. Unlink Performance: The deletion of files is nearly 7 times faster with the patch. Throughput: Large file I/O remains stable with negligible differences, proving the patch doesn't introduce any regressions in data throughput. The system remains rock-solid during these tests, with no "getblk failed" errors or panics.
FIO Benchmark Results: Observations from a user's perspective Hi Alan, I am just a regular user, not a specialist, but I wanted to provide you with some solid data from my MooseFS/FUSE setup. I compared two systems: almach (with your patch, p4) and almbis (unpatched, p2). The results look very promising, especially regarding system stability and latency spikes. Comparison Table (Patched vs. Unpatched) Metric Patched (almach) Unpatched (almbis) Change Mixed: Read Throughput (BW) 2617 KiB/s 1665 KiB/s +57.2% Mixed: Avg Read Latency 12.1 ms 19.1 ms -36.6% RandWrite: IOPS (4k) 2665 2080 +28.1% RandWrite: Max Latency 1.64 s 3.40 s -51.7% RandRead: IOPS (4k) 6119 5322 +15.0% RandRead: 99.99th Latency 35.9 ms 240.1 ms -85.0% SeqRead: Max Latency 2.83 s 10.53 s -73.1% Simple Observations: Stability: The most important thing for me is that the massive 10-second hangs are gone. With the patch, the system feels much more "fluid" and consistent. Speed: I was surprised to see a 57% gain in mixed workloads. It seems the patch helps the system handle reading and writing at the same time much better. Reliability: No more panics or weird error messages during these heavy tests. In my humble opinion, this patch is a huge improvement. Even as a non-expert, I can see that there might be even more potential for FUSE optimizations in the future, but this fix is already a great step forward. Thank you for your hard work on this! zjk
System Efficiency & Stability Metrics: Metric Patched (almach) Unpatched (almbis) Change RandWrite: Context Switches 143,017 183,178 -22.0% RandWrite: Latency stdev 18.2 ms 27.8 ms -34.5% SeqWrite: 99.9th Latency 859 ms 1,010 ms -15.0% From a user's perspective, the system with the patch feels much more stable and "lighter." The 22% reduction in context switches suggests that the kernel handles I/O operations more efficiently, without wasting resources on lock contention. Most importantly, the dramatic reduction in latency spikes (jitter) means the system no longer "freezes" during heavy workloads, which was the main cause of previous issues. zjk