Hello everyone, I have a server running on FreeBSD 13.0, the server randomly freeze after some days or a month. Here is the phenomenon: 1. Unable to connect to the ssh, when input ssh command, no any response. 2. Alot of services can not be visit, some simple service like static nginx page can be opened in a short time, but if you refreshed page some times, the page will be stuck, and have no response, some other services is the same. 3. Another server has a always logged ssh to FreeBSD Server, and opened a top command, when FreeBSD freeze, this ssh can still visit and top command can refresh and output system status,the memory is normal, cpu usage is normal, ZFS ARC is normal, swap is normal, clock is normal, looks like anything is normal. Any hot key for top can use, but when press q to quit top, and type other command, like "systat -ifstat", the command stuck, no any output, Ctrl + Z or C no response. 4. Ping server always normal. 5. The redis-server on freebsd is normal, because redis service can response and very good. 6. Unable to login from console, when type username and password, press enter, no any output. Environment: FreeBSD 13.0,Intel Xeon 4Core + 16GB Memory, Two 2T Disk, ZFS Mirror, Root on ZFS. It's a new machine, it's been less than half a year since we bought it. Main system only running sshguard+ipfw, mount a nfs and use nullfs to a jail, jail file system running on zfs dataset clone, services all running in this jail. Server has two bge network interface, one for lan, one for wan, the services is network heavy service. In jail, running nginx, php-fpm, php cli server, mysql, redis-server, there is alot of nfs write, read by php. Some try: At first it was suspected to be a ZFS ARC problem, and I set arc max to 2G, but in top ARC is very normal.. When look at dmesg, or any log by system or services, every log stopped record when system freeze, means there is no any abnormal log.. but looks like some service that no need read or write file is normal. Some try 2: Before configure kern.ipc.somaxconn, when system hang, I can't login system, can't do any operate. But after change somaxconn, when system hang (worker processes freeze), I can still login to system, and do some operate that not touch the NFS mountpoint, I found that freeze because of nfs mount is dead. I found some same problems: https://emby.media/community/index.php?/topic/74175-freebsd-jail-with-nfsv4-share-causes-system-to-hang/ https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=251347 https://redmine.ixsystems.com/issues/2068 Some try 3: I have move my program that read/write nfs out of jail, add intr to nfs mount options, more system crash happened, and every crash stack trace is different, sometimes is arc_write, or nfscl, sometime is zio_execute, I now suspect that the possibility of hardware failure is quite high. - uname -a FreeBSD ppbsd 13.0-RELEASE-p4 FreeBSD 13.0-RELEASE-p4 #0: Tue Aug 24 07:33:27 UTC 2021 root@amd64-builder.daemonology.net:/usr/obj/usr/src/amd64.amd64/sys/GENERIC amd64 - loader.conf kern.geom.label.disk_ident.enable="0" kern.geom.label.gptid.enable="0" cryptodev_load="YES" zfs_load="YES" coretemp_load="YES" net.inet.ip.fw.default_to_accept=1 vfs.zfs.arc_max="2G" # Increase dmesg buffer to fit longer boot output. kern.msgbufsize="524288" - sysctl.conf # $FreeBSD$ # # This file is read when going to multi-user and its contents piped thru # ``sysctl'' to adjust kernel values. ``man 5 sysctl.conf'' for details. # # Uncomment this to prevent users from seeing information about processes that # are being run under another UID. #security.bsd.see_other_uids=0 #vfs.zfs.min_auto_ashift=12 kern.ipc.somaxconn=4096 Here is the crash log 1: Dec 17 00:47:02 ppbsd kernel: Fatal trap 12: page fault while in kernel mode Dec 17 00:47:02 ppbsd kernel: cpuid = 2; apic id = 04 Dec 17 00:47:02 ppbsd kernel: fault virtual address = 0x28 Dec 17 00:47:02 ppbsd kernel: fault code = supervisor read data, page not present Dec 17 00:47:02 ppbsd kernel: instruction pointer = 0x20:0xffffffff821495f8 Dec 17 00:47:02 ppbsd kernel: stack pointer = 0x0:0xfffffe010fae48d0 Dec 17 00:47:02 ppbsd kernel: frame pointer = 0x0:0xfffffe010fae48d0 Dec 17 00:47:02 ppbsd kernel: code segment = base rx0, limit 0xfffff, type 0x1b Dec 17 00:47:02 ppbsd kernel: = DPL 0, pres 1, long 1, def32 0, gran 1 Dec 17 00:47:02 ppbsd kernel: processor eflags = interrupt enabled, resume, IOPL = 0 Dec 17 00:47:02 ppbsd kernel: current process = 0 (z_wr_int_3) Dec 17 00:47:02 ppbsd kernel: trap number = 12 Dec 17 00:47:02 ppbsd kernel: panic: page fault Dec 17 00:47:02 ppbsd kernel: cpuid = 2 Dec 17 00:47:02 ppbsd kernel: time = 1639673010 Dec 17 00:47:02 ppbsd kernel: KDB: stack backtrace: Dec 17 00:47:02 ppbsd kernel: #0 0xffffffff80c574c5 at kdb_backtrace+0x65 Dec 17 00:47:02 ppbsd kernel: #1 0xffffffff80c09ea1 at vpanic+0x181 Dec 17 00:47:02 ppbsd kernel: #2 0xffffffff80c09d13 at panic+0x43 Dec 17 00:47:02 ppbsd kernel: #3 0xffffffff8108b1b7 at trap_fatal+0x387 Dec 17 00:47:02 ppbsd kernel: #4 0xffffffff8108b20f at trap_pfault+0x4f Dec 17 00:47:02 ppbsd kernel: #5 0xffffffff8108a86d at trap+0x27d Dec 17 00:47:02 ppbsd kernel: #6 0xffffffff81061958 at calltrap+0x8 Dec 17 00:47:02 ppbsd kernel: #7 0xffffffff821a4d3e at dbuf_write_done+0x9e Dec 17 00:47:02 ppbsd kernel: #8 0xffffffff82190c5c at arc_write_done+0x33c Dec 17 00:47:02 ppbsd kernel: #9 0xffffffff822f920d at zio_done+0xd9d Dec 17 00:47:02 ppbsd kernel: #10 0xffffffff822f2d5c at zio_execute+0x3c Dec 17 00:47:02 ppbsd kernel: #11 0xffffffff80c6b161 at taskqueue_run_locked+0x181 Dec 17 00:47:02 ppbsd kernel: #12 0xffffffff80c6c47c at taskqueue_thread_loop+0xac Dec 17 00:47:02 ppbsd kernel: #13 0xffffffff80bc7dde at fork_exit+0x7e Dec 17 00:47:02 ppbsd kernel: #14 0xffffffff810629de at fork_trampoline+0xe Dec 17 00:47:02 ppbsd kernel: Uptime: 1h44m58s Dec 17 00:47:02 ppbsd kernel: Dumping 2511 out of 16190 MB:..1%..11%..21%..31%..41%..51%..61%..71%..81%..91%---<<BOOT>>--- Crash log 2: Dec 21 14:09:50 ppbsd syslogd: kernel boot file is /boot/kernel/kernel Dec 21 14:09:50 ppbsd kernel: Dec 21 14:09:50 ppbsd syslogd: last message repeated 1 times Dec 21 14:09:50 ppbsd kernel: Fatal trap 12: page fault while in kernel mode Dec 21 14:09:50 ppbsd kernel: cpuid = 2; apic id = 04 Dec 21 14:09:50 ppbsd kernel: fault virtual address = 0x0 Dec 21 14:09:50 ppbsd kernel: fault code = supervisor write data, page not present Dec 21 14:09:50 ppbsd kernel: instruction pointer = 0x20:0xffffffff80ac9e26 Dec 21 14:09:50 ppbsd kernel: stack pointer = 0x28:0xfffffe011bf165b0 Dec 21 14:09:50 ppbsd kernel: frame pointer = 0x28:0xfffffe011bf165f0 Dec 21 14:09:50 ppbsd kernel: code segment = base rx0, limit 0xfffff, type 0x1b Dec 21 14:09:50 ppbsd kernel: = DPL 0, pres 1, long 1, def32 0, gran 1 Dec 21 14:09:50 ppbsd kernel: processor eflags = interrupt enabled, resume, IOPL = 0 Dec 21 14:09:50 ppbsd kernel: current process = 4541 (newnfs 3) Dec 21 14:09:50 ppbsd kernel: trap number = 12 Dec 21 14:09:50 ppbsd kernel: panic: page fault Dec 21 14:09:50 ppbsd kernel: cpuid = 2 Dec 21 14:09:50 ppbsd kernel: time = 1640066765 Dec 21 14:09:50 ppbsd kernel: KDB: stack backtrace: Dec 21 14:09:50 ppbsd kernel: #0 0xffffffff80c574c5 at kdb_backtrace+0x65 Dec 21 14:09:50 ppbsd kernel: #1 0xffffffff80c09ea1 at vpanic+0x181 Dec 21 14:09:50 ppbsd kernel: #2 0xffffffff80c09d13 at panic+0x43 Dec 21 14:09:50 ppbsd kernel: #3 0xffffffff8108b1b7 at trap_fatal+0x387 Dec 21 14:09:50 ppbsd kernel: #4 0xffffffff8108b20f at trap_pfault+0x4f Dec 21 14:09:50 ppbsd kernel: #5 0xffffffff8108a86d at trap+0x27d Dec 21 14:09:50 ppbsd kernel: #6 0xffffffff81061958 at calltrap+0x8 Dec 21 14:09:50 ppbsd kernel: #7 0xffffffff80acc5d9 at nfscl_hasexpired+0x709 Dec 21 14:09:50 ppbsd kernel: #8 0xffffffff80add066 at nfsrpc_read+0x316 Dec 21 14:09:50 ppbsd kernel: #9 0xffffffff80aee349 at ncl_readrpc+0x89 Dec 21 14:09:50 ppbsd kernel: #10 0xffffffff80b01443 at ncl_doio+0xe3 Dec 21 14:09:50 ppbsd kernel: #11 0xffffffff80b03b32 at nfssvc_iod+0x232 Dec 21 14:09:50 ppbsd kernel: #12 0xffffffff80bc7dde at fork_exit+0x7e Dec 21 14:09:50 ppbsd kernel: #13 0xffffffff810629de at fork_trampoline+0xe Dec 21 14:09:50 ppbsd kernel: Uptime: 2d17h24m52s Dec 21 14:09:50 ppbsd kernel: Dumping 3243 out of 16190 MB:..1%..11%..21%..31%..41%..51%..61%..71%..81%..91%---<<BOOT>>--- Crash log 3: Dec 22 16:30:19 ppbsd syslogd: kernel boot file is /boot/kernel/kernel Dec 22 16:30:19 ppbsd kernel: Dec 22 16:30:19 ppbsd syslogd: last message repeated 1 times Dec 22 16:30:19 ppbsd kernel: Fatal trap 12: page fault while in kernel mode Dec 22 16:30:19 ppbsd kernel: cpuid = 2; apic id = 04 Dec 22 16:30:19 ppbsd kernel: fault virtual address = 0x2000 Dec 22 16:30:19 ppbsd kernel: fault code = supervisor read instruction, page not present Dec 22 16:30:19 ppbsd kernel: instruction pointer = 0x20:0x2000 Dec 22 16:30:19 ppbsd kernel: stack pointer = 0x28:0xfffffe010f9a4998 Dec 22 16:30:19 ppbsd kernel: frame pointer = 0x28:0xfffffe010f9a49d0 Dec 22 16:30:19 ppbsd kernel: code segment = base rx0, limit 0xfffff, type 0x1b Dec 22 16:30:19 ppbsd kernel: = DPL 0, pres 1, long 1, def32 0, gran 1 Dec 22 16:30:19 ppbsd kernel: processor eflags = interrupt enabled, resume, IOPL = 0 Dec 22 16:30:19 ppbsd kernel: current process = 0 (z_wr_int_2) Dec 22 16:30:19 ppbsd kernel: trap number = 12 Dec 22 16:30:19 ppbsd kernel: panic: page fault Dec 22 16:30:19 ppbsd kernel: cpuid = 2 Dec 22 16:30:19 ppbsd kernel: time = 1640161595 Dec 22 16:30:19 ppbsd kernel: KDB: stack backtrace: Dec 22 16:30:19 ppbsd kernel: #0 0xffffffff80c574c5 at kdb_backtrace+0x65 Dec 22 16:30:19 ppbsd kernel: #1 0xffffffff80c09ea1 at vpanic+0x181 Dec 22 16:30:19 ppbsd kernel: #2 0xffffffff80c09d13 at panic+0x43 Dec 22 16:30:19 ppbsd kernel: #3 0xffffffff8108b1b7 at trap_fatal+0x387 Dec 22 16:30:19 ppbsd kernel: #4 0xffffffff8108b20f at trap_pfault+0x4f Dec 22 16:30:19 ppbsd kernel: #5 0xffffffff8108a86d at trap+0x27d Dec 22 16:30:19 ppbsd kernel: #6 0xffffffff81061958 at calltrap+0x8 Dec 22 16:30:19 ppbsd kernel: #7 0xffffffff822e2d5c at zio_execute+0x3c Dec 22 16:30:19 ppbsd kernel: #8 0xffffffff80c6b161 at taskqueue_run_locked+0x181 Dec 22 16:30:19 ppbsd kernel: #9 0xffffffff80c6c47c at taskqueue_thread_loop+0xac Dec 22 16:30:19 ppbsd kernel: #10 0xffffffff80bc7dde at fork_exit+0x7e Dec 22 16:30:19 ppbsd kernel: #11 0xffffffff810629de at fork_trampoline+0xe Dec 22 16:30:19 ppbsd kernel: Uptime: 23h2m59s Dec 22 16:30:19 ppbsd kernel: Dumping 3138 out of 16190 MB:..1%..11%..21%..31%..41%..51%..61%..71%..81%..91%---<<BOOT>>---
@Reporter Please include additional information: - /var/run/dmesg.boot (as an attachment) - pciconf -lv output (as an attachment)
I'll make a few random comments. 1 - Since the crashes occur at different places, I agree with you that it might be a hardware problem. It would be nice if you could set up another machine with exactly the same software/usage and see if it crashes as well. 2 - You should never use "intr" nor "soft" on NFSv4 mounts. This is mentioned in the BUGS section at the end of "man mount_nfs". To get rid of a hung NFS mount, use "umount -N <mnt_path>. (It can take a couple of minutes, but normally succeeds. Note that any file writing that was in progress when you do this will get lost.) 3 - If you can still log in and do these when "hung", capture the output of: # ps axHl # procstat -a -kk # netstat -a # ping <nfs-server> 4 - nfscl_hasexpired() only gets called when the NFS client does not get a response from the server for minutes and then gets a NFSERR_EXPIRED reply from the server. You did not mention what NFS server you are using. If it is a FreeBSD 13.0 server, then see PR#256280. If you get the output from #3, please post it here.
Created attachment 230434 [details] dmesg.boot file
Created attachment 230435 [details] pciconf output
(In reply to Kubilay Kocak from comment #1) Thank, I have add the attachments of those two files.
(In reply to Rick Macklem from comment #2) NFS Server is two linux servers, one is an old serve running almost 10 years, another is a new server we set recently. I think the NFS servers are normal because of there is about 6 linux servers mount these NFS servers and running same program, and they did not freeze or crash. It just so happened that the system crashed today, I will add some output file attachments as you said.
Created attachment 230436 [details] ps axHl
Created attachment 230437 [details] procstat -a -kk
Created attachment 230438 [details] netstat -a
Would it be possible for you to upgrade your kernel from the sources for stable/13? There is a patch in stable/13 that fixed a problem where open/create would defer to an exclusive lock request and cause a deadlock. It is commit 701eb03cc0dc in stable/13, but I do not know if it will apply to 13.0 kernel sources cleanly? I thought this could only occur when delegations were being issued by the server and, since you are not running nfscbd(8), delegations should never be issued to the client, even for non-FreeBSD servers. You never mentioned what NFS server you are using. If it is a FreeBSD one, make sure delegations are not enabled. The sysctl vfs.nfsd.enable_delegations should be set to 0 on the server.
Ok, it does look like the commit in stable/13 would fix this hang. It happens because read calls nfscl_hasexpired(), which tries to acquire the exclusive lock (similar to delegation return cases). However, calling nfscl_hasexpired() should *almost never* happen. It happens when the client has been partitioned from the NFSv4 server for at least a minute. For the FreeBSD NFSv4 server (which is what is called a courteous server), the expired only happens when a conflicting open/lock request is done by another client or when open/lock resources become exhausted. When recovery from "expired" is done, all byte range locks are lost, so getting the client/server into this state is to be avoiding if at all possible. If your NFSv4 server is a Linux one, expired will happen when the client is network partitioned from the server for over 60sec. In other words, I think you have some sort of network connectivity problem to the NFS server. As an alternative to upgrading to stable/13, you could switch to using NFSv3 mounts to avoid the hang.
Oh, and try disabling TSO. It's often broken for various net drivers/chips.
(In reply to Rick Macklem from comment #11) Thank you, I 'm now trying mount nfs as nfsv3. One of server mount as nfsv3 is ok. Another one report: RPCPROG_NFS: RPC: Program not registered Do showmount to that server: showmount -e 192.168.1.2 RPC: Program not registered showmount: can't do exports rpc However, this nfs is not strongly dependent, so I don't need to mount it for now, continue to observe for a while.
If you are going to do an NFSv3 mount, the NFS server must be configured for that. I can't comment further, since I do not know what kind of NFS server you are using. For the client, rpcbind must be running for NFSv3 mounts.
(In reply to Rick Macklem from comment #10) It's been 28 days ago after I switch to NFSv3, everything's normal. Server is very stable. I think that is the problem.
It's been almost 49 days ago with NFSv3, everything's normal.
Since I believe this is fixed in the 13.1 release, close the PR.