Bug 248659 - random system freezes
Summary: random system freezes
Status: New
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: CURRENT
Hardware: Any Any
: --- Affects Only Me
Assignee: freebsd-bugs (Nobody)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-08-14 18:27 UTC by Dries Michiels
Modified: 2020-09-23 23:19 UTC (History)
2 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Dries Michiels 2020-08-14 18:27:06 UTC
When doing very low intensive stuff (like checking out the ports tree over WIFI g at like 6 MB/s) on my laptop (Lenovo T490) running KDE5, my disk IO seems to partially stall for a few seconds and then just continue. I have also observed frequent system freezes/deadlocks with no debug info what so ever, no kernel dump, etc. I'm leaning towards VFS deadlocks although I'm kind off in the shadow on how to proceed further to debug this issue. The reason I suspect the VFS stack is that I also observe the random freezes when not running any GUI, just console based interaction. Maybe a driver that I'm loading?

See video link for what I'm experiencing (https://youtu.be/1_ll4OBefjo).

I am running 13-CURRENT and have a Samsung NVMe drive (PM981a) running the UFS file system. I have disable SU-journaling but doesn't seem to help. I really like the look and feel of KDE5 on my laptop although its just not usable in this state with frequent data loss due to the system freezes and hard resets. Trim is disabled through tunefs on this drive as I've read that could be a cause for the problem I describe. Although any of the filesystem settings don't seem to help.

I'd very much appreciate someone bearing with me into debugging this issue.
I have also tried disabling all debugging related features in the kernel that should be disabled in stable branches but that didn't help either, so will probably reenable to debug.
Comment 1 Dries Michiels 2020-08-14 18:35:17 UTC
Re uploaded the video in better quality.
Comment 2 Dries Michiels 2020-08-19 07:00:40 UTC
The nvme hick-ups are resolved after disabling trim (probably it was being trashed with trim requests).
I still have the random system freezes and have tried to limit the scope of it.
ATM the system freezes also occur in single user mode (RO mount of my rootfs).
Comment 3 rkoberman 2020-09-12 21:09:19 UTC
Seeing the same issue on my Lenovo L15 running current on a Comet Lake CPU. Problem occurs in single-user mode and is mitigated by keyboard activity. This is a regression as the problem does not occur on 12.1-RELEASE-P8. Problem has occurred twice during boot, most recently after lo0 came up but before em0 or rtwn were started. Happens when system is busy (compiles on all cores) or idle. GENERIC-NODEBUG kernel except using SCHED_4BSD. Will shortly build GENERIC-SCHED_4BSD.

System gets very warm after the freeze and then cools, probably due to the firmware slowing the processor. This seems to indicate a possible loop in the kernel.

As long as I keep the keyboard reasonably active, the system does not seem to hang. While several times the freeze occurred when I was not closely watching the system, I am not aware of the system staying up longer than 10 minutes.

When running in graphic mode (X11/MATE), it has frozen when I was typing on at least one occasion. This could have been simple bad luck. Seems somewhat more stable when in graphic mode. 

From dmesg:
FreeBSD 13.0-CURRENT #2 r365481M: Tue Sep  8 20:16:02 PDT 2020
    root@ptavv:/usr/obj/usr/src/amd64.amd64/sys/GENERIC-NODEBUG amd64
FreeBSD clang version 11.0.0 (git@github.com:llvm/llvm-project.git llvmorg-11.0.0-rc2-0-g414f32a9e86)reeBSD clang version 11.0.0 (git@github.com:llvm/llvm-project.git llvmorg-11.0.0-rc2-0-g414f32a9e86)
VT(efifb): resolution 1920x1080
CPU: Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz (2112.11-MHz K8-class CPU)
  Origin="GenuineIntel"  Id=0xa0660  Family=0x6  Model=0xa6  Stepping=0
  Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE>
  Features2=0x7ffafbbf<SSE3,PCLMULQDQ,DTES64,MON,DS_CPL,VMX,EST,TM2,SSSE3,SDBG,FMA,CX16,xTPR,PDCM,PCID,SSE4.1,SSE4.2,x2APIC,MOVBE,POPCNT,TSCDLT,AESNI,XSAVE,OSXSAVE,AVX,F16C,RDRAND>
  AMD Features=0x2c100800<SYSCALL,NX,Page1GB,RDTSCP,LM>
  AMD Features2=0x121<LAHF,ABM,Prefetch>
  Structured Extended Features=0x29c67af<FSGSBASE,TSCADJ,SGX,BMI1,AVX2,SMEP,BMI2,ERMS,INVPCID,NFPUSG,MPX,RDSEED,ADX,SMAP,CLFLUSHOPT,PROCTRACE>
  Structured Extended Features3=0xbc000400<MD_CLEAR,IBPB,STIBP,L1DFL,ARCH_CAP,SSBD>
  XSAVE Features=0xf<XSAVEOPT,XSAVEC,XINUSE,XSAVES>
  IA32_ARCH_CAPS=0x2b<RDCL_NO,IBRS_ALL,SKIP_L1DFL_VME,MDS_NO>
  VT-x: PAT,HLT,MTF,PAUSE,EPT,UG,VPID
  TSC: P-state invariant, performance statistics
real memory  = 4294967296 (4096 MB)
avail memory = 3746226176 (3572 MB)
Event timer "LAPIC" quality 600
ACPI APIC Table: <LENOVO TP-R17  >
Comment 4 Dries Michiels 2020-09-12 21:24:07 UTC
I can confirm that this summarizes my issue perfectly.
Comment 5 rkoberman 2020-09-17 23:31:54 UTC
I've continued to analyze the problem. Don't know if this will help track it soen, but I have noted the following:
System is substantially more stable under X (Mate) than under just VT. Once I start Mate, I often have the system stay up and running for over an hour. When X11 is not running, I have not seen the system stay operational for over 10 minutes when the keyboard is not active. Also, when working on X, I have had the system lock up even when the keyboard is active.

I have not tested the reliability of the system when using a vty while X is also running.

The freeze is not instantaneous. With my system monitor running and a bulk disk to disk data transfer running (rsync of 190GB of media files) from a USB disk to the system disk, I note that the transfer slow dramatically a few seconds before the complete freeze. The write rate slowly declines from over 50MBps to zero. When it reaches zero, the system may be barely alive. I have managed to do a sync(1) once or twice which greatly reduces the number of corrections needed by fsck when I reboot. Even when the keyboard stops responding, my system monitor (gkrellm) will continue to update for a few seconds and on a couple of occasions, Ctrl-C to a frozen process caused the system to become responsive again for a couple of seconds with the system monitor updating and commands typed but not even echoed to appear in terminal windows.

My system has only 4B of RAM, so is rather restricted. Have not been able to even try running a VM on it. I have 16 G on order, but lost in the mail.

The system ahd a WD Black 2TB drive. Could a drive issue be at the root? The initial report wee also on a fairly recent Lenovo system. Could a bad disk batch be the trigger? But, if it is a disk issue, why no problems on 12.1?
Comment 6 rkoberman 2020-09-23 23:19:13 UTC
I had an epiphany yesterday and may have figured something that may help track this down. It appears that this is likely tied to the system disk which is an ATAPI (SATA) drive.

I can keep the system from freezing indefinitely by either typing or moving the cursor. I realized that these two devices (kbd and psm) are , to the best of my knowledge, the last physical devices still GIANT locked ad the last ISA devices, as well. It is my suspicion that the GIANT lock happening now and then causes something to clear out that will eventually livelock the system.

I previously noted that the disk transfer rate would deteriorate over time, leading to a livelock. I can now report that, if I see the transfer rate declining, I can suspend the job and, after a few seconds, the system returns to normal. If I let the problem continue for more than a few seconds, the keyboard will be locked up and the system will be livelocked and require a power down.

I am unable to un-tar teh firefox source tarball or any other large tarball. Even with typing or moving the mouse, the transfer rate will start declining. I can suspend tat, but it sems to start declining again very soon whne the tar is resumed and I was unable to complete the restore. I have seen similar behavior with other large tarballs. Oddly, when copying from USB disk to system disk, I see similar issues, but a suspend seems to allow then to return to full speed for a while upon resume. Copies on the system disk to the system disk seem to be the worst problem.

It also appears that an inactive system usually does not lock up. I can boot to the single-user or the login prompt and the system will remain usable for a long time. It does  seem to eventually lock up, but it can take hours. It had never locked until I left it at the single-user prompt after it finished fscking the system. When I get back to the system today, after about 13 hours, it was frozen.

Any suggestions on tracking this down would really be appreciated as this system is replacing an old one which has a failing fan and may become useless at any time.