Bug 236961

Summary: [VFS] vfs_bio_getpages: infinite loop
Product: Base System Reporter: Alexandre martins <alexandre.martins>
Component: kernAssignee: freebsd-bugs (Nobody) <bugs>
Status: Closed DUPLICATE    
Severity: Affects Only Me CC: alexandre.martins, markj, ota, pho
Priority: ---    
Version: 12.0-STABLE   
Hardware: amd64   
OS: Any   

Description Alexandre martins 2019-04-02 12:59:58 UTC
Hi,

I currently have some trouble with a amd64 HEAD build machine.

This machine cross compile for an ARMv6 host.

During the compilation, objcopy enter in an infinite loop. The process is stuck (unkillable) in "RUN" state:

@@
  PID   %CPU %MEM     VSZ    RSS TT  STAT STARTED       TIME COMMAND
93219   99.5  0.1   12844   3580  -  R    11:46    184:34.58 objcopy -j .peh ...
@@

The only thing a can get for now is the kernel backtrace via "procstat -kk 93219", that I run in loop

There are some of the data:
@@
__lockmgr_args+0x62a getblkx+0x154 breadn_flags+0x3d vfs_bio_getpages+0x323 ffs_getpages+0x78 VOP_GETPAGES_APV+0x56 ...
        gbincore+0x38 getblkx+0xab breadn_flags+0x3d vfs_bio_getpages+0x323 ffs_getpages+0x78 VOP_GETPAGES_APV+0x56 ...
                                  breadn_flags+0x1e9 vfs_bio_getpages+0x323 ffs_getpages+0x78 VOP_GETPAGES_APV+0x56 ...
                  __lockmgr_args+0x672 binsfree+0x51 vfs_bio_getpages+0x386 ffs_getpages+0x78 VOP_GETPAGES_APV+0x56 ...
                                   vm_page_grab+0x6b vfs_bio_getpages+0x4ac ffs_getpages+0x78 VOP_GETPAGES_APV+0x56 ...
                                                                            ffs_getpages+0x78 VOP_GETPAGES_APV+0x56 ...
@@

The "common" part is the vfs_bio_getpages that seems to endless loop.

What I can do to bring more info for that issue ?

Best regards

Alexandre
Comment 1 Alexandre martins 2019-04-02 13:33:15 UTC
Some additional info:
 - I'm running on UFS
 - I'm running Asynchronous
 - The machine has 12 CPU
Comment 2 Alexandre martins 2019-04-15 13:57:50 UTC
I updated the issue and changed the affected version. The stable 12 has the same problem.
Comment 3 Alexandre martins 2019-04-16 07:43:35 UTC
All file systems are OK. I have to do manually the fsck each time because the tool send me the error "PARTIALLY TRUNCATED INODE" and is unable to recover the error.


# mount
/dev/ufs/root on / (ufs, local, noatime)
devfs on /dev (devfs, local, multilabel)
/dev/ufs/var on /var (ufs, local, noatime)
/dev/ufs/tmp on /tmp (ufs, asynchronous, local, noatime)
/dev/ufs/usr on /usr (ufs, asynchronous, local, noatime)
/dev/ufs/home on /home (ufs, asynchronous, local, noatime)

# tunefs -p /dev/ufs/root (all file systems are the same)
Password:
tunefs: POSIX.1e ACLs: (-a)                                disabled
tunefs: NFSv4 ACLs: (-N)                                   disabled
tunefs: MAC multilabel: (-l)                               disabled
tunefs: soft updates: (-n)                                 disabled
tunefs: soft update journaling: (-j)                       disabled
tunefs: gjournal: (-J)                                     disabled
tunefs: trim: (-t)                                         disabled
tunefs: maximum blocks per file in a cylinder group: (-e)  4096
tunefs: average file size: (-f)                            16384
tunefs: average number of files in a directory: (-s)       64
tunefs: minimum percentage of free space: (-m)             8%
tunefs: space to hold for metadata blocks: (-k)            5240
tunefs: optimization preference: (-o)                      time
tunefs: volume label: (-L)                                 root
Comment 4 Peter Holm freebsd_committer freebsd_triage 2019-04-19 06:32:49 UTC
I have not yet been able to reproduce the problem.
I have a core file from Alexandre host:
https://people.freebsd.org/~pho/bug236961.12.0-STABLE.coredump.txz
Comment 5 Alexandre martins 2019-04-19 15:40:00 UTC
After some investigation, it seems that the condition "if (ma[i]->valid != VM_PAGE_BITS_ALL)" (into vfs_bio_getpages) is always true in my case.
Comment 6 Alexandre martins 2019-04-24 14:19:26 UTC
Hello,

The problem disappear when I put the /tmp folder (via symlink) in the same partition than /home (where the build run)

To recap my disk configuration:
 - the build (source + objects) runs on /home partition
 - the /tmp is on the same disk as /home, but before (/tmp is quicker than /home)
 - Both /home and /tmp are "async + noatime"
 - I use ccache (but seems not relevant)
 - The swap is not the problem (freeze occurs when I disable it)
 - When /tmp is a symlink to a folder in /home, the problem disappear.
Comment 7 ota 2019-05-29 06:13:45 UTC
What does "swapctl -l" show?
"systat -swap" also helps to monitor swap page usage.
Comment 8 Mark Johnston freebsd_committer freebsd_triage 2020-04-03 13:44:03 UTC
We believe this will be fixed by r359464.

*** This bug has been marked as a duplicate of bug 242626 ***