Status: Closed Feedback Timeout
Product: Base System
Component: bin (show other bugs)
Version: 12.1-RELEASE
Hardware: amd64 Any
Assignee: Jochen Neumeister
Reported: 2019-12-13 19:34 UTC by Andris Vasers
Modified: 2020-01-20 10:07 UTC (History)
Comment 1 Andris Vasers 2019-12-13 19:34:56 UTC
Created attachment 209918 [details]
top daemon with 100% nginx cpu usage

Hello. Using FreeBSD for many years as production environment.
Upgraded webserver to 12.1 release and many problems begun. This is worst BSD release ever I have used before starting 4.11. Very unstable.
WIth exact configuration using on freebsd 11.1 wanted to migrate to bsd 12.1 and now nginx stucks and I'm unable to kill the process only restart will help
It seems that nginx causes kernel / process panic. So after init 6 or reboot sometimes system halts un i need to hard restart whole system on virtual machine.

I, using VMware Vsphere 6.0 environment with open-vm-tools-nox11 installed
16GB of ram, 2 x 4 CPU cores.
em and vmx network interfaces.

tried to play with several nginx configurations - no success
logged out nginx for error logs - no error outputs, seems everything is fine

configured for 8 cores 8 vworker processes and after 8 or 10 hours few cores are in stuck on nginx till nginx daeemon crashes

Im unable to kill process, even kill x -9 wont help.

Tried build kernel with latest source without any success.

So thinking return back to 11.x release. I don't know why this release has been submitted. Next generation approvals i think gets more lazy to check out bugs.
i'm using following packages:

FreeBSD web1.vilnis.dc.local 12.1-RELEASE FreeBSD 12.1-RELEASE r354233 GENERIC  amd64

tried also to upgrade to latest nginx-devel-1.17.6_1, but the same shit.

I'm against to use 12.1 RELEASE in production even it classifies as STABLE, it' s unstable.

Lot of errors using in VM in dmesg:
Comment 1 Mark Johnston freebsd_committer 2019-12-13 19:37:16 UTC
Can you grab procstat -kk output from some of the stuck processes?
Comment 2 Andris Vasers 2019-12-13 19:40:11 UTC
  PID    TID COMM                TDNAME              KSTACK
 1106 100280 nginx               -                   vfs_bio_getpages ncl_getpages VOP_GETPAGES_APV vop_stdgetpages_async VOP_GETPAGES_ASYNC_APV vnode_pager_getpages_async vn_sendfile sendfile amd64_syscall fast_syscall_common
Comment 3 Andris Vasers 2019-12-13 19:48:04 UTC
PID    TID COMM                TDNAME              KSTACK
 1106 100280 nginx               -                   vfs_bio_getpages+0x1d9 ncl_getpages+0x2be VOP_GETPAGES_APV+0x7c vop_stdgetpages_async+0x49 VOP_GETPAGES_ASYNC_APV+0x7c vnode_pager_getpages_async+0x7e vn_sendfile+0xd9c sendfile+0x12b amd64_syscall+0x364 fast_syscall_common+0x101
Comment 4 Andris Vasers 2019-12-13 20:03:53 UTC
Now got 2 cores on stuck:
root@web1:~ # procstat -kk 1106
  PID    TID COMM                TDNAME              KSTACK                     
 1106 100280 nginx               -                   vfs_bio_getpages+0x1d9 ncl_getpages+0x2be VOP_GETPAGES_APV+0x7c vop_stdgetpages_async+0x49 VOP_GETPAGES_ASYNC_APV+0x7c vnode_pager_getpages_async+0x7e vn_sendfile+0xd9c sendfile+0x12b amd64_syscall+0x364 fast_syscall_common+0x101
root@web1:~ # procstat -kk 1104
  PID    TID COMM                TDNAME              KSTACK                     
 1104 100136 nginx               -                   vfs_bio_getpages+0x1d9 ncl_getpages+0x2be VOP_GETPAGES_APV+0x7c vop_stdgetpages_async+0x49 VOP_GETPAGES_ASYNC_APV+0x7c vnode_pager_getpages_async+0x7e vn_sendfile+0xd9c sendfile+0x12b amd64_syscall+0x364 fast_syscall_common+0x101
Comment 5 Andris Vasers 2019-12-13 20:05:04 UTC
Created attachment 209920 [details]
top process list
Comment 6 Andris Vasers 2019-12-13 20:17:29 UTC
Created attachment 209923 [details]
2 minute stuck after init 6
Comment 7 Andris Vasers 2019-12-13 20:19:16 UTC
Created attachment 209924 [details]
stuck on sync

process still in stuck after 5 minutes need to reboot server manually
Comment 8 Andris Vasers 2019-12-13 20:20:20 UTC
Sorry, forgot to post vmstat list
Comment 9 Mark Johnston freebsd_committer 2019-12-13 20:22:39 UTC
As a temporary workaround you can try setting vfs.nfs.use_buf_pager=0, but vfs_bio_getpages() has been the default for some time now.

Are the files that are being served also being modified or truncated?
Comment 10 Andris Vasers 2019-12-13 20:27:11 UTC
seems that files are ok
this server is frontend part, probably files are not saved just log writes in file way.
Comment 11 Konstantin Belousov freebsd_committer 2019-12-13 22:05:51 UTC
Try installing latest stable/12 kernel (no need to rebuild world).  If there is indeed a truncation in parallel with the mapping operation, then I have an expectation that it is fixed.
Comment 12 Andris Vasers 2019-12-13 23:12:07 UTC
ok, new kernel has built, so I'll check and report any issues if occured
12.1-STABLE FreeBSD 12.1-STABLE r355737 amd64
Comment 13 Andris Vasers 2019-12-14 12:03:22 UTC
Created attachment 209933 [details]
Problem still exists core process stuck on latest stable kernel
Comment 14 Andris Vasers 2019-12-14 12:05:50 UTC
No success, nginx still eating cpu time.
Process stuck and onky hard whole vm reset helped to restart nginx.
Any other ideas?
Comment 15 Jochen Neumeister freebsd_committer 2019-12-14 12:32:00 UTC
It could be the same problem as here: 
Comment 16 Konstantin Belousov freebsd_committer 2019-12-14 17:22:20 UTC
(In reply to Andris Vasers from comment #14)
I suppose you either build kernel with debug symbols and ddb, or can rebuild it.
After that, please dump core by entering into ddb:
# sysctl debug.kdb.enter=1
db> dump
.... output ...
db> c
after that run savecore(8) and provide me with kernel.full and vmcore.
Also tell me the pid of the stuck nginx.

Do not put the core file onto a public location, it might contain sensitive info.
Comment 17 Ed Maste freebsd_committer 2019-12-14 18:01:58 UTC
(In reply to Jochen Neumeister from comment #15)
> It could be the same problem as here: 
> https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=242626

It seems you pasted this bug in your comment
Comment 18 Jochen Neumeister freebsd_committer 2019-12-16 08:54:22 UTC
(In reply to Ed Maste from comment #17)

hups, i mean https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=235296
Comment 19 Andris Vasers 2019-12-16 23:15:15 UTC
(In reply to Konstantin Belousov from comment #16)
Thanks Konstantin, I'll post you and attach link to data later.
Now mentioned that there's no dependency in which state is nginx, CPUX or nfs, it will stuck unexpectedly. I have tried playing also with sysctl values, but still no success. Turning off nfs pages only helps if last state was stuck on nfs.
I don't think that 11.x compatibility enabled kernel will crash with the same config as it was on 11.x RELEASE. Third party Binaries were build from latest sources. Tried different versions of nginx, even configuring and building without ports manually. Still the same.
Comment 20 Andris Vasers 2019-12-21 18:46:49 UTC
Now only workaround suggested by Konstantin is disabling vfs.nfs.use_buf_pager=0 works for me.
On 11.1 RELEASE it's disabled by default, but now on 12.1 - enabled.
Comment 21 Konstantin Belousov freebsd_committer 2020-01-04 05:50:14 UTC
(In reply to Andris Vasers from comment #19)
So are you going to provide the requested kernel.full and vmcore files ?