Bug 246886 - Nginx + FUSE causes VM stall
Summary: Nginx + FUSE causes VM stall
Status: New
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 12.1-RELEASE
Hardware: amd64 Any
: --- Affects Only Me
Assignee: Alan Somers
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-05-31 02:32 UTC by Hiroshi Nishida
Modified: 2020-06-25 18:19 UTC (History)
5 users (show)

See Also:


Attachments
Output of procstat -kka (24.98 KB, text/plain)
2020-06-05 14:52 UTC, Hiroshi Nishida
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Hiroshi Nishida 2020-05-31 02:32:21 UTC
I'm developing a distributed file system using FUSE on FreeBSD 12.1-RELEASE and STABLE. However, when Nginx is accessing the FUSE mounted filesystem, Nginx stalls with 'grbmaw' state.
grbmaw seems to be used by vm_page_busy_sleep() and it seems to never awake again for some reason.
My FUSE program is just waiting for the next command at fuse_session_loop_mt().

I have tested with 4 different kinds of hardware and get this problem with 3 of them. The biggest difference between those 3 and the remaining 1 is the NIC; the 3 are wired and the rest is wireless.
I guess this happens when Nginx is rushed to access or send the data.
Once it happens, I need to reboot the device but it does not always shut down.
It is easily reproducible.

I think the problem is fuse.ko related and would appreciate it if anybody could fix it.

Thank you.
Comment 1 Mark Linimon freebsd_committer freebsd_triage 2020-05-31 02:59:26 UTC
Since it involves fusefs, over to fs@.
Comment 2 Mark Johnston freebsd_committer 2020-06-01 16:16:10 UTC
Could you provide the stack of the thread stuck in grbmaw?  procstat -kk <pid of stuck process> will show it.

A reproducible would also be useful.
Comment 3 Hiroshi Nishida 2020-06-01 16:44:55 UTC
Thank you for your response.
Here is the stack:

  PID    TID COMM                TDNAME              KSTACK                       
 1190 100092 nginx               -                   mi_switch+0xd4 sleepq_wait+0x2c _sleep+0x253 vm_page_busy_sleep+0xaf vm_page_grab_pages+0x3f2 allocbuf+0x371 getblkx+0x5be breadn_flags+0x3d vfs_bio_getpages+0x33f fuse_vnop_getpages+0x46 VOP_GETPAGES_APV+0x7b vop_stdgetpages_async+0x49 VOP_GETPAGES_ASYNC_APV+0x7b vnode_pager_getpages_async+0x7d vn_sendfile+0xdc2 sendfile+0x12b amd64_syscall+0x387 fast_syscall_common+0x101 

Well, I wish I could open my source code but it's unfortunately commercial and closed.
But I may be able to write a similar and simpler program and test with it.
Comment 4 Hiroshi Nishida 2020-06-01 20:47:59 UTC
I added 

sendfile off;

to nginx.conf and tested again.
Interestingly, nginx never stops on all devices and seems to be running with no problem.

It looks like sendfile triggers the problem.
Comment 5 Alan Somers freebsd_committer 2020-06-01 20:50:38 UTC
(In reply to Hiroshi Nishida from comment #4)
Yep, and perhaps that's an adequate workaround for you.  But we should still fix the bug.  Have you had any success with reducing your test case?
Comment 6 Hiroshi Nishida 2020-06-01 21:51:07 UTC
No.
It's gonna take for a while and doesn't guarantee to reproduce the same problem.
However, I'll try it.
Comment 7 Julien Cigar 2020-06-02 09:39:36 UTC
maybe same issue as bug #244713
Comment 8 Mark Johnston freebsd_committer 2020-06-02 13:28:01 UTC
(In reply to Julien Cigar from comment #7)
Yes, I suspect that it would be a better use of time to try testing on the latest stable/12.  A number of bugs in sendfile have been fixed in the past few months, most of which were only reproducible with !UFS.
Comment 9 Hiroshi Nishida 2020-06-02 14:05:16 UTC
(In reply to Mark Johnston from comment #8)

The problem is reproducible with 12-STABLE downloaded last week.
I'll start programing today for the filesystem and hope it reproduces the same problem.

Intuitively, it seems to happen when the filesystem is slow.
Comment 10 Mark Johnston freebsd_committer 2020-06-02 14:45:28 UTC
(In reply to Hiroshi Nishida from comment #9)
If it is convenient to test, I would also suggest trying the latest -CURRENT snapshots.  sendfile internals have diverged a fair bit between HEAD and stable/12.  If you are able to reproduce there, the output of "procstat -kka" would probably be sufficient to start investigating.  Otherwise, the same output from the stable/12 system would be helpful.
Comment 11 Hiroshi Nishida 2020-06-03 03:41:40 UTC
I have uploaded my test program to:
https://github.com/scopedog/FreeBSD-FUSE-sendfile

I could reproduce the problem very easily in my office but it took pretty long in my house.
It seems the network bandwidth is somehow related but I'm not sure.
If it is too hard to reproduce, I may write another one.

Please take a look at README for the installation, usage, etc.
You will need to keep seeking video until it freezes.
With my original program, it takes only seconds.

Unfortunately, I don't have any devices available for installing CURRENT right now.
I would appreciate it if somebody could test with CURRENT.
Comment 12 Alan Somers freebsd_committer 2020-06-04 01:14:43 UTC
The test case does not work for me.  When I try, I get the following error.  Maybe you need to add the video file to the git repo?

> sudo ./fusetest -d
DEBUG: FuseGetattr: path: /BigBuckBunny-Full-web.mp4
DEBUG: FuseOpen: path: /BigBuckBunny-Full-web.mp4, flags: 0
DEBUG: FuseRead: path: /BigBuckBunny-Full-web.mp4, fi->fh: 0x0, size: 65536, offset: 0
DEBUG: FuseRelease: path: /BigBuckBunny-Full-web.mp4, fi->fh: 0x0
DEBUG: FuseGetattr: path: /BigBuckBunny-Full-web.mp4
DEBUG: FuseOpen: path: /BigBuckBunny-Full-web.mp4, flags: 0
DEBUG: FuseRead: path: /BigBuckBunny-Full-web.mp4, fi->fh: 0x0, size: 65536, offset: 0
Error: FuseRead: curl_easy_perform: Timeout was reached
^C^C^C^CDEBUG: FuseRelease: path: /BigBuckBunny-Full-web.mp4, fi->fh: 0x0
Comment 13 Hiroshi Nishida 2020-06-04 02:23:09 UTC
(In reply to Alan Somers from comment #12)

That's maybe because rnci002.ddns.net is an IPv6 only server and its IPv4 address is dummy.

Could you check if your device running fusetest can access
http://rnci002.ddns.net/raw-videos/BigBuckBunny-Full-web.mp4 ?

BigBuckBunny-Full-web.mp4 can be downloaded from there.
It's too large to put at github.
Comment 14 Hiroshi Nishida 2020-06-04 02:29:21 UTC
Well I can put BigBuckBunny-Full-web.mp4 on a different server with an accessible IPv4 address.
Let me know if you need it.
Comment 15 Alan Somers freebsd_committer 2020-06-04 02:39:29 UTC
(In reply to Hiroshi Nishida from comment #14)
Please do.  I'm on an IPv4-only ISP.
Comment 16 Hiroshi Nishida 2020-06-04 03:28:04 UTC
Here you go.
Please update from https://github.com/scopedog/FreeBSD-FUSE-sendfile
Comment 17 Alan Somers freebsd_committer 2020-06-04 03:46:30 UTC
I can't reproduce it so far, on either 12.1-RELEASE or on 13-CURRENT.  What I'm doing is loading the video, pressing seek, and as soon as the next frame renders seeking again.  So far I've done that all the way through about five times with no hangs.
Comment 18 Hiroshi Nishida 2020-06-04 03:55:58 UTC
It took about 10 min in my house, so you shouldn't give up.
That said, please try with your LAN instead.
Put BigBuckBunny-Full-web.mp4 on a machine in your LAN and change

#define URL     "http://rnc02.asusa.net/raw/BigBuckBunny-Full-web.mp4"

in fusetest.h to the URL of the new one.
I used LAN in my office and the video froze very easily.

If you still cannot reproduce the problem, I'll think about using my original program that freezes the video super easily.
But I need a permission from my boss and it will be hard.
Comment 19 Hiroshi Nishida 2020-06-04 18:38:18 UTC
I have created another test program that reproduces the error more easily than the first one, at least in my environment.

https://github.com/scopedog/FreeBSD-FUSE-sendfile-2

If you had no luck with reproducing the error, please try it.
Just seek and seek, click and click even while the video is loading.
Now the video freezes in 10 sec in my house.

By the way, I run nginx and fusetest on mini PCs like Intel NUC.
If possible, run them on a slow PC.
Comment 20 Alan Somers freebsd_committer 2020-06-05 04:14:08 UTC
Still can't reproduce.  I don't have any slow FreeBSD computers, but I tried running a CPU-intensive benchmark in the background and it didn't help.
Comment 21 Hiroshi Nishida 2020-06-05 05:01:43 UTC
Okay, let me think what to do.
Can you debug remotely?
Or I can send one of my servers (maybe Intel NUC) to you but in either way, I need permission from my boss.
Comment 22 Mark Johnston freebsd_committer 2020-06-05 14:47:04 UTC
(In reply to Hiroshi Nishida from comment #21)
I don't mean to interrupt, but in the meantime it would help to see "procstat -kka" output from a system where the deadlock is occurring.  Presumably some other thread is holding the page busy, which is causing nginx to block forever.  procstat output would help identify that thread.
Comment 23 Hiroshi Nishida 2020-06-05 14:52:39 UTC
Created attachment 215250 [details]
Output of procstat -kka

Here you go.
Comment 24 Mark Johnston freebsd_committer 2020-06-05 15:16:47 UTC
(In reply to Hiroshi Nishida from comment #23)
I don't see any other blocked threads, which suggests that the busy lock is being leaked somewhere.

Do any of your sendfile calls result in read errors from fuse?  In other words, do you ever see sendfile_iodone() being called with error != 0?  It can be verified by running:

# dtrace -n 'fbt::sendfile_iodone:entry /args[3] != 0/{stack();}'

while running your test.
Comment 25 Hiroshi Nishida 2020-06-05 15:37:03 UTC
(In reply to Mark Johnston from comment #24)

My FUSE program does not use sendfile but my other programs like rncddsd, rncmond use it, so I stopped them.
However, I still get the same error and dtrace outputs nothing after

dtrace: description 'fbt::sendfile_iodone:entry ' matched 1 probe

the procstat -kka output seems almost same.
Comment 26 Hiroshi Nishida 2020-06-05 22:52:34 UTC
(In reply to Mark Johnston from comment #24)

I repeatedly ran one of my programs that used sendfile but it didn't return error and the data were all sent correctly, even after nginx froze.

dtrace didn't output anything, either.
Comment 27 Hiroshi Nishida 2020-06-10 15:05:19 UTC
I tested with CURRENT. 
It seems OK with CURRENT, nginx never freezes so far.
Comment 28 Alan Somers freebsd_committer 2020-06-10 15:24:37 UTC
(In reply to Hiroshi Nishida from comment #27)
Terrific news!  Would you be able to test on the latest stable/12 as well?
Comment 29 Hiroshi Nishida 2020-06-10 15:31:47 UTC
(In reply to Alan Somers from comment #28)
A new SSD will arrive on Saturday.
I will try then.
Comment 30 Hiroshi Nishida 2020-06-15 22:58:38 UTC
(In reply to Alan Somers from comment #28)

Unfortunately, the problem still persists with 12.1-STABLE r362026 dated 20200611.

By the way, I got permission from my boss to let other people log in to my device or lend it for development.
If you need it, let me know.
I've been using FreeBSD for 23 years and am willing to cooperate for the bug fix.