Created attachment 186420 [details] firefox patch Multi-process firefox appears to use a lot of shared memory backed by files on /tmp. If /tmp is not tmpfs but a regular file system this causes significant delays. Scrolling pages can be slow for example. This can be improved by patching firefox to use MAP_NOSYNC which prevents dirty pages being flushed to disc as long as they are mapped. When they are unmapped (and all descriptors have been closed) FreeBSD still flushes them though and there are several situations where firefox does this (e.g. switching between tabs and minimising and restoring the browser window). The backing files have been unlinked so why doesn't FreeBSD just discard the pages? I've attached a patch for firefox that works around this problem by using POSIX shared memory with shm_open for the case of anonymous shared memory. Named shared memory is left unchanged but doesn't appear to be used. The patch also removes recording of the inode because I don't think it's valid for shm_open.
Could it be that msync(2) is called over that mappings ? Try this diff --git a/sys/vm/vm_object.c b/sys/vm/vm_object.c index fa197214296..33e7e6f8643 100644 --- a/sys/vm/vm_object.c +++ b/sys/vm/vm_object.c @@ -1083,8 +1068,8 @@ vm_object_sync(vm_object_t object, vm_ooffset_t offset, vm_size_t size, * I/O. */ if (object->type == OBJT_VNODE && - (object->flags & OBJ_MIGHTBEDIRTY) != 0) { - vp = object->handle; + (object->flags & OBJ_MIGHTBEDIRTY) != 0 && + ((vp = object->handle)->v_vflag & VV_NOSYNC) == 0) { VM_OBJECT_WUNLOCK(object); (void) vn_start_write(vp, &mp, V_WAIT); vn_lock(vp, LK_EXCLUSIVE | LK_RETRY);
(In reply to Konstantin Belousov from comment #1) There are calls to msync and fsync in Firefox but they seem unrelated. The patch makes no difference (with Firefox patched to use MAP_NOSYNC but no other changes). At some point I also got strange build failures building Firefox so I suspect the patch isn't safe. Firefox wraps shared memory support in a C++ class. The destructor unmaps the memory and closes the descriptor. The implementation isn't smart enough to keep a pool or something. I see a lot of disk activity (process in wdrain state) when switching tabs, probably because some C++ objects related to the now inactive tab are destroyed at that point. And when all mappings are removed and all descriptors are closed FreeBSD flushes pages to disk (vm_object_terminate?). When a file has zero links this flushing isn't needed. There's no point in writing data to disk that cannot be read again. The file is essentially extra swap space and pages should only be flushed to disk under memory pressure, even without MAP_NOSYNC. The following test program creates a data file on the first run and on the second run it unlinks the file and uses mmap. The second run shouldn't cause any disk activity but it does on FreeBSD. #include <sys/mman.h> #include <sys/stat.h> #include <fcntl.h> #include <stdlib.h> #include <string.h> #include <unistd.h> int main(void) { struct stat stat; int fd; size_t sz; char *base; sz = 1024 * 4096; fd = open("nosync.data", O_RDWR | O_CREAT, 0600); fstat(fd, &stat); if(stat.st_size != sz) { ftruncate(fd, sz); base = malloc(sz); memset(base, '0', sz); write(fd, base, sz); } else { unlink("nosync.data"); base = mmap(NULL, sz, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_NOSYNC, fd, 0); memset(base, (*base - '0' + 1) % 10 + '0', sz); } return(0); }
(In reply to Tijl Coosemans from comment #2) Which file system do you use ? The supposed mechanism is that when last on-disk link goes away, vp->v_vflag gets VV_NOSYNC bit set. Look for instance at the ufs_remove(). Then, vm_object_page_clean() skips such vnodes, and this function is exactly what syncer calls to clean vnode' cached pages. How did you ensured that there is an io caused by the write-out of the data ? The metadata must be written on the unmap, and even unlinked vnode removal causes a lot of metadata writes on UFS. The inode must be prepared for reuse by writing initial data, all inode data blocks must be marked as free in the cylinder group bitmask, the inode must be marked as free in another bitmask, then cg summary must be updated indicating new amount of free blocks and inodes. As a side note, I do not see how could my patch cause failures which you described. All it does is preventing msync(2) from actually write pages to disk for the unlinked file.
(In reply to Konstantin Belousov from comment #3) It's UFS with softupdates and journaling. This test program is more like what Firefox does. It opens and unlinks a file, sets the size with ftruncate and uses mmap. There's no disk I/O that I can notice except for the final close which causes a lot of I/O. If you put munmap last the I/O happens on that call. The bigger the file the longer it takes. Surely this isn't all metadata? #include <sys/mman.h> #include <fcntl.h> #include <stdio.h> #include <string.h> #include <unistd.h> int main( void ) { int fd; size_t sz; void *base; sz = 128 * 1024 * 1024; fd = open( "nosync.data", O_RDWR | O_CREAT, 0600 ); unlink( "nosync.data" ); ftruncate( fd, sz ); base = mmap( NULL, sz, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_NOSYNC, fd, 0 ); puts( "calling memset" ); memset( base, '0', sz ); puts( "memset called" ); sleep( 5 ); puts( "calling munmap" ); munmap( base, sz ); puts( "munmap called" ); sleep( 5 ); puts( "calling close" ); close( fd ); puts( "close called" ); sleep( 5 ); return( 0 ); } I didn't really investigate the problem with your patch. I could build Firefox fine and then suddenly gmake couldn't find some targets that were clearly defined in the Makefile. Rerunning the make command a couple times always gave this same error. Rebooting with the old kernel fixed it.
(In reply to Tijl Coosemans from comment #4) +J should make the amount of writes on close significantly larger. +J solves the problem of leaked resources only, since the problem of the filesystem metadata correctness is already solved by the SU part. So +J must journal all metadata objects which cannot be reached from the normal metadata accesses. In particular, all freed data blocks must be mentioned in the journal records before cg bitmaps are allowed to be written. The result is that +J at least doubles the amount of writes, and also the data amount is O(file length) at least. Again, your description does not make me see how my patch (which only affects msync(2)) relates to the build failures.
(In reply to Konstantin Belousov from comment #5) Even with journaling disabled the final close still takes about 5 seconds of continuous disk activity, exactly as long as calling write(2) with 128MiB. And if I set sz to 256MiB it takes 10 seconds. Clearing an inode and updating a few bitmaps shouldn't take that long.
(In reply to Tijl Coosemans from comment #6) Ok, then try to catch the backtraces of the places which initiate that io. Also it would be interesting to see the state of the vnode.
(In reply to Konstantin Belousov from comment #7) Here's the ddb backtrace and vnode info during the call to close. I've also included the kgdb backtrace. db> ps pid ppid pgrp uid state wmesg wchan cmd 35223 35219 35223 1001 D+ wdrain 0xffffffff80b36320 nosync db> t 35223 Tracing pid 35223 tid 100231 td 0xfffff80014932560 sched_switch() at sched_switch+0x263/frame 0xfffffe0096f20f60 mi_switch() at mi_switch+0xd4/frame 0xfffffe0096f20f90 sleepq_wait() at sleepq_wait+0x3a/frame 0xfffffe0096f20fc0 _sleep() at _sleep+0x22d/frame 0xfffffe0096f21040 waitrunningbufspace() at waitrunningbufspace+0x77/frame 0xfffffe0096f21060 bufwrite() at bufwrite+0x199/frame 0xfffffe0096f210a0 cluster_wbuild() at cluster_wbuild+0x7dd/frame 0xfffffe0096f21150 cluster_write() at cluster_write+0x5da/frame 0xfffffe0096f21230 ffs_write() at ffs_write+0x3e2/frame 0xfffffe0096f212d0 VOP_WRITE_APV() at VOP_WRITE_APV+0x103/frame 0xfffffe0096f213e0 vnode_pager_generic_putpages() at vnode_pager_generic_putpages+0x2bf/frame 0xfffffe0096f214b0 VOP_PUTPAGES_APV() at VOP_PUTPAGES_APV+0x78/frame 0xfffffe0096f214e0 vnode_pager_putpages() at vnode_pager_putpages+0x86/frame 0xfffffe0096f21550 vm_pageout_flush() at vm_pageout_flush+0xe8/frame 0xfffffe0096f21650 vm_object_page_collect_flush() at vm_object_page_collect_flush+0x216/frame 0xfffffe0096f217c0 vm_object_page_clean() at vm_object_page_clean+0x146/frame 0xfffffe0096f21830 vinactive() at vinactive+0x98/frame 0xfffffe0096f21890 vputx() at vputx+0x256/frame 0xfffffe0096f218f0 vn_close1() at vn_close1+0xf8/frame 0xfffffe0096f21960 vn_closefile() at vn_closefile+0x50/frame 0xfffffe0096f219e0 closef() at closef+0x226/frame 0xfffffe0096f21a70 closefp() at closefp+0x89/frame 0xfffffe0096f21ab0 amd64_syscall() at amd64_syscall+0x562/frame 0xfffffe0096f21bf0 Xfast_syscall() at Xfast_syscall+0xfb/frame 0xfffffe0096f21bf0 --- syscall (6, FreeBSD ELF64, sys_close), rip = 0x80099666a, rsp = 0x7fffffffea78, rbp = 0x7fffffffeaf0 --- db> show lockedvnods Locked vnodes vnode 0xfffff80077f5eb10: tag ufs, type VREG usecount 0, writecount 0, refcount 4099 mountedhere 0 flags (VV_NOSYNC|VI_ACTIVE|VI_DOINGINACT) v_object 0xfffff80014b512d0 ref 0 pages 32776 cleanbuf 4096 dirtybuf 1 lock type ufs: EXCL by thread 0xfffff80014932560 (pid 35223, nosync, tid 100231) ino 3945563, on dev ada0p5 (kgdb) bt #0 sched_switch (td=0xfffff80014932560, newtd=<optimized out>, flags=<optimized out>) at /usr/src/sys/kern/sched_ule.c:1988 #1 0xffffffff80400424 in mi_switch (flags=<optimized out>, newtd=0x0) at /usr/src/sys/kern/kern_synch.c:438 #2 0xffffffff8044020a in sleepq_wait (wchan=<unavailable>, pri=<unavailable>) at /usr/src/sys/kern/subr_sleepqueue.c:687 #3 0xffffffff803ffedd in _sleep (ident=0xffffffff80b36320 <runningbufreq>, lock=0xffffffff808c0840 <rbreqlock>, priority=84, wmesg=0xffffffff806850b3 "wdrain", sbt=0, pr=0, flags=<optimized out>) at /usr/src/sys/kern/kern_synch.c:216 #4 0xffffffff80489697 in waitrunningbufspace () at /usr/src/sys/kern/vfs_bio.c:814 #5 0xffffffff80489219 in bufwrite (bp=0xfffffe007aba3cb8) at /usr/src/sys/kern/vfs_bio.c:1990 #6 0xffffffff804982ed in cluster_wbuild (vp=<optimized out>, size=<optimized out>, start_lbn=390, len=<optimized out>, gbflags=<optimized out>) at /usr/src/sys/kern/vfs_cluster.c:856 #7 0xffffffff80497a9a in cluster_wbuild_wb (start_lbn=<optimized out>, vp=<optimized out>, size=<optimized out>, len=<optimized out>, gbflags=<optimized out>) at /usr/src/sys/kern/vfs_cluster.c:625 #8 cluster_write (vp=0xfffff80077f5eb10, bp=<optimized out>, filesize=134217728, seqcount=-2048, gbflags=8) at /usr/src/sys/kern/vfs_cluster.c:694 #9 0xffffffff80587bb2 in ffs_write (ap=0xfffffe0096f213f8) at /usr/src/sys/ufs/ffs/ffs_vnops.c:817 #10 0xffffffff80627bb3 in VOP_WRITE_APV (vop=<optimized out>, a=0xfffffe0096f213f8) at vnode_if.c:1000 #11 0xffffffff805c8def in VOP_WRITE (vp=<unavailable>, uio=0xfffffe0096f21450, ioflag=8323104, cred=<optimized out>) at ./vnode_if.h:413 #12 vnode_pager_generic_putpages (vp=<optimized out>, ma=0xfffffe0096f21660, bytecount=<optimized out>, flags=<optimized out>, rtvals=0xfffffe0096f21560) at /usr/src/sys/vm/vnode_pager.c:1276 #13 0xffffffff80629d48 in VOP_PUTPAGES_APV (vop=<optimized out>, a=0xfffffe0096f214f0) at vnode_if.c:2930 #14 0xffffffff805c6e36 in VOP_PUTPAGES (vp=<optimized out>, m=<optimized out>, count=<optimized out>, sync=<optimized out>, rtvals=<optimized out>) at ./vnode_if.h:1224 #15 vnode_pager_putpages (object=<optimized out>, m=0xfffffe0096f21660, count=<optimized out>, flags=8, rtvals=<optimized out>) at /usr/src/sys/vm/vnode_pager.c:1176 #16 0xffffffff805bd988 in vm_pager_put_pages (object=0xfffff80014b512d0, m=0xfffffe0096f21660, count=32, flags=8, rtvals=0xfffffe0096f21560) at /usr/src/sys/vm/vm_pager.h:129 #17 vm_pageout_flush (mc=0xfffffe0096f21660, count=32, flags=8, mreq=0, prunlen=0xfffffe0096f2177c, eio=0xfffffe0096f217e4) at /usr/src/sys/vm/vm_pageout.c:539 #18 0xffffffff805b5886 in vm_object_page_collect_flush ( object=<optimized out>, p=<optimized out>, pagerflags=<optimized out>, flags=<optimized out>, clearobjflags=<optimized out>, eio=<optimized out>) at /usr/src/sys/vm/vm_object.c:1032 #19 0xffffffff805b55b6 in vm_object_page_clean (object=0xfffff80014b512d0, start=<optimized out>, end=<optimized out>, flags=<optimized out>) at /usr/src/sys/vm/vm_object.c:958 #20 0xffffffff804a9228 in vinactive (vp=0xfffff80077f5eb10, td=0xfffff80014932560) at /usr/src/sys/kern/vfs_subr.c:3060 #21 0xffffffff804a96e6 in vputx (vp=0xfffff80077f5eb10, func=2) at /usr/src/sys/kern/vfs_subr.c:2789 #22 0xffffffff804b9b28 in vn_close1 (vp=0xfffff80077f5eb10, flags=3, file_cred=0xfffff800141a4b00, td=<optimized out>, keep_ref=false) at /usr/src/sys/kern/vfs_vnops.c:459 #23 0xffffffff804b8a00 in vn_closefile (fp=0xfffff80039583960, td=<unavailable>) at /usr/src/sys/kern/vfs_vnops.c:1578 #24 0xffffffff803b2486 in fo_close (fp=0xfffff80039583960, td=0xfffff80014932560) at /usr/src/sys/sys/file.h:346 #25 _fdrop (fp=0xfffff80039583960, td=<optimized out>) at /usr/src/sys/kern/kern_descrip.c:2879 #26 closef (fp=0xfffff80039583960, td=0xfffff80014932560) at /usr/src/sys/kern/kern_descrip.c:2460 #27 0xffffffff803afbb9 in closefp (fdp=0xfffff8000485f000, fd=<optimized out>, fp=0xfffff80039583960, td=0xfffff80014932560, holdleaders=<optimized out>) at /usr/src/sys/kern/kern_descrip.c:1193 #28 0xffffffff805e8c02 in syscallenter (td=0xfffff80014932560) at /usr/src/sys/amd64/amd64/../../kern/subr_syscall.c:132 #29 amd64_syscall (td=0xfffff80014932560, traced=0) at /usr/src/sys/amd64/amd64/trap.c:915
(In reply to Tijl Coosemans from comment #8) I see, this happens during inactivation. Please try this. diff --git a/sys/kern/vfs_subr.c b/sys/kern/vfs_subr.c index 2c144ab17c2..2e8324a7bf8 100644 --- a/sys/kern/vfs_subr.c +++ b/sys/kern/vfs_subr.c @@ -3055,7 +3055,8 @@ vinactive(struct vnode *vp, struct thread *td) * pending I/O and dirty pages in the object. */ obj = vp->v_object; - if (obj != NULL && (obj->flags & OBJ_MIGHTBEDIRTY) != 0) { + if ((vp->v_vflag & VV_NOSYNC) == 0 && obj != NULL && + (obj->flags & OBJ_MIGHTBEDIRTY) != 0) { VM_OBJECT_WLOCK(obj); vm_object_page_clean(obj, 0, 0, 0); VM_OBJECT_WUNLOCK(obj); diff --git a/sys/vm/vm_object.c b/sys/vm/vm_object.c index fa197214296..33e7e6f8643 100644 --- a/sys/vm/vm_object.c +++ b/sys/vm/vm_object.c @@ -1083,8 +1068,8 @@ vm_object_sync(vm_object_t object, vm_ooffset_t offset, vm_size_t size, * I/O. */ if (object->type == OBJT_VNODE && - (object->flags & OBJ_MIGHTBEDIRTY) != 0) { - vp = object->handle; + (object->flags & OBJ_MIGHTBEDIRTY) != 0 && + ((vp = object->handle)->v_vflag & VV_NOSYNC) == 0) { VM_OBJECT_WUNLOCK(object); (void) vn_start_write(vp, &mp, V_WAIT); vn_lock(vp, LK_EXCLUSIVE | LK_RETRY);
(In reply to Konstantin Belousov from comment #9) That makes Firefox usable again, thanks. There's still some disk I/O that seems too much to be just metadata, but I could be wrong about that. If I add a call to fsync before close in the test program above there's still a lot of disk I/O that is fixed by the patch below but it made no difference on Firefox. The only other thing I can think of is that write(2) on an unlinked file still goes straight to the file system, but a patch for that wasn't immediately obvious to me. I've added the backtrace below. Index: sys/kern/vfs_syscalls.c =================================================================== --- sys/kern/vfs_syscalls.c (revision 323578) +++ sys/kern/vfs_syscalls.c (working copy) @@ -3346,6 +3346,8 @@ kern_fsync(struct thread *td, int fd, bool fullsync) if (error != 0) return (error); vp = fp->f_vnode; + if ((vp->v_vflag & VV_NOSYNC) != 0) + goto drop; #if 0 if (!fullsync) /* XXXKIB: compete outstanding aio writes */; db> t 982 Tracing pid 982 tid 100143 td 0xfffff800049b8560 sched_switch() at sched_switch+0x263/frame 0xfffffe0096d7d310 mi_switch() at mi_switch+0xd4/frame 0xfffffe0096d7d340 sleepq_wait() at sleepq_wait+0x3a/frame 0xfffffe0096d7d370 _sleep() at _sleep+0x22d/frame 0xfffffe0096d7d3f0 waitrunningbufspace() at waitrunningbufspace+0x77/frame 0xfffffe0096d7d410 bufwrite() at bufwrite+0x199/frame 0xfffffe0096d7d450 cluster_wbuild() at cluster_wbuild+0x7dd/frame 0xfffffe0096d7d500 cluster_write() at cluster_write+0x5da/frame 0xfffffe0096d7d5e0 ffs_write() at ffs_write+0x3e2/frame 0xfffffe0096d7d680 VOP_WRITE_APV() at VOP_WRITE_APV+0x103/frame 0xfffffe0096d7d790 vn_write() at vn_write+0x1b6/frame 0xfffffe0096d7d810 vn_io_fault1() at vn_io_fault1+0x168/frame 0xfffffe0096d7d950 vn_io_fault() at vn_io_fault+0x189/frame 0xfffffe0096d7d9c0 dofilewrite() at dofilewrite+0x89/frame 0xfffffe0096d7da10 kern_writev() at kern_writev+0x68/frame 0xfffffe0096d7da60 sys_write() at sys_write+0x86/frame 0xfffffe0096d7dab0 amd64_syscall() at amd64_syscall+0x562/frame 0xfffffe0096d7dbf0 Xfast_syscall() at Xfast_syscall+0xfb/frame 0xfffffe0096d7dbf0 --- syscall (4, FreeBSD ELF64, sys_write), rip = 0x80099660a, rsp = 0x7fffffffea88, rbp = 0x7fffffffeaf0 --- db> show lockedvnods Locked vnodes vnode 0xfffff800686793b0: tag ufs, type VREG usecount 1, writecount 1, refcount 546 mountedhere 0 flags (VV_NOSYNC|VI_ACTIVE) v_object 0xfffff80068684870 ref 0 pages 4352 cleanbuf 543 dirtybuf 1 lock type ufs: EXCL by thread 0xfffff800049b8560 (pid 982, nosync, tid 100143) ino 3945559, on dev ada0p5
(In reply to Tijl Coosemans from comment #10) The normal writes, as well as writes initiated by the pagedaemon pageouts, must be allowed even for unlinked vnodes. It is possible that the system is low either on memory, or as is in the case of your backtrace, short of the non-dirty reusable buffers, which cause writes. In this case pages can be reused and we still need the page content, because the process with open handle might access the paged out page again. Your patch for kern_fsync() looks fine, but I did not looked at it in some details. Since we lock the vnode on normal path, I prefer to not be racy there and check for VV_NOSYNC after the vnode is locked.
A commit references this bug: Author: kib Date: Tue Sep 19 16:46:37 UTC 2017 New revision: 323768 URL: https://svnweb.freebsd.org/changeset/base/323768 Log: For unlinked files, do not msync(2) or sync on the vnode deactivation. One consequence of the patch is that msyncing unlinked file mappings no longer reduces the amount of the dirty memory in the system, but I do not think that there are users of msync(2) that utilize it for such side-effect. Reported and tested by: tjil PR: 222356 Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D12411 Changes: head/sys/kern/vfs_subr.c head/sys/vm/vm_object.c
Forgotten to close?
I'd like to see the firefox patch attached to this bug committed to the port. No FreeBSD release has these fixes yet. But even with the fixes there are disk accesses that this patch eliminates.
Created attachment 194794 [details] firefox patch Patch updated for Firefox 61.
The firefox patch has been committed upstream in 63.0.