For a server having 360GB of RAM and these defaults: kern.maxfiles: 11520879 kern.maxfilesperproc: 10368783 g_spawn_sync() invokes fork_exec_with_pipes() then do_exec() and finally fdwalk (set_cloexec, GINT_TO_POINTER(3)) fdwalk() function loops over all possible file descriptors no matter open or not and does fcntl(X,F_SETFD,FD_CLOEXEC) for every possible X. So, if I run the Midnight Commander and use F3 (view file) for a file called "natt.diff", mc_popen() function calls g_spawn_async_with_pipes() then g_spawn_sync() is called (see above) is over 10 million of system calls (fcntl) are performed for such a machine. This is quite slow and takes several seconds in total. "Internal view" is used (per defaults). This part of glib2 code is not used for Linux, though as it has fdwalk() in its <stdlib.h> Here comes full backtrace of mc binary built with debugging symbols and killed while it was looping: #0 0x0000000801b984fa in _read () from /lib/libc.so.7 (gdb) bt #0 0x0000000801b984fa in _read () from /lib/libc.so.7 #1 0x0000000803220c46 in pthread_suspend_all_np () from /lib/libthr.so.3 #2 0x0000000801397a6e in g_spawn_close_pid () from /usr/local/lib/libglib-2.0.so.0 #3 0x0000000801396b37 in g_spawn_sync () from /usr/local/lib/libglib-2.0.so.0 #4 0x00000008013962e8 in g_spawn_async_with_pipes () from /usr/local/lib/libglib-2.0.so.0 #5 0x000000000046e47e in mc_popen (command=0x803842520 "/bin/sh /tmp/mc-root/mcextZNX6YZ", error=0x7fffffffdd58) at utilunix.c:525 #6 0x00000000004f05be in mcview_load_command_output (view=0x8038ac180, command=0x803842520 "/bin/sh /tmp/mc-root/mcextZNX6YZ") at datasource.c:391 #7 0x000000000048ac39 in mcview_load (view=0x8038ac180, command=0x803842520 "/bin/sh /tmp/mc-root/mcextZNX6YZ", file=0x80381b9c0 "/usr/ports/security/ipsec-tools/natt.diff", start_line=0, search_start=0, search_end=0) at mcviewer.c:312 #8 0x000000000048a9ac in mcview_viewer (command=0x803842520 "/bin/sh /tmp/mc-root/mcextZNX6YZ", file_vpath=0x8038c16e0, start_line=0, search_start=0, search_end=0) at mcviewer.c:251 #9 0x0000000000423ff1 in exec_extension_view (target=0x0, cmd=0x803842520 "/bin/sh /tmp/mc-root/mcextZNX6YZ", filename_vpath=0x8038c16e0, start_line=0) at ext.c:376 #10 0x00000000004234cb in exec_extension (target=0x0, filename_vpath=0x8038c16e0, lc_data=0x8038946a0 "%view{ascii} /usr/local/libexec/mc/ext.d/misc.sh view cat\n\n# ar library\ #11 0x0000000000422c5e in regex_command_for (target=0x0, filename_vpath=0x8038c16e0, action=0x7fffffffe530 "View", script_vpath=0x0) at ext.c:1025 #12 0x00000000004d0664 in regex_command (filename_vpath=0x8038c16e0, action=0x7fffffffe530 "View") at ext.h:30 #13 0x00000000004d053a in view_file_at_line (filename_vpath=0x8038c16e0, plain_view=0, #14 0x00000000004d06a1 in view_file (filename_vpath=0x8038c16e0, plain_view=0, internal=1) at cmd.c:600 #15 0x00000000004d0869 in do_view_cmd (normal=0) at cmd.c:157 #16 0x00000000004d06bb in view_cmd () at cmd.c:610 #17 0x000000000044478c in midnight_execute_cmd (sender=0x8038431c0, command=101) at midnight.c:1380 #18 0x0000000000443dc0 in midnight_callback (w=0x80385d600, sender=0x8038431c0, msg=MSG_ACTION, parm=101, data=0x0) at midnight.c:1580 #19 0x000000000041ad1a in send_message (w=0x80385d600, sender=0x8038431c0, msg=MSG_ACTION, parm=101, data=0x0) at widget-common.h:210 #20 0x000000000041aabf in buttonbar_call (bb=0x8038431c0, i=2) at buttonbar.c:155 #21 0x000000000041a5e7 in buttonbar_callback (w=0x8038431c0, sender=0x0, msg=MSG_HOTKEY, parm=1003, data=0x0) at buttonbar.c:172 #22 0x000000000042003a in send_message (w=0x8038431c0, sender=0x0, msg=MSG_HOTKEY, parm=1003, data=0x0) at widget-common.h:210 #23 0x0000000000421c9a in dlg_try_hotkey (h=0x80385d600, d_key=1003) at dialog.c:435 #24 0x00000000004212d1 in dlg_key_event (h=0x80385d600, d_key=1003) at dialog.c:480 #25 0x0000000000420da7 in dlg_process_event (h=0x80385d600, key=1003, event=0x7fffffffe960) at dialog.c:1162 #26 0x0000000000421678 in frontend_dlg_run (h=0x80385d600) at dialog.c:542 #27 0x00000000004214ae in dlg_run (h=0x80385d600) at dialog.c:1193 #28 0x00000000004436bd in do_nc () at midnight.c:1791 #29 0x000000000041054b in main (argc=1, argv=0x7fffffffeb38) at main.c:409
Forgot to show versions: glib-2.56.3_3,1 mc-4.8.22_1 FreeBSD 11.2-STABLE/amd64 r344922
There exist kind of work around: "limits -n 1000 mc" decreases number of syscalls and delay in 10000 times.
Seems like the best solution to this is to add fdwalk() to libc, inspired by kinfo_getfile from libutil.
(In reply to Steve Wills from comment #3) It would be nice long-term solution for head but for stable branches we need some change in the port.
See also: https://gitlab.gnome.org/GNOME/glib/issues/1638
Created attachment 205914 [details] patch against glib20 in ports. Hi, while trying to understand why the performance of xfce4 (especially xfce4-appfinder) was poor on my system I discovered the same bug as the reporter of thuis bug. I created the attached patch to test whether I could get any significant performance gains by using the sysctl api to enumerate file descriptors. Before patch: - start time of xfce4-appfinder >3s. After patch: - start time of xfce4-appfinder <1s (instant).
Created attachment 205931 [details] proposed fix
(In reply to Peter Niitenberg from comment #6) (In reply to Peter Niitenberg from comment #6) Peter, thank you very much for the patch. I somewhat modified it and tested with great success. It helps for the MC case too. For a test, I used mentioned system based on Intel Xeon Gold 6130 CPU @ 2.10GHz running FreeBSD 11.2-STABLE with memory: real memory = 377945587712 (360437 MB) avail memory = 367256449024 (350243 MB) sysctl defaults are: kern.maxfiles: 11520879 kern.maxfilesperproc: 10368783 For unpatched glib20 port, the command "mcview /usr/ports/security/ipsec-tools/files/natt.diff" takes as much time to start as: real 0m3,341s user 0m1,349s sys 0m2,002s Very noticable delay over 3 seconds. With patch applied: real 0m0,216s user 0m0,023s sys 0m0,035s That's 5714% system time ratio and 1546% real time ratio. I've attached edited version of same code that adds missing check for result of additional malloc() call. Dear maintainer, please consider adding this to the port.
(In reply to Eugene Grosbein from comment #8) Thank you for reviewing the patch Eugene! Missed the malloc check (it was supposed to go into the following if statement), shouldn't create patches after 00:00 anyhow :).
Comment on attachment 205931 [details] proposed fix >--- glib/gspawn.c.orig 2018-09-21 12:29:23.000000000 +0300 >+++ glib/gspawn.c 2019-07-20 15:37:26.923958000 +0300 >@@ -1077,6 +1084,45 @@ set_cloexec (void *data, gint fd) > } > > #ifndef HAVE_FDWALK >+ >+#ifdef __FreeBSD__ >+static int >+fdwalk_s(int (*cb)(void *data, int fd), void *data, gint *resp) >+{ >+ char *bp, *ep; >+ int mib[4] = {CTL_KERN, KERN_PROC, KERN_PROC_FILEDESC, 0}; >+ size_t len = 0; >+ struct kinfo_file kif; >+ >+ mib[3] = (int)getpid(); >+ >+ if (sysctl(mib, nitems(mib), NULL, &len, NULL, 0) != 0) >+ return FALSE; >+ >+ if ((bp = (char*) malloc(len)) == NULL) >+ return FALSE; I am not sure if it is OK to use malloc here. The current fallback fdwalk function says 'Avoid use of opendir/closedir since these are not async-signal-safe'. Since malloc isn't async-signal-safe, using it may break the expectation.
(In reply to Ting-Wei Lan from comment #10) Fall back linux code in fdwalk() does use opendir("/proc/self/fd"), readdir() and closedir(). And I could not find the phrase 'Avoid use of opendir/closedir since these are not async-signal-safe' anywhere in the glib sources.
(In reply to Eugene Grosbein from comment #11) Oh, it was changed during the development of GLib 2.58. https://gitlab.gnome.org/GNOME/glib/merge_requests/490 It said that using malloc could cause deadlock.
Created attachment 205937 [details] proposed fix Protect fdwalk_s() with mutex.
(In reply to Ting-Wei Lan from comment #12) Nice catch. I've updated my patch to wrap new code calling sysctl and malloc() with mutex lock if code runs in multi-threaded mode.
(In reply to Eugene Grosbein from comment #14) Adding a lock doesn't make an async-signal-unsafe function safe to use.
I also ran into this on my POWER9 system. A year ago I thought it was just the bad AST-based VGA framebuffer. But with a Radeon in it I saw the same problem, and spent a bit of time debugging, only to arrive at this exact same result. By using kinfo_getfile() (libutil) in fdwalk() launching mate-calculator from mate-panel went from ~12 seconds down to ~0.5 seconds. kern.maxfilesperproc being ~7.5M.
Another glib20 performance issue: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=240517
Same patch required for: x11-toolkits/vte3: src/vtespawn.cc
(In reply to Ting-Wei Lan from comment #12) IMHO this is wrong or incomplete description. 1. malloc() can not cause deadlock, it is heavy used by many multithread apps. 2. Docs says that opendir() is thread and mt safe: http://man7.org/linux/man-pages/man3/opendir.3.html But readdir() is not: MT-Unsafe: http://man7.org/linux/man-pages/man3/readdir.3.html 3. Reading dir using more than one syscall is bad idea because between read it can change some content is it will not be handled/processed. As I remember this was a reason to drop opendir() in mine FAM backend and some strange crashes: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=214338 So I do not understand why some one can need LOCK() in fdwalk().
Created attachment 207442 [details] glib patch Remove lock, use memcpy() instead of x=*y, add extra buf space, remove duplicate includes.
Created attachment 207479 [details] reworked
https://gitlab.gnome.org/GNOME/glib/merge_requests/1106
(In reply to Ting-Wei Lan from comment #12) > 1. malloc() can not cause deadlock, it is heavy used by many multithread apps. Thread-safe does not imply async-signal-safe. The former can be achieved by using a mutex on shared resources, but the latter cannot. 'man fork' says: If the process has more than one thread, locks and other resources held by the other threads are not released and therefore only async-signal-safe functions (see sigaction(2)) are guaranteed to work in the child process until a call to execve(2) or a similar function. Since fdwalk is called from the child process and it is possible for the parent process to have multiple threads, it can only call async-signal-safe functions. malloc is thread-safe but not async-signal-safe because the implementation is likely to use a lock to protect global data. Assume that there is a thread calling malloc while you are calling fork. If the malloc lock is held while forking, the child process will have the malloc lock held as well. Since the child process only has one thread, which is the thread calling fork, the malloc lock can never be released because the thread holding the lock does not exist in the child process. Therefore, if you call malloc in the child process before exec, it is possible to cause deadlock. > 2. Docs says that opendir() is thread and mt safe: > http://man7.org/linux/man-pages/man3/opendir.3.html > But readdir() is not: MT-Unsafe: > http://man7.org/linux/man-pages/man3/readdir.3.html The current master branch uses neither opendir nor readdir. > 3. Reading dir using more than one syscall is bad idea because between read > it can change some content is it will not be handled/processed. > As I remember this was a reason to drop opendir() in mine FAM backend and > some strange crashes: > https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=214338 > So I do not understand why some one can need LOCK() in fdwalk(). fdwalk should not use LOCK. Where did you find it?
I use mine version of patch last 2 weeks on few workstations - see no crashes/issues. fdwalk() in current, in future, with glib 2.62+ this patch can be deleted, if some one backport fdwalk() to stable. https://reviews.freebsd.org/D21206
What should happen to merge it?
Created attachment 209087 [details] more faster I rework my patch to use struct xfile instead of struct kinfo_file. struct kinfo_file - causes to read many info, in case files on non local fs this produces traffic and lags. For example I open in thunar /usr/ports/devel mounted on NFS, and same folder mounted via sshfs, and got ~9000 opened files descriptors. Then while I try to open any file thunar make call fdwalk() and it freeze on sysctl(CTL_KERN, KERN_PROC, KERN_PROC_FILEDESC) 5-10 seconds, and this happen 2 times, so total freeze time starting from 10 second per one opened file. struct xfile have less info/size and does not cause any side effects.
ping
Created attachment 218218 [details] patch rebased to 2.66.0
There's three patches here that seem to do the same thing, in three wildly different ways. One has a bunch of additional synchronisation that the others do not. There's a related phab review (through PR240548: it is https://reviews.freebsd.org/D21206 ) that solves the problem in libc. I'm inclined to +1 the https://bugs.freebsd.org/bugzilla/attachment.cgi?id=218218 patch version (from 2020-09-23) because it matches the patch in the other PR as well.
A commit references this bug: Author: tcberner Date: Wed Dec 23 17:30:19 UTC 2020 New revision: 559006 URL: https://svnweb.freebsd.org/changeset/ports/559006 Log: devel/glib20: loops over all possible file descriptors - Stop glib from looping over all possible file descriptors. - This should greatly increase performance PR: 236815 Submitted by: rozhuk.im@gmail.com (committed version) Reported by: Eugene Grosbein <eugen@freebsd.org> MFH: 2020Q4 Changes: head/devel/glib20/Makefile head/devel/glib20/files/patch-glib_gspawn.c
A commit references this bug: Author: tcberner Date: Wed Dec 23 17:47:12 UTC 2020 New revision: 559008 URL: https://svnweb.freebsd.org/changeset/ports/559008 Log: MFH: r559006 devel/glib20: loops over all possible file descriptors - Stop glib from looping over all possible file descriptors. - This should greatly increase performance PR: 236815 Submitted by: rozhuk.im@gmail.com (committed version) Reported by: Eugene Grosbein <eugen@freebsd.org> Changes: _U branches/2020Q4/ branches/2020Q4/devel/glib20/Makefile branches/2020Q4/devel/glib20/files/patch-glib_gspawn.c
Committed, thanks!