I'm the maintainer of lang/opencoarrays. tijl@ couldn't reproduce this problem on 12-current. I'm asking for somebody on 11.1-RELEASE to try to reproduce the problem. 1. Build net/mpich with gcc7 2. Build lang/opencoarrays with default options (MPICH) with gcc7. 3. "make test" under lang/opencoarays. On my box a silent reboot follows the successful completion of all tests. My box is a 11.1-RELEASE-p8 amd64 laptop with: FreeBSD/SMP: Multiprocessor System Detected: 4 CPUs FreeBSD/SMP: 1 package(s) x 2 core(s) x 2 hardware threads Thanks Anton
I can reproduce the panic on my home laptop, amd64 with: FreeBSD/SMP: Multiprocessor System Detected: 2 CPUs FreeBSD/SMP: 1 package(s) x 2 core(s) also running 11.1-RELEASE-p8. Actually "make test" results in a halt, not a reboot.
There is a new update ports r465154 from today.
Sorry, overlooked you are the maintainer.
I've done a bit more digging. "make test" results in a halt only when run as root, which is the default. When I do "chown -R user:user work", where user is some unprivileged user, then "make test" does not cause a hang. The ideal solution is not to run the tests as root anyway. In fact, if lang/opencoarrays is built with OpenMPI or OpenMPI2, than the tests will not run under root: Script started on Thu Mar 22 11:41:05 2018 Command: ctest -VV UpdateCTestConfiguration from :/usr/ports/lang/opencoarrays/work/.build/DartConfiguration.tcl UpdateCTestConfiguration from :/usr/ports/lang/opencoarrays/work/.build/DartConfiguration.tcl Test project /usr/ports/lang/opencoarrays/work/.build Constructing a list of tests Done constructing a list of tests Updating test list for fixtures Added 0 tests to meet fixture requirements Checking test dependency graph... Checking test dependency graph end test 1 Start 1: initialize_mpi 1: Test command: /usr/local/bin/bash "/usr/ports/lang/opencoarrays/work/.build/bin/cafrun" "-np" "2" "--hostfile" "/usr/ports/lang/opencoarrays/work/.build/hostfile" "/usr/ports/lang/opencoarrays/work /.build/bin/OpenCoarrays-2.0.0-tests/initialize_mpi" 1: Test timeout computed to be: 9.99988e+06 1: -------------------------------------------------------------------------- 1: mpiexec has detected an attempt to run as root. 1: Running at root is *strongly* discouraged as any mistake (e.g., in 1: defining TMPDIR) or bug can result in catastrophic damage to the OS 1: file system, leaving your system in an unusable state. 1: 1: You can override this protection by adding the --allow-run-as-root 1: option to your cmd line. However, we reiterate our strong advice 1: against doing so - please do so at your own risk. 1: -------------------------------------------------------------------------- 1: Error: Command: 1: `/usr/local/mpi/openmpi2/bin/mpiexec -n 2 --hostfile /usr/ports/lang/opencoarrays/work/.build/ hostfile /usr/ports/lang/opencoarrays/work/.build/bin/OpenCoarrays-2.0.0-tests/initialize_mpi` 1: failed to run. 1/63 Test #1: initialize_mpi .......................***Failed Required regular expression not fou nd.Regex=[Test passed. Yes, adding --allow-run-as-root works, and the tests can be run as root, but it's a bad idea, as the halt shows. So the question is: Is anybody in FreeBSD interested in investigating why the hang happens when the tests are run as root? If so, then this PR should be fixed. If no, meaning that running make test under root is asking for trouble, and a halt in an MPI program run as root is not that surprising, then this PR can be closed, as there is no problem. I'll then open another PR asking for help to change the port to *not* run the tests as root.
Can you attach net/mpich/work/mpich-3.2/src/pm/hydra/config.log? There was a problem with devel/hwloc on FreeBSD 11.1 that has been fixed. MPICH is supposed to use libhwloc from that port but it also contains an embedded copy. Maybe for some reason it's using that in your case.
Created attachment 192480 [details] net/mpich/work/mpich-3.2/src/pm/hydra/config.log Note that I'm building with: # cat /usr/ports/net/mpich/Makefile.local USE_GCC=7 DEFAULT_VERSIONS=gcc=7 #
(In reply to Anton Shterenlikht from comment #6) Hmm, looks normal. You have the latest hwloc installed right (1.11.7_2)?
20474701626e> pkg info -xo hwloc hwloc-1.11.7_2 devel/hwloc But there was a massive ports update a few days ago. Let me see if I can still reproduce the hang.
yes, still hangs
(In reply to Anton Shterenlikht from comment #9) There are several open issues compiling opencoarrays with gcc7. Is it even supported upstream?
yes, of course. The upstream's problem is that they test only on GCC, hence our errors with clang. I spoke to one of opencoarrays developers last week about our problems. He promised to pay more attention to non-linux non-GCC world. Anyway, I think the issue is not limited to opencoarrays. It seems running tests as root is asking for trouble, so I'm keen to not allow this for my port. But then I'm not sure this fits well with package building, etc.
I'm now on 11.2-RELEASE-p2. Running "make test" as root still gives a hang. Running as an unprivileged user is fine. The latest port PR is now: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=230689
(In reply to Anton Shterenlikht from comment #12) It doesn't hang for me in 11.1 or 11.2.
Something must be wrong on my end then. I reproduced the hang again this morning.
11.1 is EOL. GCC_DEFAULT=8. Is this still relevant?
Actually, I want to drop maintainership of this port. I recently changed jobs and really have no time to maintain the port anymore. Sorry... As to whether it's relevant, I guess somebody has to see if the problem exists in 12.0-RELEASE
Yes, still there I just reproduced it on a 12.0-RELEASE-p3 laptop. I think it is very dangerous, and would be great to sort out. I haven't got the skills.
I'm also there: uname -a FreeBSD hammer 12.0-RELEASE-p3 FreeBSD 12.0-RELEASE-p3 GENERIC amd64 But I still can not reproduce the issue. make test runs fine with regular user or root via sudo. See: https://www.dropbox.com/s/4syrn53u9zxuf7d/typescript?dl=0 About the other issue. Do you want me to reset your maintainership of the port?
Is done per ports r492769.
The port has been updated to 2.3.1 in ports r235686. Build testing to check if I can reproduce with the new version.
Have you tried using real root, not sudo?
yes, ordinary user is fine. The problem is with root only
(In reply to Anton Shterenlikht from comment #21) Yes, poudriere testport gives you a "real" root logged into the jail. The last lines from make test: ... 69/78 Test #69: image_status_test_1 .................... Passed 0.08 sec Start 70: image_fail_test_1 70/78 Test #70: image_fail_test_1 ...................... Passed 0.03 sec Start 71: image_fail_and_sync_test_1 71/78 Test #71: image_fail_and_sync_test_1 ............. Passed 0.03 sec Start 72: image_fail_and_sync_test_2 72/78 Test #72: image_fail_and_sync_test_2 ............. Passed 0.03 sec Start 73: image_fail_and_sync_test_3 73/78 Test #73: image_fail_and_sync_test_3 ............. Passed 0.03 sec Start 74: image_fail_and_status_test_1 74/78 Test #74: image_fail_and_status_test_1 ........... Passed 0.03 sec Start 75: image_fail_and_failed_images_test_1 75/78 Test #75: image_fail_and_failed_images_test_1 .... Passed 0.03 sec Start 76: image_fail_and_stopped_images_test_1 76/78 Test #76: image_fail_and_stopped_images_test_1 ... Passed 0.03 sec Start 77: image_fail_and_get_test_1 77/78 Test #77: image_fail_and_get_test_1 .............. Passed 0.03 sec Start 78: test-installation-scripts.sh 78/78 Test #78: test-installation-scripts.sh ........... Passed 0.20 sec 100% tests passed, 0 tests failed out of 78 Total Test time (real) = 8.23 sec root@12_0amd64-default:/usr/ports/lang/opencoarrays #
what can I say... I just reproduced it again. I had it with several versions of the port now, on dell 4-core laptop under 11.(something)-release, and now on lenovo x240 12.0-release. % /usr/bin/su - # cd /usr/ports/lang/opencoarrays # make test All tests pass, and about 1 sec later - reboot. I really now doubt this is me (see above - multiple port versions, 2 hardware specs, 2 major OS revisions). But I don't know what else I can do to help. If nobody else can reproduce it... But my boxes are completely standard install, nothing strange or unusual. But nothing in /var/crash, and only this in /var/log/messages: ... Feb 13 21:09:10 z wpa_supplicant[681]: wlan0: WPA: Group rekeying completed with c0:05:c2:6e:e8:19 [GTK=CCMP] Feb 13 22:09:10 z syslogd: last message repeated 1 times Feb 13 22:35:47 z su[8340]: as to root on /dev/pts/1 Feb 13 22:36:01 z kernel: , 1211. Feb 13 22:36:01 z kernel: . Feb 13 22:36:01 z ntpd[1135]: ntpd exiting on signal 15 (Terminated) Feb 13 22:36:02 z kernel: , 1135. Feb 13 22:36:02 z syslogd: exiting on signal 15 Feb 13 22:37:03 z syslogd: kernel boot file is /boot/kernel/kernel Feb 13 22:37:03 z kernel: wlan0: link state changed to DOWN Feb 13 22:37:03 z kernel: Waiting (max 60 seconds) for system process `vnlru' to stop... done Feb 13 22:37:03 z kernel: Waiting (max 60 seconds) for system process `syncer' to stop... Feb 13 22:37:03 z kernel: Syncing disks, vnodes remaining... 1 0 done Feb 13 22:37:03 z kernel: Waiting (max 60 seconds) for system thread `bufdaemon' to stop... done Feb 13 22:37:03 z kernel: Waiting (max 60 seconds) for system thread `bufspacedaemon-0' to stop... done Feb 13 22:37:03 z kernel: Waiting (max 60 seconds) for system thread `bufspacedaemon-1' to stop... done Feb 13 22:37:03 z kernel: Waiting (max 60 seconds) for system thread `bufspacedaemon-2' to stop... done Feb 13 22:37:03 z kernel: Waiting (max 60 seconds) for system thread `bufspacedaemon-3' to stop... done Feb 13 22:37:03 z kernel: All buffers synced. Feb 13 22:37:03 z kernel: Uptime: 2h16m9s
Just for completeness, can you try all 3 MPI options. I get: 1. MPICH - all tests pass, silent reboot afterwards. 2. OpenMPI - test seem to hang. 3. OpenMPI2 - all tests pass, no reboot - all ok.
(In reply to Anton Shterenlikht from comment #25) Sure. tl;dr I couldn't reproduce your results, no matter what :S I tried the 3 options in both, 12 amd64 jail (as root) and my host system (12.0 amd64) as root too. The not so funny thing is how different the tests behave. I definitely would not trust a software with such volatile results. Jailed ------ TEST - 1 Options : MPICH : on OPENMPI : off OPENMPI2 : off [snip] 68/78 Test #68: issue-515-mimic-mpi-gatherv ............ Passed 0.02 sec Start 69: image_status_test_1 69/78 Test #69: image_status_test_1 .................... Passed 0.02 sec Start 70: image_fail_test_1 70/78 Test #70: image_fail_test_1 ...................... Passed 0.02 sec Start 71: image_fail_and_sync_test_1 71/78 Test #71: image_fail_and_sync_test_1 ............. Passed 0.02 sec Start 72: image_fail_and_sync_test_2 72/78 Test #72: image_fail_and_sync_test_2 ............. Passed 0.02 sec Start 73: image_fail_and_sync_test_3 73/78 Test #73: image_fail_and_sync_test_3 ............. Passed 0.02 sec Start 74: image_fail_and_status_test_1 74/78 Test #74: image_fail_and_status_test_1 ........... Passed 0.02 sec Start 75: image_fail_and_failed_images_test_1 75/78 Test #75: image_fail_and_failed_images_test_1 .... Passed 0.02 sec Start 76: image_fail_and_stopped_images_test_1 76/78 Test #76: image_fail_and_stopped_images_test_1 ... Passed 0.02 sec Start 77: image_fail_and_get_test_1 77/78 Test #77: image_fail_and_get_test_1 .............. Passed 0.02 sec Start 78: test-installation-scripts.sh 78/78 Test #78: test-installation-scripts.sh ........... Passed 0.18 sec 100% tests passed, 0 tests failed out of 78 Total Test time (real) = 6.36 sec TEST - 2 Options : MPICH : off OPENMPI : on OPENMPI2 : off [snip] 58/69 Test #58: static_event_post_issue_293 .......... Passed 0.40 sec Start 59: co_reduce-factorial 59/69 Test #59: co_reduce-factorial .................. Passed 0.61 sec Start 60: co_reduce-factorial-int8 60/69 Test #60: co_reduce-factorial-int8 ............. Passed 0.52 sec Start 61: co_reduce-factorial-int64 61/69 Test #61: co_reduce-factorial-int64 ............ Passed 0.61 sec Start 62: issue-493-coindex-slice 62/69 Test #62: issue-493-coindex-slice .............. Passed 1.05 sec Start 63: issue-488-multi-dim-cobounds-true 63/69 Test #63: issue-488-multi-dim-cobounds-true .... Passed 1.05 sec Start 64: issue-488-multi-dim-cobounds-false 64/69 Test #64: issue-488-multi-dim-cobounds-false ... Passed 1.05 sec Start 65: issue-503-multidim-array-broadcast 65/69 Test #65: issue-503-multidim-array-broadcast ... Passed 0.28 sec Start 66: issue-503-non-contig-red-ndarray 66/69 Test #66: issue-503-non-contig-red-ndarray ..... Passed 0.28 sec Start 67: issue-552-send_by_ref-singleton 67/69 Test #67: issue-552-send_by_ref-singleton ...... Passed 0.27 sec Start 68: issue-515-mimic-mpi-gatherv 68/69 Test #68: issue-515-mimic-mpi-gatherv .......... Passed 0.27 sec Start 69: test-installation-scripts.sh 69/69 Test #69: test-installation-scripts.sh ......... Passed 0.19 sec 100% tests passed, 0 tests failed out of 69 Total Test time (real) = 36.78 sec TEST - 3 Options : MPICH : off OPENMPI : off OPENMPI2 : on [snip] 58/69 Test #58: static_event_post_issue_293 .......... Passed 0.41 sec Start 59: co_reduce-factorial 59/69 Test #59: co_reduce-factorial .................. Passed 0.53 sec Start 60: co_reduce-factorial-int8 60/69 Test #60: co_reduce-factorial-int8 ............. Passed 0.53 sec Start 61: co_reduce-factorial-int64 61/69 Test #61: co_reduce-factorial-int64 ............ Passed 0.53 sec Start 62: issue-493-coindex-slice 62/69 Test #62: issue-493-coindex-slice .............. Passed 4.17 sec Start 63: issue-488-multi-dim-cobounds-true 63/69 Test #63: issue-488-multi-dim-cobounds-true .... Passed 5.73 sec Start 64: issue-488-multi-dim-cobounds-false 64/69 Test #64: issue-488-multi-dim-cobounds-false ... Passed 5.08 sec Start 65: issue-503-multidim-array-broadcast 65/69 Test #65: issue-503-multidim-array-broadcast ... Passed 0.29 sec Start 66: issue-503-non-contig-red-ndarray 66/69 Test #66: issue-503-non-contig-red-ndarray ..... Passed 0.29 sec Start 67: issue-552-send_by_ref-singleton 67/69 Test #67: issue-552-send_by_ref-singleton ...... Passed 0.28 sec Start 68: issue-515-mimic-mpi-gatherv 68/69 Test #68: issue-515-mimic-mpi-gatherv .......... Passed 0.28 sec Start 69: test-installation-scripts.sh 69/69 Test #69: test-installation-scripts.sh ......... Passed 0.19 sec 100% tests passed, 0 tests failed out of 69 Total Test time (real) = 71.52 sec HOST ---- TEST - 1 Options : MPICH : on OPENMPI : off OPENMPI2 : off [snip] 71 - image_fail_and_sync_test_1 (Failed) 72 - image_fail_and_sync_test_2 (Failed) 73 - image_fail_and_sync_test_3 (Failed) 74 - image_fail_and_status_test_1 (Failed) 75 - image_fail_and_failed_images_test_1 (Failed) 76 - image_fail_and_stopped_images_test_1 (Failed) 77 - image_fail_and_get_test_1 (Failed) 79 - shellcheck:caf (Failed) Errors while running CTest *** Error code 8 Stop. make: stopped in /usr/ports/lang/opencoarrays root@hammer:/usr/ports/lang/opencoarrays# TEST - 2 Options : MPICH : off OPENMPI : on OPENMPI2 : off [snip] 67/71 Test #67: issue-552-send_by_ref-singleton ...... Passed 0.36 sec Start 68: issue-515-mimic-mpi-gatherv 68/71 Test #68: issue-515-mimic-mpi-gatherv .......... Passed 0.33 sec Start 69: test-installation-scripts.sh 69/71 Test #69: test-installation-scripts.sh ......... Passed 0.23 sec Start 70: shellcheck:caf 70/71 Test #70: shellcheck:caf .......................***Failed 0.11 sec Start 71: shellcheck:cafrun 71/71 Test #71: shellcheck:cafrun .................... Passed 0.09 sec 99% tests passed, 1 tests failed out of 71 Total Test time (real) = 45.38 sec The following tests FAILED: 70 - shellcheck:caf (Failed) Errors while running CTest *** Error code 8 Stop. make: stopped in /usr/ports/lang/opencoarrays root@hammer:/usr/ports/lang/opencoarrays# TEST - 3 Options : MPICH : off OPENMPI : off OPENMPI2 : on [snip] 68/71 Test #68: issue-515-mimic-mpi-gatherv .......... Passed 0.38 sec Start 69: test-installation-scripts.sh 69/71 Test #69: test-installation-scripts.sh ......... Passed 0.22 sec Start 70: shellcheck:caf 70/71 Test #70: shellcheck:caf .......................***Failed 0.10 sec Start 71: shellcheck:cafrun 71/71 Test #71: shellcheck:cafrun .................... Passed 0.09 sec 99% tests passed, 1 tests failed out of 71 Total Test time (real) = 83.32 sec The following tests FAILED: 70 - shellcheck:caf (Failed) Errors while running CTest *** Error code 8 Stop. make: stopped in /usr/ports/lang/opencoarrays root@hammer:/usr/ports/lang/opencoarrays#
I'm glad I'm not the maintainer anymore... Seriously - I give up. Separately I asked Steve Kargl to run the tests - he cannot reproduce my behaviour either. So feel free to close this PR. Many thanks for your time!
OK, It seems we are not able to reproduce neither the hangs nor the reboots despite trying different combinations of OS both in host and jailed. Closing this PR as requested by submitter. Thanks to all involved.