Bug 226722

Summary: lang/opencoarrays: "make test" as root causes silent reboot
Product: Ports & Packages Reporter: Anton Shterenlikht <as>
Component: Individual Port(s)Assignee: freebsd-ports-bugs (Nobody) <ports-bugs>
Status: Closed Unable to Reproduce    
Severity: Affects Only Me CC: fernape, tijl, w.schwarzenfeld
Priority: --- Keywords: needs-qa
Version: LatestFlags: koobs: merge-quarterly?
Hardware: Any   
OS: Any   
See Also: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=226724
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=230689
Attachments:
Description Flags
net/mpich/work/mpich-3.2/src/pm/hydra/config.log none

Description Anton Shterenlikht 2018-03-19 10:58:12 UTC
I'm the maintainer of lang/opencoarrays.
tijl@ couldn't reproduce this problem on 12-current.

I'm asking for somebody on 11.1-RELEASE to try to reproduce the problem.

1. Build net/mpich with gcc7
2. Build lang/opencoarrays with default options (MPICH) with gcc7.
3. "make test" under lang/opencoarays.

On my box a silent reboot follows the successful completion of all tests.

My box is a 11.1-RELEASE-p8 amd64 laptop with:

FreeBSD/SMP: Multiprocessor System Detected: 4 CPUs
FreeBSD/SMP: 1 package(s) x 2 core(s) x 2 hardware threads

Thanks

Anton
Comment 1 Anton Shterenlikht 2018-03-21 16:13:11 UTC
I can reproduce the panic on my home laptop, amd64 with:

FreeBSD/SMP: Multiprocessor System Detected: 2 CPUs
FreeBSD/SMP: 1 package(s) x 2 core(s)

also running 11.1-RELEASE-p8.

Actually "make test" results in a halt, not a reboot.
Comment 2 Walter Schwarzenfeld freebsd_triage 2018-03-21 16:20:48 UTC
There is a new update ports r465154 from today.
Comment 3 Walter Schwarzenfeld freebsd_triage 2018-03-21 16:23:34 UTC
Sorry, overlooked you are the maintainer.
Comment 4 Anton Shterenlikht 2018-03-23 11:05:14 UTC
I've done a bit more digging.
"make test" results in a halt only when run as root,
which is the default.

When I do "chown -R user:user work", where user is some
unprivileged user, then "make test" does not cause a hang.

The ideal solution is not to run the tests as root anyway.
In fact, if lang/opencoarrays is built with OpenMPI or OpenMPI2,
than the tests will not run under root:

Script started on Thu Mar 22 11:41:05 2018
Command: ctest -VV
UpdateCTestConfiguration  from :/usr/ports/lang/opencoarrays/work/.build/DartConfiguration.tcl
UpdateCTestConfiguration  from :/usr/ports/lang/opencoarrays/work/.build/DartConfiguration.tcl
Test project /usr/ports/lang/opencoarrays/work/.build
Constructing a list of tests
Done constructing a list of tests
Updating test list for fixtures
Added 0 tests to meet fixture requirements
Checking test dependency graph...
Checking test dependency graph end
test 1
      Start  1: initialize_mpi

1: Test command: /usr/local/bin/bash "/usr/ports/lang/opencoarrays/work/.build/bin/cafrun" "-np" "2"
 "--hostfile" "/usr/ports/lang/opencoarrays/work/.build/hostfile" "/usr/ports/lang/opencoarrays/work
/.build/bin/OpenCoarrays-2.0.0-tests/initialize_mpi"
1: Test timeout computed to be: 9.99988e+06
1: --------------------------------------------------------------------------
1: mpiexec has detected an attempt to run as root.
1: Running at root is *strongly* discouraged as any mistake (e.g., in
1: defining TMPDIR) or bug can result in catastrophic damage to the OS
1: file system, leaving your system in an unusable state.
1:
1: You can override this protection by adding the --allow-run-as-root
1: option to your cmd line. However, we reiterate our strong advice
1: against doing so - please do so at your own risk.
1: --------------------------------------------------------------------------
1: Error: Command:
1:    `/usr/local/mpi/openmpi2/bin/mpiexec -n 2 --hostfile /usr/ports/lang/opencoarrays/work/.build/
hostfile /usr/ports/lang/opencoarrays/work/.build/bin/OpenCoarrays-2.0.0-tests/initialize_mpi`
1: failed to run.
 1/63 Test  #1: initialize_mpi .......................***Failed  Required regular expression not fou
nd.Regex=[Test passed.


Yes, adding --allow-run-as-root works, and
the tests can be run as root, but it's a bad
idea, as the halt shows.

So the question is:

Is anybody in FreeBSD interested in investigating why
the hang happens when the tests are run as root?

If so, then this PR should be fixed.

If no, meaning that running make test under root is
asking for trouble, and a halt in an MPI program
run as root is not that surprising, then
this PR can be closed, as there is no problem.

I'll then open another PR asking for help to
change the port to *not* run the tests as root.
Comment 5 Tijl Coosemans freebsd_committer freebsd_triage 2018-04-12 21:49:09 UTC
Can you attach net/mpich/work/mpich-3.2/src/pm/hydra/config.log?  There was a problem with devel/hwloc on FreeBSD 11.1 that has been fixed.  MPICH is supposed to use libhwloc from that port but it also contains an embedded copy.  Maybe for some reason it's using that in your case.
Comment 6 Anton Shterenlikht 2018-04-13 07:55:03 UTC
Created attachment 192480 [details]
net/mpich/work/mpich-3.2/src/pm/hydra/config.log

Note that I'm building with:

# cat /usr/ports/net/mpich/Makefile.local 
USE_GCC=7
DEFAULT_VERSIONS=gcc=7
#
Comment 7 Tijl Coosemans freebsd_committer freebsd_triage 2018-04-13 09:37:23 UTC
(In reply to Anton Shterenlikht from comment #6)
Hmm, looks normal.  You have the latest hwloc installed right (1.11.7_2)?
Comment 8 Anton Shterenlikht 2018-04-13 10:33:46 UTC
20474701626e> pkg info -xo hwloc
hwloc-1.11.7_2                 devel/hwloc

But there was a massive ports update a few days ago.
Let me see if I can still reproduce the hang.
Comment 9 Anton Shterenlikht 2018-04-13 10:39:32 UTC
yes, still hangs
Comment 10 Fernando Apesteguía freebsd_committer freebsd_triage 2018-06-16 16:46:13 UTC
(In reply to Anton Shterenlikht from comment #9)

There are several open issues compiling opencoarrays with gcc7. Is it even supported upstream?
Comment 11 Anton Shterenlikht 2018-06-22 15:37:25 UTC
yes, of course.
The upstream's problem is that they test only on GCC,
hence our errors with clang.
I spoke to one of opencoarrays developers last week
about our problems. He promised to pay more attention
to non-linux non-GCC world.

Anyway, I think the issue is not limited to opencoarrays.
It seems running tests as root is asking for trouble,
so I'm keen to not allow this for my port.

But then I'm not sure this fits well with package building, etc.
Comment 12 Anton Shterenlikht 2018-08-17 10:11:26 UTC
I'm now on 11.2-RELEASE-p2.
Running "make test" as root still gives a hang.
Running as an unprivileged user is fine.
The latest port PR is now:
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=230689
Comment 13 Fernando Apesteguía freebsd_committer freebsd_triage 2018-08-17 16:24:34 UTC
(In reply to Anton Shterenlikht from comment #12)
It doesn't hang for me in 11.1 or 11.2.
Comment 14 Anton Shterenlikht 2018-08-19 09:27:14 UTC
Something must be wrong on my end then.
I reproduced the hang again this morning.
Comment 15 Walter Schwarzenfeld freebsd_triage 2019-02-11 19:55:03 UTC
11.1 is EOL. GCC_DEFAULT=8. Is this still relevant?
Comment 16 Anton Shterenlikht 2019-02-11 20:10:47 UTC
Actually, I want to drop maintainership of this port.
I recently changed jobs and really have no time
to maintain the port anymore.
Sorry...

As to whether it's relevant, I guess somebody has to see if the problem exists
in 12.0-RELEASE
Comment 17 Anton Shterenlikht 2019-02-12 09:08:05 UTC
Yes, still there
I just reproduced it on a  12.0-RELEASE-p3 laptop.
I think it is very dangerous, and would be great
to sort out. I haven't got the skills.
Comment 18 Fernando Apesteguía freebsd_committer freebsd_triage 2019-02-12 22:01:36 UTC
I'm also there:

uname -a
FreeBSD hammer 12.0-RELEASE-p3 FreeBSD 12.0-RELEASE-p3 GENERIC  amd64

But I still can not reproduce the issue. make test runs fine with regular user or root via sudo. See:

https://www.dropbox.com/s/4syrn53u9zxuf7d/typescript?dl=0

About the other issue. Do you want me to reset your maintainership of the port?
Comment 19 Walter Schwarzenfeld freebsd_triage 2019-02-12 22:12:20 UTC
Is done per ports r492769.
Comment 20 Fernando Apesteguía freebsd_committer freebsd_triage 2019-02-13 18:14:19 UTC
The port has been updated to 2.3.1 in ports r235686.

Build testing to check if I can reproduce with the new version.
Comment 21 Anton Shterenlikht 2019-02-13 20:23:48 UTC
Have you tried using real root, not sudo?
Comment 22 Anton Shterenlikht 2019-02-13 20:33:49 UTC
yes, ordinary user is fine. The problem is with root only
Comment 23 Fernando Apesteguía freebsd_committer freebsd_triage 2019-02-13 22:28:07 UTC
(In reply to Anton Shterenlikht from comment #21)

Yes, poudriere testport gives you a "real" root logged into the jail. The last lines from make test:

...
69/78 Test #69: image_status_test_1 ....................   Passed    0.08 sec
      Start 70: image_fail_test_1
70/78 Test #70: image_fail_test_1 ......................   Passed    0.03 sec
      Start 71: image_fail_and_sync_test_1
71/78 Test #71: image_fail_and_sync_test_1 .............   Passed    0.03 sec
      Start 72: image_fail_and_sync_test_2
72/78 Test #72: image_fail_and_sync_test_2 .............   Passed    0.03 sec
      Start 73: image_fail_and_sync_test_3
73/78 Test #73: image_fail_and_sync_test_3 .............   Passed    0.03 sec
      Start 74: image_fail_and_status_test_1
74/78 Test #74: image_fail_and_status_test_1 ...........   Passed    0.03 sec
      Start 75: image_fail_and_failed_images_test_1
75/78 Test #75: image_fail_and_failed_images_test_1 ....   Passed    0.03 sec
      Start 76: image_fail_and_stopped_images_test_1
76/78 Test #76: image_fail_and_stopped_images_test_1 ...   Passed    0.03 sec
      Start 77: image_fail_and_get_test_1
77/78 Test #77: image_fail_and_get_test_1 ..............   Passed    0.03 sec
      Start 78: test-installation-scripts.sh
78/78 Test #78: test-installation-scripts.sh ...........   Passed    0.20 sec

100% tests passed, 0 tests failed out of 78

Total Test time (real) =   8.23 sec
root@12_0amd64-default:/usr/ports/lang/opencoarrays #
Comment 24 Anton Shterenlikht 2019-02-13 22:46:14 UTC
what can I say...
I just reproduced it again.

I had it with several versions of the port now,
on dell 4-core laptop under 11.(something)-release,
and now on lenovo x240 12.0-release.

% /usr/bin/su -
# cd /usr/ports/lang/opencoarrays
# make test

All tests pass, and about 1 sec later - reboot.

I really now doubt this is me (see above - multiple
port versions, 2 hardware specs, 2 major OS revisions).

But I don't know what else I can do to help.

If nobody else can reproduce it...

But my boxes are completely standard install,
nothing strange or unusual.

But nothing in /var/crash, and only this in /var/log/messages:

...
Feb 13 21:09:10 z wpa_supplicant[681]: wlan0: WPA: Group rekeying completed with c0:05:c2:6e:e8:19 [GTK=CCMP]
Feb 13 22:09:10 z syslogd: last message repeated 1 times
Feb 13 22:35:47 z su[8340]: as to root on /dev/pts/1
Feb 13 22:36:01 z kernel: , 1211.
Feb 13 22:36:01 z kernel: .
Feb 13 22:36:01 z ntpd[1135]: ntpd exiting on signal 15 (Terminated)
Feb 13 22:36:02 z kernel: , 1135.
Feb 13 22:36:02 z syslogd: exiting on signal 15
Feb 13 22:37:03 z syslogd: kernel boot file is /boot/kernel/kernel
Feb 13 22:37:03 z kernel: wlan0: link state changed to DOWN
Feb 13 22:37:03 z kernel: Waiting (max 60 seconds) for system process `vnlru' to stop... done
Feb 13 22:37:03 z kernel: Waiting (max 60 seconds) for system process `syncer' to stop...
Feb 13 22:37:03 z kernel: Syncing disks, vnodes remaining... 1 0 done
Feb 13 22:37:03 z kernel: Waiting (max 60 seconds) for system thread `bufdaemon' to stop... done
Feb 13 22:37:03 z kernel: Waiting (max 60 seconds) for system thread `bufspacedaemon-0' to stop... done
Feb 13 22:37:03 z kernel: Waiting (max 60 seconds) for system thread `bufspacedaemon-1' to stop... done
Feb 13 22:37:03 z kernel: Waiting (max 60 seconds) for system thread `bufspacedaemon-2' to stop... done
Feb 13 22:37:03 z kernel: Waiting (max 60 seconds) for system thread `bufspacedaemon-3' to stop... done
Feb 13 22:37:03 z kernel: All buffers synced.
Feb 13 22:37:03 z kernel: Uptime: 2h16m9s
Comment 25 Anton Shterenlikht 2019-02-13 23:34:12 UTC
Just for completeness, can you try all 3 MPI options.
I get:

1. MPICH - all tests pass, silent reboot afterwards.
2. OpenMPI - test seem to hang.
3. OpenMPI2 - all tests pass, no reboot - all ok.
Comment 26 Fernando Apesteguía freebsd_committer freebsd_triage 2019-02-15 17:23:41 UTC
(In reply to Anton Shterenlikht from comment #25)
Sure.

tl;dr I couldn't reproduce your results, no matter what :S

I tried the 3 options in both, 12 amd64 jail (as root) and my host system (12.0 amd64) as root too. The not so funny thing is how different the tests behave. I definitely would not trust a software with such volatile results.

Jailed 
------

TEST - 1
Options        :
        MPICH          : on
        OPENMPI        : off
        OPENMPI2       : off
[snip]
68/78 Test #68: issue-515-mimic-mpi-gatherv ............   Passed    0.02 sec
      Start 69: image_status_test_1
69/78 Test #69: image_status_test_1 ....................   Passed    0.02 sec
      Start 70: image_fail_test_1
70/78 Test #70: image_fail_test_1 ......................   Passed    0.02 sec
      Start 71: image_fail_and_sync_test_1
71/78 Test #71: image_fail_and_sync_test_1 .............   Passed    0.02 sec
      Start 72: image_fail_and_sync_test_2
72/78 Test #72: image_fail_and_sync_test_2 .............   Passed    0.02 sec
      Start 73: image_fail_and_sync_test_3
73/78 Test #73: image_fail_and_sync_test_3 .............   Passed    0.02 sec
      Start 74: image_fail_and_status_test_1
74/78 Test #74: image_fail_and_status_test_1 ...........   Passed    0.02 sec
      Start 75: image_fail_and_failed_images_test_1
75/78 Test #75: image_fail_and_failed_images_test_1 ....   Passed    0.02 sec
      Start 76: image_fail_and_stopped_images_test_1
76/78 Test #76: image_fail_and_stopped_images_test_1 ...   Passed    0.02 sec
      Start 77: image_fail_and_get_test_1
77/78 Test #77: image_fail_and_get_test_1 ..............   Passed    0.02 sec
      Start 78: test-installation-scripts.sh
78/78 Test #78: test-installation-scripts.sh ...........   Passed    0.18 sec

100% tests passed, 0 tests failed out of 78

Total Test time (real) =   6.36 sec

TEST - 2
Options        :
        MPICH          : off
        OPENMPI        : on
        OPENMPI2       : off
[snip]
58/69 Test #58: static_event_post_issue_293 ..........   Passed    0.40 sec
      Start 59: co_reduce-factorial
59/69 Test #59: co_reduce-factorial ..................   Passed    0.61 sec
      Start 60: co_reduce-factorial-int8
60/69 Test #60: co_reduce-factorial-int8 .............   Passed    0.52 sec
      Start 61: co_reduce-factorial-int64
61/69 Test #61: co_reduce-factorial-int64 ............   Passed    0.61 sec
      Start 62: issue-493-coindex-slice
62/69 Test #62: issue-493-coindex-slice ..............   Passed    1.05 sec
      Start 63: issue-488-multi-dim-cobounds-true
63/69 Test #63: issue-488-multi-dim-cobounds-true ....   Passed    1.05 sec
      Start 64: issue-488-multi-dim-cobounds-false
64/69 Test #64: issue-488-multi-dim-cobounds-false ...   Passed    1.05 sec
      Start 65: issue-503-multidim-array-broadcast
65/69 Test #65: issue-503-multidim-array-broadcast ...   Passed    0.28 sec
      Start 66: issue-503-non-contig-red-ndarray
66/69 Test #66: issue-503-non-contig-red-ndarray .....   Passed    0.28 sec
      Start 67: issue-552-send_by_ref-singleton
67/69 Test #67: issue-552-send_by_ref-singleton ......   Passed    0.27 sec
      Start 68: issue-515-mimic-mpi-gatherv
68/69 Test #68: issue-515-mimic-mpi-gatherv ..........   Passed    0.27 sec
      Start 69: test-installation-scripts.sh
69/69 Test #69: test-installation-scripts.sh .........   Passed    0.19 sec

100% tests passed, 0 tests failed out of 69

Total Test time (real) =  36.78 sec

TEST - 3
Options        :
        MPICH          : off
        OPENMPI        : off
        OPENMPI2       : on

[snip]
58/69 Test #58: static_event_post_issue_293 ..........   Passed    0.41 sec
      Start 59: co_reduce-factorial
59/69 Test #59: co_reduce-factorial ..................   Passed    0.53 sec
      Start 60: co_reduce-factorial-int8
60/69 Test #60: co_reduce-factorial-int8 .............   Passed    0.53 sec
      Start 61: co_reduce-factorial-int64
61/69 Test #61: co_reduce-factorial-int64 ............   Passed    0.53 sec
      Start 62: issue-493-coindex-slice
62/69 Test #62: issue-493-coindex-slice ..............   Passed    4.17 sec
      Start 63: issue-488-multi-dim-cobounds-true
63/69 Test #63: issue-488-multi-dim-cobounds-true ....   Passed    5.73 sec
      Start 64: issue-488-multi-dim-cobounds-false
64/69 Test #64: issue-488-multi-dim-cobounds-false ...   Passed    5.08 sec
      Start 65: issue-503-multidim-array-broadcast
65/69 Test #65: issue-503-multidim-array-broadcast ...   Passed    0.29 sec
      Start 66: issue-503-non-contig-red-ndarray
66/69 Test #66: issue-503-non-contig-red-ndarray .....   Passed    0.29 sec
      Start 67: issue-552-send_by_ref-singleton
67/69 Test #67: issue-552-send_by_ref-singleton ......   Passed    0.28 sec
      Start 68: issue-515-mimic-mpi-gatherv
68/69 Test #68: issue-515-mimic-mpi-gatherv ..........   Passed    0.28 sec
      Start 69: test-installation-scripts.sh
69/69 Test #69: test-installation-scripts.sh .........   Passed    0.19 sec

100% tests passed, 0 tests failed out of 69

Total Test time (real) =  71.52 sec

HOST
----

TEST - 1
Options        :
        MPICH          : on
        OPENMPI        : off
        OPENMPI2       : off

[snip]
         71 - image_fail_and_sync_test_1 (Failed)
         72 - image_fail_and_sync_test_2 (Failed)
         73 - image_fail_and_sync_test_3 (Failed)
         74 - image_fail_and_status_test_1 (Failed)
         75 - image_fail_and_failed_images_test_1 (Failed)
         76 - image_fail_and_stopped_images_test_1 (Failed)
         77 - image_fail_and_get_test_1 (Failed)
         79 - shellcheck:caf (Failed)
Errors while running CTest
*** Error code 8

Stop.
make: stopped in /usr/ports/lang/opencoarrays
root@hammer:/usr/ports/lang/opencoarrays#

TEST - 2
Options        :
        MPICH          : off
        OPENMPI        : on
        OPENMPI2       : off

[snip]
67/71 Test #67: issue-552-send_by_ref-singleton ......   Passed    0.36 sec
      Start 68: issue-515-mimic-mpi-gatherv
68/71 Test #68: issue-515-mimic-mpi-gatherv ..........   Passed    0.33 sec
      Start 69: test-installation-scripts.sh
69/71 Test #69: test-installation-scripts.sh .........   Passed    0.23 sec
      Start 70: shellcheck:caf
70/71 Test #70: shellcheck:caf .......................***Failed    0.11 sec
      Start 71: shellcheck:cafrun
71/71 Test #71: shellcheck:cafrun ....................   Passed    0.09 sec

99% tests passed, 1 tests failed out of 71

Total Test time (real) =  45.38 sec

The following tests FAILED:
         70 - shellcheck:caf (Failed)
Errors while running CTest
*** Error code 8

Stop.
make: stopped in /usr/ports/lang/opencoarrays
root@hammer:/usr/ports/lang/opencoarrays#

TEST - 3
Options        :
        MPICH          : off
        OPENMPI        : off
        OPENMPI2       : on

[snip]
68/71 Test #68: issue-515-mimic-mpi-gatherv ..........   Passed    0.38 sec
      Start 69: test-installation-scripts.sh
69/71 Test #69: test-installation-scripts.sh .........   Passed    0.22 sec
      Start 70: shellcheck:caf
70/71 Test #70: shellcheck:caf .......................***Failed    0.10 sec
      Start 71: shellcheck:cafrun
71/71 Test #71: shellcheck:cafrun ....................   Passed    0.09 sec

99% tests passed, 1 tests failed out of 71

Total Test time (real) =  83.32 sec

The following tests FAILED:
         70 - shellcheck:caf (Failed)
Errors while running CTest
*** Error code 8

Stop.
make: stopped in /usr/ports/lang/opencoarrays
root@hammer:/usr/ports/lang/opencoarrays#
Comment 27 Anton Shterenlikht 2019-02-15 17:54:34 UTC
I'm glad I'm not the maintainer anymore...

Seriously - I give up.
Separately I asked Steve Kargl to run the tests - he cannot reproduce my behaviour either. So feel free to close this PR.

Many thanks for your time!
Comment 28 Fernando Apesteguía freebsd_committer freebsd_triage 2019-02-17 18:02:29 UTC
OK,

It seems we are not able to reproduce neither the hangs nor the reboots despite trying different combinations of OS both in host and jailed.

Closing this PR as requested by submitter.

Thanks to all involved.