Trying to run a simple program built with MPICH-4.2.1 aborts with this message: Abort(1007805711): Fatal error in internal_Init: Other MPI error, error stack: internal_Init(48306).: MPI_Init(argc=0x820930afc, argv=0x820930af0) failed MPII_Init_thread(189): MPII_init_gpu(31)....: gpu_init failed The same error also occurs when running mpivars: $ mpivars Abort(135390735): Fatal error in internal_Init_thread: Other MPI error, error stack: internal_Init_thread(48399): MPI_Init_thread(argc=0x820e2739c, argv=0x820e27390, required=0, provided=0x820e2738c) failed MPII_Init_thread(189)......: MPII_init_gpu(31)..........: gpu_init failed Remark: I checked that there is no mix with the libraries provided by OpenMPI. Note: the same error has been reported and fixed previously in PR 256244.
You can assign it to me
This seems to work as intended for me: 1) Compiled with default options (I don't have a GPU) $ mpivars > /dev/null Abort(403826191): Fatal error in internal_Init_thread: Other MPI error, error stack: internal_Init_thread(48399): MPI_Init_thread(argc=0x8202ea77c, argv=0x8202ea770, required=0, provided=0x8202ea76c) failed MPII_Init_thread(189)......: MPII_init_gpu(31)..........: gpu_init failed $ MPIR_CVAR_ENABLE_GPU=0 mpivars > /dev/null $ I get an error if I try to initialize a GPU that I don't have. 2) Compiled with default options except that I removed L0 $ mpivars > /dev/null $ So the binaries behave as expected. The question that remains is: should the default package crash if no GPU is found, and I'm sure that we agree that the answer is no. Then I would be inclined to undo the change brought by PR 256244 in 88e134883dd2 and remove L0 from the default options. I'll submit this patch. Can confirm that mpich works for you as expected without the L0 option.
Created attachment 250548 [details] 0001-net-mpich-Fix-crash-when-GPU-not-found.patch net/mpich: Fix crash when GPU not found Remove L0 option from default Reverts 88e134883dd2 PR: 278869
(In reply to Laurent from comment #2) I meant: can *you* confirm that it works for you without the L0 option?
I can confirm that MPICH is fixed when L0 is deselected, but this is just a work-around. The port should work with or without this option. Note: I have a GPU, but gpu support is disabled in the port (no CUDA). Remark: I cannot assign a PR to you, this field is designed for the committer.
A commit in branch main references this bug: URL: https://cgit.FreeBSD.org/ports/commit/?id=817999872ea552463b9eff1de66cb3a17a33e1c7 commit 817999872ea552463b9eff1de66cb3a17a33e1c7 Author: Laurent Chardon <laurent.chardon@gmail.com> AuthorDate: 2024-05-09 17:06:52 +0000 Commit: Thierry Thomas <thierry@FreeBSD.org> CommitDate: 2024-05-09 17:09:10 +0000 net/mpich: fix with default options MPICH is broken when L0 is selected: remove it from the default options. Remark: still broken with L0. PR: 278869 net/mpich/Makefile | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
Before bug 275317 the port had various fixes to make L0=on work without GPU or with unsupported GPU. If L0=on still requires patches then upstream isn't dogfooding such a configuration, so it makes sense to disable L0 by default. Besides, L0 cannot be tested anymore on drm-kmod >= 5.15 even on supported GPUs due to https://github.com/FreeBSDDesktop/kms-drm/issues/197 . Mesa doesn't provide L0 drivers (yet?) and anything else is likely to have worse portability story (see patches in lang/intel-compute-runtime).
Thanks Jan for the explanation! In this case the port should be marked IGNORE when L0 is selected - at least for some time.
Created attachment 250565 [details] 0002-net-mpich-IGNORE-L0-option.patch net/mpich: IGNORE L0 option level-zero is currently broken PR 278869
Patch submitted. Should level-zero be marked as BROKEN?
(In reply to Laurent Chardon from comment #10) Your patch seems right: we mark a port BROKEN when it does not build; in this case IGNORE is OK.
(In reply to Thierry Thomas from comment #11) Thanks for checking the patch. My question about BROKEN was really about devel/level-zero. If it can't run (according to https://github.com/FreeBSDDesktop/kms-drm/issues/197), then should it be marked as BROKEN, as per https://docs.freebsd.org/en/books/porters-handbook/book/#dads-noinstall-variables when it "has runtime issues on systems where it is supposed to run fine"? I happily leave this decision in your collective hands as committers.
A commit in branch main references this bug: URL: https://cgit.FreeBSD.org/ports/commit/?id=51909b06f14d400171e8c0b5ddcdf5b23f337a1a commit 51909b06f14d400171e8c0b5ddcdf5b23f337a1a Author: Laurent Chardon <laurent.chardon@gmail.com> AuthorDate: 2024-05-10 13:41:08 +0000 Commit: Thierry Thomas <thierry@FreeBSD.org> CommitDate: 2024-05-10 13:41:08 +0000 net/mpich: mark IGNORE when the option L0 is selected It builds fine with L0, but does not run. PR: 278869 net/mpich/Makefile | 5 +++++ 1 file changed, 5 insertions(+)
Committed, thanks! Remark: no PORTREVISION bump needed in this case, this is no more the default option. I think that there is no value in marking it BROKEN, since it builds fine.
(In reply to Laurent Chardon from comment #12) - devel/level-zero is L0 loader. It likely still works fine but useless without drivers. Besides, marking it as IGNORE without checking consumers may break other ports. - lang/intel-compute-runtime provides L0 (and OpenCL) driver. It currently fails on drm-515-kmod and drm-61-kmod but may still work on drm-510-kmod and used by default on FreeBSD 13.* (supported until 2026-04-30).
Besides L0 mpich also supports GPU offload via CUDA which may be possible to use on FreeBSD via libc6-shim (nv-sglrun). At least some other CUDA apps are known to work. I don't have an NVIDIA GPU and don't use mpich, so unlikely to work on it.