Bug 278869 - net/mpich: Fatal error in internal_Init_thread
Summary: net/mpich: Fatal error in internal_Init_thread
Status: Closed FIXED
Alias: None
Product: Ports & Packages
Classification: Unclassified
Component: Individual Port(s) (show other bugs)
Version: Latest
Hardware: Any Any
: --- Affects Only Me
Assignee: Thierry Thomas
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2024-05-09 12:11 UTC by Thierry Thomas
Modified: 2024-05-11 02:18 UTC (History)
1 user (show)

See Also:
bugzilla: maintainer-feedback? (laurent.chardon)


Attachments
0001-net-mpich-Fix-crash-when-GPU-not-found.patch (965 bytes, patch)
2024-05-09 14:29 UTC, Laurent Chardon
no flags Details | Diff
0002-net-mpich-IGNORE-L0-option.patch (1.07 KB, patch)
2024-05-10 10:50 UTC, Laurent Chardon
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Thierry Thomas freebsd_committer freebsd_triage 2024-05-09 12:11:36 UTC
Trying to run a simple program built with MPICH-4.2.1 aborts with this message:

Abort(1007805711): Fatal error in internal_Init: Other MPI error, error stack:
internal_Init(48306).: MPI_Init(argc=0x820930afc, argv=0x820930af0) failed
MPII_Init_thread(189): 
MPII_init_gpu(31)....:  gpu_init failed

The same error also occurs when running mpivars:

$ mpivars
Abort(135390735): Fatal error in internal_Init_thread: Other MPI error, error stack:
internal_Init_thread(48399): MPI_Init_thread(argc=0x820e2739c, argv=0x820e27390, required=0, provided=0x820e2738c) failed
MPII_Init_thread(189)......: 
MPII_init_gpu(31)..........:  gpu_init failed

Remark: I checked that there is no mix with the libraries provided by OpenMPI.

Note: the same error has been reported and fixed previously in PR 256244.
Comment 1 Laurent Chardon 2024-05-09 12:50:25 UTC
You can assign it to me
Comment 2 Laurent Chardon 2024-05-09 14:04:37 UTC
This seems to work as intended for me:

1) Compiled with default options

  (I don't have a GPU)

  $ mpivars > /dev/null
  Abort(403826191): Fatal error in internal_Init_thread: Other MPI error, error stack:
  internal_Init_thread(48399): MPI_Init_thread(argc=0x8202ea77c, argv=0x8202ea770, required=0, provided=0x8202ea76c) failed
  MPII_Init_thread(189)......:
  MPII_init_gpu(31)..........:  gpu_init failed

  $  MPIR_CVAR_ENABLE_GPU=0 mpivars > /dev/null
  $

  I get an error if I try to initialize a GPU that I don't have.

2) Compiled with default options except that I removed L0

  $ mpivars > /dev/null
  $

So the binaries behave as expected. The question that remains is: should the default package crash if no GPU is found, and I'm sure that we agree that the answer is no.

Then I would be inclined to undo the change brought by PR 256244 in 88e134883dd2 and remove L0 from the default options.

I'll submit this patch. Can confirm that mpich works for you as expected without the L0 option.
Comment 3 Laurent Chardon 2024-05-09 14:29:41 UTC
Created attachment 250548 [details]
0001-net-mpich-Fix-crash-when-GPU-not-found.patch

net/mpich: Fix crash when GPU not found

    Remove L0 option from default
    Reverts 88e134883dd2

    PR: 278869
Comment 4 Laurent Chardon 2024-05-09 16:37:46 UTC
(In reply to Laurent from comment #2)
I meant: can *you* confirm that it works for you without the L0 option?
Comment 5 Thierry Thomas freebsd_committer freebsd_triage 2024-05-09 16:44:30 UTC
I can confirm that MPICH is fixed when L0 is deselected, but this is just a work-around.

The port should work with or without this option.

Note: I have a GPU, but gpu support is disabled in the port (no CUDA).

Remark: I cannot assign a PR to you, this field is designed for the committer.
Comment 6 commit-hook freebsd_committer freebsd_triage 2024-05-09 17:09:45 UTC
A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/ports/commit/?id=817999872ea552463b9eff1de66cb3a17a33e1c7

commit 817999872ea552463b9eff1de66cb3a17a33e1c7
Author:     Laurent Chardon <laurent.chardon@gmail.com>
AuthorDate: 2024-05-09 17:06:52 +0000
Commit:     Thierry Thomas <thierry@FreeBSD.org>
CommitDate: 2024-05-09 17:09:10 +0000

    net/mpich: fix with default options

    MPICH is broken when L0 is selected: remove it from the default options.

    Remark: still broken with L0.

    PR:             278869

 net/mpich/Makefile | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)
Comment 7 Jan Beich freebsd_committer freebsd_triage 2024-05-09 17:32:18 UTC
Before bug 275317 the port had various fixes to make L0=on work without GPU or with unsupported GPU. If L0=on still requires patches then upstream isn't dogfooding such a configuration, so it makes sense to disable L0 by default.

Besides, L0 cannot be tested anymore on drm-kmod >= 5.15 even on supported GPUs due to https://github.com/FreeBSDDesktop/kms-drm/issues/197 . Mesa doesn't provide L0 drivers (yet?) and anything else is likely to have worse portability story (see patches in lang/intel-compute-runtime).
Comment 8 Thierry Thomas freebsd_committer freebsd_triage 2024-05-10 07:40:26 UTC
Thanks Jan for the explanation!

In this case the port should be marked IGNORE when L0 is selected - at least for some time.
Comment 9 Laurent Chardon 2024-05-10 10:50:31 UTC
Created attachment 250565 [details]
0002-net-mpich-IGNORE-L0-option.patch

net/mpich: IGNORE L0 option

level-zero is currently broken

PR 278869
Comment 10 Laurent Chardon 2024-05-10 10:51:37 UTC
Patch submitted. 

Should level-zero be marked as BROKEN?
Comment 11 Thierry Thomas freebsd_committer freebsd_triage 2024-05-10 11:03:15 UTC
(In reply to Laurent Chardon from comment #10)
Your patch seems right: we mark a port BROKEN when it does not build; in this case IGNORE is OK.
Comment 12 Laurent Chardon 2024-05-10 11:32:11 UTC
(In reply to Thierry Thomas from comment #11)
Thanks for checking the patch.

My question about BROKEN was really about devel/level-zero.
If it can't run (according to https://github.com/FreeBSDDesktop/kms-drm/issues/197), then should it be marked as BROKEN, as per https://docs.freebsd.org/en/books/porters-handbook/book/#dads-noinstall-variables when it "has runtime issues on systems where it is supposed to run fine"?

I happily leave this decision in your collective hands as committers.
Comment 13 commit-hook freebsd_committer freebsd_triage 2024-05-10 13:46:05 UTC
A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/ports/commit/?id=51909b06f14d400171e8c0b5ddcdf5b23f337a1a

commit 51909b06f14d400171e8c0b5ddcdf5b23f337a1a
Author:     Laurent Chardon <laurent.chardon@gmail.com>
AuthorDate: 2024-05-10 13:41:08 +0000
Commit:     Thierry Thomas <thierry@FreeBSD.org>
CommitDate: 2024-05-10 13:41:08 +0000

    net/mpich: mark IGNORE when the option L0 is selected

    It builds fine with L0, but does not run.

    PR:             278869

 net/mpich/Makefile | 5 +++++
 1 file changed, 5 insertions(+)
Comment 14 Thierry Thomas freebsd_committer freebsd_triage 2024-05-10 13:49:10 UTC
Committed, thanks!

Remark: no PORTREVISION bump needed in this case, this is no more the default option.

I think that there is no value in marking it BROKEN, since it builds fine.
Comment 15 Jan Beich freebsd_committer freebsd_triage 2024-05-11 01:59:07 UTC
(In reply to Laurent Chardon from comment #12)
- devel/level-zero is L0 loader. It likely still works fine but useless without drivers. Besides, marking it as IGNORE without checking consumers may break other ports.
- lang/intel-compute-runtime provides L0 (and OpenCL) driver. It currently fails on drm-515-kmod and drm-61-kmod but may still work on drm-510-kmod and used by default on FreeBSD 13.* (supported until 2026-04-30).
Comment 16 Jan Beich freebsd_committer freebsd_triage 2024-05-11 02:18:56 UTC
Besides L0 mpich also supports GPU offload via CUDA which may be possible to use on FreeBSD via libc6-shim (nv-sglrun). At least some other CUDA apps are known to work. I don't have an NVIDIA GPU and don't use mpich, so unlikely to work on it.