Bug 256244 - net/mpich: enable L0 for GPU support by default
Summary: net/mpich: enable L0 for GPU support by default
Status: Closed FIXED
Alias: None
Product: Ports & Packages
Classification: Unclassified
Component: Individual Port(s) (show other bugs)
Version: Latest
Hardware: Any Any
: --- Affects Only Me
Assignee: Jan Beich
URL:
Keywords: needs-qa, patch
Depends on: 252536
Blocks:
  Show dependency treegraph
 
Reported: 2021-05-29 17:40 UTC by Jan Beich
Modified: 2021-09-06 22:26 UTC (History)
2 users (show)

See Also:


Attachments
v1 (apply via "git am") (966 bytes, patch)
2021-05-29 17:40 UTC, Jan Beich
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Jan Beich freebsd_committer 2021-05-29 17:40:21 UTC
Created attachment 225361 [details]
v1 (apply via "git am")

After ports 697c7df81364 mpich can offload some work to GPU, controlled by MPIR_CVAR_ENABLE_GPU=0 in environ(7). Currently, only lang/intel-compute-runtime provides L0 driver.

Can someone check for regressions?
Comment 1 Thierry Thomas freebsd_committer 2021-05-30 14:43:29 UTC
Thanks for the notification, but I have no device supporting GPU ATM.
Comment 2 Jan Beich freebsd_committer 2021-05-30 16:02:08 UTC
(In reply to Thierry Thomas from comment #1)
That's why I'm asking to "check for regressions". Enabling L0 shouldn't break GPU-less or GPU-incompatible setups.
Comment 3 Thierry Thomas freebsd_committer 2021-05-30 17:47:27 UTC
(In reply to Jan Beich from comment #2)
OK, everything seems good, on my workstation (without CUDA) and under my workload.

BTW, nothing related with your change, but stage-qa displays this error:

Error: /usr/local/bin/hydra_nameserver is linked to /usr/local/lib/libtorque.so.2 from sysutils/torque but it is not declared as a dependency
Warning: you need LIB_DEPENDS+=libtorque.so:sysutils/torque

Maybe Torque should be defined as an option, and disabled when not requested?
Comment 4 commit-hook freebsd_committer 2021-05-30 21:44:22 UTC
A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/ports/commit/?id=88e134883dd2a2a78b909fdee59257513afe0c77

commit 88e134883dd2a2a78b909fdee59257513afe0c77
Author:     Jan Beich <jbeich@FreeBSD.org>
AuthorDate: 2021-05-29 17:29:17 +0000
Commit:     Jan Beich <jbeich@FreeBSD.org>
CommitDate: 2021-05-30 21:43:06 +0000

    net/mpich: enable L0 by default for GPU support

    To disable at runtime set MPIR_CVAR_ENABLE_GPU=0 via environ(7).

    PR:             256244
    Tested by:      thierry

 net/mpich/Makefile | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)
Comment 5 Li-Wen Hsu freebsd_committer 2021-06-01 16:56:58 UTC
It looks this one can be closed?
Comment 6 commit-hook freebsd_committer 2021-09-06 22:26:30 UTC
A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/ports/commit/?id=b5815e7648a8e5307a20a234befa00e34306319d

commit b5815e7648a8e5307a20a234befa00e34306319d
Author:     Henrik Gulbrandsen <henrik@gulbra.net>
AuthorDate: 2021-08-12 14:35:20 +0000
Commit:     Jan Beich <jbeich@FreeBSD.org>
CommitDate: 2021-09-06 22:25:00 +0000

    net/mpich: unbreak optimized runtime after 88e134883dd2

    Runtime may fail without a L0 driver like intel-compute-runtime e.g.,

    $ mpivars
    Abort(268484367) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init_thread: Other MPI error, error stack:
    MPIR_Init_thread(153):  gpu_init failed
    [unset]: write_line error; fd=-1 buf=:cmd=abort exitcode=268484367
    :
    system msg for write_line failure : Bad file descriptor
    Attempting to use an MPI routine before initializing MPICH

    $ MPIR_CVAR_ENABLE_GPU=0 mpivars
    Abort(2139535) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init_thread: Other MPI error, error stack:
    MPIR_Init_thread(159)......:
    MPID_Init(591).............:
    MPIDI_SHM_mpi_init_hook(22):
    MPIDI_IPC_mpi_init_hook(36):
    MPIDI_GPU_mpi_init_hook(79):  gpu_get_dev_count failed
    [unset]: write_line error; fd=-1 buf=:cmd=abort exitcode=2139535
    :
    system msg for write_line failure : Bad file descriptor
    Abort(2139535) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init_thread: Other MPI error, error stack:
    MPIR_Init_thread(159)......:
    MPID_Init(591).............:
    MPIDI_SHM_mpi_init_hook(22):
    MPIDI_IPC_mpi_init_hook(36):
    MPIDI_GPU_mpi_init_hook(79):  gpu_get_dev_count failed
    [unset]: write_line error; fd=-1 buf=:cmd=abort exitcode=2139535
    :
    system msg for write_line failure : Bad file descriptor
    Segmentation fault

    PR:             256244 (for tracking)

 net/mpich/Makefile                      |  2 +-
 net/mpich/files/patch-l0-fallback (new) | 44 +++++++++++++++++++++++++++++++++
 2 files changed, 45 insertions(+), 1 deletion(-)
Comment 7 commit-hook freebsd_committer 2021-09-06 22:26:31 UTC
A commit in branch 2021Q3 references this bug:

URL: https://cgit.FreeBSD.org/ports/commit/?id=fd490a171c3da0d7bcb9a5f3ee3b4b46075dfa9e

commit fd490a171c3da0d7bcb9a5f3ee3b4b46075dfa9e
Author:     Henrik Gulbrandsen <henrik@gulbra.net>
AuthorDate: 2021-08-12 14:35:20 +0000
Commit:     Jan Beich <jbeich@FreeBSD.org>
CommitDate: 2021-09-06 22:25:57 +0000

    net/mpich: unbreak optimized runtime after 88e134883dd2

    Runtime may fail without a L0 driver like intel-compute-runtime e.g.,

    $ mpivars
    Abort(268484367) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init_thread: Other MPI error, error stack:
    MPIR_Init_thread(153):  gpu_init failed
    [unset]: write_line error; fd=-1 buf=:cmd=abort exitcode=268484367
    :
    system msg for write_line failure : Bad file descriptor
    Attempting to use an MPI routine before initializing MPICH

    $ MPIR_CVAR_ENABLE_GPU=0 mpivars
    Abort(2139535) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init_thread: Other MPI error, error stack:
    MPIR_Init_thread(159)......:
    MPID_Init(591).............:
    MPIDI_SHM_mpi_init_hook(22):
    MPIDI_IPC_mpi_init_hook(36):
    MPIDI_GPU_mpi_init_hook(79):  gpu_get_dev_count failed
    [unset]: write_line error; fd=-1 buf=:cmd=abort exitcode=2139535
    :
    system msg for write_line failure : Bad file descriptor
    Abort(2139535) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init_thread: Other MPI error, error stack:
    MPIR_Init_thread(159)......:
    MPID_Init(591).............:
    MPIDI_SHM_mpi_init_hook(22):
    MPIDI_IPC_mpi_init_hook(36):
    MPIDI_GPU_mpi_init_hook(79):  gpu_get_dev_count failed
    [unset]: write_line error; fd=-1 buf=:cmd=abort exitcode=2139535
    :
    system msg for write_line failure : Bad file descriptor
    Segmentation fault

    PR:             256244 (for tracking)
    (cherry picked from commit b5815e7648a8e5307a20a234befa00e34306319d)

 net/mpich/Makefile                      |  2 +-
 net/mpich/files/patch-l0-fallback (new) | 44 +++++++++++++++++++++++++++++++++
 2 files changed, 45 insertions(+), 1 deletion(-)