Bug 252536 - net/mpich: Build fails on mpl/src/gpu/mpl_gpu_ze.c
Summary: net/mpich: Build fails on mpl/src/gpu/mpl_gpu_ze.c
Status: Closed Overcome By Events
Alias: None
Product: Ports & Packages
Classification: Unclassified
Component: Individual Port(s) (show other bugs)
Version: Latest
Hardware: amd64 Any
: --- Affects Only Me
Assignee: Thierry Thomas
URL:
Keywords:
Depends on:
Blocks: 256244
  Show dependency treegraph
 
Reported: 2021-01-09 09:01 UTC by Nick
Modified: 2021-10-28 00:08 UTC (History)
3 users (show)

See Also:


Attachments
Build output (6.77 KB, text/plain)
2021-01-09 09:01 UTC, Nick
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Nick 2021-01-09 09:01:45 UTC
Created attachment 221408 [details]
Build output

When trying to build net/mpich-3.4, multiple build errors occur while compiling src/mpl/src/gpu/mpl_gpu_ze.lo.  The build output is attached, but it looks like a missing/incorrect dependency.  As an example:

```
src/gpu/mpl_gpu_ze.c:280:19: error: use of undeclared identifier 'device_handles'; did you mean 'dev_handle'?
    *dev_handle = device_handles[dev_id];
                  ^~~~~~~~~~~~~~
                  dev_handle
src/gpu/mpl_gpu_ze.c:278:66: note: 'dev_handle' declared here
int MPL_gpu_get_dev_handle(int dev_id, MPL_gpu_device_handle_t * dev_handle)
                                                                 ^
```
Comment 1 Thierry Thomas freebsd_committer freebsd_triage 2021-01-09 09:14:35 UTC
Could you please describe your platform? (uname -mrU)
Comment 2 Nick 2021-01-09 09:18:47 UTC
(In reply to Thierry Thomas from comment #1)

Of course.  

> uname -mrU
12.2-RELEASE-p1 amd64 1202000
Comment 3 Thierry Thomas freebsd_committer freebsd_triage 2021-01-09 14:04:48 UTC
The encountered error seems caused by the file
/usr/local/include/level_zero/ze_api.h
and I don't know it: it does not exist on my machines!

Could you please report the output of
pkg which /usr/local/include/level_zero/ze_api.h

A log of the different config.log could also be interesting, specially if you change the options.

For comparison, the output of a build session in a clean jail (poudriere) is available at:
https://people.freebsd.org/~thierry/mpich-3.4.log
and it does not display anything related the reported error on src/gpu/mpl_gpu_ze.c.
Comment 4 Nick 2021-01-09 15:20:03 UTC
(In reply to Thierry Thomas from comment #3)

> pkg which /usr/local/include/level_zero/ze_api.h
/usr/local/include/level_zero/ze_api.h was installed by package level-zero-1.0.26

I'm not sure where that package came from, since it wasn't a dependency for anything.  Removing it allowed the build to complete as planned.

I did see a block in the configuration stage that refers to level-zero:
> checking level_zero/ze_api.h usability... no
> checking level_zero/ze_api.h presence... no
> checking for level_zero/ze_api.h... no
> checking for zeInit in -lze_loader... no

Seems to be the source of the issue.
Comment 5 Jan Beich freebsd_committer freebsd_triage 2021-01-09 16:59:24 UTC
(In reply to Nick from comment #0)
> src/gpu/mpl_gpu_ze.c:280:19: error: use of undeclared identifier 'device_handles'

device_handles doesn't show up in any change under https://github.com/oneapi-src/level-zero. Which version of level-zero is expected by mpich? Does it build on Linux?

For example, src/pm/hydra2/mpl/src/gpu/mpl_gpu_ze.c has "ze_device_handle_t *global_ze_devices_handle;". Maybe device_handles is a leftover from before global_ze_devices_handle was renamed e.g.,

https://github.com/pmodels/mpich/commit/4c1ed41821b4
Comment 6 commit-hook freebsd_committer freebsd_triage 2021-01-09 17:52:00 UTC
A commit references this bug:

Author: jbeich
Date: Sat Jan  9 17:51:20 UTC 2021
New revision: 560881
URL: https://svnweb.freebsd.org/changeset/ports/560881

Log:
  net/mpich: unbreak with level-zero after r560756

  level-zero is pulled as a build-only dependency of intel-compute-runtime.
  mpich support for level-zero is broken and uses pre-1.0 API (before r545238).

  src/gpu/mpl_gpu_ze.c:123:11: warning: implicit declaration of function 'zeDriverGetMemIpcHandle' is invalid in C99 [-Wimplicit-function-declaration]
      ret = zeDriverGetMemIpcHandle(global_ze_driver_handle, ptr, ipc_handle);
            ^
  src/gpu/mpl_gpu_ze.c:139:9: warning: implicit declaration of function 'zeDriverOpenMemIpcHandle' is invalid in C99 [-Wimplicit-function-declaration]
          zeDriverOpenMemIpcHandle(global_ze_driver_handle,
          ^
  src/gpu/mpl_gpu_ze.c:140:70: error: no member named 'global_dev_id' in 'struct _ze_ipc_mem_handle_t'
                                   global_ze_devices_handle[ipc_handle.global_dev_id],
                                                            ~~~~~~~~~~ ^
  src/gpu/mpl_gpu_ze.c:141:45: error: no member named 'handle' in 'struct _ze_ipc_mem_handle_t'
                                   ipc_handle.handle, ZE_IPC_MEMORY_FLAG_NONE, ptr);
                                   ~~~~~~~~~~ ^
  src/gpu/mpl_gpu_ze.c:141:53: error: use of undeclared identifier 'ZE_IPC_MEMORY_FLAG_NONE'; did you mean 'ZE_IPC_MEMORY_FLAG_TBD'?
                                   ipc_handle.handle, ZE_IPC_MEMORY_FLAG_NONE, ptr);
                                                      ^~~~~~~~~~~~~~~~~~~~~~~
  src/gpu/mpl_gpu_ze.c:156:11: warning: implicit declaration of function 'zeDriverCloseMemIpcHandle' is invalid in C99 [-Wimplicit-function-declaration]
      ret = zeDriverCloseMemIpcHandle(global_ze_driver_handle, ptr);
            ^
  src/gpu/mpl_gpu_ze.c:171:11: warning: implicit declaration of function 'zeDriverGetMemAllocProperties' is invalid in C99 [-Wimplicit-function-declaration]
      ret = zeDriverGetMemAllocProperties(global_ze_driver_handle, ptr, &ptr_attr, &device);
            ^
  src/gpu/mpl_gpu_ze.c:202:25: error: use of undeclared identifier 'ZE_DEVICE_MEM_ALLOC_FLAG_DEFAULT'
      device_desc.flags = ZE_DEVICE_MEM_ALLOC_FLAG_DEFAULT;
                          ^
  src/gpu/mpl_gpu_ze.c:204:17: error: no member named 'version' in 'struct _ze_device_mem_alloc_desc_t'
      device_desc.version = ZE_DEVICE_MEM_ALLOC_DESC_VERSION_CURRENT;
      ~~~~~~~~~~~ ^
  src/gpu/mpl_gpu_ze.c:204:27: error: use of undeclared identifier 'ZE_DEVICE_MEM_ALLOC_DESC_VERSION_CURRENT'
      device_desc.version = ZE_DEVICE_MEM_ALLOC_DESC_VERSION_CURRENT;
                            ^
  src/gpu/mpl_gpu_ze.c:208:11: warning: implicit declaration of function 'zeDriverAllocDeviceMem' is invalid in C99 [-Wimplicit-function-declaration]
      ret = zeDriverAllocDeviceMem(global_ze_driver_handle, &device_desc,
            ^
  src/gpu/mpl_gpu_ze.c:223:23: error: use of undeclared identifier 'ZE_HOST_MEM_ALLOC_FLAG_DEFAULT'
      host_desc.flags = ZE_HOST_MEM_ALLOC_FLAG_DEFAULT;
                        ^
  src/gpu/mpl_gpu_ze.c:224:15: error: no member named 'version' in 'struct _ze_host_mem_alloc_desc_t'
      host_desc.version = ZE_HOST_MEM_ALLOC_DESC_VERSION_CURRENT;
      ~~~~~~~~~ ^
  src/gpu/mpl_gpu_ze.c:224:25: error: use of undeclared identifier 'ZE_HOST_MEM_ALLOC_DESC_VERSION_CURRENT'
      host_desc.version = ZE_HOST_MEM_ALLOC_DESC_VERSION_CURRENT;
                          ^
  src/gpu/mpl_gpu_ze.c:229:11: warning: implicit declaration of function 'zeDriverAllocHostMem' is invalid in C99 [-Wimplicit-function-declaration]
      ret = zeDriverAllocHostMem(global_ze_driver_handle, &host_desc, size, mem_alignment, ptr);
            ^
  src/gpu/mpl_gpu_ze.c:240:11: warning: implicit declaration of function 'zeDriverFreeMem' is invalid in C99 [-Wimplicit-function-declaration]
      ret = zeDriverFreeMem(global_ze_driver_handle, ptr);
            ^
  src/gpu/mpl_gpu_ze.c:251:11: warning: implicit declaration of function 'zeDriverFreeMem' is invalid in C99 [-Wimplicit-function-declaration]
      ret = zeDriverFreeMem(global_ze_driver_handle, ptr);
            ^
  src/gpu/mpl_gpu_ze.c:280:19: error: use of undeclared identifier 'device_handles'; did you mean 'dev_handle'?
      *dev_handle = device_handles[dev_id];
                    ^~~~~~~~~~~~~~

  PR:		252536
  Reported by:	Nick, thierry

Changes:
  head/net/mpich/Makefile
Comment 7 Thierry Thomas freebsd_committer freebsd_triage 2021-05-11 17:09:55 UTC
Nick, could we close this PR?
Comment 8 Nick 2021-05-11 19:46:26 UTC
Yes, I believe so.
Comment 9 Thierry Thomas freebsd_committer freebsd_triage 2021-05-11 20:05:24 UTC
Closed after submitter´s feedback.
Thanks!
Comment 10 Edoardo Aprà 2021-10-27 17:13:53 UTC
You might want to have look at the following MPICH comments/developments that might be related to this bug entry

https://github.com/nwchemgit/nwchem/issues/463#issuecomment-953060633

https://github.com/pmodels/mpich/pull/5623
Comment 11 Thierry Thomas freebsd_committer freebsd_triage 2021-10-27 19:46:42 UTC
(In reply to Edoardo Aprà from comment #10)

Thanks for the feedback!

Have you been able to reproduce the reported issue?
Comment 12 Edoardo Aprà 2021-10-27 19:52:08 UTC
I have been able to reproduce the issue reported in https://github.com/nwchemgit/nwchem/issues/463#issue-1034059467 on a VirtualBox image of FreeBSD 13.
It only shows up when the current mpich port is used (no problems with either openmpi of mpich2). Valgrind shows memory issues in the mpich layer even prior to the fatal segv.
Comment 13 commit-hook freebsd_committer freebsd_triage 2021-10-27 22:24:25 UTC
A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/ports/commit/?id=a46966bb3496e0cf8100f6acdd671d4fb90c9cdb

commit a46966bb3496e0cf8100f6acdd671d4fb90c9cdb
Author:     Jan Beich <jbeich@FreeBSD.org>
AuthorDate: 2021-10-27 21:54:51 +0000
Commit:     Jan Beich <jbeich@FreeBSD.org>
CommitDate: 2021-10-27 22:19:27 +0000

    net/mpich: replace L0 fix with upstream version

    PR:             252536
    Reported by:    Edoardo Aprà

 net/mpich/Makefile                   |  1 -
 net/mpich/files/patch-l0-1.4.1 (new) | 50 ++++++++++++++++++++++++++++++++++++
 2 files changed, 50 insertions(+), 1 deletion(-)
Comment 14 commit-hook freebsd_committer freebsd_triage 2021-10-28 00:08:46 UTC
A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/ports/commit/?id=dea82318648e46c157874bd1079f50b23c9c08d0

commit dea82318648e46c157874bd1079f50b23c9c08d0
Author:     Jan Beich <jbeich@FreeBSD.org>
AuthorDate: 2021-10-27 23:48:16 +0000
Commit:     Jan Beich <jbeich@FreeBSD.org>
CommitDate: 2021-10-28 00:08:04 +0000

    net/mpich: switch L0=off to --without-ze after 697c7df81364

    https://github.com/pmodels/mpich/commit/67b1e07851fe
    https://github.com/pmodels/mpich/commit/84ae6243139c

    PR:             252536
    Reported by:    Edoardo Aprà

 net/mpich/Makefile | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)