Created attachment 268792 [details] net/openmpi: introduced ucx option Hi Laurent, I prepared a patch introducing an optional UCX port option for net/openmpi to enable integration with the external net/ucx port on FreeBSD. Build and configure integration work correctly. This required a small FreeBSD-specific adjustment for locating ucx.pc, since it is installed under ${LOCALBASE}/libdata/pkgconfig rather than the usual ${LOCALBASE}/lib/pkgconfig. With the patch applied: - the port builds successfully with the UCX option enabled - ompi_info shows the UCX components (pml/ucx, osc/ucx) - forcing --mca pml ucx --mca osc ucx confirms that the UCX path is selected at runtime However, runtime testing also revealed remaining FreeBSD-specific issues inside UCX itself (e.g. Linux-specific network discovery assumptions and subsequent initialization failures). These need to be addressed in net/ucx before the UCX path can be considered fully usable on FreeBSD. Before continuing further on the UCX side, I wanted to ask how you would prefer UCX integration to be exposed in the port. My current implementation uses a non-default UCX option, but alternatives could include a different option policy or a separate flavor. I would also appreciate feedback on whether the current adjustment for locating ucx.pc is acceptable, or if you would prefer a different approach.
Hi Rikka, Thanks for your work on ucx, and for your work on the integration of ucx in openmpi. The way you did it is the way I was planning on doing it: make it a non-default option. I plan to do the same thing for mpich also. The reason why I haven't done so already is that I am waiting for ucx to be tested and stable on FreeBSD. Once ucx passes all unit test, please let me know and I'll add the option to both mpi implementations.
Hi Laurent, Congratulations on the new @FreeBSD.org address! I also noticed this only after sending the previous message, which is why I CC’d your Gmail address. I’ll use your FreeBSD address for future correspondence. On the ports side I’ve started working on unblocking UCX. Once I’ve completed testing and verified that it works properly together with OpenMPI on FreeBSD, I’ll let you know. Thanks for your feedback on the UCX integration. Best regards, Rikka
Thanks Rikka. You're right, the laurent@ address is new from last week! I can add the ucx option for openmpi and mpich and do the tests if you wish, once you give me to go ahead. If you wish to do it yourself that's OK too. Then I'd suggest a different approach to files/patch-configure. Since it seems that the only change is to replace "${check_package_prefix}/lib/pkgconfig/ucx.pc" by "${check_package_prefix}/libdata/pkgconfig/ucx.pc" in several places, it is simpler to make theses changes with a ${REINPLACE_CMD} in the Makefile. It will also be easier to maintain. Thanks for your great HPC contributions!
(In reply to Laurent Chardon from comment #3) You can probably use USES= pathfix
(In reply to Daniel Engberg from comment #4) Yes, thanks. - pkgconfig python:build shebangfix tar:bzip2 + pathfix pkgconfig python:build shebangfix tar:bzip2 +PATHFIX_MAKEFILEIN= configure seems to work well
Hi Laurent, I wanted to report back with the current status of the UCX work on the FreeBSD side. While debugging the UCX-backed OpenMPI runtime path on FreeBSD, I identified several underlying issues in net/ucx and split them into a separate UCX bug report: bug #293867 The attached net/ucx patch series addresses multiple FreeBSD-specific runtime problems in UCX itself, including runtime portability fixes, UCM relocation handling, async thread state hardening, and mm signal socket binding. With these changes in place, standalone UCX smoke tests and reduced UCP/worker tests now succeed on FreeBSD. In particular, ucp_init() and ucp_worker_create() complete successfully in the standalone test cases, including the worker path that previously exercised the problematic async thread handling. On the OpenMPI side, this also improves the situation: OpenMPI no longer fails in the earlier UCX/UCM runtime path, and the UCX component now loads successfully. However, the UCX-backed OpenMPI path still does not initialize fully. For example, when forcing the UCX PML with: OMPI_MCA_pml=ucx OMPI_MCA_osc=sm mpirun -np 1 /tmp/mpi_hello the verbose output shows: - component ucx open function successful - select: initializing pml component ucx - select: init returned failure for component ucx From the UCX trace output, the UCX context itself is already created successfully at that point, so the remaining failure appears to happen later, likely in the OpenMPI UCX integration layer rather than in the UCX base runtime itself. So my current conclusion is: - the UCX port itself was blocked by several FreeBSD runtime issues, and those have now largely been addressed in the separate UCX patch series - the remaining failure is in the UCX-backed MPI integration path, not in the earlier UCX initialization path I was debugging before Because of that, I think it makes sense for you to handle the MPI-side UCX integration in net/openmpi and net/mpich, since you maintain those ports and are much more familiar with the MPI-side internals. I mainly wanted to report back that the original UCX runtime blockers have been reduced substantially, but that the OpenMPI UCX PML path is still not fully working yet on FreeBSD. Best regards, Rikka
I had a first try at ucx integration with openmpi today. I have a couple of questions: 1) ucx_info -d gives me errors: $ ucx_info -d | grep ERROR [1773749512.359150] [freebsd:41529:0] sys.c:425 UCX ERROR failed to get boot id [1773749512.360539] [freebsd:41529:0] sys.c:425 UCX ERROR failed to get boot id [1773749512.360657] [freebsd:41529:0] tcp_iface.c:989 UCX ERROR scandir(/sys/class/net) failed: No such file or directory [1773749512.360789] [freebsd:41529:0] mm_iface.c:705 UCX ERROR Failed to auto-bind unix domain socket: Invalid argument 2) I noticed that UCX’s gtest test framework has been excluded from the build. Do you intend for this to be a permanent change, or is your intention to re-enable it later?
(In reply to Laurent Chardon from comment #7) I forgot to write my question for 1). The question is: is this an issue or can it be ignored? Thanks!
Thanks for testing this! That helps a lot to cross-check behavior outside my local setup. Regarding your first question: Those errors you are seeing from ucx_info -d are expected with the current unpatched UCX port, and they are not all harmless. For reference, with the stock package (no patches), I see the same issues: - failed to get boot id - scandir(/sys/class/net) failed - Failed to auto-bind unix domain socket: Invalid argument - /dev/shm related errors In particular: - the /sys/class/net error comes from Linux-specific assumptions and is expected on FreeBSD - the mm socket bind failure is more problematic — this breaks the shared-memory transport initialization (mm), so this is not something we should ignore - /dev/shm errors are also expected (Linux-specific), but they indicate missing adaptation for POSIX/shared memory handling on FreeBSD After applying my patch series from bug #293867, the situation improves significantly: - the mm socket binding issue is resolved - UCX is able to initialize transports (e.g. TCP, SYSV) correctly - ucx_info -d produces a mostly functional output The remaining messages you see in a run after applying my patches are mostly: - Linux-specific warnings (e.g. /proc access, somaxconn) - debug instrumentation from my async-thread probing Those are not fatal and do not prevent UCX initialization or usage. Now to your second question (gtest): The test suite was disabled intentionally for now, but not meant as a permanent change. UCX’s gtest framework relies heavily on Linux-specific infrastructure (e.g. /proc, /sys, certain memory and network assumptions), and a significant portion of the tests either fail or are not meaningful on FreeBSD in their current form. My plan is: - first: get the runtime behavior of UCX working correctly (which is what I’m focusing on now) - then: revisit the test suite, either by - patching tests to be portable, or - selectively enabling a subset that makes sense on FreeBSD At the moment, enabling the full gtest suite would produce a large amount of noise without giving useful validation signal.