Itoyori is a distributed multithreading runtime system for global-view fork-join task parallelism. It is implemented as a C++17 header-only library over MPI (which must have a full support of MPI-3 RMA).
This README explains the basic usage of Itoyori for running example programs. For more information, please see publications.
Itoyori offers a simple, unified programming model for both shared-memory and distributed-memory computers.
Itoyori is a C++17 header-only library located at include/
dir. This repository contains some CMake settings to build tests and examples.
To build tests and examples:
To run test:
Examples (at examples/
dir) include Fib, NQueens, and Cilksort. Cilksort involves global memory accesses (over PGAS), while Fib and NQueens do not.
To run Cilksort:
The command setarch $(uname -m) --addr-no-randomize
is needed for Itoyori to disable address space layout randomization (ASLR). The argument $(uname -m)
might not be needed depending on the setarch
version.
Please see the example programs for usage (e.g., cilksort.cpp).
Profiler-enabled versions are also built by CMake.
To show event statistics:
To record execution traces:
The output trace can be visualized by using MassiveLogger viewer:
The include/ityr/
dir includes following sub directories:
common
: common utils for the following layersito
: low-level threading layer (fork-join primitives)ori
: low-level PGAS layer (checkout/checkin APIs)pattern
: parallel patterns (e.g., for_each()
, reduce()
)container
: containers for global memory objects (e.g., global_vector
, global_span
)The ito
and ori
layers are loosely coupled, so that each layer runs independently. These two low-level layers are integrated into high-level parallel patterns and containers, by appropriately inserting global memory fences to fork-join calls, for example. Thus, it is highly recommended to use these high-level interfaces (under ityr::
namespace), rather than low-level ones (under ityr::ito
or ityr::ori
namespace).
Git submodules are used for the following purposes, but not required to run Itoyori:
doctest
: used for testingmassivelogger
: needed to collect execution tracesAs Itoyori heavily uses MPI-3 RMA (one-sided communication) for communication between workers, the performance can be significantly affected by the MPI implementation being used. Itoyori assumes truly one-sided communication of MPI-3 RMA, and preferably, RMA calls should be offloaded to RDMA.
Truly one-sided communication implies that an RMA operation can be completed without the involvement of the target process. In other words, an RMA operation should make progress even if the target process is busy executing tasks without calling any MPI calls.
You can check if RMA calls of your MPI installation are truly one-sided by running <example>_prof_stats.out
programs, in which statistics profiling is enabled. For instance, you can check a profile of Cilksort by running:
Example output of Cilksort with 2 workers on the same computer:
This result implies that RMA calls are not truly one-sided. What is happening is that one worker is continuously executing tasks without any MPI calls, while the other tries work stealing but ends up being blocked due to lack of progress. This situation leads to all tasks being executed by just one worker.
The above result was obtained with MPICH (v4.0), which seems not supporting truly one-sided communication. Nevertheless, you can emulate truly one-sided communication by launching asynchronous progress threads by setting:
Then, we will get the following result:
However, MPICH's approach is based on active messages to simulate one-sided communication by asynchronous two-sided communication, which does not take full advantage of RDMA. From our experience, Open MPI better offloads RMA operations to RDMA.
We confirmed that Itoyori worked well on RDMA-capable interconnects with the following MPI configurations:
Note that actual MPI behaviors will depend on actual hardware configurations and driver versions.
Open MPI v5.0.x enables the use of an MCA parameter osc_ucx_acc_single_intrinsic
, which accelerates network atomic operations being heavily used in Itoyori. A recommended way to run Itoyori with Open MPI is:
However, local executions with Open MPI (without high-performance network cards) may degrade to an implementation that is not truly one-sided. In addition, there seems no option to launch asynchronous progress threads in Open MPI. Therefore, for the debugging purpose on local machines, we recommend to use MPICH-based MPI implementations with MPICH_ASYNC_PROGRESS=1
.
Overview of the Itoyori runtime system and its PGAS implementation:
About the threading layer (formarly called uni-address threads):
About the task scheduler Almost Deterministic Work Stealing (ADWS):
Itoyori is named after the fish *thread*fin breams ("糸撚魚" in Japanese).