This tutorial explains the basics of writing Itoyori programs.
An Itoyori program begins with the Single Program Multiple Data (SPMD) execution model, as it is launched by MPI. Later, it can be switched to the task-parallel execution mode by spawning the root thread.
A sample program:
Notes:
ityr
is short for Itoyoriityr::init()
and ityr::fini()
must be called to initialize/finalize the Itoyori runtime systemityr::my_rank()
and ityr::n_ranks()
correspond to its MPI rank and sizeityr::root_exec()
must be collectively called by all processes to spawn the root threadSuppose that the above program is saved as sample.cpp
, it can be compiled with:
Notes:
-fno-stack-protector
option is necessary for allowing dynamic thread migration which preserves virtual addresses of call stacks across different processesExample output with 4 MPI processes:
The output indicates that the SPMD region is executed by all processes, while the root thread is executed once.
Notes:
After the root thread is spawned, the user can arbitrarily spawn a bunch of lightweight threads. This section explains how to express recursive fork/join parallelism in Itoyori.
Let's take the Fibonacci sequence as an example. The Fibonacci sequence is defined as follows.
$$ \mathit{fib}(n) = \mathit{fib}(n-1) + \mathit{fib}(n-2),\; \mathit{fib}(0) = \mathit{fib}(1) = 1 $$
The following program calculates the n-th Fibonacci number in a stupid way, by recursively solving subproblems in parallel (divide-and-conquer).
Fibonacci program:
Notes:
ityr::parallel_invoke()
forks the given function objects (labmda in this case) as child threads and joins them at a timeparallel_invoke
is also used in common shared-memory fork/join libraries such as oneTBB (formarly Intel TBB) and Microsoft PPLityr::root_exec()
can also return a value, which is shared by all processes when switching back to the SPMD modeityr::is_master()
is equivalent to ityr::my_rank() == 0
One important difference from the shared-memory task-parallel model is that objects cannot not be passed to child threads by reference (or raw pointers). In the above example, lambda expressions for ityr::parallel_invoke()
should capture values by copy (see Pitfalls for details). In Itoyori, no pointer or reference to local variables in any other thread's stack is allowed.
Arguments can also be passed to child threads as tuples without using lambdas:
Also, ityr::parallel_invoke()
can accept an arbitrary number of parallel tasks, as shown in the below Tribonacci example (an extention to Fibonacci).
Tribonacci example:
The Tribonacci sequence is defined as follows.
$$ \mathit{trib}(n) = \mathit{trib}(n-1) + \mathit{trib}(n-2) + \mathit{trib}(n-3),\; \mathit{trib}(0) = \mathit{trib}(1) = 1,\; \mathit{trib}(2) = 2 $$
Unlike the above Fibonacci example, practical real-world applications would need global memory. Itoyori offers a global address space, which can be accessed through checkout/checkin APIs. In Itoyori, global addresses are represented as merely raw virtual addresses, which can be directly accessed with CPU load/store instructions, but access to the virtual memory region must be granted through explicit checkout/checkin calls.
In some literatures, low-level checkout and checkin APIs are explicitly called for explanation, but in the high-level API of Itoyori, we can use checkout spans (a sort of "smart spans") to make sure that checked-out regions are always checked in when destroyed.
Usage of checkout spans:
Notes:
int*
), it is recommended to use a wrapper class ityr::ori::global_ptr<int>
to prevent dereferencing it without checking outityr::ori::global_ptr
is prefixed with ityr::ori
, which is the namespace of the low-level global address space layerityr::global_span
for safetyityr::checkout_mode
) for ityr::make_checkout()
read
, read_write
, or write
, as explained laterAbout the checkout mode:
read
or read_write
, the checked-out region has valid data after the checkout callwrite
, the region may have indeterminate values by skipping fetching data from remote nodes, which can be useful for write-only access (e.g., initialization)read_write
or write
, the entire checked-out region is treated as modifiedIn the following, we explain how to write programs with global memory through an example of parallel mergesort, in which the input array is divided into two subarrays and sorted recursively (divide-and-conquer).
Parallel mergesort example:
The parallel mergesort example is written in a data-race-free manner. In fact, Itoyori does not allow any data race; i.e., the same region can be concurrently checked out by multiple processes in the ityr::checkout_mode::read
mode only.
As Itoyori provides a software cache for global memory accesses, the user can expect both temporal and spatial locality is exploited by the system. This means that, even if the same or close memory regions are checked out multiple times, the cache prevents redundant and fine-grained communication.
Full mergesort program:
Notes:
ityr::global_vector
is used to allocate global memoryityr::global_vector_options
.collective = true
means the global memory should be collectively allocated by all processesityr::root_exec()
(the SPMD region).collective = false
, the global memory is allocated in local memory of each process (noncollective)ityr::global_span
is often used to pass a view of ityr::global_vector
to other threads, so as not to unnecessarily copy the contents of vectorsityr::checkout_mode::write
is specified for array initialization in order to skip fetching unnecessary dataityr::checkout_mode::read
is specified for checking the result, in which the array is never modifiedcutoff
without checking them out, but it is on the user's responsibility to guarantee that global variables have the same values across all processessetarch $(uname -m) --addr-no-randomize
that disables address randomization)Also note that the above example does not work with an array larger than each process's local cache. The following runtime error suggests that checkout requests are too large.
To avoid this, checkout requests must be decomposed into sufficiently small chunks, so that each chunk fits into the cache. Divide-and-conquer parallelization is often a good fit for this problem, as it effectively decomposes checkout/checkin operations into smaller ones and also increases parallelism. See Itoyori's Cilksort example for a parallelized merge implementation. For more regular parallel patterns, higher-order parallel patterns or parallel loops can be used as explained in the next section.
When computing on an array that is much larger than each process's local cache, checkout calls have to be made in sufficiently small granularity. If manually written, the code to express a simple for loop would look like the following:
This code repeatedly makes checkout calls with a size no larger than block_size
to prevent too large checkout requests.
By using a higher-order function ityr::for_each()
with global iterators, the same goal can be achieved:
ityr::for_each()
receives an execution policy, an iterator range, and a user-defined function (lambda) that operates on each element. Global iterators passed to ityr::for_each()
are automatically checked out internally, and raw references to corresponding elements are passed to the user function. This allows for more concise and structured code.
Notes:
ityr::for_each()
resemble the standard C++ algorithms (like std::for_each()
)ityr::make_global_iterator()
) are passed as arguments, they are automatically checked out in the specified granularityityr::count_iterator
can be used in combination to get an index of each iterator element (i.e., loop counter)ityr::ori::global_ref
) are passed to user functionsityr::execution::sequenced_policy
specifies the sequential execution policycheckout_count
) specifies the number of elements that are internally checked out at one timeityr::execution::seq
)This can be easily translated into a parallel for loop:
Notes:
ityr::for_each()
recursively divides the index space into two parts and runs them in parallelityr::execution::parallel_policy
can additionally accept the cutoff_count
option, which specifies the cutoff count for the leaf taskscutoff_count
and checkout_count
block_size
) to both countsityr::execution::par
)In addition, ityr::for_each()
can accept multiple iterators:
ityr::count_iterator
is a special iterator that counts up its value when incremented. Using it with ityr::for_each()
corresponds to parallelizing a loop that looks like for (int i = 0; i < n; i++)
.
AXPY is an example that can be concisely written with ityr::for_each()
. AXPY computes a scalar-vector product and adds the result to another vector:
$$ y \gets a \times x + y $$
where a is a scalar and x and y are vectors.
AXPY example:
Similarly, the sum of vector elements can be computed in parallel with ityr::reduce()
.
Calculate sum:
Note that global iterators created by ityr::make_global_iterator()
are not needed for ityr::reduce()
because the checkout mode is automatically inferred to ityr::checkout_mode::read
here. In Itoyori, for specific patterns where input/output is clear (e.g., ityr::reduce()
, ityr::transform()
), global pointers are automatically converted to global iterators with appropriate checkout modes, unlike more generic patterns like ityr::for_each
.
When performing reduction, the user can also provide a user-defined function (lambda) to process each element before summation with ityr::transform_reduce()
.
For example, to calculate L2 norm:
In the above code, x * x
is applied to each element before summed up.
ityr::transform_reduce()
supports a general reduction operation more than just summation. Users can define their own reducers, by providing associative reduction operator (commutativity is not required in Itoyori) and an identity element (to constitute a monoid).
TODO: write a document for reducers
Full code example to calculate AXPY and show the result's L2 norm:
It is recommended to read Pitfalls for writing Itoyori programs.