Index of algorithms provided by Gloo and their semantics.
Variables used:
Terms used:
Compute sum of N arrays per process across P processes. This computation happens in place; all input arrays contain the resulting sum after the algorithm completes.
There are 3 phases to each implementation of this algorithm:
Phase 2 is implemented as follows:
Phase 2 is implemented in 2 sub-phases:
With 2*P chunks and two sub-phases, we arrive at 4*P communication steps.
These sub-phases are implemented as followed (roughly):
First:
Second:
Phase 2 is implemented in two sub-phases:
First, a reduce-scatter is performed in lg(P) steps using a recursive vector-halving, distance-doubling approach. In the first step of this algorithm processes communicate in pairs (rank 0 with 1, 2 with 3, etc.), sending and receiving for different halves of their input buffer. For example, process 0 sends the second half of its buffer to process 1 and receives and reduces data for the first half of the buffer from process 1. A reduction over the received data is performed before proceeding to the next communication step, where the distance to the destination rank is doubled while the data sent and received is halved. After the reduce-scatter phase is finished, each process has a portion of the final reduced array.
The second sub-phase of Phase 2 performs an allgather. This is again done using a recursive algorithm, retracing the communication steps from the reduce-scatter in reverse, but this time simply concatenating the received data at each step. At each process and step, the portion of the buffer that was being sent in the reduce-scatter is received in the allgather, and the portion that was being received in the reduce-scatter is now sent.
Across the steps of the reduce-scatter, data is received intto different buffers and there is no potential for race conditions. However, mirrored steps of the reduce-scatter and allgather (e.g. last step of the reduce-scatter and first step of the allgather) write into the same buffers. To prevent race conditions, a notification is sent after data is processed in the reduce-scatter subphase. This notification is processed in the allgather subphase prior to performing the send. In the majority of cases these notification messages will arrive long before the step of the allgather where they are processed, so their effect on performance should be minimal.
When running on non-power-of-two number of processes, the algorithm works by breaking up execution into blocks that are powers of two and communicating interblock after the intrablock reduce-scatter. Non-power-of-two cases will have some degree of load imbalance compared to power-of-two, but cases with few large blocks (e.g. 8 + 4 or 16 + 8) should still perform relatively well.
The halving-doubling / binary-blocks algorithm is described and analyzed in (Thakur et al., Optimization of Collective Communication Operations in MPICH, IJHPCA, 2005).
CUDA-aware implementation of allreduce_ring
. GPU side buffers are copied to system memory in parallel, prior to running local reduction on CPU. After phase 2 completes, CPU side result is copied back to GPU side buffers in parallel.
CUDA-aware implementation of allreduce_ring_chunked
. GPU side buffers are reduced into GPU buffer 0 (using NCCL). The result is copied to system memory asynchronously. After phase 2 completes, the CPU side result is copied back to GPU buffer 0, and then broadcast to other GPU buffers in parallel (using NCCL).
Both local reduction in phase 1 and broadcast in phase 3 is pipelined with the communication steps where this data is needed or becomes available.
CUDA-aware implementation of allreduce_halving_doubling
with no pipelining between reduction/broadcast steps and the communication.
CUDA-aware implementation of allreduce_halving_doubling
with pipelining between local reduction/broadcast steps and communication. Local reduction step is split into two steps (since the first communication step sends half the buffer size). Final broadcast is pipelined across lgP steps, with each step corresponding to a receive during the allgather phase.
Synchronization point between processes.
Every process sends a notification to every other process. Then, it waits for a notification from every other process.
Non-root processes: send notification to root, wait for notification from root.
Root process: wait for notification from P-1 processes, send notification to P-1 processes.
Broadcast contents of buffer on one process to other P-1 processes.
Non-root processes: receive buffer from root.
Root process: send buffer to P-1 processes.