blob: cc6e1b7ef10db63fc16538440bbfbbc4d3528664 [file] [log] [blame] [view]
# Algorithms
Index of algorithms provided by Gloo and their semantics.
Variables used:
* **P**: Number of processes/machines
* **N**: Number of buffers per process
* **S**: Size of buffer
Terms used:
* **Communication steps**: number of communication steps. Every
communication step has some latency, depending on the transport.
Therefore, the fewer steps an algorithm uses, the better it is
suited towards higher latency transports. Lower latency transports
better tolerate more communication steps.
* **Bytes on the wire**: total number of bytes transmitted per
participating process. The higher this number, the sooner an
algorithm will be bound by the network bandwidth.
## Allreduce
Compute sum of N arrays per process across P processes. This
computation happens in place; all input arrays contain the resulting
sum after the algorithm completes.
There's 3 phases to each implementation of this algorithm:
1. Local reduction of N buffers
2. Allreduce between processes
3. Broadcast result back to N buffers
### allreduce_ring
* Communication steps: P-1
* Bytes on the wire: P*S
Phase 2 is implemented as follows:
1. Transmit local result to right side neighbor
2. Receive buffer from left side neighbor and reduce into local result
3. Transmit incoming buffer to right side neighbor
4. Repeat 2-3 until process has seen all data
### allreduce_ring_chunked
* Communication steps: 4*P
* Bytes on the wire: 2*S
Phase 2 is implemented in 2 sub-phases:
1. First, the algorithm iterates over the local reduction,
transmitting chunks of the buffer and reducing at every step. The
number of chunks is equal to 2*P, allowing double buffering to be
used. This means there is always one chunk in flight while
reduction is done on another chunk concurrently. At the end of this
phase, every process P holds 1/P of the reduced result.
2. Second, the algorithm iterates over the local reduction again, now
broadcasting the local results.
With 2*P chunks and two sub-phases, we arrive at 4*P communication
steps.
These sub-phases are implemented as followed (roughly):
First:
1. Compute offset into local reduction buffer based on process rank
2. Transmit chunk at offset to right side neighbor
3. Receive chunk at offset-1 from left side neighbor and reduce into
local result
4. Subtract 1 from offset, wrapping when needed
5. Repeat 2-4 until process has walked entire buffer
Second:
1. Transmit chunk at offset+1 (containing the global reduction) to
right side neighbor
2. Receive chunk at offset from left side neighbor and copy into local
result
3. Subtract 1 from offset, wrapping when needed
4. Repeat 1-3 until process has walked entire buffer
### cuda_allreduce_ring
CUDA-aware implementation of `allreduce_ring`. GPU side buffers are
copied to system memory in parallel, prior to running local reduction
on CPU. After phase 2 completes, CPU side result is copied back to GPU
side buffers in parallel.
### cuda_allreduce_ring_chunked
CUDA-aware implementation of `allreduce_ring_chunked`. GPU side
buffers are reduced into GPU buffer 0 (using NCCL). The result is
copied to system memory asynchronously. After phase 2 completes, the
CPU side result is copied back to GPU buffer 0, and then broadcast to
other GPU buffers in parallel (using NCCL).
Both local reduction in phase 1 and broadcast in phase 3 is pipelined
with the communication steps where this data is needed or becomes
available.
## Barrier
Synchronization point between processes.
### barrier_all_to_all
* Communication steps: 1
* Bytes on the wire: P
Every process sends a notification to every other process.
Then, it waits for a notification from every other process.
### barrier_all_to_one
* Communication steps: 2
* Bytes on the wire: 1 for non-root, P for root
_Non-root processes_: send notification to root, wait for notification
from root.
_Root process_: wait for notification from P-1 processes, send
notification to P-1 processes.
## Broadcast
Broadcast contents of buffer on one process to other P-1 processes.
### broadcast_one_to_all
* Communication steps: 1
* Bytes on the wire: P*S
_Non-root processes_: receive buffer from root.
_Root process_: send buffer to P-1 processes.