More documentation
Summary: TSIA
Reviewed By: andrewwdye
Differential Revision: D4644734
fbshipit-source-id: 50f5fadd2c5cd04e06a025f5538187ed852e669a
diff --git a/README.md b/README.md
index 53faad5..e478f6b 100644
--- a/README.md
+++ b/README.md
@@ -8,20 +8,28 @@
Where applicable, algorithms have an implementation that works with
system memory buffers, and one that works with NVIDIA GPU memory
-buffers. In the latter case, if the InfiniBand transport is used,
-GPUDirect can be used to accelerate cross machine GPU-to-GPU memory
-transfers.
+buffers. In the latter case, it is not necessary to copy memory between
+host and device; this is taken care of by the algorithm implementations.
## Requirements
-Gloo is built to run on Linux and has no hard dependencies other than libc.
+Gloo is built to run on Linux and has no hard dependencies other than libstdc++.
Optional dependencies are:
-* cuda -- for CUDA algorithms, tests, and benchmark
-* googletest -- to build and run tests
-* eigen -- for fast floating point routines
-* hiredis -- for coordinating machine rendezvous through Redis
+* **[CUDA][cuda] and [NCCL][nccl]** -- for CUDA aware algorithms, tests, and benchmark
+* **[Google Test][gtest]** -- to build and run tests
+* **[Eigen][eigen]** -- for fast floating point routines
+* **[Hiredis][hiredis]** -- for coordinating machine rendezvous through Redis
-## Usage
+[cuda]: http://www.nvidia.com/object/cuda_home_new.html
+[nccl]: https://github.com/nvidia/nccl
+[gtest]: https://github.com/google/googletest
+[eigen]: http://eigen.tuxfamily.org
+[hiredis]: https://github.com/redis/hiredis
+
+## Documentation
+Please refer to [docs/](docs/) for detailed documentation.
+
+## Building
You can build Gloo using CMake.
Since it is a library, it is most convenient to vendor it in your own
@@ -45,8 +53,57 @@
ls -l gloo/gloo_{test,benchmark}
```
-## Documentation
-Please refer to [docs/](docs/) for detailed documentation.
+## Benchmarking
+The benchmark tool depends on 1) Eigen for floating point math and 2)
+Redis/Hiredis for rendezvous. The benchmark tool for CUDA algorithms
+obviously also depends on both CUDA and NCCL.
+
+To run a benchmark:
+1. Copy the benchmark tool to all participating machines
+2. Start a Redis server on any host (either a client machine or one of
+ the machines participating in the test).
+3. Determine some unique ID for the benchmark run (e.g. the `uuid`
+ tool or some number).
+4. On each machine, run (or pass `--help` for more options):
+
+ ``` text
+ ./benchmark \
+ --size <number of machines> \
+ --rank <index of this machine, starting at 0> \
+ --redis-host <Redis host> \
+ --redis-port <Redis port> \
+ --prefix <unique identifier for this run> \
+ --transport tcp \
+ --elements <number of elements; -1 for a sweep> \
+ --iteration-time 1s \
+ allreduce_ring_chunked
+ ```
+Example output (running on 4 machines with a 40GbE network):
+
+``` text
+ elements min (us) p50 (us) p99 (us) max (us) samples
+ 1 195 263 342 437 3921
+ 2 195 261 346 462 4039
+ 5 197 261 339 402 3963
+ 10 197 263 338 398 3749
+ 20 199 268 343 395 4146
+ 50 200 265 344 401 3889
+ 100 205 265 351 414 3645
+ 200 197 264 328 387 3960
+ 500 201 264 329 394 4274
+ 1000 200 267 330 380 3344
+ 2000 205 263 323 395 3682
+ 5000 240 335 424 460 3277
+ 10000 271 346 402 457 2721
+ 20000 283 358 392 428 2719
+ 50000 342 438 495 649 1654
+ 100000 413 487 669 799 1687
+ 200000 1113 1450 1837 2801 669
+ 500000 1099 1294 1665 1959 560
+ 1000000 1858 2286 2779 6100 320
+ 2000000 3546 3993 4364 4886 252
+ 5000000 10030 10608 11106 11628 92
+```
## License
Gloo is BSD-licensed. We also provide an additional patent grant.
diff --git a/docs/overview.md b/docs/overview.md
deleted file mode 100644
index 173309e..0000000
--- a/docs/overview.md
+++ /dev/null
@@ -1,2 +0,0 @@
-# Overview
-TBD
diff --git a/docs/readme.md b/docs/readme.md
new file mode 100644
index 0000000..ec1aeb7
--- /dev/null
+++ b/docs/readme.md
@@ -0,0 +1,85 @@
+# Gloo documentation
+
+Documentation is split by domain. This file contains a general
+overview of these domains and how they interact.
+
+## Index
+
+* [Overview](readme.md) -- this file
+
+* [Transport details](transport.md) -- the transport API and its
+ implementations
+
+* [CUDA integration](cuda.md) -- integration of CUDA aware Gloo
+ algorithms with existing CUDA code
+
+* [Latency optimization](latency.md) -- series of tips and tricks to
+ improve algorithm latency
+
+## Overview
+
+Gloo algorithms are collective algoritms, meaning they can run in
+parallel across two or more processes/machines. To be able to execute
+across multiple machines, they first need to find each other. We call
+this _rendezvous_ and it is the first thing to address when
+integrating Gloo into your code base.
+
+Once rendezvous completes, participating machines have setup
+connections to one another, either in a full mesh (every machine has a
+bidirectional communication channel to every other machine), or some
+subset. The required connectivity between machines depends on the type
+of algorithm that is used. For example, a ring algorithm only needs
+communication channels to a machine's neighbors.
+
+Every participating process knows about the number of participating
+processes, and its _rank_ (or 0-based index) within the list of
+participating processes. This state, as well as the state needed to
+store the persistent communication channels, is stored in a
+`gloo::Context` class. Gloo does not maintain global state or
+thread-local state. This means that you can setup as many contexts as
+needed, and introduce as much parallelism as needed by your
+application.
+
+## Rendezvous
+
+The rendezvous process needs to happen exactly once per Gloo context.
+It makes participating Gloo processes exchange details for setting up
+their communication channels. For example, when the TCP transport is
+used, processes exchange IP address and port number details of
+listening sockets.
+
+Rendezvous is abstracted as a key/value interface to a store that is
+accessible by all participating processes. Every process is
+responsible for setting a number of keys and will wait until their
+peers have set their keys. The values stored against these keys hold
+the information that is passed to the transport layer.
+
+This interface is defined in [`store.h`](../gloo/rendezvous/store.h).
+
+### HashStore
+
+The [HashStore](../gloo/rendezvous/hash_store.cc) is an in-process
+implementation of this interface. This is realistically not useful in
+any application but integration tests.
+
+### RedisStore
+
+The [RedisStore](../gloo/rendezvous/redis_store.cc) implementation uses
+the Hiredis library to set/get values against a Redis server. This
+server needs to be accessible to all participating machines.
+
+Since the keys used by the Redis implementation are accessible to any
+process using that server -- which would prevent usage for concurrent
+rendezvous executation -- the
+[PrefixStore](../gloo/rendezvous/prefix_store.cc) can be used to scope
+rendezvous to a particular namespace.
+
+### ...
+
+Any class that inherits from the `gloo::rendezvous::Store` abstract
+base class can be used for rendezvous.
+
+## Anything else?
+
+If you find particular documentation is missing, please consider
+[contributing](../CONTRIBUTING.md).