More documentation

Summary: TSIA

Reviewed By: andrewwdye

Differential Revision: D4644734

fbshipit-source-id: 50f5fadd2c5cd04e06a025f5538187ed852e669a
diff --git a/README.md b/README.md
index 53faad5..e478f6b 100644
--- a/README.md
+++ b/README.md
@@ -8,20 +8,28 @@
 
 Where applicable, algorithms have an implementation that works with
 system memory buffers, and one that works with NVIDIA GPU memory
-buffers. In the latter case, if the InfiniBand transport is used,
-GPUDirect can be used to accelerate cross machine GPU-to-GPU memory
-transfers.
+buffers. In the latter case, it is not necessary to copy memory between
+host and device; this is taken care of by the algorithm implementations.
 
 ## Requirements
-Gloo is built to run on Linux and has no hard dependencies other than libc.
+Gloo is built to run on Linux and has no hard dependencies other than libstdc++.
 
 Optional dependencies are:
-* cuda -- for CUDA algorithms, tests, and benchmark
-* googletest -- to build and run tests
-* eigen -- for fast floating point routines
-* hiredis -- for coordinating machine rendezvous through Redis
+* **[CUDA][cuda] and [NCCL][nccl]** -- for CUDA aware algorithms, tests, and benchmark
+* **[Google Test][gtest]** -- to build and run tests
+* **[Eigen][eigen]** -- for fast floating point routines
+* **[Hiredis][hiredis]** -- for coordinating machine rendezvous through Redis
 
-## Usage
+[cuda]: http://www.nvidia.com/object/cuda_home_new.html
+[nccl]: https://github.com/nvidia/nccl
+[gtest]: https://github.com/google/googletest
+[eigen]: http://eigen.tuxfamily.org
+[hiredis]: https://github.com/redis/hiredis
+
+## Documentation
+Please refer to [docs/](docs/) for detailed documentation.
+
+## Building
 You can build Gloo using CMake.
 
 Since it is a library, it is most convenient to vendor it in your own
@@ -45,8 +53,57 @@
 ls -l gloo/gloo_{test,benchmark}
 ```
 
-## Documentation
-Please refer to [docs/](docs/) for detailed documentation.
+## Benchmarking
+The benchmark tool depends on 1) Eigen for floating point math and 2)
+Redis/Hiredis for rendezvous. The benchmark tool for CUDA algorithms
+obviously also depends on both CUDA and NCCL.
+
+To run a benchmark:
+1. Copy the benchmark tool to all participating machines
+2. Start a Redis server on any host (either a client machine or one of
+   the machines participating in the test).
+3. Determine some unique ID for the benchmark run (e.g. the `uuid`
+   tool or some number).
+4. On each machine, run (or pass `--help` for more options):
+
+    ``` text
+    ./benchmark \
+      --size <number of machines> \
+      --rank <index of this machine, starting at 0> \
+      --redis-host <Redis host> \
+      --redis-port <Redis port> \
+      --prefix <unique identifier for this run> \
+      --transport tcp \
+      --elements <number of elements; -1 for a sweep> \
+      --iteration-time 1s \
+      allreduce_ring_chunked
+    ```
+Example output (running on 4 machines with a 40GbE network):
+
+``` text
+   elements   min (us)   p50 (us)   p99 (us)   max (us)    samples
+          1        195        263        342        437       3921
+          2        195        261        346        462       4039
+          5        197        261        339        402       3963
+         10        197        263        338        398       3749
+         20        199        268        343        395       4146
+         50        200        265        344        401       3889
+        100        205        265        351        414       3645
+        200        197        264        328        387       3960
+        500        201        264        329        394       4274
+       1000        200        267        330        380       3344
+       2000        205        263        323        395       3682
+       5000        240        335        424        460       3277
+      10000        271        346        402        457       2721
+      20000        283        358        392        428       2719
+      50000        342        438        495        649       1654
+     100000        413        487        669        799       1687
+     200000       1113       1450       1837       2801        669
+     500000       1099       1294       1665       1959        560
+    1000000       1858       2286       2779       6100        320
+    2000000       3546       3993       4364       4886        252
+    5000000      10030      10608      11106      11628         92
+```
 
 ## License
 Gloo is BSD-licensed. We also provide an additional patent grant.
diff --git a/docs/overview.md b/docs/overview.md
deleted file mode 100644
index 173309e..0000000
--- a/docs/overview.md
+++ /dev/null
@@ -1,2 +0,0 @@
-# Overview
-TBD
diff --git a/docs/readme.md b/docs/readme.md
new file mode 100644
index 0000000..ec1aeb7
--- /dev/null
+++ b/docs/readme.md
@@ -0,0 +1,85 @@
+# Gloo documentation
+
+Documentation is split by domain. This file contains a general
+overview of these domains and how they interact.
+
+## Index
+
+* [Overview](readme.md) -- this file
+
+* [Transport details](transport.md) -- the transport API and its
+  implementations
+
+* [CUDA integration](cuda.md) -- integration of CUDA aware Gloo
+  algorithms with existing CUDA code
+
+* [Latency optimization](latency.md) -- series of tips and tricks to
+  improve algorithm latency
+
+## Overview
+
+Gloo algorithms are collective algoritms, meaning they can run in
+parallel across two or more processes/machines. To be able to execute
+across multiple machines, they first need to find each other. We call
+this _rendezvous_ and it is the first thing to address when
+integrating Gloo into your code base.
+
+Once rendezvous completes, participating machines have setup
+connections to one another, either in a full mesh (every machine has a
+bidirectional communication channel to every other machine), or some
+subset. The required connectivity between machines depends on the type
+of algorithm that is used. For example, a ring algorithm only needs
+communication channels to a machine's neighbors.
+
+Every participating process knows about the number of participating
+processes, and its _rank_ (or 0-based index) within the list of
+participating processes. This state, as well as the state needed to
+store the persistent communication channels, is stored in a
+`gloo::Context` class. Gloo does not maintain global state or
+thread-local state. This means that you can setup as many contexts as
+needed, and introduce as much parallelism as needed by your
+application.
+
+## Rendezvous
+
+The rendezvous process needs to happen exactly once per Gloo context.
+It makes participating Gloo processes exchange details for setting up
+their communication channels. For example, when the TCP transport is
+used, processes exchange IP address and port number details of
+listening sockets.
+
+Rendezvous is abstracted as a key/value interface to a store that is
+accessible by all participating processes. Every process is
+responsible for setting a number of keys and will wait until their
+peers have set their keys. The values stored against these keys hold
+the information that is passed to the transport layer.
+
+This interface is defined in [`store.h`](../gloo/rendezvous/store.h).
+
+### HashStore
+
+The [HashStore](../gloo/rendezvous/hash_store.cc) is an in-process
+implementation of this interface. This is realistically not useful in
+any application but integration tests.
+
+### RedisStore
+
+The [RedisStore](../gloo/rendezvous/redis_store.cc) implementation uses
+the Hiredis library to set/get values against a Redis server. This
+server needs to be accessible to all participating machines.
+
+Since the keys used by the Redis implementation are accessible to any
+process using that server -- which would prevent usage for concurrent
+rendezvous executation -- the
+[PrefixStore](../gloo/rendezvous/prefix_store.cc) can be used to scope
+rendezvous to a particular namespace.
+
+### ...
+
+Any class that inherits from the `gloo::rendezvous::Store` abstract
+base class can be used for rendezvous.
+
+## Anything else?
+
+If you find particular documentation is missing, please consider
+[contributing](../CONTRIBUTING.md).