Example workflow for running disributed (syncsgd) imagenet training in Flow Summary: This diff introduces a simplified Imagenet trainer that uses data_parallel_model to parallellize training over GPUs and Nodes in synchronous manner. Flow's gang scheduling is used to launch the nodes, and data_parallel_model handles the synchronization among the gang members. This example also uses the operator-per-epoch model where each epoch produces a checkpoint consumed by the followup epoch. Reviewed By: salexspb Differential Revision: D4223384 fbshipit-source-id: 8c2c73f4f6b2fdadb98511075ebbd8426c91eadb

commit: 5d0167c8e70c812c59882ebf3d26077f6a513740 [log] [tgz]
author: Aapo Kyrola <akyrola@fb.com> Mon Nov 28 12:57:40 2016 -0800
committer: Bram Wasti <bwasti@dev11999.prn1.facebook.com> Tue Nov 29 15:18:38 2016 -0800
tree: 9b977558bc66bb31c2c683d06691cea87dbaf4f4
parent: 6ebae91d247a8b68f2ba14c7853e055bb4810b3e [diff]
diff --git a/caffe2/python/data_parallel_model.py b/caffe2/python/data_parallel_model.py
index 44fb524..5b75f94 100644
--- a/caffe2/python/data_parallel_model.py
+++ b/caffe2/python/data_parallel_model.py

@@ -231,7 +231,7 @@
         comm_world = net.CreateCommonWorld(
             rendezvous['kv_handler'],
             "iter_cw",
-            name="iter_cw_op",
+            name=net.Proto().name + ".iter_cw_op",
             size=rendezvous['num_shards'],
             rank=rendezvous['shard_id'],
             engine=rendezvous['engine'],
@@ -252,7 +252,7 @@
             comm_world = net.CreateCommonWorld(
                 rendezvous['kv_handler'],
                 "{}_cw".format(param_name),
-                name="{}_cw_op".format(param_name),
+                name=net.Proto().name + ".{}_cw_op".format(param_name),
                 size=rendezvous['num_shards'],
                 rank=rendezvous['shard_id'],
                 engine=rendezvous['engine'],
commit	5d0167c8e70c812c59882ebf3d26077f6a513740	[log] [tgz]
author	Aapo Kyrola <akyrola@fb.com>	Mon Nov 28 12:57:40 2016 -0800
committer	Bram Wasti <bwasti@dev11999.prn1.facebook.com>	Tue Nov 29 15:18:38 2016 -0800
tree	9b977558bc66bb31c2c683d06691cea87dbaf4f4
parent	6ebae91d247a8b68f2ba14c7853e055bb4810b3e [diff]