Example workflow for running disributed (syncsgd) imagenet training in Flow
Summary:
This diff introduces a simplified Imagenet trainer that uses data_parallel_model to parallellize training over GPUs and Nodes in synchronous manner. Flow's gang scheduling is used to launch the nodes, and data_parallel_model handles the synchronization among the gang members.
This example also uses the operator-per-epoch model where each epoch produces a checkpoint consumed by the followup epoch.
Reviewed By: salexspb
Differential Revision: D4223384
fbshipit-source-id: 8c2c73f4f6b2fdadb98511075ebbd8426c91eadb
diff --git a/caffe2/python/data_parallel_model.py b/caffe2/python/data_parallel_model.py
index 44fb524..5b75f94 100644
--- a/caffe2/python/data_parallel_model.py
+++ b/caffe2/python/data_parallel_model.py
@@ -231,7 +231,7 @@
comm_world = net.CreateCommonWorld(
rendezvous['kv_handler'],
"iter_cw",
- name="iter_cw_op",
+ name=net.Proto().name + ".iter_cw_op",
size=rendezvous['num_shards'],
rank=rendezvous['shard_id'],
engine=rendezvous['engine'],
@@ -252,7 +252,7 @@
comm_world = net.CreateCommonWorld(
rendezvous['kv_handler'],
"{}_cw".format(param_name),
- name="{}_cw_op".format(param_name),
+ name=net.Proto().name + ".{}_cw_op".format(param_name),
size=rendezvous['num_shards'],
rank=rendezvous['shard_id'],
engine=rendezvous['engine'],