Changes default DDP behavior to divide sparse grad by world size before allreduce, not after (#61814) Summary: I appreciate https://github.com/pytorch/pytorch/pull/61379, which restores the fusion of div-by-world-size and copy-to-allreduce-buffer for dense gradients. But i noticed in the wake of https://github.com/pytorch/pytorch/pull/61379 there's misaligned treatment of dense and sparse gradients. Specifically, dense gradients are dived by world size before the allreduce, and sparse gradients are dived by world size after the allreduce. On paper you wouldn't expect that to matter, but for cluster-scale DDP training with amp gradient scaling and allreduces of FP16 grads, we've noticed several cases where postdividing grads by world size caused nonconvergence while predividing worked. I'm not aware of any cases where the reverse was true. This PR changes the treatment of sparse gradients to match the treatment of dense gradients (both will be dived by world size before allreduce). Pull Request resolved: https://github.com/pytorch/pytorch/pull/61814 Reviewed By: mrshenli Differential Revision: D29772444 Pulled By: rohan-varma fbshipit-source-id: 033a17d5c019511889d908876282c6624fb26a2d

commit: 403f59701c2f7f533c4448579643365b5e8f2fe7 [log] [tgz]
author: Michael Carilli <mcarilli@gmail.com> Wed Jul 21 09:51:20 2021 -0700
committer: Facebook GitHub Bot <facebook-github-bot@users.noreply.github.com> Wed Jul 21 09:54:53 2021 -0700
tree: d01c95181b9be7fc098f57acf000c3b396057268
parent: 17d743ff0452b9cf12c7fcab7314079b83012bd4 [diff]
diff --git a/torch/csrc/distributed/c10d/reducer.cpp b/torch/csrc/distributed/c10d/reducer.cpp
index 31192f5..676ade2 100644
--- a/torch/csrc/distributed/c10d/reducer.cpp
+++ b/torch/csrc/distributed/c10d/reducer.cpp

@@ -439,6 +439,11 @@
     // struct are empty, and there is no pre-existing accumulation tensor.
     // Directly assign the sparse tensor to the `contents` field.
     replica.contents = grad;
+    // If no DDP comm hook is registered,
+    // the allreduce only sums up the value, and a separate division is required.
+    if (comm_hook_ == nullptr) {
+      replica.contents.div_(div_factor_);
+    }
     // The grad is modified in place and needs to be written back.
     return true;
   });
@@ -1452,11 +1457,6 @@
     for (const auto i : c10::irange(future_result.size())) {
       auto& replica = bucket.replicas[i];
       if (bucket.expect_sparse_gradient) {
-        // If no DDP comm hook is registered,
-        // the allreduce only sums up the value, and a separate division is required.
-        if (comm_hook_ == nullptr) {
-          future_result[i].div_(div_factor_);
-        }
         replica.contents.copy_(future_result[i]);
       } else {
         // Reinitialize only `bucket_views_out` with the future_result by
commit	403f59701c2f7f533c4448579643365b5e8f2fe7	[log] [tgz]
author	Michael Carilli <mcarilli@gmail.com>	Wed Jul 21 09:51:20 2021 -0700
committer	Facebook GitHub Bot <facebook-github-bot@users.noreply.github.com>	Wed Jul 21 09:54:53 2021 -0700
tree	d01c95181b9be7fc098f57acf000c3b396057268
parent	17d743ff0452b9cf12c7fcab7314079b83012bd4 [diff]