Changes default DDP behavior to divide sparse grad by world size before allreduce, not after (#61814)
Summary:
I appreciate https://github.com/pytorch/pytorch/pull/61379, which restores the fusion of div-by-world-size and copy-to-allreduce-buffer for dense gradients. But i noticed in the wake of https://github.com/pytorch/pytorch/pull/61379 there's misaligned treatment of dense and sparse gradients. Specifically, dense gradients are dived by world size before the allreduce, and sparse gradients are dived by world size after the allreduce. On paper you wouldn't expect that to matter, but for cluster-scale DDP training with amp gradient scaling and allreduces of FP16 grads, we've noticed several cases where postdividing grads by world size caused nonconvergence while predividing worked. I'm not aware of any cases where the reverse was true.
This PR changes the treatment of sparse gradients to match the treatment of dense gradients (both will be dived by world size before allreduce).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61814
Reviewed By: mrshenli
Differential Revision: D29772444
Pulled By: rohan-varma
fbshipit-source-id: 033a17d5c019511889d908876282c6624fb26a2d
diff --git a/torch/csrc/distributed/c10d/reducer.cpp b/torch/csrc/distributed/c10d/reducer.cpp
index 31192f5..676ade2 100644
--- a/torch/csrc/distributed/c10d/reducer.cpp
+++ b/torch/csrc/distributed/c10d/reducer.cpp
@@ -439,6 +439,11 @@
// struct are empty, and there is no pre-existing accumulation tensor.
// Directly assign the sparse tensor to the `contents` field.
replica.contents = grad;
+ // If no DDP comm hook is registered,
+ // the allreduce only sums up the value, and a separate division is required.
+ if (comm_hook_ == nullptr) {
+ replica.contents.div_(div_factor_);
+ }
// The grad is modified in place and needs to be written back.
return true;
});
@@ -1452,11 +1457,6 @@
for (const auto i : c10::irange(future_result.size())) {
auto& replica = bucket.replicas[i];
if (bucket.expect_sparse_gradient) {
- // If no DDP comm hook is registered,
- // the allreduce only sums up the value, and a separate division is required.
- if (comm_hook_ == nullptr) {
- future_result[i].div_(div_factor_);
- }
replica.contents.copy_(future_result[i]);
} else {
// Reinitialize only `bucket_views_out` with the future_result by