Modify nccl_dependency to take dev mode (#79169)
Summary:
Modify nccl_dependency to take dev mode. Default is still the tp2 version
Suggestion from D35919342 are added into this
Test Plan:
NCCL TESTS
Using version dev:
Build:
hpc_comms.use_nccl = dev
```
buck build mode/opt -c hpc_comms.use_nccl=dev -c fbcode.platform=platform009 -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=v100,a100 //param_bench/train/comms/cpp/nccl-tests/src:nccl_allreduce_perf --show-full-output --verbose 1
```
build done successfully
Running test on devgpu:
```
/usr/local/fbcode/platform009/bin/mpirun -np 8 -x NCCL_DEBUG=INFO -x NCCL_DEBUG_SUBSYS=INIT,GRAPH,TUNING,ENV,NET ./buck-out/gen/param_bench/train/comms/cpp/nccl-tests/src/nccl_allreduce_perf -b 8 -e 128M -f 2
```
Result: P507192135 - nccl version from logs "NCCL version 2.10.3dev+cudaCUDA_MAJOR.CUDA_MINOR"
--------
Using version dev_v2.10.3-1:
Build:
hpc_comms.use_nccl=dev_v2.10.3-1
```
buck kill && buck clean && buck build mode/opt -c hpc_comms.use_nccl=dev_v2.10.3-1 -c fbcode.platform=platform009 -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=v100,a100 //param_bench/train/comms/cpp/nccl-tests/src:nccl_allreduce_perf --show-full-output --verbose 1
```
Build done successfully
Running test on devgpu:
```
/usr/local/fbcode/platform009/bin/mpirun -np 8 -x NCCL_DEBUG=INFO -x NCCL_DEBUG_SUBSYS=INIT,GRAPH,TUNING,ENV,NET ./buck-out/gen/param_bench/train/comms/cpp/nccl-tests/src/nccl_allreduce_perf -b 8 -e 128M -f 2
```
Result: P507194570 - nccl version from logs "NCCL version 2.10.3dev+cudaCUDA_MAJOR.CUDA_MINOR"
--------
Using version tp2:
Build:
hpc_comms.use_nccl=tp2
```
buck kill && buck clean && buck build mode/opt -c hpc_comms.use_nccl=tp2 -c fbcode.platform=platform009 -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=v100,a100 //param_bench/train/comms/cpp/nccl-tests/src:nccl_allreduce_perf --show-full-output --verbose 1
```
Build done successfully
Running test on devgpu:
```
/usr/local/fbcode/platform009/bin/mpirun -np 8 -x NCCL_DEBUG=INFO -x NCCL_DEBUG_SUBSYS=INIT,GRAPH,TUNING,ENV,NET ./buck-out/gen/param_bench/train/comms/cpp/nccl-tests/src/nccl_allreduce_perf -b 8 -e 128M -f 2
```
Result: P507195497 - nccl version from logs "NCCL version 2.10.3+cudaCUDA_MAJOR.CUDA_MINOR"
--------
Using version default:
Build:
hpc_comms.use_nccl=tp2
```
buck kill && buck clean && buck build mode/opt -c fbcode.platform=platform009 -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=v100,a100 //param_bench/train/comms/cpp/nccl-tests/src:nccl_allreduce_perf --show-full-output --verbose 1
```
Build done successfully
Running test on devgpu:
```
/usr/local/fbcode/platform009/bin/mpirun -np 8 -x NCCL_DEBUG=INFO -x NCCL_DEBUG_SUBSYS=INIT,GRAPH,TUNING,ENV,NET ./buck-out/gen/param_bench/train/comms/cpp/nccl-tests/src/nccl_allreduce_perf -b 8 -e 128M -f 2
```
Result: P507207374 - nccl version from logs "NCCL version 2.10.3+cudaCUDA_MAJOR.CUDA_MINOR"
--------
RUNNING PARAM COMMS TO TEST CAFFE TORCH INTEGRATION WITH NCCL DEV LIB
Using version dev:
Build:
hpc_comms.use_nccl = dev
```
buck kill && buck clean && buck build mode/opt -c fbcode.platform=platform009 -c hpc_comms.use_nccl=dev -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=v100,a100 //param_bench/train/comms/pt:comms --show-full-output --verbose 1
```
build done successfully
Running test on devgpu:
```
sh ai_codesign/comms/scripts/test_param_local_no_mpi.sh -s 8 --backend nccl --coll all_reduce
```
Result: P507214467 - nccl version from logs "NCCL version 2.10.3dev+cudaCUDA_MAJOR.CUDA_MINOR"
--------
Using version dev_v2.10.3-1:
Build:
hpc_comms.use_nccl=dev_v2.10.3-1
```
buck kill && buck clean && buck build mode/opt -c fbcode.platform=platform009 -c hpc_comms.use_nccl=dev_v2.10.3-1 -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=v100,a100 //param_bench/train/comms/pt:comms --show-full-output --verbose 1
```
Build done successfully
Running test on devgpu:
```
sh ai_codesign/comms/scripts/test_param_local_no_mpi.sh -s 8 --backend nccl --coll all_reduce
```
Result: P507247559 - nccl version from logs "NCCL version 2.10.3dev+cudaCUDA_MAJOR.CUDA_MINOR"
--------
Using version tp2:
Build:
hpc_comms.use_nccl=tp2
```
buck kill && buck clean && buck build mode/opt -c fbcode.platform=platform009 -c hpc_comms.use_nccl=tp2 -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=v100,a100 //param_bench/train/comms/pt:comms --show-full-output --verbose 1
```
Build done successfully
Running test on devgpu:
```
sh ai_codesign/comms/scripts/test_param_local_no_mpi.sh -s 8 --backend nccl --coll all_reduce
```
Result: P507251808 - nccl version from logs "NCCL version 2.10.3+cudaCUDA_MAJOR.CUDA_MINOR"
--------
Using version default:
Build:
hpc_comms.use_nccl=tp2
```
buck kill && buck clean && buck build mode/opt -c fbcode.platform=platform009 -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=v100,a100 //param_bench/train/comms/pt:comms --show-full-output --verbose 1
```
Build done successfully
Running test on devgpu:
```
sh ai_codesign/comms/scripts/test_param_local_no_mpi.sh -s 8 --backend nccl --coll all_reduce
```
Result: P507256357 - nccl version from logs "NCCL version 2.10.3+cudaCUDA_MAJOR.CUDA_MINOR"
Differential Revision: D36873694
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79169
Approved by: https://github.com/kingchc, https://github.com/kwen2501
diff --git a/tools/target_definitions.bzl b/tools/target_definitions.bzl
index be227e2..cecc12b 100644
--- a/tools/target_definitions.bzl
+++ b/tools/target_definitions.bzl
@@ -253,13 +253,13 @@
"//caffe2/torch/lib/libshm:libshm",
"//gloo:gloo_gpu_cuda",
"//tensorpipe:tensorpipe_cuda",
- ],
+ ] + get_nccl_dependency(),
exported_external_deps = [
("cudnn", None, "cudnn-lazy"),
("cuda", None, "nvToolsExt-lazy"),
("cuda", None, "nvrtc-lazy"),
("cuda", None, "nvrtc-builtins-lazy"),
- ] + get_nccl_dependency(),
+ ],
compiler_flags = compiler_flags_cpu + compiler_flags_cuda,
**common_flags
)