Modify nccl_dependency to take dev mode (#79169)

Summary:
Modify nccl_dependency to take dev mode. Default is still the tp2 version

Suggestion from D35919342 are added into this

Test Plan:
NCCL TESTS

Using version dev:
Build:
hpc_comms.use_nccl = dev

```
buck build mode/opt -c hpc_comms.use_nccl=dev -c fbcode.platform=platform009 -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=v100,a100 //param_bench/train/comms/cpp/nccl-tests/src:nccl_allreduce_perf --show-full-output --verbose 1
```
build done successfully

Running test on devgpu:

```
/usr/local/fbcode/platform009/bin/mpirun -np 8 -x NCCL_DEBUG=INFO -x  NCCL_DEBUG_SUBSYS=INIT,GRAPH,TUNING,ENV,NET ./buck-out/gen/param_bench/train/comms/cpp/nccl-tests/src/nccl_allreduce_perf -b 8 -e 128M -f 2
```
Result: P507192135 - nccl version from logs "NCCL version 2.10.3dev+cudaCUDA_MAJOR.CUDA_MINOR"

--------

Using version dev_v2.10.3-1:
Build:
hpc_comms.use_nccl=dev_v2.10.3-1

```
buck kill && buck clean && buck build mode/opt -c hpc_comms.use_nccl=dev_v2.10.3-1 -c fbcode.platform=platform009 -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=v100,a100 //param_bench/train/comms/cpp/nccl-tests/src:nccl_allreduce_perf --show-full-output --verbose 1
```
Build done successfully

Running test on devgpu:

```
/usr/local/fbcode/platform009/bin/mpirun -np 8 -x NCCL_DEBUG=INFO -x  NCCL_DEBUG_SUBSYS=INIT,GRAPH,TUNING,ENV,NET ./buck-out/gen/param_bench/train/comms/cpp/nccl-tests/src/nccl_allreduce_perf -b 8 -e 128M -f 2
```
Result: P507194570 - nccl version from logs "NCCL version 2.10.3dev+cudaCUDA_MAJOR.CUDA_MINOR"

--------

Using version tp2:
Build:
hpc_comms.use_nccl=tp2

```
buck kill && buck clean && buck build mode/opt -c hpc_comms.use_nccl=tp2 -c fbcode.platform=platform009 -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=v100,a100 //param_bench/train/comms/cpp/nccl-tests/src:nccl_allreduce_perf --show-full-output --verbose 1
```
Build done successfully

Running test on devgpu:

```
/usr/local/fbcode/platform009/bin/mpirun -np 8 -x NCCL_DEBUG=INFO -x  NCCL_DEBUG_SUBSYS=INIT,GRAPH,TUNING,ENV,NET ./buck-out/gen/param_bench/train/comms/cpp/nccl-tests/src/nccl_allreduce_perf -b 8 -e 128M -f 2
```
Result: P507195497 - nccl version from logs "NCCL version 2.10.3+cudaCUDA_MAJOR.CUDA_MINOR"

--------

Using version default:
Build:
hpc_comms.use_nccl=tp2

```
buck kill && buck clean && buck build mode/opt -c fbcode.platform=platform009 -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=v100,a100 //param_bench/train/comms/cpp/nccl-tests/src:nccl_allreduce_perf --show-full-output --verbose 1
```
Build done successfully

Running test on devgpu:

```
/usr/local/fbcode/platform009/bin/mpirun -np 8 -x NCCL_DEBUG=INFO -x  NCCL_DEBUG_SUBSYS=INIT,GRAPH,TUNING,ENV,NET ./buck-out/gen/param_bench/train/comms/cpp/nccl-tests/src/nccl_allreduce_perf -b 8 -e 128M -f 2
```
Result: P507207374 - nccl version from logs "NCCL version 2.10.3+cudaCUDA_MAJOR.CUDA_MINOR"

--------

RUNNING PARAM COMMS TO TEST CAFFE TORCH INTEGRATION WITH NCCL DEV LIB

Using version dev:
Build:
hpc_comms.use_nccl = dev

```
buck kill && buck clean && buck build mode/opt -c fbcode.platform=platform009 -c hpc_comms.use_nccl=dev -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=v100,a100 //param_bench/train/comms/pt:comms --show-full-output --verbose 1
```
build done successfully

Running test on devgpu:

```
sh ai_codesign/comms/scripts/test_param_local_no_mpi.sh -s 8 --backend nccl --coll all_reduce
```
Result: P507214467 - nccl version from logs "NCCL version 2.10.3dev+cudaCUDA_MAJOR.CUDA_MINOR"

--------

Using version dev_v2.10.3-1:
Build:
hpc_comms.use_nccl=dev_v2.10.3-1

```
buck kill && buck clean && buck build mode/opt -c fbcode.platform=platform009 -c hpc_comms.use_nccl=dev_v2.10.3-1 -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=v100,a100 //param_bench/train/comms/pt:comms --show-full-output --verbose 1
```
Build done successfully

Running test on devgpu:

```
sh ai_codesign/comms/scripts/test_param_local_no_mpi.sh -s 8 --backend nccl --coll all_reduce
```
Result: P507247559 - nccl version from logs "NCCL version 2.10.3dev+cudaCUDA_MAJOR.CUDA_MINOR"

--------

Using version tp2:
Build:
hpc_comms.use_nccl=tp2

```
buck kill && buck clean && buck build mode/opt -c fbcode.platform=platform009 -c hpc_comms.use_nccl=tp2 -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=v100,a100 //param_bench/train/comms/pt:comms --show-full-output --verbose 1
```
Build done successfully

Running test on devgpu:

```
sh ai_codesign/comms/scripts/test_param_local_no_mpi.sh -s 8 --backend nccl --coll all_reduce
```
Result: P507251808 - nccl version from logs "NCCL version 2.10.3+cudaCUDA_MAJOR.CUDA_MINOR"

--------

Using version default:
Build:
hpc_comms.use_nccl=tp2

```
buck kill && buck clean && buck build mode/opt -c fbcode.platform=platform009 -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=v100,a100 //param_bench/train/comms/pt:comms --show-full-output --verbose 1
```
Build done successfully

Running test on devgpu:

```
sh ai_codesign/comms/scripts/test_param_local_no_mpi.sh -s 8 --backend nccl --coll all_reduce
```
Result: P507256357 - nccl version from logs "NCCL version 2.10.3+cudaCUDA_MAJOR.CUDA_MINOR"

Differential Revision: D36873694

Pull Request resolved: https://github.com/pytorch/pytorch/pull/79169
Approved by: https://github.com/kingchc, https://github.com/kwen2501
diff --git a/tools/target_definitions.bzl b/tools/target_definitions.bzl
index be227e2..cecc12b 100644
--- a/tools/target_definitions.bzl
+++ b/tools/target_definitions.bzl
@@ -253,13 +253,13 @@
             "//caffe2/torch/lib/libshm:libshm",
             "//gloo:gloo_gpu_cuda",
             "//tensorpipe:tensorpipe_cuda",
-        ],
+        ] + get_nccl_dependency(),
         exported_external_deps = [
             ("cudnn", None, "cudnn-lazy"),
             ("cuda", None, "nvToolsExt-lazy"),
             ("cuda", None, "nvrtc-lazy"),
             ("cuda", None, "nvrtc-builtins-lazy"),
-        ] + get_nccl_dependency(),
+        ],
         compiler_flags = compiler_flags_cpu + compiler_flags_cuda,
         **common_flags
     )