[CI] Use jemalloc for CUDA builds (#116900)

According to @ptrblck it'll likely mitigate non-deterministic NVCC bug
See https://github.com/pytorch/pytorch/issues/116289 for more detail

Test plan: ssh into one of the cuda builds and make sure that `LD_PRELOAD` is set for the top-level make command

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116900
Approved by: https://github.com/atalman
diff --git a/.ci/docker/common/install_base.sh b/.ci/docker/common/install_base.sh
index 4550de8..e3568b2 100755
--- a/.ci/docker/common/install_base.sh
+++ b/.ci/docker/common/install_base.sh
@@ -61,6 +61,7 @@
     ${maybe_libiomp_dev} \
     libyaml-dev \
     libz-dev \
+    libjemalloc2 \
     libjpeg-dev \
     libasound2-dev \
     libsndfile-dev \
diff --git a/.ci/pytorch/build.sh b/.ci/pytorch/build.sh
index c77913d..b72461b 100755
--- a/.ci/pytorch/build.sh
+++ b/.ci/pytorch/build.sh
@@ -28,6 +28,8 @@
 env
 
 if [[ "$BUILD_ENVIRONMENT" == *cuda* ]]; then
+  # Use jemalloc during compilation to mitigate https://github.com/pytorch/pytorch/issues/116289
+  export LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.2
   echo "NVCC version:"
   nvcc --version
 fi