[profiler] Add kineto init delay when used in daemon mode (#120276) Fixes #112389 ## About PyTorch (Kineto) profiler registers with the profiling daemon Dynolog to enable on-demand profiling. The user should only need to set the env variable `KINETO_USE_DAEMON`. To enable this we need to initialize kineto library early rather than lazily on a PyTorch profiler call. This initialization happens in a static initializer. - Kineto init function basically registers a callback using the CUDA CUPTI library https://github.com/pytorch/kineto/blob/main/libkineto/src/init.cpp#L130-L148 - However, the above needs the dynamic linking to libcupti.so to have taken place. - I understand now that static initializations of compilation units will be called before the dynamic linking leading to a segfault in #112389 ![image](https://github.com/pytorch/pytorch/assets/6922212/29c9e79b-8080-4198-aaae-8a5696dccaec) ## Workaround We add a delay in the initialization that can be configured using the env variable 'KINETO_DAEMON_INIT_DELAY_S'. May not be the best but it could help resolve the issue. ## Testing Tested this out with [linear_model_example.py](https://github.com/facebookincubator/dynolog/blob/main/scripts/pytorch/linear_model_example.py) First export the daemon env variable ### Without any delay ``` >$ python3 linear_model_example.py INFO:2024-02-21 19:34:50 2366287:2366287 init.cpp:131] Registering daemon config loader, cpuOnly = 1 INFO:2024-02-21 19:34:50 2366287:2366287 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1 INFO:2024-02-21 19:34:50 2366287:2366287 IpcFabricConfigClient.cpp:93] Setting up IPC Fabric at endpoint: dynoconfigclientb8f91363-d8d6-47a7-9103-197661e28397 status = initialized INFO:2024-02-21 19:34:50 2366287:2366287 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1 INFO:2024-02-21 19:34:50 2366287:2366287 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1 cpu 99 1385.468505859375 ``` ### With 5 seconds delay ``` >$ KINETO_DAEMON_INIT_DELAY_S=5 python3 linear_model_example.py cpu 99 284.82305908203125 10099 8.817167282104492 INFO:2024-02-21 19:34:26 2359155:2359214 init.cpp:131] Registering daemon config loader, cpuOnly = 1 ERROR: External init callback must run in same thread as registerClient (1782580992 != -1922169024) INFO:2024-02-21 19:34:26 2359155:2359214 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1 INFO:2024-02-21 19:34:26 2359155:2359214 IpcFabricConfigClient.cpp:93] Setting up IPC Fabric at endpoint: dynoconfigclient49270a3f-e913-4ea6-b9e0-cc90a853a869 status = initialized INFO:2024-02-21 19:34:26 2359155:2359214 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1 INFO:2024-02-21 19:34:26 2359155:2359214 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1 20099 8.817167282104492 ``` ### With an invalid delay ``` >$ KINETO_DAEMON_INIT_DELAY_S=abc python3 linear_model_example.py INFO:2024-02-21 19:35:02 2369647:2369647 init.cpp:131] Registering daemon config loader, cpuOnly = 1 INFO:2024-02-21 19:35:02 2369647:2369647 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1 INFO:2024-02-21 19:35:02 2369647:2369647 IpcFabricConfigClient.cpp:93] Setting up IPC Fabric at endpoint: dynoconfigclient0e12a349-af7b-4322-901d-1ff22f91fd4c status = initialized INFO:2024-02-21 19:35:02 2369647:2369647 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1 INFO:2024-02-21 19:35:02 2369647:2369647 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1 cpu ``` ### Unit test updated as well. ## Impact This should not impact any general user. The initialization only occurs if `KINETO_USE_DAEMON` is set in the environment. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120276 Approved by: https://github.com/anupambhatnagar, https://github.com/aaronenyeshi

commit: b88621040a9fb6402ed02d2bb5e4ae6d2e4c1704 [log] [tgz]
author: briancoutinho <bcoutinho@meta.com> Thu Feb 22 18:17:24 2024 +0000
committer: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com> Thu Feb 22 18:17:33 2024 +0000
tree: 7cf00e74424e073e2c310536737bb7ceb9272208
parent: be0ee934674bc846c5463095f8861a3b58ea6113 [diff]
diff --git a/test/profiler/test_profiler.py b/test/profiler/test_profiler.py
index 098735c..f4846b8 100644
--- a/test/profiler/test_profiler.py
+++ b/test/profiler/test_profiler.py

@@ -1661,6 +1661,7 @@
                 self.assertTrue(len(e.input_shapes[0]) > 0)
 
     @patch.dict(os.environ, {"KINETO_USE_DAEMON": "1"})
+    @patch.dict(os.environ, {"KINETO_DAEMON_INIT_DELAY_S": "1"})
     def test_kineto_profiler_with_environment_variable(self):
         script = """
 import torch

diff --git a/torch/csrc/profiler/kineto_client_interface.cpp b/torch/csrc/profiler/kineto_client_interface.cpp
index 2b32c5e..bf4b8f2 100644
--- a/torch/csrc/profiler/kineto_client_interface.cpp
+++ b/torch/csrc/profiler/kineto_client_interface.cpp

@@ -2,6 +2,8 @@
 #include <ATen/Context.h>
 #include <libkineto.h>
 #include <torch/csrc/autograd/profiler_kineto.h>
+#include <chrono>
+#include <thread>
 
 // Ondemand tracing is not supported on Apple or edge platform
 #if defined(__APPLE__) || defined(EDGE_PROFILER_USE_KINETO)
@@ -73,18 +75,43 @@
 #if ENABLE_GLOBAL_OBSERVER
 namespace {
 
+int get_init_delay() {
+  const char* delay_c = std::getenv("KINETO_DAEMON_INIT_DELAY_S");
+  if (!delay_c) {
+    return -1;
+  }
+  std::string delay_s{delay_c};
+  try {
+    return std::stoi(delay_s);
+  } catch (const std::invalid_argument& _) {
+    return -1;
+  }
+}
+
 struct RegisterLibKinetoClient {
   RegisterLibKinetoClient() {
     static profiler::impl::LibKinetoClient client;
+    libkineto::api().registerClient(&client);
 
-    if (std::getenv("KINETO_USE_DAEMON") != nullptr) {
+    auto kineto_init = []() {
       libkineto_init(
           /*cpuOnly=*/!(at::hasCUDA() || at::hasXPU() || at::hasMTIA()),
           /*logOnError=*/true);
       libkineto::api().suppressLogMessages();
-    }
+    };
 
-    libkineto::api().registerClient(&client);
+    if (std::getenv("KINETO_USE_DAEMON") != nullptr) {
+      int init_delay_s = get_init_delay();
+      if (init_delay_s > 0) {
+        std::thread t([init_delay_s, kineto_init]() {
+          std::this_thread::sleep_for(std::chrono::seconds(init_delay_s));
+          kineto_init();
+        });
+        t.detach();
+      } else {
+        kineto_init();
+      }
+    }
   }
 } register_libkineto_client;
commit	b88621040a9fb6402ed02d2bb5e4ae6d2e4c1704	[log] [tgz]
author	briancoutinho <bcoutinho@meta.com>	Thu Feb 22 18:17:24 2024 +0000
committer	PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>	Thu Feb 22 18:17:33 2024 +0000
tree	7cf00e74424e073e2c310536737bb7ceb9272208
parent	be0ee934674bc846c5463095f8861a3b58ea6113 [diff]