[profiler] Add  kineto init delay when used in daemon mode (#120276)

Fixes #112389

## About

PyTorch (Kineto) profiler registers with the profiling daemon Dynolog to enable on-demand profiling. The user should only need to set the env variable `KINETO_USE_DAEMON`. To enable this we need to initialize kineto library early rather than lazily on a PyTorch profiler call. This initialization happens in a static initializer.
- Kineto init function basically registers a callback using the CUDA CUPTI library https://github.com/pytorch/kineto/blob/main/libkineto/src/init.cpp#L130-L148
- However, the above needs the dynamic linking to libcupti.so to have taken place.
- I understand now that static initializations of compilation units will be called before the dynamic linking leading to a segfault in #112389

![image](https://github.com/pytorch/pytorch/assets/6922212/29c9e79b-8080-4198-aaae-8a5696dccaec)

## Workaround
We add a delay in the initialization that can be configured using the env variable 'KINETO_DAEMON_INIT_DELAY_S'. May not be the best but it could help resolve the issue.

## Testing
Tested this out with [linear_model_example.py](https://github.com/facebookincubator/dynolog/blob/main/scripts/pytorch/linear_model_example.py)
First export the daemon env variable

### Without any delay
```
>$ python3 linear_model_example.py

INFO:2024-02-21 19:34:50 2366287:2366287 init.cpp:131] Registering daemon config loader, cpuOnly =  1
INFO:2024-02-21 19:34:50 2366287:2366287 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1
INFO:2024-02-21 19:34:50 2366287:2366287 IpcFabricConfigClient.cpp:93] Setting up IPC Fabric at endpoint: dynoconfigclientb8f91363-d8d6-47a7-9103-197661e28397 status = initialized
INFO:2024-02-21 19:34:50 2366287:2366287 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1
INFO:2024-02-21 19:34:50 2366287:2366287 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1
cpu
99 1385.468505859375
```

### With 5 seconds delay
```
>$ KINETO_DAEMON_INIT_DELAY_S=5 python3 linear_model_example.py

cpu
99 284.82305908203125
10099 8.817167282104492
INFO:2024-02-21 19:34:26 2359155:2359214 init.cpp:131] Registering daemon config loader, cpuOnly =  1
ERROR: External init callback must run in same thread as registerClient (1782580992 != -1922169024)
INFO:2024-02-21 19:34:26 2359155:2359214 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1
INFO:2024-02-21 19:34:26 2359155:2359214 IpcFabricConfigClient.cpp:93] Setting up IPC Fabric at endpoint: dynoconfigclient49270a3f-e913-4ea6-b9e0-cc90a853a869 status = initialized
INFO:2024-02-21 19:34:26 2359155:2359214 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1
INFO:2024-02-21 19:34:26 2359155:2359214 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1
20099 8.817167282104492
```

### With an invalid delay
```
>$ KINETO_DAEMON_INIT_DELAY_S=abc python3 linear_model_example.py

INFO:2024-02-21 19:35:02 2369647:2369647 init.cpp:131] Registering daemon config loader, cpuOnly =  1
INFO:2024-02-21 19:35:02 2369647:2369647 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1
INFO:2024-02-21 19:35:02 2369647:2369647 IpcFabricConfigClient.cpp:93] Setting up IPC Fabric at endpoint: dynoconfigclient0e12a349-af7b-4322-901d-1ff22f91fd4c status = initialized
INFO:2024-02-21 19:35:02 2369647:2369647 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1
INFO:2024-02-21 19:35:02 2369647:2369647 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1
cpu
```

### Unit test updated as well.

## Impact
This should not impact any general user. The initialization only occurs if `KINETO_USE_DAEMON` is set in the environment.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120276
Approved by: https://github.com/anupambhatnagar, https://github.com/aaronenyeshi
diff --git a/test/profiler/test_profiler.py b/test/profiler/test_profiler.py
index 098735c..f4846b8 100644
--- a/test/profiler/test_profiler.py
+++ b/test/profiler/test_profiler.py
@@ -1661,6 +1661,7 @@
                 self.assertTrue(len(e.input_shapes[0]) > 0)
 
     @patch.dict(os.environ, {"KINETO_USE_DAEMON": "1"})
+    @patch.dict(os.environ, {"KINETO_DAEMON_INIT_DELAY_S": "1"})
     def test_kineto_profiler_with_environment_variable(self):
         script = """
 import torch
diff --git a/torch/csrc/profiler/kineto_client_interface.cpp b/torch/csrc/profiler/kineto_client_interface.cpp
index 2b32c5e..bf4b8f2 100644
--- a/torch/csrc/profiler/kineto_client_interface.cpp
+++ b/torch/csrc/profiler/kineto_client_interface.cpp
@@ -2,6 +2,8 @@
 #include <ATen/Context.h>
 #include <libkineto.h>
 #include <torch/csrc/autograd/profiler_kineto.h>
+#include <chrono>
+#include <thread>
 
 // Ondemand tracing is not supported on Apple or edge platform
 #if defined(__APPLE__) || defined(EDGE_PROFILER_USE_KINETO)
@@ -73,18 +75,43 @@
 #if ENABLE_GLOBAL_OBSERVER
 namespace {
 
+int get_init_delay() {
+  const char* delay_c = std::getenv("KINETO_DAEMON_INIT_DELAY_S");
+  if (!delay_c) {
+    return -1;
+  }
+  std::string delay_s{delay_c};
+  try {
+    return std::stoi(delay_s);
+  } catch (const std::invalid_argument& _) {
+    return -1;
+  }
+}
+
 struct RegisterLibKinetoClient {
   RegisterLibKinetoClient() {
     static profiler::impl::LibKinetoClient client;
+    libkineto::api().registerClient(&client);
 
-    if (std::getenv("KINETO_USE_DAEMON") != nullptr) {
+    auto kineto_init = []() {
       libkineto_init(
           /*cpuOnly=*/!(at::hasCUDA() || at::hasXPU() || at::hasMTIA()),
           /*logOnError=*/true);
       libkineto::api().suppressLogMessages();
-    }
+    };
 
-    libkineto::api().registerClient(&client);
+    if (std::getenv("KINETO_USE_DAEMON") != nullptr) {
+      int init_delay_s = get_init_delay();
+      if (init_delay_s > 0) {
+        std::thread t([init_delay_s, kineto_init]() {
+          std::this_thread::sleep_for(std::chrono::seconds(init_delay_s));
+          kineto_init();
+        });
+        t.detach();
+      } else {
+        kineto_init();
+      }
+    }
   }
 } register_libkineto_client;