[profiler] Add kineto init delay when used in daemon mode (#120276)
Fixes #112389
## About
PyTorch (Kineto) profiler registers with the profiling daemon Dynolog to enable on-demand profiling. The user should only need to set the env variable `KINETO_USE_DAEMON`. To enable this we need to initialize kineto library early rather than lazily on a PyTorch profiler call. This initialization happens in a static initializer.
- Kineto init function basically registers a callback using the CUDA CUPTI library https://github.com/pytorch/kineto/blob/main/libkineto/src/init.cpp#L130-L148
- However, the above needs the dynamic linking to libcupti.so to have taken place.
- I understand now that static initializations of compilation units will be called before the dynamic linking leading to a segfault in #112389

## Workaround
We add a delay in the initialization that can be configured using the env variable 'KINETO_DAEMON_INIT_DELAY_S'. May not be the best but it could help resolve the issue.
## Testing
Tested this out with [linear_model_example.py](https://github.com/facebookincubator/dynolog/blob/main/scripts/pytorch/linear_model_example.py)
First export the daemon env variable
### Without any delay
```
>$ python3 linear_model_example.py
INFO:2024-02-21 19:34:50 2366287:2366287 init.cpp:131] Registering daemon config loader, cpuOnly = 1
INFO:2024-02-21 19:34:50 2366287:2366287 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1
INFO:2024-02-21 19:34:50 2366287:2366287 IpcFabricConfigClient.cpp:93] Setting up IPC Fabric at endpoint: dynoconfigclientb8f91363-d8d6-47a7-9103-197661e28397 status = initialized
INFO:2024-02-21 19:34:50 2366287:2366287 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1
INFO:2024-02-21 19:34:50 2366287:2366287 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1
cpu
99 1385.468505859375
```
### With 5 seconds delay
```
>$ KINETO_DAEMON_INIT_DELAY_S=5 python3 linear_model_example.py
cpu
99 284.82305908203125
10099 8.817167282104492
INFO:2024-02-21 19:34:26 2359155:2359214 init.cpp:131] Registering daemon config loader, cpuOnly = 1
ERROR: External init callback must run in same thread as registerClient (1782580992 != -1922169024)
INFO:2024-02-21 19:34:26 2359155:2359214 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1
INFO:2024-02-21 19:34:26 2359155:2359214 IpcFabricConfigClient.cpp:93] Setting up IPC Fabric at endpoint: dynoconfigclient49270a3f-e913-4ea6-b9e0-cc90a853a869 status = initialized
INFO:2024-02-21 19:34:26 2359155:2359214 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1
INFO:2024-02-21 19:34:26 2359155:2359214 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1
20099 8.817167282104492
```
### With an invalid delay
```
>$ KINETO_DAEMON_INIT_DELAY_S=abc python3 linear_model_example.py
INFO:2024-02-21 19:35:02 2369647:2369647 init.cpp:131] Registering daemon config loader, cpuOnly = 1
INFO:2024-02-21 19:35:02 2369647:2369647 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1
INFO:2024-02-21 19:35:02 2369647:2369647 IpcFabricConfigClient.cpp:93] Setting up IPC Fabric at endpoint: dynoconfigclient0e12a349-af7b-4322-901d-1ff22f91fd4c status = initialized
INFO:2024-02-21 19:35:02 2369647:2369647 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1
INFO:2024-02-21 19:35:02 2369647:2369647 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1
cpu
```
### Unit test updated as well.
## Impact
This should not impact any general user. The initialization only occurs if `KINETO_USE_DAEMON` is set in the environment.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120276
Approved by: https://github.com/anupambhatnagar, https://github.com/aaronenyeshi
diff --git a/test/profiler/test_profiler.py b/test/profiler/test_profiler.py
index 098735c..f4846b8 100644
--- a/test/profiler/test_profiler.py
+++ b/test/profiler/test_profiler.py
@@ -1661,6 +1661,7 @@
self.assertTrue(len(e.input_shapes[0]) > 0)
@patch.dict(os.environ, {"KINETO_USE_DAEMON": "1"})
+ @patch.dict(os.environ, {"KINETO_DAEMON_INIT_DELAY_S": "1"})
def test_kineto_profiler_with_environment_variable(self):
script = """
import torch
diff --git a/torch/csrc/profiler/kineto_client_interface.cpp b/torch/csrc/profiler/kineto_client_interface.cpp
index 2b32c5e..bf4b8f2 100644
--- a/torch/csrc/profiler/kineto_client_interface.cpp
+++ b/torch/csrc/profiler/kineto_client_interface.cpp
@@ -2,6 +2,8 @@
#include <ATen/Context.h>
#include <libkineto.h>
#include <torch/csrc/autograd/profiler_kineto.h>
+#include <chrono>
+#include <thread>
// Ondemand tracing is not supported on Apple or edge platform
#if defined(__APPLE__) || defined(EDGE_PROFILER_USE_KINETO)
@@ -73,18 +75,43 @@
#if ENABLE_GLOBAL_OBSERVER
namespace {
+int get_init_delay() {
+ const char* delay_c = std::getenv("KINETO_DAEMON_INIT_DELAY_S");
+ if (!delay_c) {
+ return -1;
+ }
+ std::string delay_s{delay_c};
+ try {
+ return std::stoi(delay_s);
+ } catch (const std::invalid_argument& _) {
+ return -1;
+ }
+}
+
struct RegisterLibKinetoClient {
RegisterLibKinetoClient() {
static profiler::impl::LibKinetoClient client;
+ libkineto::api().registerClient(&client);
- if (std::getenv("KINETO_USE_DAEMON") != nullptr) {
+ auto kineto_init = []() {
libkineto_init(
/*cpuOnly=*/!(at::hasCUDA() || at::hasXPU() || at::hasMTIA()),
/*logOnError=*/true);
libkineto::api().suppressLogMessages();
- }
+ };
- libkineto::api().registerClient(&client);
+ if (std::getenv("KINETO_USE_DAEMON") != nullptr) {
+ int init_delay_s = get_init_delay();
+ if (init_delay_s > 0) {
+ std::thread t([init_delay_s, kineto_init]() {
+ std::this_thread::sleep_for(std::chrono::seconds(init_delay_s));
+ kineto_init();
+ });
+ t.detach();
+ } else {
+ kineto_init();
+ }
+ }
}
} register_libkineto_client;