|  | .. currentmodule:: torch.cuda._sanitizer | 
|  |  | 
|  | CUDA Stream Sanitizer | 
|  | ===================== | 
|  |  | 
|  | .. note:: | 
|  | This is a prototype feature, which means it is at an early stage | 
|  | for feedback and testing, and its components are subject to change. | 
|  |  | 
|  | Overview | 
|  | -------- | 
|  |  | 
|  | .. automodule:: torch.cuda._sanitizer | 
|  |  | 
|  |  | 
|  | Usage | 
|  | ------ | 
|  |  | 
|  | Here is an example of a simple synchronization error in PyTorch: | 
|  |  | 
|  | :: | 
|  |  | 
|  | import torch | 
|  |  | 
|  | a = torch.rand(4, 2, device="cuda") | 
|  |  | 
|  | with torch.cuda.stream(torch.cuda.Stream()): | 
|  | torch.mul(a, 5, out=a) | 
|  |  | 
|  | The ``a`` tensor is initialized on the default stream and, without any synchronization | 
|  | methods, modified on a new stream. The two kernels will run concurrently on the same tensor, | 
|  | which might cause the second kernel to read uninitialized data before the first one was able | 
|  | to write it, or the first kernel might overwrite part of the result of the second. | 
|  | When this script is run on the commandline with: | 
|  | :: | 
|  |  | 
|  | TORCH_CUDA_SANITIZER=1 python example_error.py | 
|  |  | 
|  | the following output is printed by CSAN: | 
|  |  | 
|  | :: | 
|  |  | 
|  | ============================ | 
|  | CSAN detected a possible data race on tensor with data pointer 139719969079296 | 
|  | Access by stream 94646435460352 during kernel: | 
|  | aten::mul.out(Tensor self, Tensor other, *, Tensor(a!) out) -> Tensor(a!) | 
|  | writing to argument(s) self, out, and to the output | 
|  | With stack trace: | 
|  | File "example_error.py", line 6, in <module> | 
|  | torch.mul(a, 5, out=a) | 
|  | ... | 
|  | File "pytorch/torch/cuda/_sanitizer.py", line 364, in _handle_kernel_launch | 
|  | stack_trace = traceback.StackSummary.extract( | 
|  |  | 
|  | Previous access by stream 0 during kernel: | 
|  | aten::rand(int[] size, *, int? dtype=None, Device? device=None) -> Tensor | 
|  | writing to the output | 
|  | With stack trace: | 
|  | File "example_error.py", line 3, in <module> | 
|  | a = torch.rand(10000, device="cuda") | 
|  | ... | 
|  | File "pytorch/torch/cuda/_sanitizer.py", line 364, in _handle_kernel_launch | 
|  | stack_trace = traceback.StackSummary.extract( | 
|  |  | 
|  | Tensor was allocated with stack trace: | 
|  | File "example_error.py", line 3, in <module> | 
|  | a = torch.rand(10000, device="cuda") | 
|  | ... | 
|  | File "pytorch/torch/cuda/_sanitizer.py", line 420, in _handle_memory_allocation | 
|  | traceback.StackSummary.extract( | 
|  |  | 
|  | This gives extensive insight into the origin of the error: | 
|  |  | 
|  | - A tensor was incorrectly accessed from streams with ids: 0 (default stream) and 94646435460352 (new stream) | 
|  | - The tensor was allocated by invoking ``a = torch.rand(10000, device="cuda")`` | 
|  | - The faulty accesses were caused by operators | 
|  | - ``a = torch.rand(10000, device="cuda")`` on stream 0 | 
|  | - ``torch.mul(a, 5, out=a)`` on stream 94646435460352 | 
|  | - The error message also displays the schemas of the invoked operators, along with a note | 
|  | showing which arguments of the operators correspond to the affected tensor. | 
|  |  | 
|  | - In the example, it can be seen that tensor ``a`` corresponds to arguments ``self``, ``out`` | 
|  | and the ``output`` value of the invoked operator ``torch.mul``. | 
|  |  | 
|  | .. seealso:: | 
|  | The list of supported torch operators and their schemas can be viewed | 
|  | :doc:`here <torch>`. | 
|  |  | 
|  | The bug can be fixed by forcing the new stream to wait for the default stream: | 
|  |  | 
|  | :: | 
|  |  | 
|  | with torch.cuda.stream(torch.cuda.Stream()): | 
|  | torch.cuda.current_stream().wait_stream(torch.cuda.default_stream()) | 
|  | torch.mul(a, 5, out=a) | 
|  |  | 
|  | When the script is run again, there are no errors reported. | 
|  |  | 
|  | API Reference | 
|  | ------------- | 
|  |  | 
|  | .. autofunction:: enable_cuda_sanitizer |