Add THCCachingAllocator_recordStream()
This is similar to THCCachingHostAllocator_recordEvent() but on CUDA
allocations. It's useful for overlapping copies with computation. The
workflow is approximately:
0. allocate dst tensor on copy stream
1. copy from CPU to GPU on copy stream
2. synchronize the main stream with the copy stream via
cudaStreamWaitEvent
3. THCCachingAllocator_recordStream(dst, main_stream)
The recordStream() call is necessary to prevent the dst tensor from
begin reused on the copy stream before the main stream finishes work.
Previously, you would need to insert a second cudaStreamWaitEvent before
dst is freed to force the copy stream to wait on the main stream.
2 files changed