0f872ed02f - platform/external/pytorch

commit	0f872ed02fbaf5b326f235b3f18724171b061416	[log] [tgz]
author	Sam Gross <sgross@fb.com>	Fri Mar 03 11:34:28 2017 -0800
committer	Sam Gross <sgross@fb.com>	Mon Mar 06 10:50:19 2017 -0800
tree	6c2198d1fff1ba453f829119a1aaf2d2b8804652
parent	aec182ae72d51dad0f46cdfe7ff9a41380d7da35 [diff]

Add THCCachingAllocator_recordStream() This is similar to THCCachingHostAllocator_recordEvent() but on CUDA allocations. It's useful for overlapping copies with computation. The workflow is approximately: 0. allocate dst tensor on copy stream 1. copy from CPU to GPU on copy stream 2. synchronize the main stream with the copy stream via cudaStreamWaitEvent 3. THCCachingAllocator_recordStream(dst, main_stream) The recordStream() call is necessary to prevent the dst tensor from begin reused on the copy stream before the main stream finishes work. Previously, you would need to insert a second cudaStreamWaitEvent before dst is freed to force the copy stream to wait on the main stream.