Commit 729e39b1 authored Aug 29, 2018 by A. Unique TensorFlower Committed by TensorFlower Gardener Aug 29, 2018

Improve the GPU memory use discipline of CollectiveReduce.

GPU memory allocation can be done in one of two modes: efficient (but
complex and therefore somewhat risky) or conservative (simpler, but less
efficient). The main difference is that 'efficient' allocation allows
the same memory area to be allocated to mutiple independent uses
simultaenously, when it should be the case that those uses will in
fact be serial and thus temporally disjoint, while 'conservative'
allocation will always obey the invarient that one piece of memory is
allocated to at most one use at any point in time.

If GPUDevice::RequiresRecordingAccessedTensors() returns false, then
the TF runtime uses efficient memory allocation for GPU ops. That is, GPU
ops are nominally synchronous and their tensor Ref's are deleted
immediately after the ops returns although really the corresponding GPU
kernel is only guaranteed to have been enqueued on the compute stream
and may not have yet begin execution.

If RequiresRecordingAccessedTensors() returns true, then conservative
memory allocation is used, i.e. Refs on the tensors accessed by a GPU op
are held until the corresponding kernel is guaranteed to have completed
execution and no part of the op will touch them again.

Efficient GPU memory allocation should be safe when the following criteria
are all met:

1. All GPU kernels are executed serially on a single compute stream.
2. All GPU kernel outputs and temp buffers are allocated by
the GPU Op in the executor thread in which it is originally called.
3. Any read of a GPU tensor computed by a GPU kernel that is not
by another kernel on that same GPU first synchronizes on
the compute stream that produced it.
4. Any read by a GPU kernel of a value that was not produced by another
GPU kernel first synchronizes on the entity that produced it,
e.g. a copy stream.
5. All direct allocations of GPU memory that are not for kernel outputs
or temp buffers are conservative in duration.
6. Any use of directly allocated GPU memory that is not part of a kernel
execution first synchronizes on the compute stream to ensure that
any prior granted uses of the same region have expired before this new use.

These conditions together should be sufficient for safety, and
correspond to established practice, though it may be possible to
contrive other sets of rules that are also sufficient.

Collective Ops for GPUs are unusual in that they are async (as TF
Ops) and they can directly allocate GPU memory in CPU threads that are
asynchronous to the launching executor thread. This CL corrects a
couple of subtle misuse errors related to conditions 2 and 6.

PiperOrigin-RevId: 210841522

parent b7c2e787

Show whitespace changes

Inline Side-by-side

Please to comment