Commit 729e39b1 authored by A. Unique TensorFlower's avatar A. Unique TensorFlower Committed by TensorFlower Gardener
Browse files

Improve the GPU memory use discipline of CollectiveReduce.

GPU memory allocation can be done in one of two modes: efficient (but
complex and therefore somewhat risky) or conservative (simpler, but less
efficient).  The main difference is that 'efficient' allocation allows
the same memory area to be allocated to mutiple independent uses
simultaenously, when it should be the case that those uses will in
fact be serial and thus temporally disjoint, while 'conservative'
allocation will always obey the invarient that one piece of memory is
allocated to at most one use at any point in time.

If GPUDevice::RequiresRecordingAccessedTensors() returns false, then
the TF runtime uses efficient memory allocation for GPU ops.  That is, GPU
ops are nominally synchronous and their tensor Ref's are deleted
immediately after the ops returns although really the corresponding GPU
kernel is only guaranteed to have been enqueued on the compute stream
and may not have yet begin execution.

If RequiresRecordingAccessedTensors() returns true, then conservative
memory allocation is used, i.e. Refs on the tensors accessed by a GPU op
are held until the corresponding kernel is guaranteed to have completed
execution and no part of the op will touch them again.

Efficient GPU memory allocation should be safe when the following criteria
are all met:

1. All GPU kernels are executed serially on a single compute stream.
2. All GPU kernel outputs and temp buffers are allocated by
   the GPU Op in the executor thread in which it is originally called.
3. Any read of a GPU tensor computed by a GPU kernel that is not
   by another kernel on that same GPU first synchronizes on
   the compute stream that produced it.
4. Any read by a GPU kernel of a value that was not produced by another
   GPU kernel first synchronizes on the entity that produced it,
   e.g. a copy stream.
5. All direct allocations of GPU memory that are not for kernel outputs
   or temp buffers are conservative in duration.
6. Any use of directly allocated GPU memory that is not part of a kernel
   execution first synchronizes on the compute stream to ensure that
   any prior granted uses of the same region have expired before this new use.

These conditions together should be sufficient for safety, and
correspond to established practice, though it may be possible to
contrive other sets of rules that are also sufficient.

Collective Ops for GPUs are unusual in that they are async (as TF
Ops) and they can directly allocate GPU memory in CPU threads that are
asynchronous to the launching executor thread.  This CL corrects a
couple of subtle misuse errors related to conditions 2 and 6.

PiperOrigin-RevId: 210841522
parent b7c2e787
Loading
Loading
Loading
Loading
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please to comment