Commit 2dcb0a07 authored Feb 06, 2019 by A. Unique TensorFlower Committed by TensorFlower Gardener Feb 06, 2019

New Timestamped BFCAllocator and GPUKernelTracker.

The first part of this change extends BFCAllocator with an optional
timing counter for recording the time at which each Chunk is freed.
This has no effect for conventional memory management (as
applied to CPU RAM), but can achieve a new behavior when applied
to GPU RAM management. The default TensorFlow memory allocation
convention for GPU RAM is to Unref the tensors Ref'd by a GPU Op as
soon as the Op has queued its kernel (and before that kernel is known
to have completed execution). This is safe if the memory is
subsequently allocated to another GPU Op (the usual case) because that
second Op will be sequential on the single GPU compute stream and
hence won't touch the memory until the prior kernel has completed.
But this practice is unsafe if the memory is used for I/O or for an Op
queued on a different compute stream unless some further
synchronization is inserted.

Currently, I/O between a GPU and another device is made safe by
inserting stream dependencies. Multi-compute-stream computation is
made safe by delaying the Unref of Ref'd tensors until the kernel is
known to have completed, via callback through the GPU-specific
EventMgr. RDMA networking using GPUDirect is another difficult case
where stream synchronization is not possible and it is necessary to
wait until kernels are known to have completed before allowing
reallocation of the used memory.

Simply delaying the deallocation of memory until kernels are known to
have completed is unsatisfactory because it substantially raises the
high-water memory requirements of a program, drastically affecting the
model architectures that are feasible on a particular GPU model. The
new freed-at count on BFCAllocator::Chunk is part of a strategy
for maintaining the high-water size efficiency of our current
single-compute-stream GPU memory allocation strategy while reducing
synchronization stalls in I/O uses of GPU RAM. In the future it
may also be applied to multi-compute-stream execution.

The key idea is that when a request to allocate GPU memory is made we
can also pass along a 'freed-by' count and the allocator is free
to return any Chunk whose freed_count is <= that threshold.
This way we can continue to early-allocate GPU RAM without
restrictions to GPU kernels to be executed on a single compute stream,
while simultaneously satisfying the correctness constraints
needed for off-stream use.

GPUKernelTracker is the other component needed to make this new
strategy work. It keeps track of the stream queuing and real
completion times of GPU kernels thus making it possible to pick the
largest safe freed-by count when making a request for GPU memory
that must be unemcumbered by other uses immediately. A secondary
capability of the GPUKernelTracker is that it enables capping the
number of GPU kernels queued on a stream. Without this cap some TF
models can experience moments when hundreds of kernels are queued on
the single compute stream. Those queued but-not-executing kernels can
tie up memory that could be used for other purposes before its really
needed, and can delay I/O operations which are queued later and need
to wait for the compute stream to clear, for safety.

The new timestamped memory allocation strategy and pending-kernel
capping are considered experimental features and default off for
now, until more experience is gained.

PiperOrigin-RevId: 232705088

parent 4e928af5

Show whitespace changes

Inline Side-by-side

Please to comment