New Timestamped BFCAllocator and GPUKernelTracker.
The first part of this change extends BFCAllocator with an optional timing counter for recording the time at which each Chunk is freed. This has no effect for conventional memory management (as applied to CPU RAM), but can achieve a new behavior when applied to GPU RAM management. The default TensorFlow memory allocation convention for GPU RAM is to Unref the tensors Ref'd by a GPU Op as soon as the Op has queued its kernel (and before that kernel is known to have completed execution). This is safe if the memory is subsequently allocated to another GPU Op (the usual case) because that second Op will be sequential on the single GPU compute stream and hence won't touch the memory until the prior kernel has completed. But this practice is unsafe if the memory is used for I/O or for an Op queued on a different compute stream unless some further synchronization is inserted. Currently, I/O between a GPU and another device is made safe by inserting stream dependencies. Multi-compute-stream computation is made safe by delaying the Unref of Ref'd tensors until the kernel is known to have completed, via callback through the GPU-specific EventMgr. RDMA networking using GPUDirect is another difficult case where stream synchronization is not possible and it is necessary to wait until kernels are known to have completed before allowing reallocation of the used memory. Simply delaying the deallocation of memory until kernels are known to have completed is unsatisfactory because it substantially raises the high-water memory requirements of a program, drastically affecting the model architectures that are feasible on a particular GPU model. The new freed-at count on BFCAllocator::Chunk is part of a strategy for maintaining the high-water size efficiency of our current single-compute-stream GPU memory allocation strategy while reducing synchronization stalls in I/O uses of GPU RAM. In the future it may also be applied to multi-compute-stream execution. The key idea is that when a request to allocate GPU memory is made we can also pass along a 'freed-by' count and the allocator is free to return any Chunk whose freed_count is <= that threshold. This way we can continue to early-allocate GPU RAM without restrictions to GPU kernels to be executed on a single compute stream, while simultaneously satisfying the correctness constraints needed for off-stream use. GPUKernelTracker is the other component needed to make this new strategy work. It keeps track of the stream queuing and real completion times of GPU kernels thus making it possible to pick the largest safe freed-by count when making a request for GPU memory that must be unemcumbered by other uses immediately. A secondary capability of the GPUKernelTracker is that it enables capping the number of GPU kernels queued on a stream. Without this cap some TF models can experience moments when hundreds of kernels are queued on the single compute stream. Those queued but-not-executing kernels can tie up memory that could be used for other purposes before its really needed, and can delay I/O operations which are queued later and need to wait for the compute stream to clear, for safety. The new timestamped memory allocation strategy and pending-kernel capping are considered experimental features and default off for now, until more experience is gained. PiperOrigin-RevId: 232705088
Loading
Please sign in to comment