Commit 9c95751e authored Feb 25, 2019 by Chris Leary Committed by TensorFlower Gardener Feb 25, 2019

[XLA:GPU] Add NCCL-based AllReduce replica support to XLA.

This requires a CUDA-config build to enable, as the NCCL library can
only be built in a CUDA-enabled build. In non-CUDA-config builds the NCCL
thunk returns an error.

Used a super-conservative-and-quite-likely-overkill concurrency
approach, in a followup CL it'd be better to optimize for the common case where
we're enqueueing a lot of operations with the same replica count onto a stream
in a non-synchronizing fashion, and only force thread synchronization if the
number of replicas changes.

In the future this should likely be unified with NcclManager in
tensorflow/core/nccl -- for now it is separate since the EventMgr-style
memory allocation strategy from TensorFlow is not used in XLA, so some
parameterization of the memory strategy being used in that library is
likely necessary, at which point it should be reasonable to scoop out
this ~200 line implementation in the cc file and replace it with the
NcclManager abstraction to unify the two implementations.

PiperOrigin-RevId: 235632126

parent 5bce34fe

Show whitespace changes

Inline Side-by-side

Please to comment