[XLA:GPU] Add NCCL-based AllReduce replica support to XLA.
This requires a CUDA-config build to enable, as the NCCL library can only be built in a CUDA-enabled build. In non-CUDA-config builds the NCCL thunk returns an error. Used a super-conservative-and-quite-likely-overkill concurrency approach, in a followup CL it'd be better to optimize for the common case where we're enqueueing a lot of operations with the same replica count onto a stream in a non-synchronizing fashion, and only force thread synchronization if the number of replicas changes. In the future this should likely be unified with NcclManager in tensorflow/core/nccl -- for now it is separate since the EventMgr-style memory allocation strategy from TensorFlow is not used in XLA, so some parameterization of the memory strategy being used in that library is likely necessary, at which point it should be reasonable to scoop out this ~200 line implementation in the cc file and replace it with the NcclManager abstraction to unify the two implementations. PiperOrigin-RevId: 235632126
Loading
Please sign in to comment