Add support for multi-node collectives in NcclManager.
NCCL 2 enables collective communication across workers. This change introduces a multi-worker capable NcclManager. The main API change is to first generate a ncclUniqueId wrapped in a communicator key, and then pass in this unique id to every collective call. NCCL works best (no deadlocks) if workers enqueue collectives on GPU streams in the same order. The NCCL manager callee can prepare multiple collectives concurrently, but to achieve lockstep synchronization the callee needs to signal that a collective is ready to execute across all workers in the same order. This is exposed via SignalMultiNodeReady. PiperOrigin-RevId: 226076894
Loading
Please sign in to comment