Commit 44f117ce authored by Ayush Dubey's avatar Ayush Dubey Committed by TensorFlower Gardener
Browse files

Add support for multi-node collectives in NcclManager.

NCCL 2 enables collective communication across workers.  This change introduces
a multi-worker capable NcclManager.  The main API change is to first generate a
ncclUniqueId wrapped in a communicator key, and then pass in this unique id to
every collective call.

NCCL works best (no deadlocks) if workers enqueue collectives on GPU streams in
the same order.  The NCCL manager callee can prepare multiple collectives
concurrently, but to achieve lockstep synchronization the callee needs to signal
that a collective is ready to execute across all workers in the same order.
This is exposed via SignalMultiNodeReady.

PiperOrigin-RevId: 226076894
parent 2efcb526
Loading
Loading
Loading
Loading
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please to comment