Hierarchical broadcast collective op.
Before this CL, we had a binary tree based implementation. We would essentially form an arbitrary, logical binary tree over all devices and then send the tensor from each node to its children in the tree. This change introduces a topology-aware hierarchical implementation. First, we perform a binary tree broadcast with only 1 device per machine. Then for each host, the device that participated in the first inter-machine broadcast is the source for an intra-machine binary tree broadcast. PiperOrigin-RevId: 206771821
Loading
Please sign in to comment