Change RecvBufRespExtra.tensor_content to a repeated string and fill
it with many small strings instead of one large one, when using gRPC. Typing tensor_content as a Cord instead of a single string leads to roughly a 20% speedup in a 2-worker (8 v100 GPUs each) benchmark training of resnet50 using collective all-reduce for gradient reduction and gRPC for all inter-worker transport. It is hypothesized that without the Cord type gRPC is stalling incoming RecvBuf RPCs as it repeatedly reallocates and copies the strings. Using a Cord to receive the value leads to much better flow control. Unfortunately, proto3 does not yet support [ctype=CORD], so we can't use that simple and effective optimization. This CL changes tensor_content to a sequence of strings and sets a max single-string size of 4KB, the likely page size. (This default can be changed via ConfigProto.experimental.recv_buf_max_chunk.) It achieves roughly a 12% speedup on the benchmark test. The speedups are highly dependent on topology and network weather since the major effect is believed to be on flow control. PiperOrigin-RevId: 219322231
Loading
Please sign in to comment