Commit 79039059 authored Nov 30, 2018 by Bairen Yi

Implement async TensorFromTransportOptions for GDR



Instead of blocking on completion of an RDMA op, RecvTensor client will
now post a work request to the NIC send queue and return immediately.
The GDR background polling thread will handle the callback after the
corresponding RDMA op is completed, i.e. polled from the completion
queue on NIC. The old epoll based mechanism is removed to trade higher
CPU usage for improved throughput and lower latencies for RDMA ops.

The maximum numbers of work request (WR) in the send/recv queues on
NIC are increased to entertain the increased number of concurrent
RDMA ops. The threshold of tensor size below which we pass the tensor
content in metadata is also increased to reduce the pressure to send/recv
queues on NIC.

This fixes #23933.

Signed-off-by: Bairen Yi <byronyi@clustar.ai>

parent b5ca1e4a

Expand all Show whitespace changes

Inline Side-by-side

Please to comment