Commit 79039059 authored by Bairen Yi's avatar Bairen Yi
Browse files

Implement async TensorFromTransportOptions for GDR



Instead of blocking on completion of an RDMA op, RecvTensor client will
now post a work request to the NIC send queue and return immediately.
The GDR background polling thread will handle the callback after the
corresponding RDMA op is completed, i.e. polled from the completion
queue on NIC. The old epoll based mechanism is removed to trade higher
CPU usage for improved throughput and lower latencies for RDMA ops.

The maximum numbers of work request (WR) in the send/recv queues on
NIC are increased to entertain the increased number of concurrent
RDMA ops. The threshold of tensor size below which we pass the tensor
content in metadata is also increased to reduce the pressure to send/recv
queues on NIC.

This fixes #23933.

Signed-off-by: default avatarBairen Yi <byronyi@clustar.ai>
parent b5ca1e4a
Loading
Loading
Loading
Loading
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please to comment