Commit 8ba8051d authored by eladweiss's avatar eladweiss Committed by drpngx
Browse files

Verbs w 0 copies (#16005)



* Add RDMA_LOG macros.

Will be used to quickly switch between log levels when debugging the protocol.

Signed-off-by: default avatarElad Weiss <eladw@ezchip.com>

* Verbs with 0 copies - Phase 1

Changing the verbs implementation to use the 0 copies approach.
For full details and design see 'patch_notes_verbs_with_0_copies.md'

Signed-off-by: default avatarElad Weiss <eladw@ezchip.com>

* Verbs with 0 copies - Phase 2 - Remove RdmaAckBuffer

Remove the RdmaAckBuffer completely, as it is no longer required.
An Ack is now an empty RDMA write with immediate value 0x80000000.

Signed-off-by: default avatarElad Weiss <eladw@ezchip.com>

* Verbs with 0 copies - Phase 2 - Remove RDMA_MESSAGE_BUFFER_IDLE

Remove the RDMA_MESSAGE_BUFFER_IDLE message completely.
It is no longer required, since we no longer send the Tensor to a shared buffer.

Signed-off-by: default avatarElad Weiss <eladw@ezchip.com>

* Verbs with 0 copies - Phase 2 - Remove RDMA_MESSAGE_ACK/RDMA_MESSAGE_TENSOR_WRITE

The messages are no longer required. Use the immediate value instead.

Signed-off-by: default avatarElad Weiss <eladw@ezchip.com>

* Verbs with 0 copies - Phase 2 - Rename RDMA_MESSAGE_BUFFER_REQUEST/RESPONSE.

RDMA_MESSAGE_BUFFER_REQUEST ==> RDMA_MESSSAGE_META_DATA_UPDATE.
RDMA_MESSAGE_BUFFER_RESPONSE ==> RDMA_MESSAGE_TENSOR_RE_REQUEST.

Signed-off-by: default avatarElad Weiss <eladw@ezchip.com>

* Verbs with 0 copies - Phase 2 - Add data validation.

Data validation can be enabled by compiling with -DRDMA_DATA_VALIDATION.
The validation is done as follows:
1. Calculate checksum of the source Tensor on the sender side.
2. Send the checksum value in the META_DATA_RESPONSE message. The message will
   be sent for every request.
3. The receiver side receives the message and saves the checksum value.
4. When the Tensor content arrives on the receiver side, the receiver calculates
   its checksum right before invoking done(). If the value is different than the
   stored checksum value, the validation failed.

Signed-off-by: default avatarElad Weiss <eladw@ezchip.com>

* Verbs with 0 copies - Phase 2 - Some code cleanup.

1. Remove some unused code and old comments.
2. Remove some parameters from PostCopyOpearions.

Signed-off-by: default avatarElad Weiss <eladw@ezchip.com>

* Verbs with 0 copies - Phase 2 - Update README.md with the new design.

Signed-off-by: default avatarElad Weiss <eladw@ezchip.com>

* Verbs with 0 copies - Phase 2 - Encapsulate sender logic under RdmaTensorResponse.

- Move all the meta-data and content sending logic to RdmaTensorResponse methods.
- Remove RdmaTensorBuffer.
- Remove TensorBuffer base class and buffer types.
- Remove ReItem. Delayed tensor is now saved inside the response object.

Signed-off-by: default avatarElad Weiss <eladw@ezchip.com>

* Fix a synchronization issue when allocating a GPU result tensor.

Signed-off-by: default avatarElad Weiss <eladw@ezchip.com>

* Move verbs_util.h inclusion (for debug purposes).

Signed-off-by: default avatarElad Weiss <eladw@ezchip.com>

* [Verbs] - Solve a race condition issue when attempting to setup channels.

The problem started when merging to latest master. The run would fail about 50% of the times
when trying to execute Grpc GetRemoteAddress(), and return an "OS Error" message. Seems like
a race condition between the stations. For now added a while loop with N retries.

Signed-off-by: default avatarElad Weiss <eladw@ezchip.com>

* [Verbs] - PR review comment - Use SchedClosure() instead of WorkerEnv::compute_pool

Signed-off-by: default avatarElad Weiss <eladw@ezchip.com>

* [Verbs] - PR review comment - Define and use RDMA_MAX_REQUEST_ID.

Also requested internally to increase the number from 2G to 4G - 2.

Signed-off-by: default avatarElad Weiss <eladw@ezchip.com>

* [Verbs] - PR review comment - Remove old/unused code & comments.

Signed-off-by: default avatarElad Weiss <eladw@ezchip.com>

* [Verbs] - PR review comment - Change usleep() to Env::SleepForMicroseconds().

Signed-off-by: default avatarElad Weiss <eladw@ezchip.com>

* [Verbs] - PR review comment - Propagate error statuses to the higher level.

Signed-off-by: default avatarElad Weiss <eladw@ezchip.com>

* [Verbs] - Nicify connection messages.

Signed-off-by: default avatarElad Weiss <eladw@ezchip.com>

* [Verbs] - Dispose of SchedClosure.

Using SchedClosure causes a real performance degradation (10-15% on inception3
and resnet152). Instead we will use synchronous calls for now, since ops are
non-blocking anyway.

Signed-off-by: default avatarElad Weiss <eladw@ezchip.com>

* [Verbs] - Enable sending content directly from source GPU tensor.

This is a 0 copies requirement. It was implemented in the original prototype,
however commiting it was delayed because:
1. It doesn't realy affect performance very much.
2. It requires the StreamGPUOp() function which in the prototype was
implemented under GPUUtil, but in the mainstream should be kept under contrib
code. I had a lot of techincal difficulties including "gpu_context.h" in my
code, with the current Bazel configuration, so eventually I re-implemented
it as an empty GPU-to-CPU copy. It is actually quiet elegant, fully reusing an
existing code.

Signed-off-by: default avatarElad Weiss <eladw@ezchip.com>

* [Verbs] - Replace the blocking Sync() call after GPU tensor allocation.

Instead, queue the next operation on the GPU stream.

Signed-off-by: default avatarElad Weiss <eladw@ezchip.com>
parent 7548989a
Loading
Loading
Loading
Loading
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please to comment