Verbs w 0 copies (#16005)
* Add RDMA_LOG macros. Will be used to quickly switch between log levels when debugging the protocol. Signed-off-by:Elad Weiss <eladw@ezchip.com> * Verbs with 0 copies - Phase 1 Changing the verbs implementation to use the 0 copies approach. For full details and design see 'patch_notes_verbs_with_0_copies.md' Signed-off-by:
Elad Weiss <eladw@ezchip.com> * Verbs with 0 copies - Phase 2 - Remove RdmaAckBuffer Remove the RdmaAckBuffer completely, as it is no longer required. An Ack is now an empty RDMA write with immediate value 0x80000000. Signed-off-by:
Elad Weiss <eladw@ezchip.com> * Verbs with 0 copies - Phase 2 - Remove RDMA_MESSAGE_BUFFER_IDLE Remove the RDMA_MESSAGE_BUFFER_IDLE message completely. It is no longer required, since we no longer send the Tensor to a shared buffer. Signed-off-by:
Elad Weiss <eladw@ezchip.com> * Verbs with 0 copies - Phase 2 - Remove RDMA_MESSAGE_ACK/RDMA_MESSAGE_TENSOR_WRITE The messages are no longer required. Use the immediate value instead. Signed-off-by:
Elad Weiss <eladw@ezchip.com> * Verbs with 0 copies - Phase 2 - Rename RDMA_MESSAGE_BUFFER_REQUEST/RESPONSE. RDMA_MESSAGE_BUFFER_REQUEST ==> RDMA_MESSSAGE_META_DATA_UPDATE. RDMA_MESSAGE_BUFFER_RESPONSE ==> RDMA_MESSAGE_TENSOR_RE_REQUEST. Signed-off-by:
Elad Weiss <eladw@ezchip.com> * Verbs with 0 copies - Phase 2 - Add data validation. Data validation can be enabled by compiling with -DRDMA_DATA_VALIDATION. The validation is done as follows: 1. Calculate checksum of the source Tensor on the sender side. 2. Send the checksum value in the META_DATA_RESPONSE message. The message will be sent for every request. 3. The receiver side receives the message and saves the checksum value. 4. When the Tensor content arrives on the receiver side, the receiver calculates its checksum right before invoking done(). If the value is different than the stored checksum value, the validation failed. Signed-off-by:
Elad Weiss <eladw@ezchip.com> * Verbs with 0 copies - Phase 2 - Some code cleanup. 1. Remove some unused code and old comments. 2. Remove some parameters from PostCopyOpearions. Signed-off-by:
Elad Weiss <eladw@ezchip.com> * Verbs with 0 copies - Phase 2 - Update README.md with the new design. Signed-off-by:
Elad Weiss <eladw@ezchip.com> * Verbs with 0 copies - Phase 2 - Encapsulate sender logic under RdmaTensorResponse. - Move all the meta-data and content sending logic to RdmaTensorResponse methods. - Remove RdmaTensorBuffer. - Remove TensorBuffer base class and buffer types. - Remove ReItem. Delayed tensor is now saved inside the response object. Signed-off-by:
Elad Weiss <eladw@ezchip.com> * Fix a synchronization issue when allocating a GPU result tensor. Signed-off-by:
Elad Weiss <eladw@ezchip.com> * Move verbs_util.h inclusion (for debug purposes). Signed-off-by:
Elad Weiss <eladw@ezchip.com> * [Verbs] - Solve a race condition issue when attempting to setup channels. The problem started when merging to latest master. The run would fail about 50% of the times when trying to execute Grpc GetRemoteAddress(), and return an "OS Error" message. Seems like a race condition between the stations. For now added a while loop with N retries. Signed-off-by:
Elad Weiss <eladw@ezchip.com> * [Verbs] - PR review comment - Use SchedClosure() instead of WorkerEnv::compute_pool Signed-off-by:
Elad Weiss <eladw@ezchip.com> * [Verbs] - PR review comment - Define and use RDMA_MAX_REQUEST_ID. Also requested internally to increase the number from 2G to 4G - 2. Signed-off-by:
Elad Weiss <eladw@ezchip.com> * [Verbs] - PR review comment - Remove old/unused code & comments. Signed-off-by:
Elad Weiss <eladw@ezchip.com> * [Verbs] - PR review comment - Change usleep() to Env::SleepForMicroseconds(). Signed-off-by:
Elad Weiss <eladw@ezchip.com> * [Verbs] - PR review comment - Propagate error statuses to the higher level. Signed-off-by:
Elad Weiss <eladw@ezchip.com> * [Verbs] - Nicify connection messages. Signed-off-by:
Elad Weiss <eladw@ezchip.com> * [Verbs] - Dispose of SchedClosure. Using SchedClosure causes a real performance degradation (10-15% on inception3 and resnet152). Instead we will use synchronous calls for now, since ops are non-blocking anyway. Signed-off-by:
Elad Weiss <eladw@ezchip.com> * [Verbs] - Enable sending content directly from source GPU tensor. This is a 0 copies requirement. It was implemented in the original prototype, however commiting it was delayed because: 1. It doesn't realy affect performance very much. 2. It requires the StreamGPUOp() function which in the prototype was implemented under GPUUtil, but in the mainstream should be kept under contrib code. I had a lot of techincal difficulties including "gpu_context.h" in my code, with the current Bazel configuration, so eventually I re-implemented it as an empty GPU-to-CPU copy. It is actually quiet elegant, fully reusing an existing code. Signed-off-by:
Elad Weiss <eladw@ezchip.com> * [Verbs] - Replace the blocking Sync() call after GPU tensor allocation. Instead, queue the next operation on the GPU stream. Signed-off-by:
Elad Weiss <eladw@ezchip.com>
Loading
Please sign in to comment