Additional design notes
- When the sender receives a tensor request, the source tensor may or may not be ready yet. The situation is handled through a process of tag matching:
- If the request arrives before the tensor is ready, then a callback is put in a local table, and will be invoked once the tensor arrives.
- If the tensor is ready before the request arrives, than the tensor is put in a local table. When the request arrives, it will invoke the callback immediately. In code it is done by calling RecvLocalAsync(), which receives the tensor's key, step-id, and the callback.
- When the callback is invoked, the relevant tensor is removed from the tag matching table. In the case where we need to send the tensor's meta-data, the RdmaTensorResponse will store a copy of the tensor until the re-request arrives.
- The sending of protocol messages (RDMA_MESSAGE_TENSOR_REQUEST, RDMA_MESSAGE_META_DATA_RESPONSE and RDMA_MESSAGE_TENSOR_RE_REQUEST) is done by the class RdmaMessageBuffer. All messages are sent using RDMA writes from/to fixed messages buffers. This implies that we cannot send on a specific channel more than one message at a time. In order to synchronize the messages, the RdmaMessageBuffer holds the a local and remote buffer statuses which can be either busy or idle. When a write is issued, both statuses will be changed to busy. When the write-complete event is received, the local status is changed to idle. When the write is received on the remote side, the remote side will parse the message, and return an ACK back to the sending side on which the sending side will update the remote status to idle. When both the local and remote statuses are idle, the next message can be sent.
- ACK writes are empty writes (hence they require no buffer) with immediate value 0xFFFFFFFE. Message writes have the immediate value 0xFFFFFFFF. All other writes are tensor-content writes whose immediate value is the request-index.
RDMA components
- enum RdmaImmDataType - Immediate types to distinguish between different RDMA writes on the remote side. Ack writes and control-message writes have a fixed immediate value. The rest of the writes are tensor writes and the immediate value is the relevant request index.
- enum RdmaWriteIDType - Types to distinguish between different RDMA write-complete events: Ack, control message and tensor writes.
- class RdmaWriteID - Context for RDMA write complete events. Holds the RdmaWriteIDType and additional data.
- class RdmaTensorMetaData - Meta-data for a tensor (type, shape, is_dead, proto_size).
- class RdmaMemoryMgr - Manages the meta-data cache, and the registered memory regions.
- class RdmaTensorRequest - Holds and manages information for a single tensor request throughout the entire receive cycle. API:
- Start() - Start the request sequence.
- Allocate the result tensor (and proxy tensor if required).
- Send RDMA_MESSAGE_TENSOR_REQUEST to the remote side.
- RecvTensorMetaData() - Receive meta-data from the remote side.
- Update the local meta-data cache.
- Reallocate the result tensor (and proxy tensor if required).
- Re-send the request to the remote side.
- RecvTensorContent() - Receive tensor content from the remote side (RDMA write was completed).
- Decode proto if required and/or move to GPU if the content was not written to it directly (GPU direct is not available).
- Invoke the done callback.
- class RdmaTensorResponse - Holds and manages information for a single tensor response throughout the entire send cycle. API:
- Start() - Start the response sequence.
- Find the tensor in the local tag-match table.
- Compare the tensor‘s meta-data to the meta-data in the message (taken from the requester’s local cache).
- If meta-data changed:
- Clone the tensor to be sent later.
- Send a meta-data update message and wait for re-request.
- Else:
- Send the tensor's content (using direct RDMA write).
- Resume() - Resume the response sequence after a re-request. Send the tensor's content that was cloned earlier.
- Destroy() - Destroy the response's resources and remove it form the pending list.
- class RdmaAdapter - The base for RDMA communications. It may contain multiple channels and buffers. It is responsible for handling various incoming RDMA messages.
- class RdmaChannel - Responsible for RDMA connection to a particular node. It manages messagee buffers. A channel has a request table which stores all the pending tensor requests.
- class RdmaMessageBuffer - Responsible for sending or receiving messages. It has a fixed size memory to store the data. It has a queue to store the pending jobs. A channel has two message buffers one for tx and one for rx.
- class RdmaMgr - Manages the adapter and channels, including channel creation, channel setup via GRPC service, channel lookup, etc.
- class RdmaRendezvousMgr - Manages multiple rdma rendezvous.
- class RdmaRemoteRendezvous - A derived class of BaseRemoteRendezvous. This class is the back end for “send” and “recv” ops. When the sendrecv_op wants to send or receive a tensor, it calls the rendezvous' “send” and “recv” functions respectively. Rendezvous are identified by “step_id”, a random number, so that tensors for different iterations don't get mixed up.
Message structure:
type | name_size | name | step_id | request_index | remote_addr/checksum | rkey | is_dead | data_type | tensor_shape | tensor_bytes | error_status |
---|
1B | 2B | 512 | 8B | 8B | 8B | 4B | 1B | XB | XB | 8B | Size - 4B, proto - XB |
- RDMA_MESSAGE_TENSOR_REQUEST - (receiver ==> sender) The original tensor request.
- type - The message type.
- name (name_size) - Name of the requested tensor.
- step_id - Step ID.
- request_index - Request index.
- remote_addr/rkey - Address/rkey of the result/proxy tensor. Irrelevant for first-time request.
- is_dead/data_type/tensor_shape/tensor_bytes - The current meta-data as stored in the receiver local cache. The sender will use that information to know if the receiver's cache requires updating.
- RDMA_MESSAGE_META_DATA_RESPONSE - (sender ==> receiver) The meta-data update message in case meta-data had changed (or if it is the first time the tensor is requested).
- type - The message type.
- request_index - Request index.
- is_dead/data_type/tensor_shape/tensor_bytes - The up-to-date meta-data.
- checksum - In data validation mode, this will hold the checksum of the source tensor.
- RDMA_MESSAGE_TENSOR_RE_REQUEST - (receiver ==> sender) Tensor re-request after meta-data update and reallocation of result/proxy tensors.
- type - The message type.
- name (name_size) - Name of the requested tensor.
- step_id - Step ID.
- request_index - Request index.
- remote_addr/rkey - Address/rkey of the reallocated result/proxy tensor.
- RDMA_MESSAGE_ERROR_STATUS - (sender ==> receiver) Notify the receiver that an error had occurred on the sender side, so it can propagate it to the upper levels.
- type - The message type.
- name (name_size) - Name of the requested tensor.
- step_id - Step ID.
- request_index - Request index.
- error_status - The error status (code, message, details).