Fix use-after-free race condition in RpcRendezvousMgr and GdrRendezvousMgr.
To summarize the previous buggy behavior: 0. During a RunGraph request, the `RpcRemoteRendezvous` borrows a pointer to the current `WorkerSession`. 1. The `RpcRemoteRendezvous::RecvFromRemoteAsync(..., DoneCallback)` method is invoked to receive a tensor from a remote worker, as part of the same RunGraph request. 2. The method completes and calls the DoneCallback in thread T1. 3. The DoneCallback causes graph execution to complete, which sends a RunGraph response back to the master, and a RunStep response back to the client. 4. Thread T1 suspends before returning from the DoneCallback. 5. The client closes the session, which deletes the `WorkerSession` and leaves a dangling pointer in the `RpcRemoteRendezvous`. 6. Thread T1 resumes and executes code that uses the dangling `WorkerSession` pointer, leading to undefined behavior. The change ensures that the `RpcRemoteRendezvous` does not use any borrowed state after the `DoneCallback` is invoked. It also makes a similar fix in GdrRendezvousMgr. PiperOrigin-RevId: 234726984
Loading
Please sign in to comment