Commit 95147fbb authored by Derek Murray's avatar Derek Murray Committed by TensorFlower Gardener
Browse files

Fix use-after-free race condition in RpcRendezvousMgr and GdrRendezvousMgr.

To summarize the previous buggy behavior:
0. During a RunGraph request, the `RpcRemoteRendezvous` borrows a pointer to the current `WorkerSession`.
1. The `RpcRemoteRendezvous::RecvFromRemoteAsync(..., DoneCallback)` method is invoked to receive a tensor from a remote worker, as part of the same RunGraph request.
2. The method completes and calls the DoneCallback in thread T1.
3. The DoneCallback causes graph execution to complete, which sends a RunGraph response back to the master, and a RunStep response back to the client.
4. Thread T1 suspends before returning from the DoneCallback.
5. The client closes the session, which deletes the `WorkerSession` and leaves a dangling pointer in the `RpcRemoteRendezvous`.
6. Thread T1 resumes and executes code that uses the dangling `WorkerSession` pointer, leading to undefined behavior.

The change ensures that the `RpcRemoteRendezvous` does not use any borrowed state after the `DoneCallback` is invoked. It also makes a similar fix in GdrRendezvousMgr.

PiperOrigin-RevId: 234726984
parent 8714fa2c
Loading
Loading
Loading
Loading
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please to comment