Bugfix: Never use env_->device_mgr
The base_rendezvous_mgr handles transferring a tensor using DMAs to non-"host" devices such as GPUs in the SameWorkerRecvDone function. This function would use the worker_env's device_mgr to obtain a pointer to the relevant Device (using the LookupDevice call). In the ClusterSpec-propagation world, using the environment's device_set means that the devices are not renamed, often resulting in devices not being found. This change updates BaseRemoteRendezvous to use the WorkerSession stored when the BaseRemoteRendezvous is initialized. The WorkerSession has a pointer to a DeviceMgr that contains the appropriately renamed devices for the given session the Rendezvous is associated with. Note: because we have a fast-path host-device-only copy, the original bug does not show up when using 2 CPU devices. I have added a test to ensure that transferring between 2 CPU devices works in a ClusterSpec propagation session, but note that this test does not actually reproduce the motivating bug. In the process of writing a test for the original bug, I discovered another latent bug in ClusterSpec propagation where if there were 2 CPU devices (i.e. due to explicit server configuration to have 2 CPU devices), a DCHECK could be triggered. The Master::CreateSession would call `device_set->set_client_device` multiple times (once for each CPU device). PiperOrigin-RevId: 162680163
Loading
Please sign in to comment