Commit 47691396 authored by Brennan Saeta's avatar Brennan Saeta Committed by Amit Patankar
Browse files

Bugfix: Never use env_->device_mgr

The base_rendezvous_mgr handles transferring a tensor using DMAs to non-"host"
devices such as GPUs in the SameWorkerRecvDone function. This function would
use the worker_env's device_mgr to obtain a pointer to the relevant Device
(using the LookupDevice call). In the ClusterSpec-propagation world, using the
environment's device_set means that the devices are not renamed, often
resulting in devices not being found.

This change updates BaseRemoteRendezvous to use the WorkerSession stored when
the BaseRemoteRendezvous is initialized. The WorkerSession has a pointer to a
DeviceMgr that contains the appropriately renamed devices for the given
session the Rendezvous is associated with.

Note: because we have a fast-path host-device-only copy, the original bug does
not show up when using 2 CPU devices. I have added a test to ensure that
transferring between 2 CPU devices works in a ClusterSpec propagation session,
but note that this test does not actually reproduce the motivating bug.

In the process of writing a test for the original bug, I discovered another
latent bug in ClusterSpec propagation where if there were 2 CPU devices
(i.e. due to explicit server configuration to have 2 CPU devices), a DCHECK
could be triggered. The Master::CreateSession would call
`device_set->set_client_device` multiple times (once for each CPU device).

PiperOrigin-RevId: 162680163
parent 82d72520
Loading
Loading
Loading
Loading
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please to comment