Commit 8b633af6 authored by Igor Ganichev's avatar Igor Ganichev Committed by TensorFlower Gardener
Browse files

Track requested and assigned devices separately in Placer.

Before this change Placer treated assigned and requested devices
almost identically. The Placer::Member::device_name was initialized
either with assigned or requested device. During the placement process,
`device_name` could be overriden to satisfy the resource constraint (
all resource touching ops are run on the resource's device). When all
colocation groups are computed and devices to assign are chosen, Placer
simply skipped the nodes with assigned devices.

This behavior can result in various violations of colocation constraints.
For example, the following would be placed successfully and but raise an
error at runtime:
  VarHandleOp (requested on CPU)
      |
      V
    Read (assigned to GPU)

Another issue with the Placer before this change is the logic in
VerifyResourceAndRefInputsCanBeColocated method. Given a resource edge
from src to dst nodes, that connected incompatible colocation groups,
Placer would normally override the destination's device_name. Before
overriding, Placer would check if device_name of all the other inputs
is compatible with the new value.

The semantics of this check is fairly arbitrary. For example, the following
would be placed successfully by overriding the requested device of Add:

CPU resource    CPU resource
          \     /
	   v   v
	Add (requested on GPU)

On the other hand, the following graph would be rejected even though it
can be placed by overriding the requsted device of Identity - logically
the same operation as above:

               VarHandleOp (unplaced)
                 |
                 v
CPU resource   Identity (requested on GPU)
          \     /
	   v   v
	    Add

This change treats assigned and requested devices separately. Overriding
requested devices to satisfy resource constraints is always permitted.
Overriding assigned devices is permitted only when soft placement is allowed.
Colocation group constraints are always respected (or an error is raised),
even with assigned devices and soft placement.

Finally, the emergent property of VerifyResourceAndRefInputsCanBeColocated and
surrounding logic was that requested devices of resource producing ops were
always respected (and error was raised if they resulted in a conflict). This
change preserves this behavior but makes it explicit. Requested devices on
resource generating nodes are treated as assigned device.

PiperOrigin-RevId: 232715063
parent 982c0bd4
Loading
Loading
Loading
Loading
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please to comment