Commit d70ecce4 authored by Igor Ganichev's avatar Igor Ganichev Committed by TensorFlower Gardener
Browse files

Cleanly violate colocation constraints with soft placement on

This change is a merge of 2 changes that were previously reverted and the
following fix. The commit messages of the previous changes are included below.

Before this change, user-specified colocation constraints (i.e. using "_class"
attribute in NodeDef) were incorporated before resource-based constraints. When
a conflict happened merging assigned devices during the merge of two colocation
graphs, we could not handle it even with soft placement on because (1) we can't
violate assigned devices (function library runtime relies on arguments being on
the devices it asked for) and we can't violate resource-edge based constraint
(that would lead to an error at runtime).

This change switches the order of incorporating user-specified colocation
constraints and resource-based constraints. If an error occurs at the resource
stage, we can error out as there is truly nothing we can safely violate and
still place this graph. We know this because no user-specified colocation
constraints were incorporated yet. If an error happens due to user-specified
colocation constraints, we can simply ignore this constraint if soft placement
is on.

This change also removes soft merging of assigned devices. This was never intended.

This change also removes some unused code.

---------------------------------------------------
Merged change 1:
Move ColocationGraph and Member out of placer.cc

This is a pure refactoring change. There is no behavior change.
Placer is getting more complex. This split makes the diff of following
CLs cleaner.

---------------------------------------------------
Merged change 2:
Track requested and assigned devices separately in Placer.

Before this change Placer treated assigned and requested devices
almost identically. The Placer::Member::device_name was initialized
either with assigned or requested device. During the placement process,
`device_name` could be overriden to satisfy the resource constraint (
all resource touching ops are run on the resource's device). When all
colocation groups are computed and devices to assign are chosen, Placer
simply skipped the nodes with assigned devices.

This behavior can result in various violations of colocation constraints.
For example, the following would be placed successfully and but raise an
error at runtime:
  VarHandleOp (requested on CPU)
      |
      V
    Read (assigned to GPU)

Another issue with the Placer before this change is the logic in
VerifyResourceAndRefInputsCanBeColocated method. Given a resource edge
from src to dst nodes, that connected incompatible colocation groups,
Placer would normally override the destination's device_name. Before
overriding, Placer would check if device_name of all the other inputs
is compatible with the new value.

The semantics of this check is fairly arbitrary. For example, the following
would be placed successfully by overriding the requested device of Add:

CPU resource    CPU resource
          \     /
	   v   v
	Add (requested on GPU)

On the other hand, the following graph would be rejected even though it
can be placed by overriding the requsted device of Identity - logically
the same operation as above:

               VarHandleOp (unplaced)
                 |
                 v
CPU resource   Identity (requested on GPU)
          \     /
	   v   v
	    Add

This change treats assigned and requested devices separately. Overriding
requested devices to satisfy resource constraints is always permitted.
Overriding assigned devices is permitted only when soft placement is allowed.
Colocation group constraints are always respected (or an error is raised),
even with assigned devices and soft placement.

Finally, the emergent property of VerifyResourceAndRefInputsCanBeColocated and
surrounding logic was that requested devices of resource producing ops were
always respected (and error was raised if they resulted in a conflict). This
change preserves this behavior but makes it explicit. Requested devices on
resource generating nodes are treated as assigned device.

---------------------------------------------------

PiperOrigin-RevId: 234209688
parent ffd53161
Loading
Loading
Loading
Loading
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please to comment