Commit 3a1df26a authored Jul 31, 2018 by A. Unique TensorFlower Committed by TensorFlower Gardener Jul 31, 2018

Pass the device ordinal to use for execution to the XLA compiler for

auto-tuning.

Previously, when compiling a graph for multiple devices concurrently, XLA
would use the default device for auto-tuning. With this patch tf_cnn_benchmark
with model resnet50 finishes on 8 V100s batch 128, and gets a speedup of ~20%
over a single one; the next steps are to get it to run at batch 256 and to
scale well.

PiperOrigin-RevId: 206720140

parent 78f58629

Show whitespace changes

Inline Side-by-side

Please to comment