Pass the device ordinal to use for execution to the XLA compiler for
auto-tuning. Previously, when compiling a graph for multiple devices concurrently, XLA would use the default device for auto-tuning. With this patch tf_cnn_benchmark with model resnet50 finishes on 8 V100s batch 128, and gets a speedup of ~20% over a single one; the next steps are to get it to run at batch 256 and to scale well. PiperOrigin-RevId: 206720140
Loading
Please sign in to comment