Commit ec81825a authored by Reed Wanderman-Milne's avatar Reed Wanderman-Milne Committed by TensorFlower Gardener
Browse files

Add GPU explicit padding to tf.nn.conv2d.

Benchmark results:

All benchmark results were run on a Z840 with a Titan V, with internal TensorFlow.

1. Resnet50 Eager results
The internal resnet50 Eager benchmarks were run, to ensure no regressions in Resnet50 in Eager mode that could have occurred due to the extra Python overhead this change adds. The benchmarks run are here: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/eager/python/examples/resnet50/resnet50_test.py. Each row was run 150 times and the average was taken. Note none of these benchmarks use explicit padding. Numbers represent time, so lower is better.

Benchmark Name                                        After   Before  % diff
apply_async_gpu_batch_64_channels_first               0.0726  0.0726  -0.06%
apply_gpu_batch_64_channels_first                     0.0725  0.0725  -0.06%
apply_with_defun_gpu_batch_64_channels_first          0.0755  0.0756   0.07%
train_async_gpu_batch_16_channels_first               0.0776  0.0778   0.27%
train_async_gpu_batch_32_channels_first               0.1268  0.1271   0.23%
train_dataset_gpu_batch_16_channels_first             0.1085  0.1094   0.77%
train_dataset_gpu_batch_32_channels_first             0.1473  0.1477   0.28%
train_dataset_with_defun_gpu_batch_16_channels_first  0.0800  0.0803   0.37%
train_dataset_with_defun_gpu_batch_32_channels_first  0.1325  0.1326   0.09%
train_gpu_batch_16_channels_first                     0.0812  0.0813   0.18%
train_gpu_batch_32_channels_first                     0.1329  0.1325  -0.32%
train_with_defun_gpu_batch_16_channels_first          0.0789  0.0791   0.26%
train_with_defun_gpu_batch_32_channels_first          0.1325  0.1325  -0.02%

There is minimal impact to Eager performance.

2. tf_cnn_benchmarks
tf_cnn_benchmarks was run internally with the following flags:

--batch_size=128 --model=resnet50

It was run 60 times with and without this change. With this change, tf_cnn_benchmarks had all instances of a tf.pad followed by Conv2D replaced with an explicitly padded Conv2d. It got 330.96 images/sec with this change and 330.80 without, and the difference is likely noise. Therefore, this change does not improve tf_cnn_benchmarks performance.

3. Conv2D benchmarks

The added benchmarks to conv_ops_test.py were run with this change, each 400 times and the average was taken. They were not run without this change. The table groups the 8 benchmarks into 4 pairs, with each pair running two similar benchmarks, one with explicit padding, and one without explicit padding.

Benchmark name                Explicit  Non-explicit  % diff
explicit/manual pad forward   0.001815  0.002006      10.56%
explicit/manual pad backward  0.006261  0.006937      10.79%
eager explicit/same pad       0.039320  0.038403      -2.33%
graph explicit/same pad       0.037039  0.037034      -0.11%

The first two rows show there is theoretical performance gains to using explicit padding over a manual tf.pad followed by the convolution. On Resnet50, we were not able to achieve this performance gain in practice, as tf_cnn_benchmarks saw no improvement. On models that use larger paddings than in Resnet50, the performance gain will less negligible. The last two rows compare explicit padding padding to the equivalent same padding, to see if explicit padding adds any overhead. In Graph mode, there is no overhead, but Eager mode has some overhead with explicit padding over SAME padding.

PiperOrigin-RevId: 228439591
parent c8191270
Loading
Loading
Loading
Loading
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please to comment