Commit 3021eb0b authored Jan 05, 2018 by Reed Wanderman-Milne Committed by TensorFlower Gardener Jan 05, 2018

Fixed fused batch norm performance regression.

The regression was caused by 12a4c9b8. I suspect the regression was caused by calling cudaMemset without setting the CUDA stream. Using the SetZeroFunctor (or using Eigen) handles this type of initialization for us.

Benchmarks on tf_cnn_benchmarks, on a Volta DGX1, average of 3 iterations taken, with arguments: --optimizer=sgd --staged_vars=False --num_gpus=$GPU --variable_update=$VAR_UPDATE --use_fp16=True --batch_size=128 --model=$MODEL

model gpu var_update im/sec after im/sec before percent diff
resnet50 1 replicated 680.37333 640.10333 6.29117%
resnet50 8 parameter_server 4046.04000 1282.28667 215.53319%
resnet50 8 replicated 4157.30667 1634.22667 154.38984%
inception3 1 replicated 463.88667 440.94333 5.20324%
inception3 8 parameter_server 2655.55000 902.22333 194.33400%
inception3 8 replicated 3034.81000 1033.43667 193.66192%

PiperOrigin-RevId: 180980799

parent 63a4f8de

Show whitespace changes

Inline Side-by-side

Please to comment