Commit 3021eb0b authored by Reed Wanderman-Milne's avatar Reed Wanderman-Milne Committed by TensorFlower Gardener
Browse files

Fixed fused batch norm performance regression.

The regression was caused by 12a4c9b8. I suspect the regression was caused by calling cudaMemset without setting the CUDA stream. Using the SetZeroFunctor (or using Eigen) handles this type of initialization for us.

Benchmarks on tf_cnn_benchmarks, on a Volta DGX1, average of 3 iterations taken, with arguments: --optimizer=sgd --staged_vars=False --num_gpus=$GPU --variable_update=$VAR_UPDATE --use_fp16=True --batch_size=128 --model=$MODEL

model       gpu  var_update        im/sec after  im/sec before  percent diff
resnet50    1    replicated        680.37333     640.10333      6.29117%
resnet50    8    parameter_server  4046.04000    1282.28667     215.53319%
resnet50    8    replicated        4157.30667    1634.22667     154.38984%
inception3  1    replicated        463.88667     440.94333      5.20324%
inception3  8    parameter_server  2655.55000    902.22333      194.33400%
inception3  8    replicated        3034.81000    1033.43667     193.66192%

PiperOrigin-RevId: 180980799
parent 63a4f8de
Loading
Loading
Loading
Loading
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please to comment