Fixed fused batch norm performance regression.
The regression was caused by 12a4c9b8. I suspect the regression was caused by calling cudaMemset without setting the CUDA stream. Using the SetZeroFunctor (or using Eigen) handles this type of initialization for us. Benchmarks on tf_cnn_benchmarks, on a Volta DGX1, average of 3 iterations taken, with arguments: --optimizer=sgd --staged_vars=False --num_gpus=$GPU --variable_update=$VAR_UPDATE --use_fp16=True --batch_size=128 --model=$MODEL model gpu var_update im/sec after im/sec before percent diff resnet50 1 replicated 680.37333 640.10333 6.29117% resnet50 8 parameter_server 4046.04000 1282.28667 215.53319% resnet50 8 replicated 4157.30667 1634.22667 154.38984% inception3 1 replicated 463.88667 440.94333 5.20324% inception3 8 parameter_server 2655.55000 902.22333 194.33400% inception3 8 replicated 3034.81000 1033.43667 193.66192% PiperOrigin-RevId: 180980799
Loading
Please sign in to comment