Optimize batch matrix transposition for narrow matrices. (#13049)
* Specialize implementations of batch matrix transposition when the matrix is narrow. * Reduce compilation time by removing uncessary type specializations. * 1. Remove macros. 2. Use single definition of frontier. 3. Fix various issues. 4. Use clang-format 5.0 * Improve algorithm dispatcher and provide performance note. 1. Ensure static errors when requesting tile size combinations outside performant subspace. Performance Note: We define the _large problem size_ exploration precisely as: batch_num = [2**i for i in range(5, 13)] matrix_height = range(96, 2048, 16) matrix_width = range(2, 16) which consists of 13664 data points. We deffine _small problem size_ exploration precisely as: batch_num = [2**i for i in range(5, 13, 2)] matrix_height = range(96, 2048, 128) matrix_width = range(2, 16, 2) which consists of 3472 data points. We define on par or better percentage (OPB%) as the percentage of execution times collected that are within 10% difference or better than the baseline implementation. Average speedup is measured across all execution times collected. We present our findings as follow: Arch Dtype PS AvgSpeedup OPB% K40 float4 small 1.15 99.3 K40 uint64 small 1.05 92.8 K40 float large 1.15 87.1 K40 uint16 small 1.25 86.8 K40 uint8 small 1.28 89.3 P100 float large 1.81 99.5 * 1. Improve description. 2. Add more tests. * 1. Fixing comments here and there. * 1. Fix a sentence. * 1. Fix a bug that causes redundant kernel executions. * Optimize cuda kernels. 1. Loop unrolling. 2. Special-case full tile execution. 3. Reduce integer calculation instructions.
Loading
Please sign in to comment