Commit a4bbf33e authored by Tian Jin's avatar Tian Jin Committed by drpngx
Browse files

Optimize batch matrix transposition for narrow matrices. (#13049)

* Specialize implementations of batch matrix transposition when the matrix is narrow.

* Reduce compilation time by removing uncessary type specializations.

* 1. Remove macros.
2. Use single definition of frontier.
3. Fix various issues.
4. Use clang-format 5.0

* Improve algorithm dispatcher and provide performance note.

1. Ensure static errors when requesting tile size combinations outside performant subspace.

Performance Note:

We define the _large problem size_ exploration precisely as:

batch_num = [2**i for i in range(5, 13)]
matrix_height = range(96, 2048, 16)
matrix_width  = range(2, 16)

which consists of 13664 data points.

We deffine _small problem size_ exploration precisely as:

batch_num = [2**i for i in range(5, 13, 2)]
matrix_height = range(96, 2048, 128)
matrix_width  = range(2, 16, 2)

which consists of 3472 data points.

We define on par or better percentage (OPB%) as the percentage of execution times collected that are within 10% difference or better than the baseline implementation. Average speedup is measured across all execution times collected. We present our findings as follow:

Arch    Dtype   PS      AvgSpeedup  OPB%
K40     float4  small   1.15        99.3
K40     uint64  small   1.05        92.8
K40     float   large   1.15        87.1
K40     uint16  small   1.25        86.8
K40     uint8   small   1.28        89.3
P100    float   large   1.81        99.5

* 1. Improve description.
2. Add more tests.

* 1. Fixing comments here and there.

* 1. Fix a sentence.

* 1. Fix a bug that causes redundant kernel executions.

* Optimize cuda kernels.
1. Loop unrolling.
2. Special-case full tile execution.
3. Reduce integer calculation instructions.
parent fcf8e590
Loading
Loading
Loading
Loading
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please to comment