Commit 0b5cce36 authored by Eugene Brevdo's avatar Eugene Brevdo Committed by TensorFlower Gardener
Browse files

Get TopK op working on GPU again. Extend using cub's radix sort.

1. Undo rollback of Andreas Kirsch's initial implementation.
2. Use cub segmented radix sort if Andreas' heap-based impl
   for large k and small num_cols (thresholds of k=100, n=1000
   determined empirically).
3. Use cub segmented radix sort if k == num_cols (this case is always faster).
4. Added benchmarks.

Benchmarks show that the GPU implementation is up to 3x slower for small k but
can be 10x faster for large num_cols and k.

Benchmarks:

Benchmark: m_128_n_10_k_5_use_gpu_False          wall_time: 0.000166 s   Throughput: 0.0077 GB/s
Benchmark: m_128_n_10_k_5_use_gpu_True   wall_time: 0.000796 s   Throughput: 0.00161 GB/s
Benchmark: m_128_n_10_k_9_use_gpu_False          wall_time: 0.00017 s    Throughput: 0.00751 GB/s
Benchmark: m_128_n_10_k_9_use_gpu_True   wall_time: 0.000796 s   Throughput: 0.00161 GB/s
Benchmark: m_128_n_10_k_10_use_gpu_False         wall_time: 0.00017 s    Throughput: 0.00753 GB/s
Benchmark: m_128_n_10_k_10_use_gpu_True          wall_time: 0.000775 s   Throughput: 0.00165 GB/s
Benchmark: m_128_n_100_k_1_use_gpu_False         wall_time: 0.000155 s   Throughput: 0.0826 GB/s
Benchmark: m_128_n_100_k_1_use_gpu_True          wall_time: 0.000796 s   Throughput: 0.0161 GB/s
Benchmark: m_128_n_100_k_50_use_gpu_False        wall_time: 0.000247 s   Throughput: 0.0519 GB/s
Benchmark: m_128_n_100_k_50_use_gpu_True         wall_time: 0.0008 s     Throughput: 0.016 GB/s
Benchmark: m_128_n_100_k_99_use_gpu_False        wall_time: 0.000261 s   Throughput: 0.049 GB/s
Benchmark: m_128_n_100_k_99_use_gpu_True         wall_time: 0.000794 s   Throughput: 0.0161 GB/s
Benchmark: m_128_n_100_k_100_use_gpu_False       wall_time: 0.000239 s   Throughput: 0.0536 GB/s
Benchmark: m_128_n_100_k_100_use_gpu_True        wall_time: 0.000777 s   Throughput: 0.0165 GB/s
Benchmark: m_128_n_1000_k_1_use_gpu_False        wall_time: 0.000324 s   Throughput: 0.395 GB/s
Benchmark: m_128_n_1000_k_1_use_gpu_True         wall_time: 0.000916 s   Throughput: 0.14 GB/s
Benchmark: m_128_n_1000_k_10_use_gpu_False       wall_time: 0.00042 s    Throughput: 0.305 GB/s
Benchmark: m_128_n_1000_k_10_use_gpu_True        wall_time: 0.000902 s   Throughput: 0.142 GB/s
Benchmark: m_128_n_1000_k_500_use_gpu_False      wall_time: 0.0011 s     Throughput: 0.116 GB/s
Benchmark: m_128_n_1000_k_500_use_gpu_True       wall_time: 0.00097 s    Throughput: 0.132 GB/s
Benchmark: m_128_n_1000_k_990_use_gpu_False      wall_time: 0.00133 s    Throughput: 0.0962 GB/s
Benchmark: m_128_n_1000_k_990_use_gpu_True       wall_time: 0.000993 s   Throughput: 0.129 GB/s
Benchmark: m_128_n_1000_k_1000_use_gpu_False     wall_time: 0.00102 s    Throughput: 0.126 GB/s
Benchmark: m_128_n_1000_k_1000_use_gpu_True      wall_time: 0.000964 s   Throughput: 0.133 GB/s
Benchmark: m_128_n_10000_k_10_use_gpu_False      wall_time: 0.002 s      Throughput: 0.64 GB/s
Benchmark: m_128_n_10000_k_10_use_gpu_True       wall_time: 0.00288 s    Throughput: 0.445 GB/s
Benchmark: m_128_n_10000_k_100_use_gpu_False     wall_time: 0.00233 s    Throughput: 0.549 GB/s
Benchmark: m_128_n_10000_k_100_use_gpu_True      wall_time: 0.00325 s    Throughput: 0.394 GB/s
Benchmark: m_128_n_10000_k_5000_use_gpu_False    wall_time: 0.0127 s     Throughput: 0.101 GB/s
Benchmark: m_128_n_10000_k_5000_use_gpu_True     wall_time: 0.00381 s    Throughput: 0.336 GB/s
Benchmark: m_128_n_10000_k_9900_use_gpu_False    wall_time: 0.015 s      Throughput: 0.0853 GB/s
Benchmark: m_128_n_10000_k_9900_use_gpu_True     wall_time: 0.00438 s    Throughput: 0.292 GB/s
Benchmark: m_128_n_10000_k_10000_use_gpu_False   wall_time: 0.0104 s     Throughput: 0.123 GB/s
Benchmark: m_128_n_10000_k_10000_use_gpu_True    wall_time: 0.00427 s    Throughput: 0.3 GB/s
Benchmark: m_128_n_100000_k_100_use_gpu_False    wall_time: 0.0148 s     Throughput: 0.865 GB/s
Benchmark: m_128_n_100000_k_100_use_gpu_True     wall_time: 0.0262 s     Throughput: 0.488 GB/s
Benchmark: m_128_n_100000_k_1000_use_gpu_False   wall_time: 0.0201 s     Throughput: 0.636 GB/s
Benchmark: m_128_n_100000_k_1000_use_gpu_True    wall_time: 0.0263 s     Throughput: 0.486 GB/s
Benchmark: m_128_n_100000_k_50000_use_gpu_False          wall_time: 0.214 s      Throughput: 0.0599 GB/s
Benchmark: m_128_n_100000_k_50000_use_gpu_True   wall_time: 0.0322 s     Throughput: 0.398 GB/s
Benchmark: m_128_n_100000_k_99000_use_gpu_False          wall_time: 0.262 s      Throughput: 0.0489 GB/s
Benchmark: m_128_n_100000_k_99000_use_gpu_True   wall_time: 0.0377 s     Throughput: 0.34 GB/s
Benchmark: m_128_n_100000_k_100000_use_gpu_False         wall_time: 0.118 s      Throughput: 0.108 GB/s
Benchmark: m_128_n_100000_k_100000_use_gpu_True          wall_time: 0.0365 s     Throughput: 0.351 GB/s

END_PUBLIC

BEGIN_PUBLIC
Automated g4 rollback of changelist 157169178

PiperOrigin-RevId: 161124193
parent 0597c418
Loading
Loading
Loading
Loading
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please to comment