Commit 5a8c4707 authored by A. Unique TensorFlower's avatar A. Unique TensorFlower Committed by TensorFlower Gardener
Browse files

Optimized C++ and CUDA kernels for transposition.

 * Shard fallback CPU implementation.
 * Optimize index calculations by trading 1 mod for 1 subtraction and 1 multiply (which have much lower combined latency).
 * Add optimized GPU kernels for on-the-fly conjugate transposition.

PiperOrigin-RevId: 172167514
parent 8fe6ea5f
Loading
Loading
Loading
Loading
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please to comment