Commit 5a8c4707 authored Oct 13, 2017 by A. Unique TensorFlower Committed by TensorFlower Gardener Oct 13, 2017

Optimized C++ and CUDA kernels for transposition.

 * Shard fallback CPU implementation.
 * Optimize index calculations by trading 1 mod for 1 subtraction and 1 multiply (which have much lower combined latency).
 * Add optimized GPU kernels for on-the-fly conjugate transposition.

PiperOrigin-RevId: 172167514

parent 8fe6ea5f

Show whitespace changes

Inline Side-by-side

Please to comment