Optimized C++ and CUDA kernels for transposition.
* Shard fallback CPU implementation. * Optimize index calculations by trading 1 mod for 1 subtraction and 1 multiply (which have much lower combined latency). * Add optimized GPU kernels for on-the-fly conjugate transposition. PiperOrigin-RevId: 172167514
Loading
Please sign in to comment