Optimize Spatial&Cuboid backward kernel convolutions.
Without shuffle TensorExecutor uses optimized (specialized) gemm_pack_rhs to pack memory before contraction. Custom rhs packer is much faster than contracting by inner dimension with default packer. 1. CuboidConvolutionBwdKernel: ~10x-25x speedup 2. SpatialConvolutionBwdKernel: ~2x-10x speedup PiperOrigin-RevId: 212506483
Loading
Please sign in to comment