Add GPU support for float16 batched matmul (#18436)
* Add GPU support for float16 batched matmul - Uses cublasGemmBatchedEx introduced in CUDA 9.1. - Includes support for Tensor Op math. - Falls back to a loop over non-batched gemm calls on older CUDA versions or GPU architectures. * Refactor GPU batched gemm into one internal func
Loading
Please sign in to comment