Parallelize inner matrix multiplications of BatchMatMul on CPU when appropriate.
* Uses simple heuristics to choose between parallelizing outer (batch), inner (matmul) or both. * Adds benchmarks for BatchMatMul. * Switches matmul benchmark to use real time so GFlops reported are w.r.t. walltime and measure the effect of multi-threading. * Fixes bug in cost_per_unit calculation. The old code calculated B*M*N instead of M*N*K. Change: 134025273
Loading
Please sign in to comment