[XLA:GPU] Unroll multi-output loop fusions
This is easier than I thought because we can assume that all tuple members have the same number of elements. LLVM doesn't do a great job of vectorizing the resulting stores, but otherwise this is working fine. PiperOrigin-RevId: 197019718
Loading
Please sign in to comment