[XLA:GPU] Make the input-fused reduce emitter work on 16-bit types
There's a bunch of things going on here: - BuildInitializerThunk threw away half of 16 bit init values. Fix that. - Make HandleFusion verify that it gets input-fusible reduces - Fuse BF16 again in multi-output fusion. This was a workaround for the initializer bug - Drop the 32 bit requirement from unfused reduce emission. It is really confusing to have different code paths for fused and unfused reduces - Emit 8/16 integer bit add/min/max as CAS. This is somewhat covered by existing tests. PiperOrigin-RevId: 202125572
Loading
Please sign in to comment