Update the checkpoints index file in CheckpointProto before actually deleting files.
Current order of execution is: 1. Save the new checkpoint 2. Delete old checkpoint files 3. Update the checkpoint proto If the job is preempted after 2, then the checkpoint proto is left pointing to a deleted file. It is better to update the checkpoint proto first: 1. write new checkpoint 2. update checkpoint proto 3. delete old checkpoint I added tests to cover checkpoint proto in 168744975 and they are not failing with this change. PiperOrigin-RevId: 169161095
Loading
Please sign in to comment