Commit 662a2e6c authored by Igor Saprykin's avatar Igor Saprykin Committed by TensorFlower Gardener
Browse files

Update the checkpoints index file in CheckpointProto before actually deleting files.

Current order of execution is:
1. Save the new checkpoint
2. Delete old checkpoint files
3. Update the checkpoint proto

If the job is preempted after 2, then the checkpoint proto is left pointing to a deleted file.

It is better to update the checkpoint proto first:
1. write new checkpoint
2. update checkpoint proto
3. delete old checkpoint

I added tests to cover checkpoint proto in 168744975 and they are not failing with this change.

PiperOrigin-RevId: 169161095
parent 7de939bb
Loading
Loading
Loading
Loading
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please to comment