Commit 3f7c05cc authored Jan 17, 2018 by Igor Saprykin Committed by TensorFlower Gardener Jan 17, 2018

Make `replicate_model_fn` friendlier to distributed training.

I verified that async distributed training works as is. One quirk is that when replicating over a single GPU, variables end up being placed on /gpu:0 on PSs, which works correctly only thanks to allow_soft_placement=True.
For sync distributed training using SyncReplicasOptimizer the only quirk is that SyncReplicasOptimizerHook insists on SyncReplicasOptimizer.apply_gradients to be called. That happens only in the last tower, yet any tower could create the hook. To accommodate that requirement hooks from the last tower are taken as part of this CL. Before this, hooks from the first tower were taken.

SyncReplicasOptimizer doesn't behave perfectly in tests. The queue keeps hanging waiting for new token to arrive until `stop_grace_period_seconds` which is set for 120 seconds. The latter isn't exposed through the Estimator interface, which means the test is slower.

PiperOrigin-RevId: 182245657

parent a41ab15a

Show whitespace changes

Inline Side-by-side

Please to comment