Commit 3f7c05cc authored by Igor Saprykin's avatar Igor Saprykin Committed by TensorFlower Gardener
Browse files

Make `replicate_model_fn` friendlier to distributed training.

I verified that async distributed training works as is.  One quirk is that when replicating over a single GPU, variables end up being placed on /gpu:0 on PSs, which works correctly only thanks to allow_soft_placement=True.
For sync distributed training using SyncReplicasOptimizer the only quirk is that SyncReplicasOptimizerHook insists on SyncReplicasOptimizer.apply_gradients to be called.  That happens only in the last tower, yet any tower could create the hook.  To accommodate that requirement hooks from the last tower are taken as part of this CL. Before this, hooks from the first tower were taken.

SyncReplicasOptimizer doesn't behave perfectly in tests.  The queue keeps hanging waiting for new token to arrive until `stop_grace_period_seconds` which is set for 120 seconds.  The latter isn't exposed through the Estimator interface, which means the test is slower.

PiperOrigin-RevId: 182245657
parent a41ab15a
Loading
Loading
Loading
Loading
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please to comment