- FisherEstimator now supports computing products with arbitrary matrix powers...
- FisherEstimator now supports computing products with arbitrary matrix powers of the approximate Fisher
- Added multi-tower support to multi/RNN fully connected layers
- All op creation is now done inside functions that explicitly create ops, thus allowing fine control of their placement. One result of this is that we no longer need any colocation statements (and these have been removed)
- Multi-tower computations are now handled using ParitionedTensor class, which appears to be a single tensor to the FisherFactors but actually contains a list of tensors.
- To achieve the above damping values are passed around as special functions that are packaged along with "ids" that can be used to uniquely identify the computation they perform. Topohash might provide a better solution for this in the future.
- Variable creation in the factors is now done via special methods so we can have fine control over where these are placed
- FisherEstimator now has special functions to create ops and variables using different placement strategies (currently: no strategy, round-robin, and as thunks). By default this will use the round-robin strategy and manufacture the usual convenience properties ("inv_update_ops", etc). This default behavior is to preserve backwards compatibility but in the future we should deprecate this and require the user to ask for an explicit strategy.
- LossFunctions no longer make any ops in their constructors. The only make ops when evaluated. LayerCollection maintains a list of tensors/ops which we can colocate LossFunction computations with (typically their inputs)
- LossFunctions no longer support multi-tower/mini-batches directly. Instead LayerCollection maintains a list of these objects, one for each tower. This solution is better since now the loss function related computations can take place exclusively on the corresponding tower.
- All loss functions now support multiple towers/minibatches (via LayerCollection).
- tf.gradients is passed list of loss function values instead of their sum, which will prevent extraneous gradient ops being placed on arbitrary devices. Hopefully with this change and the above one for loss functions all ops associated with gradient computations (for computing stats) will occur completely on the device that defines that part of the graph. e.g. this will do the right thing for multiple towers
- I've also made sure that sensible colocation occurs for the extra ops needed by the curvature_propagation and exact estimation modes.
- Variables and ops made by FisherEstimator are now placed inside of name scopes (based on the name given to FisherEstimator)
- Restored old variable use count tracker implementation, thus fixing the issue with how generic registrations were handled by check_registration().
- Restored interface to FisherEstimator (which was changed in the previous CL).
- Fixed bug in LazyKFacOptimizer: optional/named arguments weren't being passed in properly
- Lots of other minor refactors/improvements
PiperOrigin-RevId: 188310846
Loading
Please sign in to comment